# Newspaper Navigator Data

This notebook shows code that takes the downloded data from the [Newspaper Navigator](https://news-navigator.labs.loc.gov/) project and organizes it by location.

Start by importing our necessary modules and opening the downloaded files. We'll only need the json module for this one. We're also starting with the 1850 headline and photo files for now.

In [2]:
import json

In [16]:
with open('1850_headlines.json') as f:
    headlines = json.load(f)
    
with open('1850_photos.json') as f:
    photos = json.load(f)

Next, define a function to figure out the places of publication. This will help us see how many locations there are and how we may need to clean this data.

In [7]:
def get_places(json):
    temp = dict()
    for element in json:
        if element['place_of_publication'] not in temp:
            temp[element['place_of_publication']] = 0
        temp[element['place_of_publication']]+=1
    return temp

In [8]:
headline_locations = get_places(headlines)

In [9]:
headline_locations

{'Fayetteville, Tenn.': 3,
 'Pickens Court House, S.C.': 113,
 'Fayetteville [N.C.]': 42,
 'Glasgow, Mo.': 72,
 'Fremont [Ohio]': 3,
 'Fremont, Sandusky County, Ohio': 3,
 'Jackson, Miss.': 12,
 'Holly Springs, Misp. [i.e. Miss.]': 2,
 'Cadiz, Ohio': 44,
 'Bath, Me.': 15,
 '[Reading, Pa.': 61,
 'Abbeville, S.C.': 2,
 'Montpelier, Vt.': 58,
 'Canton, Stark County, Ohio': 81,
 'Charlotte, N.C.': 32,
 'Sunbury, Pa.': 29,
 'Alexandria, D.C.': 1232,
 'Davenport, Scott Co., Iowa': 31,
 'New-Lisbon, Ohio': 73,
 'Indianapolis': 172,
 'Staunton, Va.': 266,
 'Woodville, Miss.': 32,
 'Washington [D.C.]': 2537,
 'Washington City [i.e. Washington, D.C.]': 6,
 'Allentaun, Pa.': 163,
 'Athens, Tenn.': 13,
 'St. Johnsbury, Vt.': 97,
 'Raleigh, N.C.': 41,
 'Camden, S.C.': 128,
 'Hillsborough, N.C.': 66,
 'Paris, Me.': 122,
 'Tarborough, (Edgecombe Co., N.C.)': 18,
 'Lewisburg, Pa.': 31,
 '[Ottumwa] Iowa': 168,
 'Lancaster, Ohio': 11,
 'Portsmouth, O. [Ohio]': 7,
 'Hillsdale, Mich.': 227,
 'Ottawa, Ill.

In [10]:
photo_locations = get_places(photos)

In [11]:
photo_locations

{'Fremont, Sandusky County, Ohio': 10,
 'St. Johnsbury, Vt.': 7,
 'Hillsborough, N.C.': 1,
 'Canton, Miss.': 3,
 'Richmond, IA [i.e. Ind.]': 1}

We can see that the places of publication are not standardized, making it much more difficult to correlate them to the Sanborn and geographic data in the project. Let's take a look at the geographic_coverage instead. This data should be more standardized but may create other complications with multiple locations

In [26]:
def get_geogs(json):
    temp = dict()
    for element in json:
        geogs = element['geographic_coverage']
        for g in geogs:
            if g not in temp:
                temp[g] = 0
            temp[g]+=1
    return temp

In [27]:
headline_geogs = get_geogs(headlines)

In [28]:
headline_geogs

{'Tennessee--Lincoln--Fayetteville': 3,
 'South Carolina--Oconee--Walhalla': 113,
 'South Carolina--Pickens--Pickens': 113,
 'North Carolina--Cumberland--Fayetteville': 42,
 'Missouri--Howard--Glasgow': 72,
 'Ohio--Sandusky--Fremont': 6,
 'Mississippi--Hinds--Jackson': 12,
 'Mississippi--Marshall--Holly Springs': 2,
 'Ohio--Harrison--Cadiz': 44,
 'Maine--Sagadahoc--Bath': 15,
 'Pennsylvania--Berks--Reading': 61,
 'South Carolina--Abbeville--Abbeville': 2,
 'Vermont--Washington--Montpelier': 58,
 'Ohio--Stark--Canton': 81,
 'North Carolina--Mecklenburg--Charlotte': 32,
 'Pennsylvania--Northumberland--Sunbury': 29,
 'District of Columbia--Alexandria': 1232,
 'Virginia--Alexandria': 1232,
 'Virginia--Alexandria--Alexandria': 1232,
 'Iowa--Scott--Davenport': 31,
 'Ohio--Columbiana--New Lisbon': 73,
 'Ohio--Columbiana--Salem': 73,
 'Indiana--Marion--Indianapolis': 172,
 'Virginia--Staunton': 266,
 'Mississippi--Wilkinson--Woodville': 32,
 'District of Columbia--Washington': 2543,
 'Pennsylv

In [29]:
photo_geogs = get_geogs(photos)

In [30]:
photo_geogs

{'Ohio--Sandusky--Fremont': 10,
 'Vermont--Caledonia--Saint Johnsbury': 7,
 'North Carolina--Orange--Hillsboro': 1,
 'Mississippi--Madison--Canton': 3,
 'Indiana--Wayne--Richmond': 1}

## Applying to the Project

In order to limit the amount of newspaper data to include, I'm going to focus on the counties that have Sanborn maps in them.

Start by importing the requests module so we can programmatically get data instead of downloading and opening each file.

In [3]:
import requests

In [21]:
all_photos = []

for n in range(1850, 1871): # goes from 1850-1870
    temp_photos = requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_photos.json')
    for item in temp_photos.json():
        all_photos.append(item)
    print(str(n) + '_photos.json')

1850_photos.json
1851_photos.json
1852_photos.json
1853_photos.json
1854_photos.json
1855_photos.json
1856_photos.json
1857_photos.json
1858_photos.json
1859_photos.json
1860_photos.json
1861_photos.json
1862_photos.json
1863_photos.json
1864_photos.json
1865_photos.json
1866_photos.json
1867_photos.json
1868_photos.json
1869_photos.json
1870_photos.json


In [24]:
all_photos[-1]

{'filepath': 'ohi_edgar_ver01/data/sn85038121/00280775356/1870081801/0133/000_0_91.jpg',
 'pub_date': '1870-08-18',
 'page_seq_num': 133,
 'edition_seq_num': 1,
 'batch': 'ohi_edgar_ver01',
 'lccn': 'sn85038121',
 'box': [0.38768919024277126,
  0.7511628702831343,
  0.49907297780594295,
  0.8920526586665525],
 'score': 0.9192439913749695,
 'ocr': ['August', '18,', "Sheriff's", 'Sale.'],
 'place_of_publication': 'Gallipolis, Ohio',
 'geographic_coverage': ['Ohio--Gallia--Gallipolis'],
 'name': 'Gallipolis journal. [volume]',
 'publisher': 'Alexander Vance',
 'url': 'https://news-navigator.labs.loc.gov/data/ohi_edgar_ver01/data/sn85038121/00280775356/1870081801/0133/000_0_91.jpg',
 'page_url': 'https://chroniclingamerica.loc.gov/data/batches/ohi_edgar_ver01/data/sn85038121/00280775356/1870081801/0133.jp2'}

In [25]:
for n in range(1871, 1891): # goes from 1871-1890
    temp_photos = requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_photos.json')
    for item in temp_photos.json():
        all_photos.append(item)
    print(str(n) + '_photos.json')

1871_photos.json
1872_photos.json
1873_photos.json
1874_photos.json
1875_photos.json
1876_photos.json
1877_photos.json
1878_photos.json
1879_photos.json
1880_photos.json
1881_photos.json
1882_photos.json
1883_photos.json
1884_photos.json
1885_photos.json
1886_photos.json
1887_photos.json
1888_photos.json
1889_photos.json
1890_photos.json


In [26]:
all_photos[-1]

{'filepath': 'msar_emerald_ver02/data/sn87065614/00383343793/1890012301/0756/001_0_93.jpg',
 'pub_date': '1890-01-23',
 'page_seq_num': 756,
 'edition_seq_num': 1,
 'batch': 'msar_emerald_ver02',
 'lccn': 'sn87065614',
 'box': [0.530294822080292,
  0.06756236509692659,
  0.9614150889598541,
  0.3008145780588211],
 'score': 0.9326552152633667,
 'ocr': ['FOR', 'SALE.'],
 'place_of_publication': 'Starkville, Miss.',
 'geographic_coverage': ['Mississippi--Oktibbeha--Starkville'],
 'name': 'Southern live-stock journal.',
 'publisher': 'Live-stock Journal Co.',
 'url': 'https://news-navigator.labs.loc.gov/data/msar_emerald_ver02/data/sn87065614/00383343793/1890012301/0756/001_0_93.jpg',
 'page_url': 'https://chroniclingamerica.loc.gov/data/batches/msar_emerald_ver02/data/sn87065614/00383343793/1890012301/0756.jp2'}

In [27]:
for n in range(1891, 1911): # goes from 1891-1910
    temp_photos = requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_photos.json')
    for item in temp_photos.json():
        all_photos.append(item)
    print(str(n) + '_photos.json')

1891_photos.json
1892_photos.json
1893_photos.json
1894_photos.json
1895_photos.json
1896_photos.json
1897_photos.json
1898_photos.json
1899_photos.json
1900_photos.json
1901_photos.json
1902_photos.json
1903_photos.json
1904_photos.json
1905_photos.json
1906_photos.json
1907_photos.json
1908_photos.json
1909_photos.json
1910_photos.json


In [28]:
all_photos[-1]

{'filepath': 'mimtptc_chelsea_ver01/data/sn85038709/00415669343/1910052601/0356/006_0_95.jpg',
 'pub_date': '1910-05-26',
 'page_seq_num': 356,
 'edition_seq_num': 1,
 'batch': 'mimtptc_chelsea_ver01',
 'lccn': 'sn85038709',
 'box': [0.037587425925514915,
  0.4260654315571155,
  0.16797755130779782,
  0.5878377304612715],
 'score': 0.9587684869766235,
 'ocr': ["si'.",
  'V"',
  'is',
  'j',
  "'",
  'ki;v.',
  'W.',
  ';.',
  'i.oiy',
  'ho',
  'will',
  'deli',
  'er',
  'the',
  'Memorial',
  'addres',
  'at',
  'the',
  'ojiera',
  'Ia',
  'i;s"',
  "r.'.ay",
  '-W).'],
 'place_of_publication': 'Alma, Mich.',
 'geographic_coverage': ['Michigan--Gratiot--Alma'],
 'name': 'Alma record.',
 'publisher': 'Miss C.M. Fleming',
 'url': 'https://news-navigator.labs.loc.gov/data/mimtptc_chelsea_ver01/data/sn85038709/00415669343/1910052601/0356/006_0_95.jpg',
 'page_url': 'https://chroniclingamerica.loc.gov/data/batches/mimtptc_chelsea_ver01/data/sn85038709/00415669343/1910052601/0356.jp2'}

In [29]:
for n in range(1911, 1931): # goes from 1911-1930
    temp_photos = requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_photos.json')
    for item in temp_photos.json():
        all_photos.append(item)
    print(str(n) + '_photos.json')

1911_photos.json
1912_photos.json
1913_photos.json
1914_photos.json
1915_photos.json
1916_photos.json
1917_photos.json
1918_photos.json
1919_photos.json
1920_photos.json
1921_photos.json
1922_photos.json
1923_photos.json
1924_photos.json
1925_photos.json
1926_photos.json
1927_photos.json
1928_photos.json
1929_photos.json
1930_photos.json


In [30]:
all_photos[-1]

{'filepath': 'dlc_perry_ver01/data/sn83045462/00280600313/1930122701/0425/005_0_95.jpg',
 'pub_date': '1930-12-27',
 'page_seq_num': 425,
 'edition_seq_num': 1,
 'batch': 'dlc_perry_ver01',
 'lccn': 'sn83045462',
 'box': [0.14503879964786726,
  0.03966031484597104,
  0.26296206286353785,
  0.23427494485579914],
 'score': 0.9525977373123169,
 'ocr': ['Aviatrix',
  'to',
  'Wed',
  'MARJORIE',
  'CRAWFORD,',
  'Twenty-two-year-old',
  'flyer,',
  'announced',
  'her',
  'engagement',
  'in',
  'Los',
  'Angeles',
  'to',
  'Wil',
  'liam',
  'lVdiman,',
  'motion',
  'picture',
  'director',
  'and',
  'a',
  'visitor',
  'enthusiast.',
  '—A.',
  'P.',
  'Photo.'],
 'place_of_publication': 'Washington, D.C.',
 'geographic_coverage': ['District of Columbia--Washington'],
 'name': 'Evening star. [volume]',
 'publisher': 'W.D. Wallach & Hope',
 'url': 'https://news-navigator.labs.loc.gov/data/dlc_perry_ver01/data/sn83045462/00280600313/1930122701/0425/005_0_95.jpg',
 'page_url': 'https://c

In [31]:
for n in range(1931, 1951): # goes from 1931-1950
    temp_photos = requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_photos.json')
    for item in temp_photos.json():
        all_photos.append(item)
    print(str(n) + '_photos.json')

1931_photos.json
1932_photos.json
1933_photos.json
1934_photos.json
1935_photos.json
1936_photos.json
1937_photos.json
1938_photos.json
1939_photos.json
1940_photos.json
1941_photos.json
1942_photos.json
1943_photos.json
1944_photos.json
1945_photos.json
1946_photos.json
1947_photos.json
1948_photos.json
1949_photos.json
1950_photos.json


In [32]:
all_photos[-1]

{'filepath': 'ohi_amaryllis_ver01/data/sn88077226/0041421020A/1950060201/0209/000_0_99.jpg',
 'pub_date': '1950-06-02',
 'page_seq_num': 209,
 'edition_seq_num': 1,
 'batch': 'ohi_amaryllis_ver01',
 'lccn': 'sn88077226',
 'box': [0.24949934578392804,
  0.5379106913527397,
  0.8852704997321956,
  0.8998293245219748],
 'score': 0.9969677329063416,
 'ocr': ['Miss', 'it....', '..', 'j•'],
 'place_of_publication': 'Dayton, Ohio',
 'geographic_coverage': ['Ohio--Montgomery--Dayton'],
 'name': 'The Ohio daily-express.',
 'publisher': 'Paige H. Strickland',
 'url': 'https://news-navigator.labs.loc.gov/data/ohi_amaryllis_ver01/data/sn88077226/0041421020A/1950060201/0209/000_0_99.jpg',
 'page_url': 'https://chroniclingamerica.loc.gov/data/batches/ohi_amaryllis_ver01/data/sn88077226/0041421020A/1950060201/0209.jp2'}

In [33]:
for n in range(1951, 1964): # goes from 1951-1963
    temp_photos = requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_photos.json')
    for item in temp_photos.json():
        all_photos.append(item)
    print(str(n) + '_photos.json')

1951_photos.json
1952_photos.json
1953_photos.json
1954_photos.json
1955_photos.json
1956_photos.json
1957_photos.json
1958_photos.json
1959_photos.json
1960_photos.json
1961_photos.json
1962_photos.json
1963_photos.json


In [34]:
all_photos[-1]

{'filepath': 'msar_nickel_ver01/data/sn87065526/0038334525A/1963112801/0189/000_0_99.jpg',
 'pub_date': '1963-11-28',
 'page_seq_num': 189,
 'edition_seq_num': 1,
 'batch': 'msar_nickel_ver01',
 'lccn': 'sn87065526',
 'box': [0.3737845981822294,
  0.06714182592637045,
  0.8290997673483456,
  0.3359057889970321],
 'score': 0.9971892237663269,
 'ocr': ['REBEL',
  'BOOSTERS—Ready',
  'to',
  'back',
  'the',
  'Rebels',
  'when',
  'they',
  'meet',
  'arch',
  'rival',
  'Mississippi',
  'State',
  'in',
  'Starkville',
  'Saturday',
  'are',
  'these',
  'Bennett,',
  'Tchula.',
  'Back',
  'row',
  'left:',
  'Tony',
  'Thaxton,',
  'Greenwood;',
  'Wynn',
  'Skipper,',
  'Columbia;',
  'Bubba',
  'McArthur,',
  'Jackson;',
  'Sally'],
 'place_of_publication': 'Pascagoula, Miss.',
 'geographic_coverage': ['Mississippi--Jackson--Moss Point',
  'Mississippi--Jackson--Pascagoula'],
 'name': 'The Chronicle.',
 'publisher': 'Ira B. Harkey, Jr.',
 'url': 'https://news-navigator.labs.loc.gov/

In [35]:
len(all_photos)

1594084

In [42]:
# writing the file to keep it in case needed later
f = open('all-photos.json', 'w')
f.write(json.dumps(all_photos))
f.close()

Now that we have the data in a list, we can go back through it and remove the information we don't need. I want to keep the url, place of publication, geographic coverage, ocr, publication date, publication name, and recreate the url to link back to the Chronicling America site.

In [65]:
def keep_data():
    for item in all_photos:
        item['site_url'] = 'https://chroniclingamerica.loc.gov/lccn/' + item['lccn'] + '/' + item['pub_date'] + '/ed-' + str(item['edition_seq_num'])
        del item['filepath']
        del item['page_seq_num']
        del item['edition_seq_num']
        del item['batch']
        del item['lccn']
        del item['box']
        del item['publisher']
        del item['page_url']
        del item['score']

In [66]:
keep_data()

In [80]:
all_photos[0]

{'pub_date': '1850-07-27',
 'ocr': ['ht',
  'I',
  "'",
  'Wll.',
  '.',
  'III',
  'tl',
  'T',
  '.',
  '"\'',
  '"\'',
  '"',
  'sssaaaaalf',
  '"',
  "'",
  '"',
  '"',
  'J!',
  '"',
  '"\'',
  "''",
  'fassi',
  '1',
  "'"],
 'place_of_publication': 'Fremont, Sandusky County, Ohio',
 'geographic_coverage': ['Ohio--Sandusky--Fremont'],
 'name': 'Fremont weekly freeman. [volume]',
 'url': 'https://news-navigator.labs.loc.gov/data/ohi_ingstad_ver01/data/sn85026051/00296027029/1850072701/0037/000_0_98.jpg',
 'site_url': 'https://chroniclingamerica.loc.gov/lccn/sn85026051/1850-07-27/ed-1'}

There's still too much data for my project (I need files to be under 1GB to be hosted on GitHub, and preferably smaller if possible). So, I'm going to use the Sanborn locations to filter out which locations I want to keep from the Newspaper Navigator data.

First, I'll need to do some reformatting of the Newspaper Navigator data. Ideally, I want to pull out the state, county, and city of publication. I'm starting by looking at what locations are present in the Newspaper Navigator data and writing that into a text file.

In [85]:
location_set = dict()

for item in all_photos:
    if item['place_of_publication'] not in location_set:
        location_set[item['place_of_publication']] = 0
    location_set[item['place_of_publication']]+=1

In [204]:
location_set

{'Fremont, Sandusky County, Ohio': 23,
 'St. Johnsbury, Vt.': 751,
 'Hillsborough, N.C.': 27,
 'Canton, Miss.': 26,
 'Richmond, IA [i.e. Ind.]': 1,
 'Woodville, Miss.': 18,
 'Yazoo City, Miss.': 218,
 'Hannibal, Mo.': 1,
 'Columbus, Miss.': 1460,
 'Montpelier, Vt.': 225,
 '[New Orleans, La.]': 1,
 'Bath, Me.': 1,
 'Marshall, Tex.': 6,
 'Port Gibson, Miss.': 1901,
 'Mineral Point, Wis.': 1744,
 'New York [N.Y.]': 110124,
 'Gallipolis, Ohio': 9,
 'Leonard Town, Md.': 452,
 'Clarksburg, Va. [W. Va.]': 5,
 'Bloomington, Iowa': 1,
 'Hillsdale, Mich.': 2,
 'Cincinnati, [Ohio]': 4,
 'Evansville, Ia. [i.e. Ind.]': 246,
 'Nashville [Tenn.]': 1,
 'Camden, South-Carolina': 3,
 'Lancaster, Ohio': 2,
 'Burlington, Vt.': 522,
 'Port Tobacco, Md.': 284,
 'Washington, D.C.': 177628,
 'Davenport, Iowa': 17,
 'Warren, Ohio': 9,
 'St. Clairsville, Ohio': 4,
 'City of Council Bluffs, Iowa': 1,
 'St. Paul, Minn.': 3455,
 'Fremont, Sandusky County [Ohio]': 24,
 'Napoleonville [La.]': 1,
 'Charles Town, Va. 

In [97]:
f = open('news-locations.txt', 'w')
for key in location_set.keys():
    f.write(key + '\n')
f.close()

Now that that's written, I'm going to go through and add lines manually for the state and city. This could also be done with code, but there's so much variation in here that it honestly would be easier to just go through manually. In places where there is an alternate option, I'm checking which, if either, is present in the Sanborn file and using that one.

I'll need to re-load in this file after modifying it and add the info to the all_photos data.

In [91]:
sanborn_txt = open('../sanborn-with-fips.json', 'r').read()

In [269]:
'Columbia Falls' in sanborn_txt

True

In [324]:
news_locations_lines = open('news-locations.txt').readlines()

To add the data into all_photos, I'm going to start by creating a dictionary that goes from the place_of_publication to the state and city.

In [325]:
news_loc_dict = dict()

for i in range(len(news_locations_lines)):
    if (i%3 == 0):
        news_loc_dict[news_locations_lines[i].rstrip()] = {'state': news_locations_lines[i+1].rstrip(),
                                                 'city': news_locations_lines[i+2].rstrip()}

In [326]:
news_loc_dict

{'Fremont, Sandusky County, Ohio': {'state': 'Ohio', 'city': 'Fremont'},
 'St. Johnsbury, Vt.': {'state': 'Vermont', 'city': 'St. Johnsbury'},
 'Hillsborough, N.C.': {'state': 'North Carolina', 'city': 'Hillsborough'},
 'Canton, Miss.': {'state': 'Mississippi', 'city': 'Canton'},
 'Richmond, IA [i.e. Ind.]': {'state': 'Indiana', 'city': 'Richmond'},
 'Woodville, Miss.': {'state': 'Mississippi', 'city': 'Woodville'},
 'Yazoo City, Miss.': {'state': 'Mississippi', 'city': 'Yazoo City'},
 'Hannibal, Mo.': {'state': 'Missouri', 'city': 'Hannibal'},
 'Columbus, Miss.': {'state': 'Mississippi', 'city': 'Columbus'},
 'Montpelier, Vt.': {'state': 'Vermont', 'city': 'Montpelier'},
 '[New Orleans, La.]': {'state': 'Louisiana', 'city': 'New Orleans'},
 'Bath, Me.': {'state': 'Maine', 'city': 'Bath'},
 'Marshall, Tex.': {'state': 'Texas', 'city': 'Marshall'},
 'Port Gibson, Miss.': {'state': 'Mississippi', 'city': 'Port Gibson'},
 'Mineral Point, Wis.': {'state': 'Wisconsin', 'city': 'Mineral Poin

Then, that information will get added into all_photos. With that, we can then sort the photos by state and city.

In [327]:
for item in all_photos:
    item['state'] = news_loc_dict[item['place_of_publication']]['state']
    item['city'] = news_loc_dict[item['place_of_publication']]['city']

In [328]:
all_photos = sorted(all_photos, key = lambda i: (i['state'], i['city']))

In [329]:
all_photos_sorted = dict()

for item in all_photos:
    if item['state'] not in all_photos_sorted:
        all_photos_sorted[item['state']] = dict()
    if item['city'].upper() not in all_photos_sorted[item['state']]:
        all_photos_sorted[item['state']][item['city'].upper()] = []
    all_photos_sorted[item['state']][item['city'].upper()].append(item)

In [344]:
all_photos_sorted['Massachusetts']

{'SPRINGFIELD': [{'pub_date': '1940-06-15',
   'ocr': ['fW\'fcWBmi\'IW\'WW\'\'iW\'W)"\'Wlllli!Mll',
    'I',
    'u.',
    'Hll)',
    'J',
    'III',
    'nil',
    '1'],
   'place_of_publication': 'Springfield, Mass.',
   'geographic_coverage': ['Massachusetts--Hampden--Springfield'],
   'name': 'Hartford-Springfield chronicle. [volume]',
   'url': 'https://news-navigator.labs.loc.gov/data/ct_clash_ver01/data/sn95063011/00414219585/1940061501/0015/002_0_95.jpg',
   'site_url': 'https://chroniclingamerica.loc.gov/lccn/sn95063011/1940-06-15/ed-1',
   'state': 'Massachusetts',
   'city': 'Springfield'},
  {'pub_date': '1940-06-15',
   'ocr': ['Dr.',
    'Mary',
    'C.',
    'Wright',
    'who',
    'was',
    'hon',
    'ored',
    'at',
    'a',
    'recent',
    'banquet',
    'with',
    'Dr.',
    'Dorothy',
    'Ferebee.',
    '(See',
    'page',
    '1',
    ')'],
   'place_of_publication': 'Springfield, Mass.',
   'geographic_coverage': ['Massachusetts--Hampden--Springfield'],
 

We can now access the photos from a city by going to all_photos_sorted[state][city]. 

Next, I want to load in the Sanborn data.

In [77]:
with open('../sanborn-with-fips.json') as f:
    sanborn = json.load(f)

Then, I can match up the locations of the Sanborn data to the Newspaper Navigator data. I'll create a new data structure for the items I want to keep. For my project, I'll be keeping the photos from cities that show up in the Sanborn maps collection. They'll be organized in an alphabetical list of dictionaries, where each dictionary holds a state and a dictionary of city names. Each of the city names maps to a list of items.

In [338]:
photos_to_keep = []

for state in sanborn:
    if state['state'] not in all_photos_sorted:
        photos_to_keep.append({'state': state['state'], 'cities': {}})
        continue
    temp = dict()
    temp['state'] = state['state']
    temp['cities'] = dict()
    for county in state['counties']:
        for city in county['cities']:
            city_name = city['city'].upper()
            if 'SAINT' in city_name:
                city_name.replace('SAINT', 'ST.')
            if city_name in all_photos_sorted[state['state']]:
                temp['cities'][city_name] = all_photos_sorted[state['state']][city_name]
    photos_to_keep.append(temp)

In [340]:
for item in photos_to_keep:
    print(item['state'], len(item['cities']))

Alabama 1
Alaska 5
Arizona 20
Arkansas 13
California 7
Colorado 15
Connecticut 11
Delaware 8
District of Columbia 1
Florida 19
Georgia 3
Hawaii 3
Idaho 34
Illinois 4
Indiana 10
Iowa 16
Kansas 20
Kentucky 23
Louisiana 29
Maine 5
Maryland 13
Massachusetts 1
Michigan 17
Minnesota 17
Mississippi 35
Missouri 32
Montana 32
Nebraska 14
Nevada 7
New Hampshire 0
New Jersey 3
New Mexico 21
New York 0
North Carolina 32
North Dakota 16
Ohio 42
Oklahoma 11
Oregon 24
Pennsylvania 30
Rhode Island 0
South Carolina 23
South Dakota 17
Tennessee 33
Texas 24
Utah 10
Vermont 14
Virginia 12
Washington 21
West Virginia 16
Wisconsin 17
Wyoming 0


As you can see, I've printed out how many items are in each of the states. In some areas, there aren't any photos — for most of them, that's accurate. For New York, however, a quick check in with the Sanborn data shows that New York City is represented, but it is listed as New York; Special Atlases. So, I'll go back in and add that before writing the file.

In [356]:
photos_to_keep[32]['cities']['NEW YORK; SPECIAL ATLASES'] = all_photos_sorted['New York']['NEW YORK']

In [363]:
# removing some more things
for state in photos_to_keep:
    for key in state['cities']:
        for item in state['cities'][key]:
            del item['ocr']
            del item['geographic_coverage']
            del item['state']
            del item['city']
            del item['place_of_publication']

In [6]:
photos_to_keep[0]

{'state': 'Alabama',
 'cities': {'BIRMINGHAM': [{'pub_date': '1894-07-06',
    'name': 'Birmingham age-herald. [volume]',
    'url': 'https://news-navigator.labs.loc.gov/data/au_abernethy_ver01/data/sn84020639/00414187407/1894070601/0042/007_0_91.jpg',
    'site_url': 'https://chroniclingamerica.loc.gov/lccn/sn84020639/1894-07-06/ed-1'},
   {'pub_date': '1895-09-20',
    'name': 'Birmingham age-herald. [volume]',
    'url': 'https://news-navigator.labs.loc.gov/data/au_brown_ver01/data/sn84020639/00340583292/1895092001/0697/001_0_96.jpg',
    'site_url': 'https://chroniclingamerica.loc.gov/lccn/sn84020639/1895-09-20/ed-1'},
   {'pub_date': '1895-04-14',
    'name': 'Birmingham age-herald. [volume]',
    'url': 'https://news-navigator.labs.loc.gov/data/au_abernethy_ver01/data/sn84020639/00340583206/1895041402/0145/000_0_94.jpg',
    'site_url': 'https://chroniclingamerica.loc.gov/lccn/sn84020639/1895-04-14/ed-2'},
   {'pub_date': '1896-05-17',
    'name': 'Birmingham state herald.',
    

I need these files to be less than 100 MB to be able to store them on GitHub, so I'm going to split photos_to_keep by state.

In [8]:
for i in range(len(photos_to_keep)):
    state = photos_to_keep[i]
    f = open('photos-trimmed-' + str(i) + '.json', 'w')
    f.write(json.dumps(state))
    f.close()