# Newspaper Navigator Data

This notebook shows code that takes the downloded data from the [Newspaper Navigator](https://news-navigator.labs.loc.gov/) project and organizes it by location.

Start by importing our necessary modules and opening the downloaded files. We'll only need the json module for this one. We're also starting with the 1850 headline and photo files for now.

In [1]:
import json

In [16]:
with open('1850_headlines.json') as f:
    headlines = json.load(f)
    
with open('1850_photos.json') as f:
    photos = json.load(f)

Next, define a function to figure out the places of publication. This will help us see how many locations there are and how we may need to clean this data.

In [7]:
def get_places(json):
    temp = dict()
    for element in json:
        if element['place_of_publication'] not in temp:
            temp[element['place_of_publication']] = 0
        temp[element['place_of_publication']]+=1
    return temp

In [8]:
headline_locations = get_places(headlines)

In [9]:
headline_locations

{'Fayetteville, Tenn.': 3,
 'Pickens Court House, S.C.': 113,
 'Fayetteville [N.C.]': 42,
 'Glasgow, Mo.': 72,
 'Fremont [Ohio]': 3,
 'Fremont, Sandusky County, Ohio': 3,
 'Jackson, Miss.': 12,
 'Holly Springs, Misp. [i.e. Miss.]': 2,
 'Cadiz, Ohio': 44,
 'Bath, Me.': 15,
 '[Reading, Pa.': 61,
 'Abbeville, S.C.': 2,
 'Montpelier, Vt.': 58,
 'Canton, Stark County, Ohio': 81,
 'Charlotte, N.C.': 32,
 'Sunbury, Pa.': 29,
 'Alexandria, D.C.': 1232,
 'Davenport, Scott Co., Iowa': 31,
 'New-Lisbon, Ohio': 73,
 'Indianapolis': 172,
 'Staunton, Va.': 266,
 'Woodville, Miss.': 32,
 'Washington [D.C.]': 2537,
 'Washington City [i.e. Washington, D.C.]': 6,
 'Allentaun, Pa.': 163,
 'Athens, Tenn.': 13,
 'St. Johnsbury, Vt.': 97,
 'Raleigh, N.C.': 41,
 'Camden, S.C.': 128,
 'Hillsborough, N.C.': 66,
 'Paris, Me.': 122,
 'Tarborough, (Edgecombe Co., N.C.)': 18,
 'Lewisburg, Pa.': 31,
 '[Ottumwa] Iowa': 168,
 'Lancaster, Ohio': 11,
 'Portsmouth, O. [Ohio]': 7,
 'Hillsdale, Mich.': 227,
 'Ottawa, Ill.

In [10]:
photo_locations = get_places(photos)

In [11]:
photo_locations

{'Fremont, Sandusky County, Ohio': 10,
 'St. Johnsbury, Vt.': 7,
 'Hillsborough, N.C.': 1,
 'Canton, Miss.': 3,
 'Richmond, IA [i.e. Ind.]': 1}

We can see that the places of publication are not standardized, making it much more difficult to correlate them to the Sanborn and geographic data in the project. Let's take a look at the geographic_coverage instead. This data should be more standardized but may create other complications with multiple locations

In [26]:
def get_geogs(json):
    temp = dict()
    for element in json:
        geogs = element['geographic_coverage']
        for g in geogs:
            if g not in temp:
                temp[g] = 0
            temp[g]+=1
    return temp

In [27]:
headline_geogs = get_geogs(headlines)

In [28]:
headline_geogs

{'Tennessee--Lincoln--Fayetteville': 3,
 'South Carolina--Oconee--Walhalla': 113,
 'South Carolina--Pickens--Pickens': 113,
 'North Carolina--Cumberland--Fayetteville': 42,
 'Missouri--Howard--Glasgow': 72,
 'Ohio--Sandusky--Fremont': 6,
 'Mississippi--Hinds--Jackson': 12,
 'Mississippi--Marshall--Holly Springs': 2,
 'Ohio--Harrison--Cadiz': 44,
 'Maine--Sagadahoc--Bath': 15,
 'Pennsylvania--Berks--Reading': 61,
 'South Carolina--Abbeville--Abbeville': 2,
 'Vermont--Washington--Montpelier': 58,
 'Ohio--Stark--Canton': 81,
 'North Carolina--Mecklenburg--Charlotte': 32,
 'Pennsylvania--Northumberland--Sunbury': 29,
 'District of Columbia--Alexandria': 1232,
 'Virginia--Alexandria': 1232,
 'Virginia--Alexandria--Alexandria': 1232,
 'Iowa--Scott--Davenport': 31,
 'Ohio--Columbiana--New Lisbon': 73,
 'Ohio--Columbiana--Salem': 73,
 'Indiana--Marion--Indianapolis': 172,
 'Virginia--Staunton': 266,
 'Mississippi--Wilkinson--Woodville': 32,
 'District of Columbia--Washington': 2543,
 'Pennsylv

In [29]:
photo_geogs = get_geogs(photos)

In [30]:
photo_geogs

{'Ohio--Sandusky--Fremont': 10,
 'Vermont--Caledonia--Saint Johnsbury': 7,
 'North Carolina--Orange--Hillsboro': 1,
 'Mississippi--Madison--Canton': 3,
 'Indiana--Wayne--Richmond': 1}

## Applying to the Project

In order to limit the amount of newspaper data to include, I'm going to focus on the counties that have Sanborn maps in them.

Start by importing the requests module so we can programmatically get data instead of downloading and opening each file.

In [42]:
import requests

In [None]:
all_headlines = []

for n in range(1850, 1964): # goes from 1850-1963
    all_headlines.append(requests.get('https://news-navigator.labs.loc.gov/prepackaged/' + str(n) + '_headlines.json'))
    print(str(n) + '_headlines.json')

1850_headlines.json
1851_headlines.json
1852_headlines.json
1853_headlines.json
1854_headlines.json
1855_headlines.json
1856_headlines.json
1857_headlines.json
1858_headlines.json
1859_headlines.json
1860_headlines.json
1861_headlines.json
1862_headlines.json
1863_headlines.json
1864_headlines.json
1865_headlines.json
1866_headlines.json
1867_headlines.json
1868_headlines.json
1869_headlines.json
1870_headlines.json
1871_headlines.json
1872_headlines.json
1873_headlines.json
1874_headlines.json
1875_headlines.json
1876_headlines.json
1877_headlines.json
1878_headlines.json
1879_headlines.json
1880_headlines.json
1881_headlines.json
1882_headlines.json
1883_headlines.json
1884_headlines.json
1885_headlines.json
1886_headlines.json
1887_headlines.json
1888_headlines.json
1889_headlines.json
1890_headlines.json
1891_headlines.json
1892_headlines.json
1893_headlines.json
1894_headlines.json
1895_headlines.json
1896_headlines.json
1897_headlines.json
1898_headlines.json
1899_headlines.json
