# Chronicling America API

[Chronicling America](https://chroniclingamerica.loc.gov/) is a collection of digitized American newspapers dating from 1777 to 1963 provided by the Library of Congress. The collection offers an application programming interface (API) which allows users to easily harvest large amounts of data.

In this notebook we will search Chronicling America's API, gather the search results into a Pandas dataframe, clean the data, and save it as a csv file.

In [1]:
# imports
import requests
import json
import math
import pandas as pd
import spacy

In [2]:
# initial search
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1865&proxtext="time+travel"&x=20&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)
raw = response.text
results = json.loads(raw)

## Explore search results

In [3]:
results.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

In [4]:
# explore items
print(type(results['items']))

<class 'list'>


In [5]:
print(results['items'][0])

{'sequence': 1, 'county': ['Terrebonne'], 'edition': None, 'frequency': 'Weekly', 'id': '/lccn/sn83026391/1856-09-06/ed-1/seq-1/', 'subject': ['Houma (La.)--Newspapers.', 'Louisiana--Houma.--fast--(OCoLC)fst01218915'], 'city': ['Houma'], 'date': '18560906', 'title': 'Houma Ceres. [volume]', 'end_year': 1899, 'note': ['Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', 'Democrat.', 'With legal notices in English and French.'], 'state': ['Louisiana'], 'section_label': '', 'type': 'page', 'place_of_publication': 'Houma, Parish of Terrebonne, La.', 'start_year': 1855, 'edition_label': '', 'publisher': 'E.W. Blake & Co.', 'language': ['English'], 'alt_title': [], 'lccn': 'sn83026391', 'country': 'Louisiana', 'ocr_eng': 'INDEPENDENT IN ALL TING8-N.UTRAL IN NONM\ni. . AN5E.MO.J DEVOTED TO LITERATURE, THE ARTS, AND NEWS OF THE DAY. 1[S per Anrum.\nOL. II. HO[UMA, PARISH OF TERREBONNE, LA., SATURDAY, SEPTEMBER 6, 1856. NO.

In [6]:
print('totalItems:', results['totalItems'])
print('endIndex:', results['endIndex'])
print('startIndex:', results['startIndex'])
print('itemsPerPage:', results['itemsPerPage'])
print('Length and type of items:', len(results['items']), type(results['items']))

totalItems: 10506
endIndex: 20
startIndex: 1
itemsPerPage: 20
Length and type of items: 20 <class 'list'>


In [7]:
# find total amount of pages
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

526


In [8]:
# create empty list for data
data = []

In [9]:
# set search parameters
start_date = '1770'
end_date = '1865'
search_term = 'time+travel'
state = ''

In [11]:
# loop through search results and collect data
for i in range(1, total_pages+1):  # for sake of time I'm doing only 10, you will want to put total_pages+1
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state={state}&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')  # f-string
    response = requests.get(url)
    raw = response.text
    print(f'page {i} status code:', response.status_code)  # checking for errors
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        row_data = {}
        try:
          row_data['title'] = item_['title_normal']
        except:
          row_data['city'] = "none"
        try:
          row_data['city'] = item_['city']
        except:
          row_data['city'] = "none"
        try:
          row_data['date'] = item_['date']
        except:
          row_data['date'] = "none"
        try:
          row_data['raw_text'] = item_['ocr_eng']
        except:
          row_data['raw_text'] = 'none'
    data.append(row_data)

page 1 status code: 200
page 2 status code: 200
page 3 status code: 200
page 4 status code: 200
page 5 status code: 200
page 6 status code: 200
page 7 status code: 200
page 8 status code: 200
page 9 status code: 200
page 10 status code: 200
page 11 status code: 200
page 12 status code: 200
page 13 status code: 200
page 14 status code: 200
page 15 status code: 200
page 16 status code: 200
page 17 status code: 200
page 18 status code: 200
page 19 status code: 200
page 20 status code: 200
page 21 status code: 200
page 22 status code: 200
page 23 status code: 200
page 24 status code: 200
page 25 status code: 200
page 26 status code: 200
page 27 status code: 200
page 28 status code: 200
page 29 status code: 200
page 30 status code: 200
page 31 status code: 200
page 32 status code: 200
page 33 status code: 200
page 34 status code: 200
page 35 status code: 200
page 36 status code: 200
page 37 status code: 200
page 38 status code: 200
page 39 status code: 200
page 40 status code: 200
page 41 s

page 321 status code: 200
page 322 status code: 200
page 323 status code: 200
page 324 status code: 200
page 325 status code: 200
page 326 status code: 200
page 327 status code: 200
page 328 status code: 200
page 329 status code: 200
page 330 status code: 200
page 331 status code: 200
page 332 status code: 200
page 333 status code: 200
page 334 status code: 200
page 335 status code: 200
page 336 status code: 200
page 337 status code: 200
page 338 status code: 200
page 339 status code: 200
page 340 status code: 200
page 341 status code: 200
page 342 status code: 200
page 343 status code: 200
page 344 status code: 200
page 345 status code: 200
page 346 status code: 200
page 347 status code: 200
page 348 status code: 200
page 349 status code: 200
page 350 status code: 200
page 351 status code: 200
page 352 status code: 504


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [12]:
# put data into DataFrame
df = pd.DataFrame.from_dict(data)

In [13]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,delaware journal.,[Wilmington],18280311,s\nT\nK\nTl\n4\nA\n4\njg\n'ËâUe&lrç JÆ. BTadïo...
1,sioux city register.,[Sioux City],18640227,"Sioirt Citir Register\nSATURDAY, FEBRUARY 27, ..."
2,vermont watchman and state journal.,[Montpelier],18500704,"VERMONT WATCHMAN STAE JOURNAL, sJULY 4, 1850.\..."
3,tarboro' press.,[Tarboro],18510315,"-4\ni\nV. 5 279.\nTarborough, Edgecombe County..."
4,st. mary's beacon.,[Leonardtown],18530512,PREPARATION OF SEED CORN. I\nCorn-planting sc&...


### Change date format
Pandas allows us to clean and edit our data easily (relatively). We can first convert the string values in the date column to properly formated dates and then sort the dataframe by date.

In [14]:
# convert date column from string to date-time object
df['date'] = pd.to_datetime(df['date'])

In [15]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,delaware journal.,[Wilmington],1828-03-11,s\nT\nK\nTl\n4\nA\n4\njg\n'ËâUe&lrç JÆ. BTadïo...
1,sioux city register.,[Sioux City],1864-02-27,"Sioirt Citir Register\nSATURDAY, FEBRUARY 27, ..."
2,vermont watchman and state journal.,[Montpelier],1850-07-04,"VERMONT WATCHMAN STAE JOURNAL, sJULY 4, 1850.\..."
3,tarboro' press.,[Tarboro],1851-03-15,"-4\ni\nV. 5 279.\nTarborough, Edgecombe County..."
4,st. mary's beacon.,[Leonardtown],1853-05-12,PREPARATION OF SEED CORN. I\nCorn-planting sc&...


In [16]:
# sort by date
df = df.sort_values(by='date')

In [17]:
df.head()

Unnamed: 0,title,city,date,raw_text
200,"gazette of the united states, & philadelphia d...",[Philadelphia],1799-06-29,"TREASURY DEPARTMENT.\nMarch tlth, <799.\nPUBLI..."
55,enquirer.,[Richmond],1810-08-21,ignorant that it is requisite to have permis\n...
33,"alexandria daily gazette, commercial & political.",[Alexandria],1812-02-13,"I Instead of agreeing to it, postpon\nscn'‘t'l..."
50,"alexandria gazette, commercial and political.",[Alexandria],1815-01-28,. ..._ •---_______;___^> ■ -»■\n'coThse salt.\...
259,elizabeth-town gazette.,[Elizabeth],1818-09-15,POETRY\nFrom the Wilmington JVatehman.\nSATURD...


### Process text
We can now porcess our text for analysis. The text provded by Chronicling America comes from optical character recognition (ocr) and the accuracy of ocr can be low. Here I will remove new line characters (`\n`), stop words, and then lemamtize the text.

**Rememeber** the decisions you make in how to process your text should be based on the kind of analysis you want to do.

In [18]:
# write fuction to process text
# load nlp model
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')  # these are unnecessary for the task at hand

def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [19]:
# apply process_text function
# this may take a few minutes
df['lemmas'] = df['raw_text'].apply(process_text)

In [21]:
# save to csv
df.to_csv(f'C:/Users/vivia/Downloads/cls161/cls161_fall23/{search_term}{start_date}-{end_date}.csv', index=False)