# Getting OCR files from Chronicling America in Bulk

This jupyter notebook will allow you to download OCR files from [Chronicling Amercia](https://chroniclingamerica.loc.gov/), and Library of Congress database with historical digitized newspapers.

* Go to [Chronicling Amercia](https://chroniclingamerica.loc.gov/) and do a search
* Copy the URL from your search. You will need to enter it below. 
* Run the code below. Press Shift + Enter to Run the Cells.

In [53]:
#This code import the python libraries we need.
from bs4 import BeautifulSoup 
import requests
import re
import csv
from nltk.corpus import stopwords

In [54]:
#Enter your search URL
search_url = input('Enter your search url form Chronicling America.')

Enter your search url form Chronicling America.https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1925&date2=1925&proxtext=scopes+trial&x=19&y=15&dateFilterType=yearRange&rows=20&searchType=basic


In [55]:
#This is the code that finds the URLs and downloads the OCR files as plain text. 

r = requests.get(search_url)
soup=BeautifulSoup(r.text, 'html.parser')

pagination = soup.findAll('div',attrs={"class": "left"})

#Finding the last page
pages = []
for div in pagination: 
    links = div.findAll('a')
    for a in links:
        pages.append(a.text)
        
last_page = int(pages[-2])
print('The last page is', last_page, '...')

#Saving all the page urls
search_result_urls = []
for number in range(last_page):
    search_result_urls.append(search_url + '&page=' + str(number+1))
print('Saved all the page urls ...')
    
#Getting all the links to OCR content
ocr_links = []
for url in search_result_urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    ids = soup.findAll(attrs={"name": "id"})
    for item in ids:
        ocr_links.append('https://chroniclingamerica.loc.gov' + item['value'] + 'ocr.txt')
print('pulled the ids for all the relevant files...')

#Creating CSV with Metadata
sn = []
date = []
filename = []
for link in ocr_links:
    sn.append(link[42:50])
    date.append(link[51:61])
    filename.append(link[40:].replace('/','_'))

with open('metadata.csv','w') as outfile:
    rowlists = zip(sn, date, filename)
    writer = csv.writer(outfile)
    for row in rowlists:
        writer.writerows([row])
print('Saved metadata in CSV...')

#Creating and Saving Text Files
n = 1
for link in ocr_links:
    req = requests.get(link).text.replace('\n',' ').lower()
    pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
    req = pattern.sub('', req)
    filename = link[40:].replace('/','_')
    with open('text-files/' + filename, 'w') as f:
        f.write(str(req))
        if n % 50 == 0: #printing out progress status for every 10
            print('Processed ' + str(n) + ' of ' + str(len(ocr_links)))
    n= n + 1 
print('DONE!')

The last page is 7 ...
Saved all the page urls ...
pulled the ids for all the relevant files...
Saved metadata in CSV...
Processed 50 of 137
Processed 100 of 137
DONE!
