# Network Visualization of Chronicling America Newspapers

**Avery Fernandez and Vincent Scalfani**

UA Libraries, Data Services

April 19, 2023

Code in this notebook is MIT licensed, you can find a copy of the license here: https://github.com/ualibweb/UALIB_Workshops

Some code in this workshop has been adapted from :
https://github.com/ualibweb/UALIB_Workshops/tree/master/09_Networks

https://ualibweb.github.io/UALIB_ScholarlyAPI_Cookbook/content/scripts/python/python_chronam.html


Chronicling America data is credited to Library of Congress. Please see https://chroniclingamerica.loc.gov/about/api/ for Data Usage Policies and Disclaimers.

NetworkX Docs: https://networkx.org/documentation/stable/

# For Following Along

```
conda create --name my-env
conda activate my-env
conda install -c conda-forge jupyterlab numpy matplotlib pandas networkx
```
or through pip:

``` 
pip install numpy matplotlib pandas networkx
```

# Overview of the Workflow

1. Use the Chronicling America API to obtain some full-text newspaper data. In this example, we will
search for srticles related to "University of Alabama".

2. Cleanup some of the obtained OCR text:

   * Converts the text to lowercase.
   * Removes invalid characters (characters other than letters, digits, punctuation marks, and whitespace).
   * Removes excess spaces.
   * Splits the text by punctuation marks (., !, ?) and removes them.
   * Removes any empty strings from the resulting list of sentences.
   
3. Compute combinations and occurances of co-words

4. Remove occurances of common, "uninteresting" word pairs

5. Create a basic network of co-words

# 1 . Basic Chronicling America API Call

## Import required libraries

In [None]:
import requests
import json
from pprint import pprint
from time import sleep

In [None]:
# Set the API base URL
api = "https://chroniclingamerica.loc.gov/"

# Construct the API request URL
# This request will search for articles related to "University of Alabama" in the state of Alabama
# It will return the top 10 results in JSON format
request_url = (api + "search/pages/results/?state=Alabama&andtext=(University+of+Alabama)"
               "&rows=10&format=json")

# Send the API request and parse the JSON response
request = requests.get(request_url).json()

# Print the 'ocr_eng' field of the first item in the response
# The 'ocr_eng' field contains the OCR text of the newspaper page
pprint(request['items'][0]['ocr_eng'])

In [None]:
# save JSON data to a file
with open('example_request.json', 'w') as outfile:
    json.dump(request, outfile)

In [None]:
# load JSON data from a file
#with open('example_request.json','r') as infile:
#    loadedData = json.load(infile)
#pprint(loadedData['items'][0]['ocr_eng'])

# 2. Get Chronicling America OCR Text in a Loop

If you try and request too many rows at a time, the Chronicling America API is sensitive to timing out...

In [None]:
# Set the API base URL
api = "https://chroniclingamerica.loc.gov/"

# Initialize the list to store OCR text data
ocr_eng_data = []

# Set the number of articles you want to fetch
articles_to_fetch = 2000

# Calculate the number of API calls required
api_calls_required = articles_to_fetch // 100
print(api_calls_required)

In [None]:
# Fetch articles in chunks of 100
for i in range(api_calls_required):
    # Define search parameters for the API request
    search_params = {
        "state": "Alabama",
        "andtext": "(University of Alabama)",
        "rows": 100,
        "page": i + 1,
        "format": "json"
    }

    # Send the API request and parse the JSON response
    request = requests.get(api + "search/pages/results/", params=search_params).json()

    # Wait for 1 second between API calls to avoid overloading the server
    sleep(1)

    # Extract 'ocr_eng' data and append to the list of lists
    ocr_eng_data.append([item['ocr_eng'] for item in request['items']])
    
# Flatten the list of lists to a single list
ocr_eng_data_flat = [ocr_eng for sublist in ocr_eng_data for ocr_eng in sublist]

# This is the same , but not in list comprehension
#ocr_eng_data_flat = []
#for sublist in ocr_eng_data:
#    for ocr_engine in sublist:
#        ocr_eng_data_flat.append(ocr_engine)
        

In [None]:
# Save the data as a pickle
import pickle

with open('ocr_eng_data.pickle', 'wb') as outfile:
    pickle.dump(ocr_eng_data, outfile, pickle.HIGHEST_PROTOCOL)

with open('ocr_eng_data_flat.pickle', 'wb') as outfile:
    pickle.dump(ocr_eng_data_flat, outfile, pickle.HIGHEST_PROTOCOL)

In [None]:
# to load pickle data

#import pickle
#with open('ocr_eng_data_flat.pickle', 'rb') as infile:
#    loaded_ocr_eng_data_flat = pickle.load(infile)

In [None]:
# Print the first 10 'ocr_eng' data items
for i, ocr_eng in enumerate(ocr_eng_data_flat[:2]):
    print(f"Article {i + 1}:")
    print(ocr_eng)
    print("------")

The following function clean_and_split_text takes a text string as input and performs the following tasks:

1. Converts the text to lowercase.
2. Removes invalid characters (characters other than letters, digits, punctuation marks, and whitespace).
3. Removes excess spaces.
4. Splits the text by punctuation marks (., !, ?) and removes them.
5. Removes any empty strings from the resulting list of sentences.

The function is then used to process the ocr_eng_data_flat list, resulting in a new list called ocr_eng_sentences. The first 10 cleaned and split text items can be printed by uncommenting the last block of code.

In [None]:
import re

def clean_and_split_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove invalid characters
    cleaned_text = re.sub(r"[^a-zA-Z0-9.,!?'\s]+", "", text)
    
    # Remove excess spaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    # Split text by punctuation and remove it
    sentences = re.split(r'[.!?]\s*', cleaned_text)

    # Remove any empty strings from the list
    sentences = [sentence for sentence in sentences if sentence]

    return sentences

# Process the ocr_eng_data_flat list
ocr_eng_sentences = [clean_and_split_text(ocr_eng) for ocr_eng in ocr_eng_data_flat]

# # Print the first 10 cleaned and split text items
# for i, sentences in enumerate(ocr_eng_sentences[:10]):
#     print(f"Article {i + 1}:")
#     print(sentences)
#     print("------")


The following code snippet performs the following tasks:

1. Initializes an empty list called word_pairs to store the combination pairs of words.
2. Iterates through the list of cleaned and split sentences (ocr_eng_sentences) for each article.
3. For each sentence, it splits the sentence into words.
4. Generates word pairs (consecutive words) from the list of words and adds them to the word_pairs list.
5. Finally, it prints the first 20 word pairs.

In [None]:
# Compute combination pairs of words
word_pairs = []
for article_sentences in ocr_eng_sentences:
    for sentence in article_sentences:
        # Split the sentence into words
        words = sentence.split()
        
        # Generate word pairs and add them to the word_pairs list
        word_pairs += [(words[word_idx], words[word_idx+1]) for word_idx in range(len(words)-1)]

# Print the first 20 word pairs
print(word_pairs[:20])

The following code snippet performs the following tasks:

1. Initializes an empty dictionary called counted_pairs to store the occurrences of each word pair.
2. Iterates through the list of word pairs (word_pairs).
3. Checks if the current word pair is already in the dictionary counted_pairs. If so, it increments the count for the existing word pair by 1.
4. If the word pair is not in the dictionary, it adds the word pair to the dictionary with an initial count of 1.

In [None]:
# Import the regular expression library
import re

# Count occurrences of word pairs
counted_pairs = {}
for pair in word_pairs:
    # Remove any punctuation marks from the words in the pair
    word1 = re.sub(r'[^\w\s]', '', pair[0])
    word2 = re.sub(r'[^\w\s]', '', pair[1])
    cleaned_pair = (word1, word2)
    
    # Check if the cleaned word pair is already in the dictionary
    if cleaned_pair in counted_pairs.keys():
        # Increment the count for the existing cleaned word pair
        counted_pairs[cleaned_pair] += 1
    else:
        # Add a new cleaned word pair to the dictionary with an initial count of 1
        counted_pairs[cleaned_pair] = 1

In [None]:
list(counted_pairs.items())[0:25]

This code snippet performs the following tasks:

1. Defines a list of common words (common_words) to exclude from the filtered results.
2. Creates a new dictionary called filtered_pairs that contains word pairs from the counted_pairs dictionary, filtered based on the following conditions:
    * The first word of the pair (key[0]) should not be in the list of common words.
    * The second word of the pair (key[1]) should not be in the list of common words.
    * The occurrence count of the word pair (value) should be greater than 1000.

In [None]:
# List of common words and numbers to exclude from the results
common_words = [
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves',
    'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their',
    'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was',
    'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',
    'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
    'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out',
    'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how',
    'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own',
    'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now',
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
]

# Filter word pairs by occurrence count and exclude common words and numbers
filtered_pairs = {
    key: value
    for key, value in counted_pairs.items()
    if key[0] not in common_words and key[1] not in common_words and value > 250
}


In [None]:
# Sort the word pairs by highest frequency
sorted_pairs = dict(sorted(filtered_pairs.items(), key=lambda x: x[1], reverse=True))

# Print the top 10 most frequent word pairs
list(sorted_pairs.items())[0:10]


This code snippet performs the following tasks:

1. Sorts the filtered_pairs dictionary by the occurrence count (value) in descending order. The key parameter in the sorted function is set to a lambda function that sorts the items by their values. The reverse parameter is set to True to sort the items in descending order. The sorted items are then used to create a new dictionary called sorted_pairs.
2. Prints the top 10 most frequent word pairs by slicing the sorted items of the sorted_pairs dictionary and converting them back to a list.

In [None]:
len(filtered_pairs.items())

In [None]:
filtered_pairs

In [None]:
for pair in filtered_pairs.items():
    if not (isinstance(pair, tuple) and len(pair) == 2 and
            isinstance(pair[0], tuple) and len(pair[0]) == 2 and
            all(isinstance(s, str) for s in pair[0]) and
            isinstance(pair[1], int)):
        raise ValueError(f"Invalid value: {pair}\n"+ "Invalid format: filtered_pairs must be a dictionary of tuples with the pattern {(String1, String2): Integer, ...}")

In [None]:
from coword_network import Coword_Network
networks = Coword_Network(filtered_pairs)

In [None]:
networks.display_main_graph()

In [None]:
networks.display_focused_graph()

In [None]:
networks.sorted_graphs[:10]

In [None]:
networks.display_selected_network(7)