<a href="https://colab.research.google.com/github/rbutterhof/Week4Warmup/blob/main/SentimentAnalysisWithChronAmPages_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis on Chronicling America Pages

We are going to apply some text analysis stuff to a collection of PDF files from Chronicling America.  You will need to make a folder on Google Drive called "LibraryJuicePython" first.

In [None]:
#Install / load our libraries and connect to Google Drive

!pip install pypdf

#Colab has a special library just for working with files on google drive
from google.colab import drive


#This is a new library we'll use to extract text from PDFs
from pypdf import PdfReader


import textblob
from textblob import TextBlob
import pandas as pd
import numpy as np
import glob

#Some extra libraries we'll need for text analysis
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('punkt_tab')

#Connect to Gdrive
drive.mount('/content/gdrive')

print("Libraries and Drive Ready!")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Mounted at /content/gdrive
Libraries and Drive Ready!


Here's a quick demo of how sentiment analysis works.  Polarity is a measure of postive or negative sentiment, and it ranges from -1 (very negative) to 1 (very positive.  Subjectivity is measured from 0 (factual statement) to 1 (personal opinion). Feel free to change out the sentences below to see how a particular sentence would rate.

In [None]:
happy_sentence = "I like chickens hatching."
sad_sentence = "This is just awful."
opinion_sentence = "I feel miserable."
factual_sentence = "The fish is swimming"

print("Sentiment of happy sentence ", TextBlob(happy_sentence).sentiment)
print("Sentiment of sad sentence ", TextBlob(sad_sentence).sentiment)
print("Sentiment of opinion sentence ", TextBlob(opinion_sentence).sentiment)
print("Sentiment of factual sentence ", TextBlob(factual_sentence).sentiment)

# polarity ranges from -1 to 1.
# subjectvity ranges from 0 to 1.


Sentiment of happy sentence  Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment of sad sentence  Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment of opinion sentence  Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment of factual sentence  Sentiment(polarity=0.0, subjectivity=0.0)


In [None]:
import requests
import os
import pandas as pd

This part queries the Chronicling America API to get PDFs for sentiment analysis.  You can put in any URL here.  I did a search for pages with "lincoln assassination" on 4/17/1865 from a few states.  I limited my search results to only 3 for this notebook just to make sure it would run.

In [None]:
# Perform Query - Paste your API Search Query URL into the searchURL
searchURL = 'https://www.loc.gov/collections/chronicling-america/?dl=page&end_date=1865-04-17&fa=location_state:district+of+columbia+OR+illinois+OR+indiana+OR+iowa+OR+maine+OR+massachusetts+OR+nevada+OR+virginia+OR+west+virginia&ops=~5&qs=lincoln+assassination&searchType=advanced&start_date=1865-04-17'

# Add your desired file type (extension). Options Include: pdf, jpeg, and xml (OCR files)
fileExtension = 'pdf'

# Add your Local saveTo Location
saveTo = '/content/gdrive/MyDrive/LibraryJuicePython/'

In [None]:
'''Run P1 search and get a list of results.'''
def get_item_ids(url, items=[], conditional='True'):
    # Check that the query URL is not an item or resource link.
    exclude = ["loc.gov/item","loc.gov/resource"]
    if any(string in url for string in exclude):
        raise NameError('Your URL points directly to an item or '
                        'resource page (you can tell because "item" '
                        'or "resource" is in the URL). Please use '
                        'a search URL instead. For example, instead '
                        'of \"https://www.loc.gov/item/2009581123/\", '
                        'try \"https://www.loc.gov/maps/?q=2009581123\". ')

    # request pages of 100 results at a time
    params = {"fo": "json", "c": 100, "at": "results,pagination"}
    call = requests.get(url, params=params)
    # Check that the API request was successful
    if (call.status_code==200) & ('json' in call.headers.get('content-type')):
        data = call.json()
        results = data['results']
        for result in results:
            # Filter out anything that's a colletion or web page
            filter_out = ("collection" in result.get("original_format")) \
                    or ("web page" in result.get("original_format")) \
                    or (eval(conditional)==False)
            if not filter_out:
                # Get the link to the item record
                if result.get("id"):
                    item = result.get("id")
                    # Filter out links to Catalog or other platforms
                    if item.startswith("http://www.loc.gov/resource"):
                      resource = item  # Assign item to resource
                      items.append(resource)
                    if item.startswith("http://www.loc.gov/item"):
                        items.append(item)
        # Repeat the loop on the next page, unless we're on the last page.
        if data["pagination"]["next"] is not None:
            next_url = data["pagination"]["next"]
            get_item_ids(next_url, items, conditional)

        return items
    else:
            print('There was a problem. Try running the cell again, or check your searchURL.')

# Create ids_list based on searchURL results
ids_list = get_item_ids(searchURL, items=[])

# prompt: add 'fo=json' to the end of each row in ids_list

new_ids = []
for id in ids_list:
  if not id.endswith('&fo=json'):
    id += '&fo=json'
  new_ids.append(id)
ids = new_ids

print('\nSuccess. Your API Search Query found '+str(len(new_ids))+' related newspaper pages. You may now continue.')


Success. Your API Search Query found 21 related newspaper pages. You may now continue.


This part downloads the PDFs to your Google Drive.

In [None]:
print('\n'+str(len(new_ids))+' Downloaded Files')

# prompt: print page_url if it matches the fileExtension

for item in new_ids:
    call = requests.get(item)
    if call.status_code == 200:
        data = call.json()
        page = data['page']
        for page in page:
            if 'url' in page:
                page_url = page['url']
                if page_url.endswith(fileExtension):
                    print(page_url)

# Get the page URLs
page_urls = []
for item in new_ids:
    call = requests.get(item)
    if call.status_code == 200:
        data = call.json()
        page = data['page']
        for page in page:
            if 'url' in page:
                page_url = page['url']
                if page_url.endswith(fileExtension):
                    page_urls.append(page_url)


for page_url in page_urls:
    # Extract parts of the URL to create the desired filename
    url_parts = page_url.split('/')
    sn_number = url_parts[9]
    batch = url_parts[7]
    year_month = url_parts[11]
    original_filename = url_parts[-1]

    # Create the new filename including the LCCN
    filename = f"{batch}_{sn_number}_{year_month}_{original_filename}"

    # Download the file
    response = requests.get(page_url)
    file_path = os.path.join(saveTo, filename) # Save directly to the saveTo folder
    with open(file_path, 'wb') as f:
        f.write(response.content)

print('\nSuccess! Please check your saveTo location to see the saved files. You can also redownload the selected files using the links above.')


3 Downloaded Files
https://tile.loc.gov/storage-services/service/ndnp/iahi/batch_iahi_hypno_ver01/data/sn83045646/00279529170/1865041701/0514.pdf
https://tile.loc.gov/storage-services/service/ndnp/iahi/batch_iahi_hypno_ver01/data/sn83045646/00279529170/1865041701/0515.pdf
https://tile.loc.gov/storage-services/service/ndnp/iahi/batch_iahi_hypno_ver01/data/sn83045646/00279529170/1865041701/0517.pdf

Success! Please check your saveTo location to see the saved files. You can also redownload the selected files using the links above.


**H1**
If you wanted to do this for non ChronAm pages, you could just put any PDFs in your Google Drive, and the following code would run sentiment analysis on them.  


In [None]:
#Run this cell to see the files in your directory
for file in glob.glob("/content/gdrive/MyDrive/LibraryJuicePython/*.pdf"):
  print(file)

/content/gdrive/MyDrive/LibraryJuicePython/batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf
/content/gdrive/MyDrive/LibraryJuicePython/batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf
/content/gdrive/MyDrive/LibraryJuicePython/batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf


# Exracting the text from the PDFs

We'll use a Python Library to create a text file of the contents of the PDF file.

In [None]:

for file in glob.glob("/content/gdrive/MyDrive/LibraryJuicePython/*.pdf"):
  filename = file.split("/")[-1]
  print("Extracting text for... ",filename)

  text = ""
  #This bit is new!
  reader = PdfReader(file)
  for page in reader.pages:
    text += page.extract_text()

  #Output the string variable into a text file
  output_text_file = filename+".txt"
  with open(output_text_file, "w") as text_file:
    text_file.write(text)

Extracting text for...  batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf
Extracting text for...  batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf
Extracting text for...  batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf


# Sentiment of our PDFs

Let's just print our sentiment scores to the screen for now.

In [None]:
for file in glob.glob("*.txt"):
  print("Sentiment for ",file)
  with open(file,"r") as text_file:
    text = text_file.read()
    blob = textblob.TextBlob(text)
    print(blob.sentiment)
    print("---")

Sentiment for  batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf.txt
Sentiment(polarity=0.08061743269751084, subjectivity=0.4410866397224572)
---
Sentiment for  batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf.txt
Sentiment(polarity=0.14144005831060988, subjectivity=0.48240064601274585)
---
Sentiment for  batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf.txt
Sentiment(polarity=0.11220078550787209, subjectivity=0.4379428329034628)
---


# Noun Phrases of our PDFs

We'll just do the top 3 keywords of each file and display it to the screen

In [None]:
for file in glob.glob("*.txt"):
  print("Top 5 Phrases for ",file)
  with open(file,"r") as text_file:
    text = text_file.read()

  blob = textblob.TextBlob(text)
  nphrases = dict()

  for np in blob.noun_phrases:
    if np in nphrases:
         nphrases[np] += 1
    else:
         nphrases[np] = 1

  for np in sorted(nphrases, key=nphrases.get, reverse=True)[0:3]:
      print(np, nphrases[np])

  print("---")

Top 5 Phrases for  batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf.txt
april 18
ihe 9
booth 9
---
Top 5 Phrases for  batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf.txt
davenport 10
brady 8
april 5
---
Top 5 Phrases for  batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf.txt
april 16
will 10
booth 6
---



# Dataframe of results

Let's put our findings into a dataframe and save them to our drive.

In [None]:

data_set = []

for file in glob.glob("*.txt"):

  pdf_detail = []

  with open(file,"r") as text_file:
    text = text_file.read()
  blob = textblob.TextBlob(text)

  print("working on ",file)
  pdf_detail.append(file)
  pdf_detail.append(blob.sentiment.polarity)
  pdf_detail.append(blob.sentiment.subjectivity)


  nphrases = dict()
  for np in blob.noun_phrases:
    if np in nphrases:
      nphrases[np] += 1
    else:
      nphrases[np] = 1

  #This is a weird way to find the top three keywords
  #This is an example of 'hacky' code

  #we loop through the first entry [0:1] in the sorted keyword list
  #giving us keyword one
  for np in sorted(nphrases, key=nphrases.get, reverse=True)[0:1]:
      keyword_one = np
      pdf_detail.append(keyword_one)

  #we loop through the first entry [1:2] in the sorted keyword list
  #giving us keyword two
  for np in sorted(nphrases, key=nphrases.get, reverse=True)[1:2]:
      keyword_two = np
      pdf_detail.append(keyword_two)

  #we loop through the first entry [2:3] in the sorted keyword list
  #giving us keyword two
  for np in sorted(nphrases, key=nphrases.get, reverse=True)[2:3]:
      keyword_three = np
      pdf_detail.append(keyword_three)

  data_set.append(pdf_detail)
  print(pdf_detail)
  print("---")


working on  batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf.txt
['batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf.txt', 0.08061743269751084, 0.4410866397224572, 'april', 'ihe', 'booth']
---
working on  batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf.txt
['batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf.txt', 0.14144005831060988, 0.48240064601274585, 'davenport', 'brady', 'april']
---
working on  batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf.txt
['batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf.txt', 0.11220078550787209, 0.4379428329034628, 'april', 'will', 'booth']
---


In [None]:
#We now have a list of lists
#This sorta looks like JSON doesn't it?
data_set

[['batch_iahi_hypno_ver01_sn83045646_1865041701_0517.pdf.txt',
  0.08061743269751084,
  0.4410866397224572,
  'april',
  'ihe',
  'booth'],
 ['batch_iahi_hypno_ver01_sn83045646_1865041701_0514.pdf.txt',
  0.14144005831060988,
  0.48240064601274585,
  'davenport',
  'brady',
  'april'],
 ['batch_iahi_hypno_ver01_sn83045646_1865041701_0515.pdf.txt',
  0.11220078550787209,
  0.4379428329034628,
  'april',
  'will',
  'booth']]

In [None]:
#Let's turn this list of lists into a pandas dataframe
pdf_dataframe = pd.DataFrame(data_set)
pdf_dataframe

Unnamed: 0,0,1,2,3,4,5
0,batch_iahi_hypno_ver01_sn83045646_1865041701_0...,0.080617,0.441087,april,ihe,booth
1,batch_iahi_hypno_ver01_sn83045646_1865041701_0...,0.14144,0.482401,davenport,brady,april
2,batch_iahi_hypno_ver01_sn83045646_1865041701_0...,0.112201,0.437943,april,will,booth


**H2**

Our dataframe needs better column names. Put the appropriate values into the list below on line 1 to update the column names.

In [None]:
column_names = ["PDFname","polarity","subjectivity","NounPhrase1","NounPhrase2","NounPhrase3"]
pdf_dataframe.columns = column_names
pdf_dataframe

Unnamed: 0,PDFname,polarity,subjectivity,NounPhrase1,NounPhrase2,NounPhrase3
0,batch_iahi_hypno_ver01_sn83045646_1865041701_0...,0.080617,0.441087,april,ihe,booth
1,batch_iahi_hypno_ver01_sn83045646_1865041701_0...,0.14144,0.482401,davenport,brady,april
2,batch_iahi_hypno_ver01_sn83045646_1865041701_0...,0.112201,0.437943,april,will,booth


# Save our data

We'll save our dataframe as a CSV file and put it in our usual folder.

In [None]:
pdf_dataframe.to_csv("adv_week3_homework.csv", index=False)
!cp adv_week3_homework.csv /content/gdrive/MyDrive/LibraryJuicePython

**H3** Take a moment to reflect on what your data analysis says about your PDF files. Jot down some reflections in the cell below.

My analysis tells me... I ran this on newspaper pages that included the words "lincoln assasination" in April 1865. I tried to get pages with relatively good OCR, but apparently the OCR was pretty bad for the SC newspaper I picked.  I'm guessing the bad OCR was part of the reason it had a neutral score for polarity.  I am suprised that the rest of the pages had positive scores?  For the DEU/Delaware newspaper, I'm guessing content like "President Lincoln’s remains left Philadelphia . . . and were visited by great
crowds." confused the scoring. "Great" is not necessarily mean something positive in that case.  There is also non-assassination newspaper content on the pages as well so that might also impact polarity.  Subjectivity is also something interesting to think about.  The pages had fairly high subjectivity scores for newspapers, and I realized that there were "Letters to the Editor" on those pages that frequently used "I" statements.  (This was particularly the case for the MB/Massachusetts newspaper with sentences like, "I read with profound regret the leading editorial in the Anti-Slavery Standard . . .")