I generated the OCRs for all my documents [here](https://colab.research.google.com/drive/1peA1nxVtGToFzdIfpN_S6SXzbMv1plr8?usp=sharing), and now this notebook concerns extracting keywords from those OCRs for each document and separating them into categories.

Not only will this assist me in assessing any differences in content between the different types of documents, but keywords are also a tool that can be included in future datasets and documents in order for researchers to be able to find data easier.

I will comparing keywords from the different OCR results: Tesseract, Vertex AI, and Textract. I hypothesize that the higher quality OCRs from Vertex AI and Textract will be able to extract more keywords, but lower quality OCRs from Tesseract will still return less but still useful keywords.

You can view the dataset I'm using [here](https://drive.google.com/file/d/1ChC4ntZbo3t4IBmuNYtA8LoqPoY84LNN/view?usp=sharing). You can view the OCR results [here](https://drive.google.com/drive/folders/1Zva_i_CrqQYDJaXcK-obwaAMOtmfKZZE?usp=drive_link).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd "/content/drive/MyDrive/UCSF_ASC"

/content/drive/MyDrive/UCSF_ASC


In [None]:
!pip install nltk
!pip install git+https://github.com/casics/nostril.git
!pip install -U spacy
!python -m spacy download en_core_web_lg
!pip install textblob

Collecting git+https://github.com/casics/nostril.git
  Cloning https://github.com/casics/nostril.git to /tmp/pip-req-build-ka208vl2
  Running command git clone --filter=blob:none --quiet https://github.com/casics/nostril.git /tmp/pip-req-build-ka208vl2
  Resolved https://github.com/casics/nostril.git to commit fbc0c91249283a9fbc9036206391ce1138826fd3
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting plac>=0.9.1 (from nostril==1.2.0)
  Downloading plac-1.4.3-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading plac-1.4.3-py2.py3-none-any.whl (22 kB)
Building wheels for collected packages: nostril
  Building wheel for nostril (setup.py) ... [?25l[?25hdone
  Created wheel for nostril: filename=nostril-1.2.0-py3-none-any.whl size=5765788 sha256=337dface3983634d9604947d7b207f002ffe692abc29528b277ff77bc08d6717
  Stored in directory: /tmp/pip-ephem-wheel-cache-nd_2363m/wheels/de/3e/43/5b766704a7dbffce33fcbd15a63a9919cb9cc743e04780b9d6
Successfully built nostril
Installing collec

In [None]:
import pandas as pd
import nltk as nltk
import matplotlib.pyplot as plt
import spacy
import numpy as np
from spacy import displacy
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import words
from nostril import nonsense
from textblob import TextBlob

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('words')
stop_words = set(stopwords.words('english'))
setofwords = set(words.words())
NER = spacy.load("en_core_web_lg")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [None]:
def extract_keywords(df):
  keywords = []
  all_keywords_list = []
  for idx, row in df.iterrows():
    ocr = str(row['ocr']).lower()
    all_keywords = [word for word in word_tokenize(ocr) if (word not in stop_words and word in setofwords and len(word) > 2)]
    word_freq = Counter(all_keywords)
    curr_keywords = [key for key, value in word_freq.most_common()]
    all_keywords_list.append(curr_keywords)
    top_keywords = curr_keywords[:5]
    keywords.append(top_keywords)
  # print(all_keywords)
  df.insert(3, 'keywords', keywords)
  df.insert(4, 'all_keywords', all_keywords_list)

In [None]:
def categorize_keywords(df):
  all_categories = []
  for idx, row in df.iterrows():
    #for an explanation of these categories: https://dataknowsall.com/blog/ner.html
    curr_categories = {'PERSON':[],'NORP':[],'FAC':[],'ORG':[],'GPE':[],'LOC':[],'PRODUCT':[],'EVENT':[],'WORK_OF_ART':[],'LAW':[],'DATE':[],'TIME':[]}
    ocr = row['ocr']
    text = NER(ocr)
    for word in text.ents:
      if word.label_ in ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW','DATE','TIME']:
        curr_categories[word.label_].append(str(word.text))
    all_categories.append(curr_categories)
  df.insert(5, 'categories', all_categories)

In [None]:
def sentiment_analysis(df):
  all_polarity = []
  all_subjectivity = []
  for idx, row in df.iterrows():
    ocr = row['ocr']
    #Determining the Polarity in a range of [-1,1] where -1 denotes a highly negative sentiment and 1 denotes a highly positive sentiment
    p = TextBlob(ocr).sentiment.polarity
    #Determining the Subjectivity in a range of [0,1] where a value closer to 0 denotes a piece of factual information and a value closer to 1 denotes a personal opinion
    s = TextBlob(ocr).sentiment.subjectivity
    all_polarity.append(p)
    all_subjectivity.append(s)
  df.insert(6, 'subjectivity', all_subjectivity)
  df.insert(7, 'polarity', all_polarity)


In [None]:
def subjectivity_polarity_per_category(tool, df):
  subjectivity_polarity_scores = pd.DataFrame(columns=['tool', 'format', 'subjectivity', 'polarity'])

  #ignore 0 values because that indicates gibberish. this does decrease accuracy, but will suffice for a general analysis
  handwritten_subjectivity = df[(df['format']=='handwritten')]['subjectivity'].mean()
  typed_subjectivity = df[(df['format']=='typed')]['subjectivity'].mean()
  mixed_subjectivity = df[(df['format']=='mixed')]['subjectivity'].mean()
  subjectivities = [handwritten_subjectivity, typed_subjectivity, mixed_subjectivity]
  print(subjectivities)

  handwritten_polarity = df[(df['format']=='handwritten')]['polarity'].mean()
  typed_polarity = df[(df['format']=='typed')]['polarity'].mean()
  mixed_polarity = df[(df['format']=='mixed')]['polarity'].mean()
  polarities = [handwritten_polarity, typed_polarity, mixed_polarity]

  subjectivity_polarity_scores['tool'] = [tool] * 3
  subjectivity_polarity_scores['format'] = ['handwritten', 'typed', 'mixed']
  subjectivity_polarity_scores['subjectivity'] = subjectivities
  subjectivity_polarity_scores['polarity'] = polarities

  return subjectivity_polarity_scores


#Tesseract

##Extracting keywords

In [None]:
tesseract_df = pd.read_csv('/content/drive/MyDrive/UCSF_ASC/Results/tesseract_results.csv')

In [None]:
extract_keywords(tesseract_df)

In [None]:
# tesseract_df.head(30)

##Categorizing keywords

In [None]:
#because nltk doesn't catch things like proper nouns or years in the keywords, we will be using the entire OCR
categorize_keywords(tesseract_df)

In [None]:
# tesseract_df.head(30)

##Sentiment Analysis

In [None]:
sentiment_analysis(tesseract_df)

In [None]:
# tesseract_df.head(30)

##Saving results to CSV

In [None]:
# #save the result to a csv
# tesseract_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/tesseract_content_results.csv'

# # Save the DataFrame to a CSV file
# tesseract_df.to_csv(tesseract_csv_path, index=False)

#Vertex AI

##Extracting keywords

In [None]:
vertex_df = pd.read_csv('/content/drive/MyDrive/UCSF_ASC/Results/vertex_results.csv')

In [None]:
extract_keywords(vertex_df)

In [None]:
# vertex_df.head(30)

##Categorizing keywords

In [None]:
#because nltk doesn't catch things like proper nouns or years in the keywords, we will be using the entire OCR
categorize_keywords(vertex_df)

In [None]:
# vertex_df.head(30)

##Sentiment Analysis

In [None]:
sentiment_analysis(vertex_df)

In [None]:
# vertex_df.head(30)

##Saving results to CSV

In [None]:
# #save the result to a csv
# vertex_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/vertex_content_results.csv'

# # Save the DataFrame to a CSV file
# vertex_df.to_csv(vertex_csv_path, index=False)

#Textract

##Extracting keywords

In [None]:
textract_df = pd.read_csv('/content/drive/MyDrive/UCSF_ASC/Results/textract_results.csv')

In [None]:
extract_keywords(textract_df)

In [None]:
# textract_df.head(30)

##Categorizing keywords

In [None]:
#because nltk doesn't catch things like proper nouns or years in the keywords, we will be using the entire OCR
categorize_keywords(textract_df)

In [None]:
# textract_df.head(30)

##Sentiment Analysis

In [None]:
sentiment_analysis(textract_df)

In [None]:
# textract_df.head(30)

##Saving results to CSV

In [None]:
# #save the result to a csv
# textract_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/textract_content_results.csv'

# # Save the DataFrame to a CSV file
# textract_df.to_csv(textract_csv_path, index=False)

#Comparing all subjectivity and polarity scores

Subjectivity in a range of [0,1] where a value closer to 0 denotes a piece of factual information and a value closer to 1 denotes a personal opinion.

Polarity in a range of [-1,1] where -1 denotes a highly negative sentiment and 1 denotes a highly positive sentiment

In [None]:
tesseract_sen = subjectivity_polarity_per_category('tesseract', tesseract_df)
tesseract_sen

[0.10628787878787879, 0.30394287642385065, 0.32302047534845013]


Unnamed: 0,tool,format,subjectivity,polarity
0,tesseract,handwritten,0.106288,0.03197
1,tesseract,typed,0.303943,0.055017
2,tesseract,mixed,0.32302,0.064087


In [None]:
vertex_sen = subjectivity_polarity_per_category('vertex', vertex_df)
vertex_sen

[0.3176355984083257, 0.3061420117094235, 0.4236769685491276]


Unnamed: 0,tool,format,subjectivity,polarity
0,vertex,handwritten,0.317636,0.125618
1,vertex,typed,0.306142,0.031262
2,vertex,mixed,0.423677,0.124397


In [None]:
textract_sen = subjectivity_polarity_per_category('textract', textract_df)
textract_sen

[0.28574615199615205, 0.3322950696720182, 0.43669450757575756]


Unnamed: 0,tool,format,subjectivity,polarity
0,textract,handwritten,0.285746,0.111053
1,textract,typed,0.332295,0.058977
2,textract,mixed,0.436695,0.132078


In [None]:
sentiment_results_df = pd.concat([tesseract_sen, vertex_sen, textract_sen], ignore_index=True)
sentiment_results_df.head(9)

Unnamed: 0,tool,format,subjectivity,polarity
0,tesseract,handwritten,0.106288,0.03197
1,tesseract,typed,0.303943,0.055017
2,tesseract,mixed,0.32302,0.064087
3,vertex,handwritten,0.317636,0.125618
4,vertex,typed,0.306142,0.031262
5,vertex,mixed,0.423677,0.124397
6,textract,handwritten,0.285746,0.111053
7,textract,typed,0.332295,0.058977
8,textract,mixed,0.436695,0.132078


In [None]:
#save the result to a csv
sentiment_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/sentiment_final_results.csv'

# Save the DataFrame to a CSV file
sentiment_results_df.to_csv(sentiment_csv_path, index=False)

#Topic Analysis

In [None]:
!pip3 install --upgrade google-cloud-documentai
!pip3 install --upgrade google-cloud-storage
!pip3 install --upgrade google-cloud-documentai-toolbox

Collecting google-cloud-documentai
  Downloading google_cloud_documentai-2.31.0-py2.py3-none-any.whl.metadata (5.2 kB)
Downloading google_cloud_documentai-2.31.0-py2.py3-none-any.whl (319 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m319.1/319.1 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-cloud-documentai
Successfully installed google-cloud-documentai-2.31.0


Collecting google-cloud-storage
  Downloading google_cloud_storage-2.18.2-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting google-resumable-media>=2.7.2 (from google-cloud-storage)
  Downloading google_resumable_media-2.7.2-py2.py3-none-any.whl.metadata (2.2 kB)
Downloading google_cloud_storage-2.18.2-py2.py3-none-any.whl (130 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.5/130.5 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading google_resumable_media-2.7.2-py2.py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: google-resumable-media, google-cloud-storage
  Attempting uninstall: google-resumable-media
    Found existing installation: google-resumable-media 2.7.1
    Uninstalling google-resumable-media-2.7.1:
      Successfully uninstalled google-resumable-media-2.7.1
  Attempting uninstall: google-cloud-storage
    Found

Collecting google-cloud-documentai-toolbox
  Downloading google_cloud_documentai_toolbox-0.14.0a0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting pyarrow<16.0.0,>=15.0.0 (from google-cloud-documentai-toolbox)
  Downloading pyarrow-15.0.2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting google-cloud-vision<4.0.0dev,>=2.7.0 (from google-cloud-documentai-toolbox)
  Downloading google_cloud_vision-3.7.4-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting intervaltree>=3.0.0 (from google-cloud-documentai-toolbox)
  Downloading intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pikepdf<9.0.0,>=8.0.0 (from google-cloud-documentai-toolbox)
  Downloading pikepdf-8.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.5 kB)
Collecting Pillow<11.0.0,>=10.0.0 (from google-cloud-documentai-toolbox)
  Downloading pillow-10.4.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (9.2 kB)
Collecting bottleneck>=1.3.4 (fr

In [None]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/drive/MyDrive/UCSF_ASC/nodal-keep-429420-v6-81686d3db60b.json"

In [None]:
import os
from google.api_core.client_options import ClientOptions
from google.cloud import documentai, language_v1

project_id = 'nodal-keep-429420-v6'
location = 'us'
processor_id = 'c56f7e228d63aa1e'

client = language_v1.LanguageServiceClient()
language = "en"
type_ = language_v1.Document.Type.PLAIN_TEXT

def topic_classify(df):
  all_topics = []
  for index, row in df.iterrows():
      # The local file in your current working directory
      ocr = row['ocr']

      # Prepare the document for classification
      doc = {"content": ocr, "type_": type_, "language": language}

      # Classify the text
      try:
        classify_response = client.classify_text(request={'document': doc})
        topics = [part for category in classify_response.categories for part in category.name.lstrip('/').split('/')]
      except:
        print(ocr)
        topics = []

      all_topics.append(topics)
  # df.drop('topics', axis=1, inplace=True)
  df.insert(8, 'topics', all_topics)

  # Print or save the updated DataFrame as needed
  # print(vertex_df)


In [None]:
topic_classify(tesseract_df)
#save the result to a csv
tesseract_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/tesseract_content_results.csv'

# Save the DataFrame to a CSV file
tesseract_df.to_csv(tesseract_csv_path, index=False)

Audience Assad len ths, necting aan hiyhe

 

To Ww

 

cree Sub, or

Exch age Sub

ATN Publishers
P.O. Box 411256
San Francisco, CA 94141

 

 



In [None]:
topic_classify(vertex_df)
#save the result to a csv
vertex_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/vertex_content_results.csv'

# Save the DataFrame to a CSV file
vertex_df.to_csv(vertex_csv_path, index=False)

John
Free Sub, or
Exchange Sub
ATN Publishers
P.O. Box 411256
San Francisco, CA 94141



In [None]:
topic_classify(textract_df)
#save the result to a csv
textract_csv_path = '/content/drive/MyDrive/UCSF_ASC/Content_Analysis_Results/textract_content_results.csv'

# Save the DataFrame to a CSV file
textract_df.to_csv(textract_csv_path, index=False)

X
Du
with

John
Free Sub, or
Exchange Sub
ATN Publishers
P.O. Box 411256
San Francisco, CA 94141



In [None]:
def count_topics(df):
  counts = {'handwritten':0, 'typewritten':0, 'mixed':0}
  for index, row in df.iterrows():
    if row['format'] == 'handwritten' and len(row['topics']) > 0:
      counts['handwritten'] += 1
    elif row['format'] == 'typed' and len(row['topics']) > 0:
      counts['typewritten'] += 1
    elif row['format'] == 'mixed' and len(row['topics']) > 0:
      counts['mixed'] += 1
  return counts

In [None]:
print('tesseract topic count: ', count_topics(tesseract_df))

tesseract topic count:  {'handwritten': 2, 'typewritten': 9, 'mixed': 5}


In [None]:
print('vertex topic count: ', count_topics(vertex_df))

vertex topic count:  {'handwritten': 4, 'typewritten': 10, 'mixed': 6}


In [None]:
print('textract topic count: ', count_topics(textract_df))

textract topic count:  {'handwritten': 3, 'typewritten': 10, 'mixed': 6}


In [None]:
def count_topics_total(df):
  counts = {'handwritten':0, 'typewritten':0, 'mixed':0}
  for index, row in df.iterrows():
    if row['format'] == 'handwritten':
      counts['handwritten'] += len(row['topics'])
    elif row['format'] == 'typed':
      counts['typewritten'] += len(row['topics'])
    elif row['format'] == 'mixed':
      counts['mixed'] += len(row['topics'])
  return counts

In [None]:
print('tesseract topic count: ', count_topics_total(tesseract_df))

tesseract topic count:  {'handwritten': 5, 'typewritten': 46, 'mixed': 13}


In [None]:
print('vertex topic count: ', count_topics_total(vertex_df))

vertex topic count:  {'handwritten': 7, 'typewritten': 44, 'mixed': 24}


In [None]:
print('textract topic count: ', count_topics_total(textract_df))

textract topic count:  {'handwritten': 6, 'typewritten': 45, 'mixed': 21}
