# Assignment 5: OCR, RegEx, Time Series
# Part 1: OCR and RegEx

Author: Shreya Parjan

24 Oct 2019

This part of A5 examines the quality of different OCR tools and find out which works best. Using "Maintenance of a Lobby to Influence Legislation, 1913" as our reference, we follow a tutorial by Dr. Meredith Parker from the University of Oxford to remove the unnecessary lines and find the names at the beginning of the lines. We expand the code to keep track of the different people, so that we can compare the results to the “ground truth," a verified spreadsheet of the actual counts of occurrences of the different names in the text.

## Table of contents
1. [Step 1: "Ground Truth" spreadsheet](#s0)
2. [Step 2: I reviewed the RegEx notebook from class](#s1)
3. [Step 3: Tutorial Code](#s2)
4. [Step 4: Names Analysis/Patterns](#s3)
5. [Step 5: CounterDicts & JSON Files](#s4)
6. [Step 6: Compare to Ground Truth](#s5)
    - [Read in, Clean Google sheet](#s6)
    - [readTruth function to create CounterDict](#s7)
    - [getCloseMatches](#s8)
    - [Compare the “ground truth” results with the OCR result](#s9)
7. [Interpretation](#s10)

## Step 1: "ground truth" spreadsheet has been created
<a id="s0"></a>

## Step 2: I reviewed the RegEx notebook from class
<a id="s1"></a>

## Step 3: Tutorial Code
<a id="s2"></a>
Python code from the tutorial (see above). Made minor improvements to simplify it and make it more concise.

Main improvements:
- (1): made code segments into functions that can be called for different input files
- (2): created list of errors and removed repetitive if statements/cases so that possible errors are only checked against the list
- (3): deleted analysis of text outside of names of people because it isn't relevant to our project

In [68]:
import re

# Initialize list for edited text
v_edited = []

#editFile takes in a filename, reads it, and checks lines against our list of known errors
def editFile(filename):
    errors = ["Generated","Public","MAINTENANCE","Original","NEW","Digitized"]
        # Open file and read in line by line
    with open(filename, "r", encoding = 'utf-8') as f:
        file_text = f.read()
        lines = file_text.split('\n')

        # If line has a word that is an OCR error, do not include line
        for line in lines: 
            split = line.split()
            try:
                if split[0] in errors:
                    continue
            except:
                    pass
            try: 
                if split[1] == "MAINTENANCE":
                    continue
            except:
                pass
            v_edited.append(line)
        return v_edited

## Step 4: Names Analysis/Patterns
<a id="s3"></a>
Following from the tutorial, we generate patterns to catch errors in the spellings of the various names. We also provide a function, handleNames, that takes in the edited string from editFile and returns the post-analysis output and the list of names it picked up from the file it read.

In [69]:
# Compile regular expressions to match patterns found in the text that indicate a new speaker

# This matches patterns line "Mr. Robinson." or "Mr Smith."
# '\.*' indicates an optional period after "Mr"
# ' *' indicates an option space between "Mr" and the name
# '[A-za-z]+' allows words with capital or lowercase letters of any length
# '\.' requires a period after the name, which is used in the text to indicate the end of name

re1 = re.compile('Mr\.* *[A-Za-z]+\.')

# This matches patterns like "Senator Gallinger." paralleling the above
re2 = re.compile('Senator *[A-Za-z]+\.')

# This matches "The Chairman." allowing the whole word to also be capitalized
re3 = re.compile('The *Chairman\.', re.IGNORECASE)


In [70]:
"handleNames takes in our edited string and cycles through it to analyze the names of the people in the text"
def handleNames(v_edited):
    namelist = []
# Cycle through and split on speaker names
    output = ""
    for line in v_edited:
        mr = re1.match(line)
        senator = re2.match(line)
        chairman = re3.match(line)

        # If a line matches one of the regexs, add the match to the output string
        if mr:
            output += "\n"
            split = line.split(mr.group(), 1)
            name = mr.group().rstrip('.')
            output += name
            output += "# "
            output += split[1]
            namelist.append(name)
        elif senator:
            output += "\n"
            split = line.split(senator.group(), 1)
            name = senator.group().rstrip('.')
            output += name
            output += "# "
            output += split[1]
            namelist.append(name)
        elif chairman:
            output += "\n"
            split = line.split(chairman.group(), 1)
            name = chairman.group().rstrip('.')
            output += name
            output += "#"
            output += split[1]
            namelist.append(name)
        # If a line doesn't match, it's part of the comment of the above speaker.
        # So just add it to the output
        else:
            output += line
    return output, namelist

Below, I call the relevant functions I created above for each of the files (analyzed by different technologies) that I'm interested in.

In [71]:
v_edited1 = editFile("V1.txt")
outputV1, namelistV1 = handleNames(v_edited1)

v_editedT = editFile("tesseract.txt")
outputT, namelistT = handleNames(v_editedT)

v_editedG = editFile("google-cloud.txt")
outputG, namelistG = handleNames(v_editedG)

## Step 5: Counter dicts and JSON files
<a id="s4"></a>
Now, we create a Counter dict that has all names and counts for each of them. We’ll save this file as a JSON file.

In [72]:
from collections import Counter
import json

In [73]:
"this is the counterDict for our version1 file"

counterDictV1 = Counter(namelistV1)
print(counterDictV1)

with open('namesV1JSON','w') as outfile:
    json.dump(counterDictV1, outfile)

Counter({'Mr. EMERY': 78, 'Senator REED': 50, 'The CHAIRMAN': 15, 'Mr. McCARTER': 15, 'Senator NELSON': 6, 'Senator WALSH': 6, 'Mr. MULHALL': 5, 'Senator  REED': 3, 'Mr.  EMERY': 3, 'Senator ClrnMINS': 1, 'Mr. EiomY': 1, 'Senator  REEU': 1, 'Mr. EKEBY': 1, 'Mr. fcCARTER': 1, 'Senator wALSH': 1, 'Mr. MERY': 1, 'Mr.  McCARTER': 1, 'Mr. McCABTER': 1})


In [74]:
"this is the counterDict for our Tesseract file"

import difflib

counterDictT = Counter(namelistT)
print(counterDictT)

with open('namesTJSON','w') as outfile:
    json.dump(counterDictT, outfile)

Counter({'Mr. Emery': 170, 'Senator Reep': 94, 'Mr. EMERY': 78, 'Senator REED': 58, 'Mr. McCarter': 44, 'The CHAIRMAN': 16, 'Senator Watsu': 16, 'Mr. McCARTER': 15, 'Senator Netson': 7, 'Senator NELSON': 6, 'Senator WALSH': 6, 'Senator ReEp': 6, 'Mr. MULHALL': 5, 'Senator Cummins': 4, 'Senator ReEep': 4, 'Senator ReeEp': 4, 'Senator WatsH': 4, 'Senator  REED': 3, 'Mr.  EMERY': 3, 'Senator REEp': 3, 'Senator Rreep': 3, 'Mr. Mutua': 3, 'Senator Watsn': 3, 'Senator Waxsu': 3, 'Senator Nerson': 2, 'Senator REEep': 2, 'Mr. McCarrer': 2, 'Mr. Exery': 2, 'Senator ClrnMINS': 1, 'Mr. EiomY': 1, 'Senator  REEU': 1, 'Mr. EKEBY': 1, 'Mr. fcCARTER': 1, 'Senator wALSH': 1, 'Mr. MERY': 1, 'Mr.  McCARTER': 1, 'Mr. McCABTER': 1, 'Mr. Teer': 1, 'Senator Nezson': 1, 'Senator NEtson': 1, 'Mr. MuLHALL': 1, 'Mr. Muiwa': 1, 'Mr. Mutwatt': 1, 'Mr. Mutttau': 1, 'Mr. McCarTer': 1, 'Senator Rexp': 1, 'Senator Resp': 1, 'Senator Wars': 1, 'Mr. Ttieny': 1, 'Mr. Esrery': 1, 'Senator REeep': 1, 'Senator REeEp': 1, '

In [95]:
"this is the counterDict for our Google Cloud Vision file"

counterDictG = Counter(namelistG)
print(counterDictG.keys())
with open('namesGJSON','w') as outfile:
    json.dump(counterDictG, outfile)

dict_keys(['Senator REED', 'Senator ClrnMINS', 'The CHAIRMAN', 'Mr. EMERY', 'Senator NELSON', 'Mr. McCARTER', 'Mr. EiomY', 'Senator  REEU', 'Mr. MULHALL', 'Mr. EKEBY', 'Senator  REED', 'Mr. fcCARTER', 'Senator wALSH', 'Senator WALSH', 'Mr.  EMERY', 'Mr. MERY', 'Mr.  McCARTER', 'Mr. McCABTER', 'Mr. McCarter', 'Senator Reep', 'Senator Cummins', 'Mr. Emery', 'Senator Nerson', 'Mr. Teer', 'Senator REEep', 'Senator Netson', 'Senator ReEep', 'Senator REEp', 'Senator Nezson', 'Senator NEtson', 'Mr. MuLHALL', 'Mr. Muiwa', 'Mr. Mutwatt', 'Mr. Mutttau', 'Mr. McCarTer', 'Senator Rexp', 'Senator Rreep', 'Senator ReEp', 'Mr. McCarrer', 'Senator Resp', 'Senator Watsu', 'Senator Wars', 'Mr. Ttieny', 'Senator ReeEp', 'Mr. Esrery', 'Senator REeep', 'Senator REeEp', 'Mr. Exery', 'The CHairMAN', 'Senator Regp', 'Mr. Mutua', 'Senator Rerp', 'Mr. Mutwatv', 'Mr. Mornay', 'Mr. MuHa', 'Senator Watsn', 'Senator WatsH', 'Senator Waxsu', 'Senator Reev', 'Mr. Mutnact', 'Mr. McCarrTer', 'Mr. Muuatt', 'Mr. Mutaaty'

## Step 6: Compare to Ground Truth
<a id="s5"></a>
Now that we have all of our CounterDicts for each of the different technologies used to analyze the text, we'll compare the accuracy of each to our correct analysis of the text. This will tell us which software is best.

### Read in, Clean Google spreadsheet
<a id="s6"></a>

In [76]:
import pandas as pd
from collections import Counter

df = pd.read_csv('results.csv')
df.head()

Unnamed: 0,Page #,Student,Name1,Count1,Name2,Count2,Name3,Count3,Name4,Count4,Name5,Count5,Name6,Count6
0,1,Eni,Mr. McCARTER,3,Senator REED,3,,,,,,,,
1,2,Elaney,Senator CUMMINS,3,Mr. McCARTER,5,Senator REED,4.0,The CHAIRMAN,4.0,Mr. EMERY,6.0,Senator NELSON,1.0
2,3,Izzy,Mr. McCARTER,3,Mr. EMERY,6,Senator REED,2.0,Senator NELSON,1.0,,,,
3,4,Avery,Mr. Emery,4,Mr. McCarter,5,Senator REED,6.0,Senator Cummins,1.0,The Chairman,2.0,Senator Nelson,5.0
4,5,Yige,The CHAIRMAN,1,Mr. McCarter,7,Mr. MULHALL,5.0,Mr. EMERY,5.0,Senator REED,3.0,Senator Nelson,1.0


In [77]:
# remove columns we don't need
df.drop(['Page #', 'Student'], axis=1, inplace=True)
df.head()

Unnamed: 0,Name1,Count1,Name2,Count2,Name3,Count3,Name4,Count4,Name5,Count5,Name6,Count6
0,Mr. McCARTER,3,Senator REED,3,,,,,,,,
1,Senator CUMMINS,3,Mr. McCARTER,5,Senator REED,4.0,The CHAIRMAN,4.0,Mr. EMERY,6.0,Senator NELSON,1.0
2,Mr. McCARTER,3,Mr. EMERY,6,Senator REED,2.0,Senator NELSON,1.0,,,,
3,Mr. Emery,4,Mr. McCarter,5,Senator REED,6.0,Senator Cummins,1.0,The Chairman,2.0,Senator Nelson,5.0
4,The CHAIRMAN,1,Mr. McCarter,7,Mr. MULHALL,5.0,Mr. EMERY,5.0,Senator REED,3.0,Senator Nelson,1.0


In [78]:
# replace NaNs with zeros
df = df.fillna(0)
df.head()

Unnamed: 0,Name1,Count1,Name2,Count2,Name3,Count3,Name4,Count4,Name5,Count5,Name6,Count6
0,Mr. McCARTER,3,Senator REED,3,0,0.0,0,0.0,0,0.0,0,0.0
1,Senator CUMMINS,3,Mr. McCARTER,5,Senator REED,4.0,The CHAIRMAN,4.0,Mr. EMERY,6.0,Senator NELSON,1.0
2,Mr. McCARTER,3,Mr. EMERY,6,Senator REED,2.0,Senator NELSON,1.0,0,0.0,0,0.0
3,Mr. Emery,4,Mr. McCarter,5,Senator REED,6.0,Senator Cummins,1.0,The Chairman,2.0,Senator Nelson,5.0
4,The CHAIRMAN,1,Mr. McCarter,7,Mr. MULHALL,5.0,Mr. EMERY,5.0,Senator REED,3.0,Senator Nelson,1.0


### readTruth function to create CounterDict
<a id="s7"></a>

In [79]:
def readTruth(df):
    """copies pairs of columns (i.e., Name2 and Count2) into temp df"""
    counter = Counter()
    for i in range(0, len(df.columns), 2):
        df_temp = df.iloc[:,i:i+2]
        # iterates through temp df and stores counts
        for index, row in df_temp.iterrows():
            vals = row.values.tolist()
            vals[1] = float(vals[1])
            try:
                counter[vals[0]] += vals[1] # updates the Counter dictionary
            except:
                pass
    return counter

### getCloseMatches
<a id="s8"></a>

In [89]:
"""our counterdict outputs the true counts for each name"""

counter = readTruth(df)

def closeMatch(dicto):
    names = list(counter.keys())
    names.remove(0)
    
    newDict = {}
    for n in names:
        matches = difflib.get_close_matches(n, dicto.keys(),cutoff=0.8)
        for match in matches:
            if n in newDict.keys():
                newDict[n] += dicto[match]
            else:
                newDict[n] = dicto[match]
    return newDict

countDictG = (closeMatch(counterDictG))
countDictV1 = (closeMatch(counterDictV1))
countDictT = (closeMatch(counterDictT))

countDictT

{'Mr. McCARTER': 17,
 'Senator CUMMINS': 1,
 'Mr. Emery': 173,
 'The CHAIRMAN': 16,
 'Senator REED': 62,
 'Mr. EMERY': 82,
 'Senator WALSH': 7,
 'Mr. McCarter': 47,
 'Mr EMERY': 82,
 'Mr. MULHALL': 6,
 'Mr McCARTER': 17,
 'Senator NELSON': 6,
 'Senator Cummins': 4,
 'Senator Walsh': 5,
 'Senator Nelson': 10}

### Compare the “ground truth” results with the OCR result
<a id="s9"></a>

In [90]:
counter

Counter({'Mr. McCARTER': 39.0,
         'Senator CUMMINS': 3.0,
         'Mr. Emery': 16.0,
         'The CHAIRMAN': 28.0,
         'Senator REED': 133.0,
         'Mr. EMERY': 139.0,
         'Senator WALSH': 21.0,
         'Mr. McCarter': 17.0,
         'Mr EMERY': 33.0,
         'Mr. Hughes': 4.0,
         0: 0.0,
         'Mr. MULHALL': 12.0,
         'THE CLERK OF THE COMMITTEE': 2.0,
         'Mr McCARTER': 2.0,
         'Senator NELSON': 8.0,
         'Senator Cummins': 1.0,
         'Senator Walsh': 2.0,
         'The Chairman': 9.0,
         'Mr. Stafford': 1.0,
         'Senator Nelson': 6.0})

In [98]:
def compare(c,gt):
    p = 0
    for i in c:
        if i in gt:
            p+=c[i]/gt[i]
    return p/len(gt)

g = compare(countDictG,counter)
t = compare(countDictT,counter)
v = compare(countDictV1,counter)

#### Comparision Score for version 1 file, tesseract file, and google

In [102]:
v

0.7392868317470741

In [103]:
t

1.8354403589474277

In [104]:
g

3.818061203534829

## Interpretation
<a id="s10"></a>

- Version1 (v): The version 1 OCR had the lowest accuracy since its comparison score post processing was the lowest. This means that it mis-interpreted the most names of the people in the text.
- Tesseract (t): Tesseract captured more correctly than V1 but less than Google Cloud.
- Google Cloud (g): Note that Google Cloud actually classified more words in the text as names in the ground truth spreadsheet than there actually are. This suggests that it might overcorrect in its interpretation of text. Furthermore, we know that ***Google Cloud software would be the best choice for an OCR project.***