## Task

We will use Spark NLP to perform OCR and then NER (named entity extraction) on public PDF documents pulled from the web. In particular - we will build a Spark NLP pipeline that processes a set of PDF documents, OCR's them and pull entities like person, location, and organization.

In [1]:
#PDFs that will be downloaded and used as inputs for our OCR + NER pipeline
filenames = ['https://www.cdc.gov/nchhstp/stateprofiles/pdf/Arizona_profile.pdf',
             'https://www.nrdc.org/sites/default/files/ClimateWaterFS_PhoenixAZ.pdf',
             'https://www.azcc.gov/Divisions/Corporations/Ten-Steps-to-Starting-a-Business-in-Arizona.pdf'
            ]

files_for_ocr = []

import urllib.request
for ind, fname in enumerate(filenames):
    try:
        urllib.request.urlretrieve(fname, 'testfile_' + str(ind) + '.pdf')
        print('Download complete for ', fname)
        files_for_ocr.append('testfile_' + str(ind) + '.pdf')
    except:
        print('Failed to download file ', fname)

Download complete for  https://www.cdc.gov/nchhstp/stateprofiles/pdf/Arizona_profile.pdf
Download complete for  https://www.nrdc.org/sites/default/files/ClimateWaterFS_PhoenixAZ.pdf
Download complete for  https://www.azcc.gov/Divisions/Corporations/Ten-Steps-to-Starting-a-Business-in-Arizona.pdf


In [2]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start(include_ocr=True)
model = PretrainedPipeline('explain_document_dl')

In [3]:
from sparknlp.ocr import OcrHelper

for cur_file in files_for_ocr:
    print('\nWorking on file: ', cur_file)
    data = OcrHelper().createDataset(spark, cur_file)

    people, locs, orgs = set(), set(), set() #'I-PER', 'I-LOC', 'I-ORG' 
    for row in data.collect():
        result = model.annotate(row['text'], 'text')
        ner_outputs = list(zip(result['token'], result['ner']))

        for entry in ner_outputs:
            if entry[1] == 'I-PER':
                people.add(entry[0])
            elif entry[1] == 'I-ORG':
                orgs.add(entry[0])
            elif entry[1] == 'I-LOC':
                locs.add(entry[0])

    print('People: ', people)
    print('Organizations: ', orgs)
    print('Locations: ', locs)



Working on file:  testfile_0.pdf
People:  {'\uf0b7'}
Organizations:  {'Initiatives', 'P&S', 'Reported', 'Health', 'National', 'Arizona', 'Profile', 'Centers', 'State', 'CDC', 'Control', 'Center', 'Prevention', 'School', '\uf0b7', 'Disease', 'Supported', 'CS2382532', 'TB', 'Ranked', 'health', 'for', 'HIV/AIDS', 'Information', 'More', 'Program'}
Locations:  {'States', 'County', 'Arizona', 'U.S.', 'Maricopa', 'United'}

Working on file:  testfile_1.pdf
People:  {'Michelle', 'Mehta'}
Organizations:  {'l_', 'Becoming', 'Phoenix’s', 'For', 'Change', 'Resilient', 'and', 'Impacts', 'Climate', 'Southwest', 'Council', 'Network', 'H', 'PDF', 'NRDC', '“', 'Arizona-Identifying', '©', 'Resources', 'Defense', 'of', 'Natural', 'Phoenix', 'The', 'More', 'CAP'}
Locations:  {'Central', 'Project', 'Granite', 'Underground', 'Storage', 'Colorado', 'Sonoran', 'Reef', 'States', 'Arizona', 'Phoenix', 'Desert', 'United', 'Verde', 'River', 'Situated', 'Salt'}

Working on file:  testfile_2.pdf
People:  set()
Org