## Task

We will use Spark NLP to perform OCR and then NER (named entity extraction) on public PDF documents pulled from the web. In particular - we will build a Spark NLP pipeline that processes a set of PDF documents, OCR's them and pull entities like person, location, and organization.

In [1]:
#PDFs that will be downloaded and used as inputs for our OCR + NER pipeline
filenames = ['https://www.cdc.gov/nchhstp/stateprofiles/pdf/Arizona_profile.pdf',
             'https://www.nrdc.org/sites/default/files/ClimateWaterFS_PhoenixAZ.pdf',
             'https://www.azcc.gov/Divisions/Corporations/Ten-Steps-to-Starting-a-Business-in-Arizona.pdf'
            ]

files_for_ocr = []

import urllib.request
for ind, fname in enumerate(filenames):
    try:
        urllib.request.urlretrieve(fname, 'testfile_' + str(ind) + '.pdf')
        print('Download complete for ', fname)
        files_for_ocr.append('testfile_' + str(ind) + '.pdf')
    except:
        print('Failed to download file ', fname)

Download complete for  https://www.cdc.gov/nchhstp/stateprofiles/pdf/Arizona_profile.pdf
Download complete for  https://www.nrdc.org/sites/default/files/ClimateWaterFS_PhoenixAZ.pdf
Download complete for  https://www.azcc.gov/Divisions/Corporations/Ten-Steps-to-Starting-a-Business-in-Arizona.pdf


In [2]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start(include_ocr=True)
model = PretrainedPipeline('explain_document_dl')

In [3]:
from sparknlp.ocr import OcrHelper

for cur_file in files_for_ocr:
    print('\nOCRing file: ', cur_file)
    data = OcrHelper().createDataset(spark, cur_file)

    print('Extracting NER information')
    people, locs, orgs = set(), set(), set() #'I-PER', 'I-LOC', 'I-ORG' 

    for row in data.collect():
        result = model.annotate(row['text'], 'text')
        ner_outputs = list(zip(result['token'], result['ner']))

        for entry in ner_outputs:
            if entry[1] == 'I-PER':
                people.add(entry[0])
            elif entry[1] == 'I-ORG':
                orgs.add(entry[0])
            elif entry[1] == 'I-LOC':
                locs.add(entry[0])

    print('People: ', people)
    print('Organizations: ', orgs)
    print('Locations: ', locs)


OCRing file:  testfile_0.pdf
Extracting NER information
People:  {'\uf0b7'}
Organizations:  {'Prevention', 'Reported', 'Supported', 'Program', 'State', 'Ranked', 'Center', 'National', 'Disease', 'School', 'Arizona', 'Information', 'P&S', 'CS2382532', 'Centers', 'Initiatives', 'More', '\uf0b7', 'health', 'CDC', 'Profile', 'Control', 'for', 'TB', 'HIV/AIDS', 'Health'}
Locations:  {'County', 'U.S.', 'States', 'Maricopa', 'Arizona', 'United'}
OCRing file:  testfile_1.pdf
Extracting NER information
People:  {'Michelle', 'Mehta'}
Organizations:  {'Arizona-Identifying', 'Defense', 'Council', 'Phoenix', 'Southwest', 'Resilient', 'Network', '“', 'l_', 'Phoenix’s', 'The', 'PDF', 'CAP', 'NRDC', 'of', 'Change', '©', 'More', 'Resources', 'and', 'Climate', 'Natural', 'Impacts', 'Becoming', 'H', 'For'}
Locations:  {'Verde', 'Desert', 'States', 'Underground', 'Sonoran', 'Phoenix', 'Salt', 'Project', 'Reef', 'Storage', 'Arizona', 'Situated', 'River', 'Colorado', 'Granite', 'Central', 'United'}
OCRing f