## Research Project 2
```text
- Source: PCAOB
- Goal: Information Extraction from PDF files
- Techniques: Regular Expressions, Probabilistic Context Free Grammar
- Tools: re, duckling
- Lines of code: ~100```

### Get a PDF

In [11]:
import requests
url = 'https://pcaobus.org/Inspections/Reports/Documents/2005_Tamas_B._Revai_CPA.pdf'
res = requests.get(url)

In [12]:
with open('./2005_Tamas_B._Revai_CPA.pdf', 'wb') as f:
    f.write(res.content)

In [13]:
import os
os.system('pdftotext 2005_Tamas_B._Revai_CPA.pdf 2005_Tamas_B._Revai_CPA.txt')

0

In [14]:
open('./2005_Tamas_B._Revai_CPA.txt', 'r').read()

'1666 K Street, N.W.\nWashington, DC 20006\nTelephone: (202) 207-9100\nFacsimile: (202) 862-8430\nwww.pcaobus.org\n\nInspection of\nTamas B. Revai, CPA\n\nIssued by the\n\nPublic Company Accounting Oversight Board\nJune 23, 2005\n\nTHIS IS A PUBLIC VERSION OF A PCAOB INSPECTION REPORT\nPORTIONS OF THE COMPLETE REPORT ARE OMITTED\nFROM THIS DOCUMENT IN ORDER TO COMPLY WITH\nSECTIONS 104(g)(2) AND 105(b)(5)(A)\nOF THE SARBANES-OXLEY ACT OF 2002\n\nPCAOB RELEASE NO. 104-2005-022\n\n\x0cPCAOB Release No. 104-2005-022\n\nNotes Concerning this Report\n1. Portions of this report may describe deficiencies or potential deficiencies in the systems,\npolicies, procedures, practices, or conduct of the firm that is the subject of this report.\nThe express inclusion of certain deficiencies and potential deficiencies, however, should\nnot be construed to support any negative inference that any other aspect of the firm\'s\nsystems, policies, procedures, practices, or conduct is approved or condoned by

In [None]:
# Exercise 1
# Write a `get_text_from_pdf` function that, given a filename, downloads the 
# PDF and returns its text.

Fields we're interested in:
- `PCAOB Release No`
- `Firm`
- `Offices`
- `Ownership structure`
- `Date of Inspection Report`
- `Inspection Period`
- `Failures`

### Parse "`PCAOB Release No`"

### Parse "`Firm`"

### Parse "`Offices`"

### Parse "`Ownership structure`"

### Parse "`Date of Inspection Report`"

### Parse "`Inspection Period`"

### Parse "`Failures`"

In [4]:
# Standard library
import os
import re
import json
from collections import Counter
from pprint import PrettyPrinter

# Third-party
import requests
from dateutil import parser
from duckling import DucklingWrapper

PPRINTER = PrettyPrinter()
DUCK_PARSER = DucklingWrapper()

def get_text_from_pdf(filename):
    url = 'https://pcaobus.org/Inspections/Reports/Documents/%s.pdf' % filename
    res = requests.get(url)
    with open('./%s.pdf' % filename, 'wb') as f:
        f.write(res.content)
    os.system('pdftotext %s.pdf %s.txt' % (filename, filename)) # If format is important, use "pdftohtml"
    return open('./%s.txt' % filename, 'r').read()

def get_field_from_regex(text, pattern, flags=0, most_common=True):
    matches = re.findall(pattern, text, flags=flags)
    clean = [' '.join(i.strip().split()) for i in matches]
    if most_common:
        counts = Counter(clean)
        final = counts.most_common()[0][0]
    else:
        final = clean
    return final

def get_dates(text):
    parsed = DUCK_PARSER.parse_time(text)
    values = [i['value']['value'] for i in parsed if 'grain' in i['value'] and 
              i['value']['grain'] == 'day']
    final = sorted([parser.parse(i).replace(tzinfo=None) for i in values])
    return final

def get_period_of_inspection(text):
    par = get_field_from_regex(text, 
                               r'INSPECTION PROCEDURES AND CERTAIN OBSERVATIONS(.*?)\n\n', 
                               flags=re.DOTALL)
    parsed = DUCK_PARSER.parse_time(par)
    values = [i for i in parsed if isinstance(i['value']['value'], dict) and
              'to' in i['value']['value']]
    if not values:
        final = get_dates(par)
    else:
        final = [parser.parse(values[0]['value']['value']['from']).replace(tzinfo=None), 
                 parser.parse(values[0]['value']['value']['to']).replace(tzinfo=None)]
    return final

def parse_text(text):
    return {
    
        'PCAOB Release No': get_field_from_regex(text=text,
                                                 pattern=r'PCAOB Release No. (\d{3}-\d{4}-\d{3})',
                                                 flags=re.IGNORECASE),

        'Firm': get_field_from_regex(text=text,
                                     pattern=r'Inspection of (.*?)\n'),

        'Offices': get_field_from_regex(text=text,
                                        pattern=r'Number of offices(.*?)Ownership',
                                        flags=re.DOTALL),

        'Ownership structure': get_field_from_regex(text=text,
                                                    pattern=r'Ownership structure(.*?)Number of partners',
                                                    flags=re.DOTALL),

        'Date of Inspection Report': get_dates(text=text.split('\x0c')[0])[0],

        'Inspection Period': get_period_of_inspection(text=text),

        'Failures': get_field_from_regex(text=text,
                                         pattern=r'\n\(\d\)((?:.*?)failure(?:.*?))(?:\n\n|;)',
                                         flags=re.DOTALL,
                                         most_common=False)
        }

pdfs = [
    '2005_Tamas_B._Revai_CPA',
    '2015_Bravos_Associates'
]

for pdf in pdfs:
    print()
    text = get_text_from_pdf(pdf)
    parsed = parse_text(text)
    PPRINTER.pprint(parsed)




{'Date of Inspection Report': datetime.datetime(2005, 6, 23, 0, 0),
 'Failures': ['the pervasive failure to plan, perform, and document '
              'performance of the audit or the quarterly reviews of interim '
              'financial information for the first three quarters of the '
              "issuer's fiscal year",
              "the failure to properly evaluate the issuer's ability to "
              'continue as a going concern. B.'],
 'Firm': 'Tamas B. Revai, CPA',
 'Inspection Period': [datetime.datetime(2004, 9, 7, 0, 0),
                       datetime.datetime(2004, 10, 7, 0, 0)],
 'Offices': '1 (New York, New York)',
 'Ownership structure': 'Sole proprietorship',
 'PCAOB Release No': '104-2005-022'}

{'Date of Inspection Report': datetime.datetime(2014, 10, 30, 0, 0),
 'Failures': ['the failure to perform sufficient procedures related to '
              'revenue, including the failure to perform the necessary risk '
              'assessment procedures (AS No. 12, 