## Research Project 1
---
```text
- Source: PCAOB
- Goal: Information Extraction from PDF files
- Techniques: Regular Expressions
- Tools: pdftotext, re, duckling
- Lines of code: ~100```

### Get a PDF
---

In [3]:
import requests
url = 'https://pcaobus.org/Inspections/Reports/Documents/2005_Tamas_B._Revai_CPA.pdf'
res = requests.get(url)

In [5]:
filename = '2005_Tamas_B._Revai_CPA'
with open('../data/%s.pdf' % filename, 'wb') as f:
    f.write(res.content)

In [9]:
import sys
if sys.platform == 'darwin':
    import os
    os.system('pdftotext ./data/%s.pdf ../data/%s.txt' % (filename, filename))
    pdf = open('../data/%s.txt' % filename).read().split('\x0c')
else:
    with open("../data/%s.pdf" % filename, "rb") as f:
        import pdftotext
        pdf = pdftotext.PDF(f)
print()
print(pdf[0])


1666 K Street, N.W.
Washington, DC 20006
Telephone: (202) 207-9100
Facsimile: (202) 862-8430
www.pcaobus.org

Inspection of
Tamas B. Revai, CPA

Issued by the

Public Company Accounting Oversight Board
June 23, 2005

THIS IS A PUBLIC VERSION OF A PCAOB INSPECTION REPORT
PORTIONS OF THE COMPLETE REPORT ARE OMITTED
FROM THIS DOCUMENT IN ORDER TO COMPLY WITH
SECTIONS 104(g)(2) AND 105(b)(5)(A)
OF THE SARBANES-OXLEY ACT OF 2002

PCAOB RELEASE NO. 104-2005-022




<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise 1.1</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a <span style="font-family:monospace;">get_text_from_pdf</span> function that, given a filename, downloads the PDF and returns its text as a list of pages.</p></a></font>
</div>

Fields we're interested in:
- `PCAOB Release No`
- `Firm`
- `Offices`
- `Ownership structure`
- `Date of Inspection Report`
- `Inspection Period`
- `Failures`

### Parse "`PCAOB Release No`"
---

In [4]:
pdf[0]

'1666 K Street, N.W.\nWashington, DC 20006\nTelephone: (202) 207-9100\nFacsimile: (202) 862-8430\nwww.pcaobus.org\n\nInspection of\nTamas B. Revai, CPA\n\nIssued by the\n\nPublic Company Accounting Oversight Board\nJune 23, 2005\n\nTHIS IS A PUBLIC VERSION OF A PCAOB INSPECTION REPORT\nPORTIONS OF THE COMPLETE REPORT ARE OMITTED\nFROM THIS DOCUMENT IN ORDER TO COMPLY WITH\nSECTIONS 104(g)(2) AND 105(b)(5)(A)\nOF THE SARBANES-OXLEY ACT OF 2002\n\nPCAOB RELEASE NO. 104-2005-022\n\n'

In [5]:
import re
re.findall(r'PCAOB RELEASE NO\.? ?([0-9\-]+)', pdf[0], flags=re.IGNORECASE)[0]

'104-2005-022'

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1;
            background-color: #FCF3CF;
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4">
    <a href="../deep_dives/regex.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Deep-dive</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Regular expressions</p></a></font>
</div>

### Parse "`Firm`"
---

In [6]:
re.findall(r'Inspection of\n?(.*?)\n', pdf[0])[0]

'Tamas B. Revai, CPA'

### Parse "`Offices`"
---

In [7]:
pdf[3]

'PCAOB Release No. 104-2005-022\nInspection of Tamas B. Revai, CPA\nJune 23, 2005\nPage 2\n\nPART I\nINSPECTION PROCEDURES AND CERTAIN OBSERVATIONS\nMembers of the Board\'s inspection staff ("the inspection team") conducted\nfieldwork for the inspection on September 7, 2004 and October 7, 2004. The fieldwork\nincluded procedures tailored to the nature of the Firm, certain aspects of which the\ninspection team understood at the outset of the inspection to be as follows:\nNumber of offices\n\n1 (New York, New York)\n\nOwnership structure\n\nSole proprietorship\n\nNumber of partners\n\n1\n\nNumber of professional staff3/\n\nNone\n\nNumber of issuer audit clients4/\n\nNone5/\n\nBoard inspections are designed to identify and address weaknesses and\ndeficiencies related to how a firm conducts audits. To achieve that goal, Board\ninspections include reviews of certain aspects of selected audits performed by the firm\nand reviews of other matters related to the firm\'s quality control system.\

In [8]:
re.findall(r'Number of offices(.*?)Ownership', pdf[3], flags=re.DOTALL)[0].strip()

'1 (New York, New York)'

### Parse "`Ownership structure`"
---

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise 1.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">
                    Write a regular expression that prints out the ownership structure of the company</p></a></font>
</div>

### Parse "`Date of Inspection Report`"
---

In [9]:
from duckling import DucklingWrapper
DUCK_PARSER = DucklingWrapper()

In [10]:
parsed = DUCK_PARSER.parse_time(pdf[0])
len(parsed)

16

In [11]:
parsed[0]

{'dim': 'time',
 'end': 4,
 'start': 0,
 'text': '1666',
 'value': {'grain': 'year',
  'others': [],
  'value': '1666-01-01T00:00:00.000-04:56:02'}}

In [12]:
parsed

[{'dim': 'time',
  'end': 4,
  'start': 0,
  'text': '1666',
  'value': {'grain': 'year',
   'others': [],
   'value': '1666-01-01T00:00:00.000-04:56:02'}},
 {'dim': 'time',
  'end': 427,
  'start': 423,
  'text': '2002',
  'value': {'grain': 'year',
   'others': [],
   'value': '2002-01-01T00:00:00.000-05:00'}},
 {'dim': 'time',
  'end': 56,
  'start': 53,
  'text': '202',
  'value': {'grain': 'year',
   'others': [],
   'value': '0202-01-01T00:00:00.000-04:56:02'}},
 {'dim': 'time',
  'end': 61,
  'start': 58,
  'text': '207',
  'value': {'grain': 'year',
   'others': [],
   'value': '0207-01-01T00:00:00.000-04:56:02'}},
 {'dim': 'time',
  'end': 82,
  'start': 79,
  'text': '202',
  'value': {'grain': 'year',
   'others': [],
   'value': '0202-01-01T00:00:00.000-04:56:02'}},
 {'dim': 'time',
  'end': 87,
  'start': 84,
  'text': '862',
  'value': {'grain': 'year',
   'others': [],
   'value': '0862-01-01T00:00:00.000-04:56:02'}},
 {'dim': 'time',
  'end': 370,
  'start': 367,
  'tex

In [13]:
report = [i['value']['value'] for i in parsed if 'grain' in i['value'] and 
          i['value']['grain'] == 'day']
report

['2005-06-23T00:00:00.000-04:00']

In [14]:
from dateutil import parser
sorted([parser.parse(i).replace(tzinfo=None) for i in report])[0]

datetime.datetime(2005, 6, 23, 0, 0)

### Parse "`Inspection Period`"
---

In [15]:
parsed = DUCK_PARSER.parse_time(pdf[3])
values = [i['value']['value'] for i in parsed if 'grain' in i['value'] and 
          i['value']['grain'] == 'day' if i['value']['value'] not in report]
values

['2004-10-07T00:00:00.000-04:00', '2004-09-07T00:00:00.000-04:00']

### Parse "`Failures`"
---

In [16]:
pdf[4] + ' ' + pdf[5]

"PCAOB Release No. 104-2005-022\nInspection of Tamas B. Revai, CPA\nJune 23, 2005\nPage 3\n\naddress appropriately, respects in which an issuer's financial statements do not present\nfairly the financial position, results of operations, or cash flows of the issuer in\nconformity with GAAP.6/ It is not the purpose of an inspection, however, to review all of\na firm's audits or to identify every respect in which a reviewed audit is deficient.\nAccordingly, a Board inspection report should not be understood to provide any\nassurance that the firm's audits, or its issuer clients' financial statements, are free of any\ndeficiencies not specifically described in an inspection report.\nA.\n\nReview of Audit Engagement\n\nThe scope of the inspection procedures performed included review of aspects of\nthe performance of the Firm's audit of the financial statements of its issuer audit client.\nThose aspects were selected according to the Board's criteria, and the Firm was not\nallowed an opportu

In [17]:
failures = re.findall(r'\n\(\d\)(.*?failure.*?)(?:\.|;)', 
                      pdf[4] + ' ' + pdf[5], flags=re.DOTALL)
[' '.join(i.split()).strip() for i in failures]

["the pervasive failure to plan, perform, and document performance of the audit or the quarterly reviews of interim financial information for the first three quarters of the issuer's fiscal year",
 "the failure to properly evaluate the issuer's ability to continue as a going concern"]

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise 1.3</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a function <span style="font-family:monospace;">parse_text</span> that, given the parsed <span style="font-family:monospace;">pdf</span>, returns a dictionary with all the fields we extracted above.</p></a></font>
</div>