## Research Project 1
---
```text
- Source: PCAOB
- Goal: Information Extraction from PDF files
- Techniques: Regular Expressions
- Tools: pdftotext, re, duckling
- Lines of code: ~100```

### Get a PDF
---

In [None]:
import requests
url = 'https://pcaobus.org/Inspections/Reports/Documents/2005_Tamas_B._Revai_CPA.pdf'
res = requests.get(url)

In [None]:
filename = '2005_Tamas_B._Revai_CPA'
with open('../data/%s.pdf' % filename, 'wb') as f:
    f.write(res.content)

In [None]:
import sys
if sys.platform == 'darwin':
    import os
    os.system('pdftotext ./data/%s.pdf ../data/%s.txt' % (filename, filename))
    pdf = open('../data/%s.txt' % filename).read().split('\x0c')
else:
    with open("../data/%s.pdf" % filename, "rb") as f:
        import pdftotext
        pdf = pdftotext.PDF(f)
print()
print(pdf[0])

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise 1.1</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a <span style="font-family:monospace;">get_text_from_pdf</span> function that, given a filename, downloads the PDF and returns its text as a list of pages.</p></a></font>
</div>

Fields we're interested in:
- `PCAOB Release No`
- `Firm`
- `Offices`
- `Ownership structure`
- `Date of Inspection Report`
- `Inspection Period`
- `Failures`

### Parse "`PCAOB Release No`"
---

In [None]:
pdf[0]

In [None]:
import re
re.findall(r'PCAOB RELEASE NO\.? ?([0-9\-]+)', pdf[0], flags=re.IGNORECASE)[0]

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1;
            background-color: #FCF3CF;
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4">
    <a href="../deep_dives/regex.ipynb" style="text-decoration: none"> 
    <h3 style="font-family: monospace">Deep-dive</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Regular expressions</p></a></font>
</div>

### Parse "`Firm`"
---

In [None]:
re.findall(r'Inspection of\n?(.*?)\n', pdf[0])[0]

### Parse "`Offices`"
---

In [None]:
pdf[3]

In [None]:
re.findall(r'Number of offices(.*?)Ownership', pdf[3], flags=re.DOTALL)[0].strip()

### Parse "`Ownership structure`"
---

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise 1.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">
                    Write a regular expression that prints out the ownership structure of the company</p></a></font>
</div>

### Parse "`Date of Inspection Report`"
---

In [None]:
from duckling import DucklingWrapper
DUCK_PARSER = DucklingWrapper()

In [None]:
parsed = DUCK_PARSER.parse_time(pdf[0])
len(parsed)

In [None]:
parsed[0]

In [None]:
parsed

In [None]:
report = [i['value']['value'] for i in parsed if 'grain' in i['value'] and 
          i['value']['grain'] == 'day']
report

In [None]:
from dateutil import parser
sorted([parser.parse(i).replace(tzinfo=None) for i in report])[0]

### Parse "`Inspection Period`"
---

In [None]:
parsed = DUCK_PARSER.parse_time(pdf[3])
values = [i['value']['value'] for i in parsed if 'grain' in i['value'] and 
          i['value']['grain'] == 'day' if i['value']['value'] not in report]
values

### Parse "`Failures`"
---

In [None]:
pdf[4] + ' ' + pdf[5]

In [None]:
failures = re.findall(r'\n\(\d\)(.*?failure.*?)(?:\.|;)', 
                      pdf[4] + ' ' + pdf[5], flags=re.DOTALL)
[' '.join(i.split()).strip() for i in failures]

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="./playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise 1.3</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Write a function <span style="font-family:monospace;">parse_text</span> that, given the parsed <span style="font-family:monospace;">pdf</span>, returns a dictionary with all the fields we extracted above.</p></a></font>
</div>