# Settings

In [1]:
from PyPDF2 import PdfFileMerger, PageRange, PdfFileWriter, PdfFileReader
import fitz
from os import listdir
import re

In [2]:
# Obtain a list of all the files in the 'output_data' folder
file_list = listdir('../source_data/pdf/')

# Weed out any file in the list that may not be a PDF
file_list = [file for file in file_list if re.search('pdf$', file)]

In [3]:
print('We have a total of', len(file_list), 'PDF files we wish to scrap.')

We have a total of 70 PDF files we wish to scrap.


# Bringing everybody together: Merge all PDF reports intro three different files

The first page of each report has a different format than the rest of the report's pages. This means that the cropping strategy for the first pages will be different than that for the rest of the pages. That is why regroup the all the pages of the 70 reports into three different PDF files:

1. A PDF containing all the pages of all the original reports.
1. A PDF containing only the first page of each of the 70 original reports.
1. A PDF containing only all the non-first page of each of the 70 original reports.

We will only crop files 2 & 3. File 1, containing all pages from all files, is created only for consistency testing . (Making sure the total pages from files 2 & 3 add up to the total of file 1, for example.)

In [4]:
mega_pdf = PdfFileMerger() # Pulls together all files into one big one
pdf_first_pages = PdfFileMerger() # This one will contain all the first pages of each report
pdf_non_first_pages = PdfFileMerger() # This one will contain all the non-first pages of each report

for file in file_list:

    # Merge them all into one
    mega_pdf.append('../source_data/pdf/' + file, pages=PageRange(':'))
    
    # File with only the first pages of all individual PDFs
    pdf_first_pages.append('../source_data/pdf/' + file, pages=PageRange('0'))
    
    # File with the non-first pages of all individual PDFs
    pdf_non_first_pages.append('../source_data/pdf/' + file, pages=PageRange('1:'))

mega_pdf.write('../output_data/pdf/reports_all_pages.pdf')
pdf_first_pages.write('../output_data/pdf/reports_first_pages.pdf')
pdf_non_first_pages.write('../output_data/pdf/reports_non_first_pages.pdf')

mega_pdf.close()
pdf_first_pages.close()
pdf_non_first_pages.close()

# Cropping fest

Now that we have separated first pages and non-first pages, we can go ahead and crop from each group the areas that we don't want to feed to Tabula. Once we do that, we export the result into two consolidated PDF files:

- *reports_non_first_pages_cropped.pdf*
- *reports_first_pages_cropped.pdf*

These are the files we will feed to Tabula's starving digestive system.

**Warnign!**
For some reason, sometimes the PDF files produced by the next couple of cells turn out corrupted and cannot be opened. If that happens, run both cells again. Usually the second or third time are successful. (More than willing and grateful if anybody has any idea regarding the reason for such annoyance.)

### Non-first pages

In [5]:
ftz = fitz.open('../output_data/pdf/reports_non_first_pages.pdf')
reader = PdfFileReader('../output_data/pdf/reports_non_first_pages.pdf')
writer = PdfFileWriter()

# Consistency test
assert len(reader.pages) == len(ftz)

# (left, top, right, bottom) = (0.0, 0.0, 842.0, 595.0)
for i in range(0, len(reader.pages)):
    reader_page = reader.pages[i]
    fitz_page = ftz[i]
    
    right = fitz_page.searchFor('Update')[-1][0]
    
    reader_page.mediaBox.lowerLeft = (0, fitz_page.rect[3]-555) # (0, 595-555)
    reader_page.mediaBox.upperRight = (right*0.99, fitz_page.rect[3]-45) # (842-right, 595-45)
    writer.addPage(reader_page)

output_file = open('../output_data/pdf/reports_non_first_pages_cropped.pdf', 'wb')
writer.write(output_file)

### First pages

In [6]:
ftz = fitz.open('../output_data/pdf/reports_first_pages.pdf')
reader = PdfFileReader('../output_data/pdf/reports_first_pages.pdf')
writer = PdfFileWriter()

# Consistency test
assert len(reader.pages) == len(ftz)

# (left, top, right, bottom) = (0.0, 0.0, 842.0, 595.0)
for i in range(0, len(reader.pages)):
    reader_page = reader.pages[i]
    fitz_page = ftz[i]
    
    top = fitz_page.searchFor('Date/Time')[-1][3]
    right = fitz_page.searchFor('Update')[-1][0]
    
    reader_page.mediaBox.lowerLeft = (0, fitz_page.rect[3]-550) # (0, 595-550)
    reader_page.mediaBox.upperRight = (right*0.99, fitz_page.rect[3]-top) # (right, 595-top)
    writer.addPage(reader_page)

output_file = open('../output_data/pdf/reports_first_pages_cropped.pdf', 'wb')
writer.write(output_file)

After submitting the two consolidated files to Tabula and transforming them into CSV files, kindly proceed to the second script: **script_02_data_cleaning_and_analysis.ipynb**