# PDF Scrape Testing
<font size=4 color='blue'>Project: Congressional Data Scrape and Validation</font>
***

**Project Summary:**  
The Resume of Congressional Activity has been published annually since 1947. PDF versions of this document are available for download from several US government websites, including <a href="https://senate.gov">senate.gov</a>. The primary goal of this project is to scrape the data from these documents and create a dataset that can be used for analysis.


**Notebook Scope:**  
This notebook tests various processes for collecting data from PDF files. 

**Output:**  
n/a


***  
# Test: pypdf
Documentation available at <a href='https://pypi.org/project/pypdf/'>PyPI</a>.
***

In [2]:
import pypdf

In [3]:
# Read an example resume that has been converted to editable text in Acrobat
pdf = pypdf.PdfReader('../PDF Tests/Test3 98res - Adobe OCR.pdf')

In [4]:
pg1_text = pdf.pages[0].extract_text()
pg1_text

"<tongrtssionat Rtcord \nUnited States \nof America PROCEEDINGS AND DEBATES OF THE 9 7 lb CONGRESS, SECOND SESSION \nVol 130 WASHINGTON, WEDNESnAY, NOVEMBER 14, 1984 No. 136 \nDaily Digest \nRESUME OF CONGRESSIONAL ACTIVITY OF THE NINETY-EIGHTH CONGRESS \nFIRST SESSION \nJanuary 3 through November 18, 1983 \nSmau \nDays in session...................... 1~ \nTime in session ... ; .................. 1,010 hrs., 47' \nCongressional Record: \nPages of proceedings .... .. \nExtensions of Remarks .. . \nPublic bills enacted into law .. \nPrivate bills enacted into law. \nBills in conference ............... .. \nBills through conference ..... .. \nMeasures passed, total ........ ; .. \nSenate bills .................... . \nHouse bills .................... . \nSenate joint resolutions .. \nHouse joint resolutions .. \nSenate concurrent reso-\nlutions ....................... . \nHouse concurrent reso-\nlutions ..................... ; .. \nSimple resolutions ......... . \nMeasures reported, tota

***
<font color='blue'>**Note:**</font>  
By default, pypdf is reading the page as a series of columns. However, the columns are being read one-by-one from top to bottom. It would be nearly impossible to match the labels values in this format.

In [5]:
# Let's try reading the file using layout mode
output = pdf.pages[0].extract_text(extraction_mode='layout')

In [6]:
print(repr(output[:3500]))

"                                                   <tongrtssionat         Rtcord\n\n   United             States                                                                                                                                                9 7 lb                 CONGRESS,              SECOND                  SESSION\n      of America                                                                                                                                              PROCEEDINGS                    AND              DEBATES              OF          THE\n\nVol        130                                                                                                                                                                                     WASHINGTON, WEDNESnAY, NOVEMBER 14,       1984                                                                                                                                                                                 

***
<font color='blue'>**Conclusion:**</font>  
Using the layout mode, pypdf results are usable, but would require a lot of manipulation

***  
# Test: pdfminer
Documentation available at <a href='https://pdfminersix.readthedocs.io/en/latest/'>pdfminsersix.readthedocs.io </a>.
***

In [7]:
from pdfminer.high_level import extract_text

In [8]:
text = extract_text('../PDF Tests/Test3 98res - Adobe OCR.pdf')
print(repr(text[:2500]))

"<tongrtssionat Rtcord \n\nPROCEEDINGS  AND  DEBATES  OF  THE  9 7 lb  CONGRESS,  SECOND  SESSION \n\nUnited  States \nof America \n\nVol  130 \n\nWASHINGTON,  WEDNESnAY,  NOVEMBER  14,  1984 \n\nNo.  136 \n\nDaily Digest \n\nRESUME  OF  CONGRESSIONAL  ACTIVITY  OF  THE  NINETY-EIGHTH  CONGRESS \n\nFIRST  SESSION \n\nJanuary  3 through  November  18,  1983 \n\nSmau \nDays in session...................... \n1~ \nTime in session ... ; ..................  1,010 hrs., 47' \nCongressional Record: \n\nHr>use \n146 \n851  hrs., 45' \n\n17,224 \n\n10,665 \n\nSECOND  SESSION \n\nJanuary  23  through  October  12,  1984 \n\nHouse \n120 \n940 hrs., 28'  852  hrs., 59' \n\nSenate \n131 \n\nTotal \n\n14,650 \n\n12,29~ \n\nPages of proceedings .... .. \nExtensions of Remarks .. . \nPublic bills enacted into law .. \nPrivate bills enacted into law. \nBills in conference ............... .. \nBills through conference ..... .. \nMeasures passed, total ........ ; .. \nSenate bills .................... . 

***
<font color='blue'>**Conclusion:**</font>  
The output from PDFminer is similar to the default read using pypdf. An initial review of the documentation does not show any options that would address the issue of reading columns from top to bottom.

***  
# Test: tabula-py
Documentation available at <a href='https://tabula-py.readthedocs.io/en/latest/'>tabula-py.readthedocs.io</a>. A thorough walk-through is available at <a href='https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py'>Pythonic Excursions</a>.
***

In [9]:
import tabula
import pandas as pd

In [10]:
# There is a futurewarning being produced by pandas during the read_pdf call below. Run this if you want to disable it
import warnings
warnings.simplefilter("ignore", category=FutureWarning)

In [11]:
df_list = tabula.read_pdf('../PDF Tests/Test3 98res - Adobe OCR.pdf', pages=1)

In [12]:
page1_df = df_list[0]
page1_df.head()

Unnamed: 0.1,Unnamed: 0,Smau,Hr>use,Total,Unnamed: 1,Senate,House,Total.1
0,Days in session............ ..........,1~,146,,Days in session ......... .. .......... .,131,120,
1,"Time in session ... ;. .......... .. ..... 1,0...",,"851 hrs., 45'",,Time in session ...... ............ .. ..,"940 hrs., 28'","852 hrs., 59'",
2,Congressional Record:,,,,Congressional Record:,,,
3,Pages of proceedings .... ..,17224,"10,665 27,889",,Pages of proceedings .... ..,14650,"12,29~ 26,896",
4,Extensions of Remarks .. .,,,5985.0,Extensions of Remarks .. .,,,4580.0


***
<font color='blue'>**Conclusion:**</font>  
Contents are read into a pandas dataframe and appear fairly clean. This is the most promising solution.

***  
# Test: camelot-py
Documentation available at <a href='https://camelot-py.readthedocs.io/en/master/'>camelot-py.readthedocs.io</a>.
***

In [13]:
import camelot

In [14]:
tables = camelot.read_pdf('../PDF Tests/Test3 98res - Adobe OCR.pdf', flavor='stream')

In [15]:
tables

<TableList n=1>

In [16]:
tables[0].df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,,RESUME OF CONGRESSIONAL ACTIVITY OF THE ...,,,,,,
1,,FIRST SESSION,,,,SECOND SESSION,,
2,,"January 3 through November 18, 1983",,,,"January 23 through October 12, 1984",,
3,,,,,,Senate,House,Total
4,,Smau,Hr>use,Total,,,,
5,Days in session......................,1~,146,,Days in session ......... ............ .,131,120,
6,"Time in session ... ; .................. 1,01...",,"851 hrs., 45'",,Time in session .................... ..,"940 hrs., 28'","852 hrs., 59'",
7,Congressional Record:,,,,Congressional Record:,,,
8,,,,,,14650,"12,29~",26896
9,Pages of proceedings .... ..,17224,10665,27889,Pages of proceedings .... ..,,,


***
<font color='blue'>**Conclusion:**</font>  
Successfully reads the text into a dataframe, but is not a clean as tabula-py.

***  
# Test: pdfplumber
Documentation available at <a href='https://pypi.org/project/pdfplumber/'>PyPI</a>.
***

In [17]:
import pdfplumber

In [18]:
file = pdfplumber.open('98resocr.pdf')
dfs = file.pages[0].extract_table(table_settings={'horizontal_strategy':'text', 'vertical_strategy':'text'})
file.close()

In [19]:
dfs[0:3]

[['Vol 1',
  '30 WASHI',
  'NGT',
  'ON,',
  'WED',
  'NESn',
  'AY,',
  'NOVEMBER 14, 198',
  '4',
  'No.',
  '136'],
 ['', '', '', '', '', '', '', '', '', '', ''],
 ['', '', '', 'D', 'ail', 'y', 'Di', 'gest', '', '', '']]

***
<font color='blue'>**Conclusion:**</font>  
Reads the text, but does not parse it correctly. Not a workable solution.

***  
# Findings
With this limited testing, tabula-py appears to be the most promising tool for this project.

***
**End**
***