# Import PDF and analyze content

Example notebook for how to read content and metadata of  PDF files

#### Python environment installation instructions

General Packages:

    conda install numpy scipy matplotlib jupyter

PDF-specific packages:

    pip install pypdf2
    conda install tika

#### Example PDF file

The PDF file used in the example below can be downloaded from:

    https://doi.org/10.1073/pnas.1117201109

In [1]:
# set PDF filename/filepath parameter; this PDF file will be used in all examples; 
pdf_name = '12980.full.pdf'

## Get Text and Metadata from PDF using PyPDF2

https://pythonhosted.org/PyPDF2/

In [2]:
# import PyPDF2
import PyPDF2

In [3]:
# read the example pdf defined above
file = open(pdf_name, 'rb')

In [4]:
# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

In [5]:
# print the number of pages in pdf file
print(fileReader.numPages)

6


In [6]:
# get document metadata
fileReader.documentInfo

{'/CreationDate': "D:20120727203916-04'00'",
 '/Creator': 'Arbortext Advanced Print Publisher 9.1.405/W Unicode',
 '/ModDate': "D:20200423000344-07'00'",
 '/Producer': 'Acrobat Distiller 6.0.1 (Windows)',
 '/Title': '201117201 12980..12985'}

In [7]:
page = fileReader.getPage(0)

In [8]:
page_content = page.extractText()

In [9]:
print (page_content.encode('utf-8'))

b'Aquantitativequasispeciestheory-basedmodelof\nvirusescapemutationunderimmuneselection\nHyung-JuneWooandJaquesReifman\n1BiotechnologyHighPerformanceComputingSoftwareApplicationsInstitute,TelemedicineandAdvancedTechnologyResearchCenter,USArmyMedica\nlResearchandMaterielCommand,FortDetrick,MD21702\nEditedbyPeterSchuster,UniversityofVienna,Vienna,andapprovedJune28,2012(receivedforreviewOctober18,2011)\nViralinfectionsinvolveacomplexinterplayoftheimmune\nresponseandescapemutationofthevirusquasispeciesinsidea\nsinglehost.Althoughfundamentalaspectsofsuchabalanceof\n\nmutationandselectionpressurehavebeenestablishedbythequa-\nsispeciestheorydecadesago,itsimplicationshavelargelyre-\nmainedqualitative.Here,wepresentaquantitativeapproachto\n\nmodelthevirusevolutionundercytotoxicT-lymphocyteimmune\nresponse.Thevirusquasispeciesdynamicsareexplicitlyrepre-\nsentedbymutationsinthecombinedsequencespaceofasetof\n\nepitopeswithintheviralgenome.Westochasticallysimulatedthe\n\ngrowthofaviralpopulationori

## Get text and metadata from PDF using TIKA

General information about TIKA
https://cwiki.apache.org/confluence/display/TIKA/TikaServer

TIKA python API
https://github.com/chrismattmann/tika-python

In [10]:
# import parser from TIKA
from tika import parser

In [11]:
# read example PDF definded above
parsedPDF = parser.from_file(pdf_name)

In [12]:
print(parsedPDF)

{'status': 200, 'content': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n201117201 12980..12985\n\n\nA quantitative quasispecies theory-based model of\nvirus escape mutation under immune selection\nHyung-June Woo and Jaques Reifman1\n\nBiotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical\nResearch and Materiel Command, Fort Detrick, MD 21702\n\nEdited by Peter Schuster, University of Vienna, Vienna, and approved June 28, 2012 (received for review October 18, 2011)\n\nViral infections involve a complex interplay of the immune\nresponse and escape mutation of the virus quasispecies inside a\nsingle host. Although fundamental aspects of such a balance of\nmutation and selection pressure have been established by the qua-\nsispecies theory decades ago, its implications have largely re-\nmained qualitative. Here, we present a quantitative approach to\nmodel the virus evolut

In [13]:
# parsed pdf is stored as a dictionary; here we get the keys
parsedPDF.keys()

dict_keys(['status', 'content', 'metadata'])

In [14]:
# one of the keys provides the content
print(parsedPDF['content'])






































201117201 12980..12985


A quantitative quasispecies theory-based model of
virus escape mutation under immune selection
Hyung-June Woo and Jaques Reifman1

Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical
Research and Materiel Command, Fort Detrick, MD 21702

Edited by Peter Schuster, University of Vienna, Vienna, and approved June 28, 2012 (received for review October 18, 2011)

Viral infections involve a complex interplay of the immune
response and escape mutation of the virus quasispecies inside a
single host. Although fundamental aspects of such a balance of
mutation and selection pressure have been established by the qua-
sispecies theory decades ago, its implications have largely re-
mained qualitative. Here, we present a quantitative approach to
model the virus evolution under cytotoxic T-lymphocyte immune
response. The virus quasispecies dynamics a

In [15]:
# one of the keys provides the metadata
parsedPDF['metadata']

{'Content-Type': 'application/pdf',
 'Creation-Date': '2012-07-28T00:39:16Z',
 'Last-Modified': '2020-04-23T07:03:44Z',
 'Last-Save-Date': '2020-04-23T07:03:44Z',
 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
  'org.apache.tika.parser.pdf.PDFParser'],
 'X-TIKA:parse_time_millis': '70',
 'access_permission:assemble_document': 'true',
 'access_permission:can_modify': 'true',
 'access_permission:can_print': 'true',
 'access_permission:can_print_degraded': 'true',
 'access_permission:extract_content': 'true',
 'access_permission:extract_for_accessibility': 'true',
 'access_permission:fill_in_form': 'true',
 'access_permission:modify_annotations': 'true',
 'created': '2012-07-28T00:39:16Z',
 'date': '2020-04-23T07:03:44Z',
 'dc:format': 'application/pdf; version=1.4',
 'dc:title': '201117201 12980..12985',
 'dcterms:created': '2012-07-28T00:39:16Z',
 'dcterms:modified': '2020-04-23T07:03:44Z',
 'meta:creation-date': '2012-07-28T00:39:16Z',
 'meta:save-date': '2020-04-23T07:03:44Z

### Analyze PDF text with generic tools

In [16]:
# store content in variable
cont = parsedPDF['content']

In [17]:
# separate content by a specified sequence of characters
cont.split('ei')

['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n201117201 12980..12985\n\n\nA quantitative quasispecies theory-based model of\nvirus escape mutation under immune selection\nHyung-June Woo and Jaques R',
 'fman1\n\nBiotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical\nResearch and Materiel Command, Fort Detrick, MD 21702\n\nEdited by Peter Schuster, University of Vienna, Vienna, and approved June 28, 2012 (rec',
 'ved for review October 18, 2011)\n\nViral infections involve a complex interplay of the immune\nresponse and escape mutation of the virus quasispecies inside a\nsingle host. Although fundamental aspects of such a balance of\nmutation and selection pressure have been established by the qua-\nsispecies theory decades ago, its implications have largely re-\nmained qualitative. Here, we present a quantitative approach to\nmodel the virus evolution under cytotoxic 

In [18]:
# separate all lines in the PDF file
cont.splitlines()

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '201117201 12980..12985',
 '',
 '',
 'A quantitative quasispecies theory-based model of',
 'virus escape mutation under immune selection',
 'Hyung-June Woo and Jaques Reifman1',
 '',
 'Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical',
 'Research and Materiel Command, Fort Detrick, MD 21702',
 '',
 'Edited by Peter Schuster, University of Vienna, Vienna, and approved June 28, 2012 (received for review October 18, 2011)',
 '',
 'Viral infections involve a complex interplay of the immune',
 'response and escape mutation of the virus quasispecies inside a',
 'single host. Although fundamental aspects of such a balance of',
 'mutation and selection pressure have been established by the qua-',
 'sispecies theory de

In [19]:
# get a conent Partition at a specific word
meth = cont.partition('quasispecies')

In [20]:
print(meth[0])






































201117201 12980..12985


A quantitative 
