# Comparing All Three PDF Extraction Tools

The purpose of this notebook is to compare all three of the pdf extraction tools on the same file.

In [2]:
pdf_file = '/PDF_comparison/3666_Module_9_Building_Chatbots.pdf'

## PDF Miner

In [3]:
import pdfminer
# rename any files that have spaces or special characters in the name, because pdfminer can't handle it
import re
import os

src = './PDF_comparison'
dst = './PDF_comparison'
# recursively walk directory structure looking for pdf files 
badchars = r"[\(\)<>?!\'\",\s]+"
for root, dirs, files in os.walk(src):
    for file in files:
        path_to_pdf = os.path.join(root, file)
        [stem, ext] = os.path.splitext(path_to_pdf)
        if ext == '.pdf':
            # when a pdf file is found, check the filename for special characters
            [fpath, fname] = os.path.split(stem)
            if re.search(badchars, fname):
                # if special characters found, build a new filename
                print("Found " + file)
                dstname=re.sub(r"[\s]+", "_", fname) 
                dstname=re.sub(badchars,"", dstname)
                dstpath = os.path.join(fpath, dstname + ext) 
                print("Renaming to " + dstpath)
                # rename original pdf file to new filename in original directory
                os.rename(path_to_pdf, dstpath)

In [4]:
# use pdfminer in command line mode on each pdf file in the directory structure
# this code is adapted from nadya-p/pdf_to_text.py
import pdfminer
import os

# recursively walk directory structure looking for pdf files 
for root, dirs, files in os.walk(src):
    for file in files:
        path_to_pdf = os.path.join(root, file)
        [stem, ext] = os.path.splitext(path_to_pdf)
        if ext == '.pdf':
            # when a pdf file is found, construct the output path name
            print("Processing " + path_to_pdf)
            [_, fname] = os.path.split(stem)
            path_to_txt = os.path.join(dst, fname) + '.txt'
            print("Writing contents to " + path_to_txt)
            # use pdfminer in command line mode to convert pdf file to text file
            !pdf2txt.py -o {path_to_txt} {path_to_pdf}


Processing ./PDF_comparison/3666_Module_9_Building_Chatbots.pdf
Writing contents to ./PDF_comparison/3666_Module_9_Building_Chatbots.txt


## PyPDF2

In [5]:
#imports
import PyPDF2
import os
import glob

In [6]:
# this class is to extract the text using pypdf2
class PyPDF2Extract(object):
    # initialize the class
    def __init__(self, target_directory_name):
        self.target = str(target_directory_name)
        
        
    # define a function to extract a pdf 
    def pdfExtract(self, file):
        # open the pdf file
        pdf = open(file, 'rb')
        # convert the pdf to a PdfFileReader object
        read_pdf = PyPDF2.PdfFileReader(pdf)
        # check if the pdf file is encrypted
        if read_pdf.isEncrypted == True:
            print(file + ' file is encrypted')
        else:
            print(file)
            # get the page content
            page_content = []
            # get the number of pages in the document
            number_of_pages = read_pdf.getNumPages()
            # iterate over each page to extract the text
            for i in range(number_of_pages):
                page = read_pdf.getPage(i)
                # some of the files throws a TypeError
                # others may throw a KeyError if there is a blank page
                # this has not been addressed here
                try:
                    content = page.extractText()
                    content = content.replace("\n"," ")
                    page_content.append(content)
                except TypeError:
                    pass
                # set condition for writing the text file
                if (i+ 1) == number_of_pages:
                    # write the text file
                    with open(str(os.getcwd()) + '/' + self.target + '/' + file[:-4] + '.txt', 'w') as f:
                        f.write(str(page_content) + "\n")
                        print(file + ' success')
    
    def transform(self):
        # resolve files in directory using glob
        files = list(glob.glob("*.pdf"))
        # iterate over files to run pdfExtract function
        for i in files:
            #check if the target directory exists, if it doesn't create the target
            if not os.path.exists(str(os.getcwd()) + '/' + self.target):
                os.makedirs(str(os.getcwd()) + '/' + self.target)
            self.pdfExtract(i)

In [8]:
os.chdir(str(os.getcwd()) + '/PDF_comparison')

In [10]:
# initialization the PyPDF2Extract class, specifying the target directory name
pypdf2_extractor = PyPDF2Extract(target_directory_name = 'PDF_comparison')

In [11]:
#perform the transformation
pypdf2_extractor.transform()

3666_Module_9_Building_Chatbots.pdf
3666_Module_9_Building_Chatbots.pdf success


## Tika

In [12]:
# imports
from tika import parser
import os
import datetime

In [13]:
# class for extracting tika files
class TikaExtract(object):
    # initialize the object
    def __init__(self, source_directory, target_directory_name):
        # assigned variables for source_directory and target_directory_name
        self.dir = source_directory
        self.target = str(target_directory_name)
    
    # define recursive function to walk through directory and convert pdfs    
    def extract_text_from_pdfs_recursively(self):
        for root, dirs, files in os.walk(self.dir):
            for file in files:
                path_to_pdf = os.path.join(root, file)
                [stem, ext] = os.path.splitext(path_to_pdf)
                if ext == '.pdf':
                    print("Processing " + path_to_pdf)
                    # use tika to parse contents from file
                    pdf_contents = parser.from_file(path_to_pdf)
                    # project specific - convert to raw
                    raw_text = r'{}'.format(pdf_contents['content'])
                    # project specific - replace new lines with spaces
                    raw_text = raw_text.replace("\n"," ")
                    # project specific - replace double new lines with spaces
                    raw_text = raw_text.replace("\n\n" , " ")
                    # project specific - replace tabs with spaces
                    raw_text = raw_text.replace("\t"," ")
                    path_to_txt = stem + '.txt'
                    # check if target directory exists
                    if not os.path.exists(str(os.getcwd()) + self.target):
                        os.makedirs(str(os.getcwd()) + self.target)
                    # write the text file to the target directory
                    # names of the files will be the same, except have the .txt extension
                    with open(str(os.getcwd()) + self.target + str(file[:-4]) + ".txt", 'w') as txt_file:
                        print("Writing contents to " + str(os.getcwd()) + self.target + str(file[:-4]) + ".txt")
                        txt_file.write(raw_text)

In [14]:
# this is an example, performing the operation on a local machine
tikaextract = TikaExtract(source_directory='/Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/PDF_comparison',
                         target_directory_name='/tika_PDF_comparison/')

In [15]:
# run the function
tikaextract.extract_text_from_pdfs_recursively()

Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/PDF_comparison/3666_Module_9_Building_Chatbots.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/PDF_comparison/tika_PDF_comparison/3666_Module_9_Building_Chatbots.txt


## Compare Output

In [17]:
with open('3666_Module_9_Building_Chatbots.txt','r') as f:
    pdf_miner_output = f.read()
print(pdf_miner_output)

Applied Natural Language 
Processing

Module 9: Building Chatbots

1

Course Plan

Module 1 – Introduction to Language Processing and Computation

Module Titles

Module 2 – Text Corpora & Pre-processing

Module 3 – Introduction to Machine Learning

Module 4 – Text Vectorization & Feature Engineering

Module 5 – Applying Classification on Text

Module 6 – Applying Clustering on Text

Module 7 – Context Aware Language Modeling

Module 8 – Text Visualization & Graph Analysis

Module 9 – Building Chatbots

Module 10 – Scaling with Multiprocessing and Spark

Module 11 – Deep learning on Text data

Module 12 – Team Project Presentations

2

Learning Outcomes for this Module

• We will

learn a conversational

chatbots, one of
applications

for building
the fastest-growing language aware

framework

• We will demonstrate this framework by constructing a kitchen
helper bot that can greet new users, perform measurement
conversions, and recommend good recipes

3

Topics for this Module

9.1 F

In [18]:
with open(os.getcwd() +'/PDF_comparison/3666_Module_9_Building_Chatbots.txt') as f:
    pypdf2_output = f.read()
print(pypdf2_output)

['1 Applied Natural Language  Processing Module 9: Building  Chatbots ', '2 Module Titles Module  1   Introduction to Language Processing and Computation Module 2   Text Corpora & Pre - processing Module 3   Introduction to Machine Learning Module 4   Text Vectorization & Feature Engineering Module 5   Applying Classification on Text Module 6   Applying Clustering on Text Module 7   Context Aware Language Modeling Module  8   Text Visualization & Graph Analysis Module  9   Building Chatbots Module  10   Scaling with Multiprocessing and Spark Module  11   Deep learning on Text data Module  12   Team Project Presentations Course Plan ', '3  We will learn a conversational framework for building chatbots , one of the fastest - growing language aware applications  We will demonstrate this framework by constructing a kitchen helper bot that can greet new users, perform measurement conversions, and recommend good recipes Learning Outcomes for this Module ', '4 9 .1 Fundamentals of Conversatio




In [19]:
with open(os.getcwd() + '/tika_PDF_comparison/3666_Module_9_Building_Chatbots.txt') as f:
    tika_output = f.read()
print(tika_output)

                                           PowerPoint Presentation   1  Applied Natural Language   Processing  Module 9: Building Chatbots    2  Module Titles  Module 1 – Introduction to Language Processing and Computation  Module 2 – Text Corpora & Pre-processing  Module 3 – Introduction to Machine Learning  Module 4 – Text Vectorization & Feature Engineering  Module 5 – Applying Classification on Text  Module 6 – Applying Clustering on Text  Module 7 – Context Aware Language Modeling  Module 8 – Text Visualization & Graph Analysis  Module 9 – Building Chatbots  Module 10 – Scaling with Multiprocessing and Spark  Module 11 – Deep learning on Text data  Module 12 – Team Project Presentations  Course Plan    3  • We will learn a conversational framework for building  chatbots, one of the fastest-growing language aware  applications  • We will demonstrate this framework by constructing a kitchen  helper bot that can greet new users, perform measurement  conversions, and recommend good re

## Discussion

It can be seen that all have some differences both in utilization of code and in the output that is provided. Out of the three, the output we obtained from pdfminer was the cleanest and most easy to read. The outputs from PyPDF2 and Tika look similar. One of the main differences is that PyPDF2 output is a list with each page represented as a string. This can also be seen in the code as we have ended up going page by page. With Tika it can be seen that we did apply some heuristics in order to clean up the code a little. Our output ended up being a string. Through out this process, Linda favoured using pdfminer, while Rahim favoured using Tika.