# PDFs to Text via Tika

This file is to convert pdf files into text files by using Tika. It has been designed to search the specified folder, and return a folder containing the converted text files.

This code was initially written for the UofT3666 - Applied Natural Language Processing final project. That being said, there are some lines of code in here specifically to help clean up the output of the files that we were converting. This code was build upon the following gist: https://gist.github.com/nadya-p/373e1dc335293e490d89d00c895ea7b3.

In [38]:
# imports
from tika import parser
import os
import datetime

In [30]:
# class for extracting tika files
class TikaExtract(object):
    # initialize the object
    def __init__(self, source_directory, target_directory_name):
        # assigned variables for source_directory and target_directory_name
        self.dir = source_directory
        self.target = str(target_directory_name)
    
    # define recursive function to walk through directory and convert pdfs    
    def extract_text_from_pdfs_recursively(self):
        for root, dirs, files in os.walk(self.dir):
            for file in files:
                path_to_pdf = os.path.join(root, file)
                [stem, ext] = os.path.splitext(path_to_pdf)
                if ext == '.pdf':
                    print("Processing " + path_to_pdf)
                    # use tika to parse contents from file
                    pdf_contents = parser.from_file(path_to_pdf)
                    # project specific - convert to raw
                    raw_text = r'{}'.format(pdf_contents['content'])
                    # project specific - replace new lines with spaces
                    raw_text = raw_text.replace("\n"," ")
                    # project specific - replace double new lines with spaces
                    raw_text = raw_text.replace("\n\n" , " ")
                    # project specific - replace tabs with spaces
                    raw_text = raw_text.replace("\t"," ")
                    path_to_txt = stem + '.txt'
                    # check if target directory exists
                    if not os.path.exists(str(os.getcwd()) + self.target):
                        os.makedirs(str(os.getcwd()) + self.target)
                    # write the text file to the target directory
                    # names of the files will be the same, except have the .txt extension
                    with open(str(os.getcwd()) + self.target + str(file[:-4]) + ".txt", 'w') as txt_file:
                        print("Writing contents to " + str(os.getcwd()) + self.target + str(file[:-4]) + ".txt")
                        txt_file.write(raw_text)

In [31]:
%pwd

'/Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings'

In [36]:
# this is an example, performing the operation on a local machine
tikaextract = TikaExtract(source_directory='/Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs',
                         target_directory_name='/tikadocuments2/')

In [37]:
# run the function
tikaextract.extract_text_from_pdfs_recursively()

Climate Change Adaptation - A Priorities Plan for Canada (2012)
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/Climate Change Adaptation - A Priorities Plan for Canada (2012).pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tikadocuments2/Climate Change Adaptation - A Priorities Plan for Canada (2012).txt
waterloo_region_climate_projections_final_revised30oct2015
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/waterloo_region_climate_projections_final_revised30oct2015.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tikadocuments2/waterloo_region_climate_projections_final_revised30oct2015.txt
the_london_plan_malp_march_2016_-_chapter_5_-_londons_response_to_climate_change
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/the_london_plan_malp_march_2016_-_cha

Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tikadocuments2/London_tech.txt
A_Canadian_Opportunity_-_Tackling_climate_change_by_switching_to_clean_power
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/A_Canadian_Opportunity_-_Tackling_climate_change_by_switching_to_clean_power.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tikadocuments2/A_Canadian_Opportunity_-_Tackling_climate_change_by_switching_to_clean_power.txt
2017BernardSoubryPolicyBrief
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/2017BernardSoubryPolicyBrief.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tikadocuments2/2017BernardSoubryPolicyBrief.txt
WP_Health_November2008
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/WP_Health_November200

In [44]:
# operationalize the function, while providing default parameters
# default source directory is the current working directory
# target dirctory name is tika_documents_datetime
#        in the format "tika_documents_date_month_year_hour_minute_pm"
if __name__ == "__main__":
    tikaextract = TikaExtract(source_directory = str(os.getcwd()),
                             target_directory_name = '/tika_documents_' +str(datetime.datetime.now().strftime("%d_%m_%Y_%I_%M_%p"))+"/")
    tikaextract.extract_text_from_pdfs_recursively()

Project_Draft_H_Parse_Classifier.i
Untitled1.i
.DS_S
Untitled.i
Draft_CorpusReader_One.i
climate_change_lda_model.
Draft_Tika_One.i
Project_Draft_I_LDA_two_sklearn.i
plot_topics_extraction_with_nmf_lda.i
gensim_climate_change_lda_model.
Draft_CorpusReader_Two.i
Project_Draft_H_LDA_gensim.i
ap
tika_test_3
tika_test_2
tika_test
Project_Draft_F_PDF_to_Text_Tika.i
stopwords
run_lda.i
Draft_LDA_one_sklearn.i
random_pdf
nips12raw_str602
env-yukon-state-play-analysis-climate-change-impacts-adaptation
IPCC_SRREN_Ch01
UK-CCRA-2017-Synthesis-Report-Committee-on-Climate-Change
20170125-en
BC-Agriculture-Climate-Change-Action-Plan
PB_Are_the_Dutch_going_green
Energy+agenda
Tr045
pac_third_report_2019_02_27
CAT_research_plan_2015
Perspectives on Climate Change Action in Canada English
info_sheets_summaires
nc6_can_resubmission_english
Climate Change Adaptation - A Priorities Plan for Canada (2012)
waterloo_region_climate_projections_final_revised30oct2015
the_london_plan_malp_march_2016_-_chapter_5

Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/slr-primer.txt
ClimatRisk-E-ACCESSIBLE
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/Not_Yet_Examined_copy/ClimatRisk-E-ACCESSIBLE.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/ClimatRisk-E-ACCESSIBLE.txt
Synthesis_Eng
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/Not_Yet_Examined_copy/Synthesis_Eng.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/Synthesis_Eng.txt
landuse-e
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/Not_Yet_Examined_copy/landuse-e.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_

WCEL_climate_change_FINAL_p21
protect-your-home-from-basement-flooding_p8
WCEL_climate_change_FINAL_p20
WCEL_climate_change_FINAL_p34
ClimatRisk-E-ACCESSIBLE_p103
Vulnerability_Guidebook_June2_EN_p76
Vulnerability_Guidebook_June2_EN_p62
Report_on_Effects_of_a_Changing_Climate_to_the_US_Department_of_Defense_p13
Report_on_Effects_of_a_Changing_Climate_to_the_US_Department_of_Defense_p3
ClimatRisk-E-ACCESSIBLE_p117
Vulnerability_Guidebook_June2_EN_p89
env-yukon-state-play-analysis-climate-change-impacts-adaptation_p63
coastal_flooded_land_guidelines_p37
coastal_flooded_land_guidelines_p23
protect-your-home-from-wildfire_p6
Vancouver-Climate-Change-Adaptation-Strategy-2012-11-07_p30
Adapting_to_Climate_Change_in_Coastal_Communities_In_Canada_White_Paper_p86
builders_guide_2010_final_p38
Adapting_to_Climate_Change_in_Coastal_Communities_In_Canada_White_Paper_p92
Vancouver-Climate-Change-Adaptation-Strategy-2012-11-07_p24
protect-your-home-from-snow-ice-storms_p8
ClimatRisk-E-ACCESSIBLE_p30

A_Residential_Guide_to_Flood_Prevention_and_Recovery_p31
A_Residential_Guide_to_Flood_Prevention_and_Recovery_p25
WCEL_climate_change_FINAL_p71
Vulnerability_Guidebook_June2_EN_p33
En56-226-2008-eng_p6
WCEL_climate_change_FINAL_p59
ClimatRisk-E-ACCESSIBLE_p146
adapt_bulletin-adapt1-eng_p2
ClimatRisk-E-ACCESSIBLE_p152
Vulnerability_Guidebook_June2_EN_p27
A_Residential_Guide_to_Flood_Prevention_and_Recovery_p19
ClimatRisk-E-ACCESSIBLE_p153
A_Residential_Guide_to_Flood_Prevention_and_Recovery_p18
Vulnerability_Guidebook_June2_EN_p26
WCEL_climate_change_FINAL_p58
En56-226-2008-eng_p7
Vulnerability_Guidebook_June2_EN_p32
ClimatRisk-E-ACCESSIBLE_p147
WCEL_climate_change_FINAL_p70
A_Residential_Guide_to_Flood_Prevention_and_Recovery_p24
A_Residential_Guide_to_Flood_Prevention_and_Recovery_p30
WCEL_climate_change_FINAL_p64
ClimatRisk-E-ACCESSIBLE_p190
landuse-e_p36
landuse-e_p22
ClimatRisk-E-ACCESSIBLE_p184
Floodproofing_p16
env-yukon-state-play-analysis-climate-change-impacts-adaptation_p27
e

Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/Climate Change Adaptation - A Priorities Plan for Canada (2012).txt
waterloo_region_climate_projections_final_revised30oct2015
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/waterloo_region_climate_projections_final_revised30oct2015.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/waterloo_region_climate_projections_final_revised30oct2015.txt
the_london_plan_malp_march_2016_-_chapter_5_-_londons_response_to_climate_change
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/the_london_plan_malp_march_2016_-_chapter_5_-_londons_response_to_climate_change.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/the_lond

Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/climate-change-ca.txt
London_tech
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/London_tech.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/London_tech.txt
A_Canadian_Opportunity_-_Tackling_climate_change_by_switching_to_clean_power
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/A_Canadian_Opportunity_-_Tackling_climate_change_by_switching_to_clean_power.pdf
Writing contents to /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/tika_documents_14_12_2019_01_21_PM/A_Canadian_Opportunity_-_Tackling_climate_change_by_switching_to_clean_power.txt
2017BernardSoubryPolicyBrief
Processing /Users/rahimjiwa/Documents/DataScience/UofT3666_AppliedNLP/Final_Testings/NewDocs/