<center><h1>Converting Multiple PDF Files to TXT files, then a <br> Pandas Dataframe, for Exploratory Data Analysis & Hypothesis Testing</h1></center><br>
<center>John Mayers, Physical Scientist <br> NOAA Space Weather Prediction Center</center>

<h2>Background:</h2>

The Space Weather Prediction Center produces and releases 27-day geomagnetic forecasts found in ["Weekly Highlights and 27-day Forecast](https://www.swpc.noaa.gov/products/weekly-highlights-and-27-day-forecast), which is in a PDF format. These forecast are not archived in a database and can only be found in these PDF documents. In order to assess forecast skill, the forecasts have to be extracted from these files into a readable format. This program will extract the data from the PDF files (found on page 4) using tabula-py then convert into text files. The text files will then be read into Pandas dataframes for additional analysis.  

**Null Hypothesis:** 27 day forecasts are skillful <br>
**Alternate Hypothesis:** 27 day forecast are not skillful

In [76]:
import pandas as pd
import tabula
import os
import glob
    
from tabula.io import read_pdf
from tabulate import tabulate
from tabula.io import convert_into
from tqdm import tqdm

<h3>Navigating to files</h3>

In [23]:
#change working directory to location of pdf files
    
os.chdir('C:/Users/john.mayers/Documents/27_Day/Data/test_data/')
print(f'The current working directory is {os. getcwd()}.')

The current working directory is C:\Users\john.mayers\Documents\27_Day\Data\test_data.


In [65]:
pdf_ls = []
for file in glob.glob("*.pdf"):
    pdf_ls.append(file)
pdf_ls

num_pdfs = len(pdf_ls)

print(f'The are {num_pdfs} PDF files in this folder. The files are: {pdf_ls}.')

The are 5 PDF files in this folder. The files are: ['prf2314.pdf', 'prf2315.pdf', 'prf2316.pdf', 'prf2317.pdf', 'prf2318.pdf'].


<h3>Preparing to use tabula-py to convert PDF to TXT</h3>

The tabula-py wrapper has a specific format *(convert_into(input_filename, output_filename, pages=all)* to read PDF files and convert to TXT. Since hundreds of PDF file will need to be processed, loops will be needed to iterate through input and output filenames.

In [27]:
#creating a list of output_filenames from PDF filenames in dir but with .txt extension

filenames = os.listdir() #list all PDF files in folder and save to variable
filenames = [i.split('.', 1)[0] for i in filenames] #remove file extension from list of files and resave to variable

ls=[]

for i in range(len(filenames)):
    w = os.path.join(directory + filenames[i]+ ".txt") #concat directory, filenames with .txt extension
    ls.append(w)
    
print(f'The tabula-py output files will be: {ls}.')

The output files will be: ['C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2314.txt', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2315.txt', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2316.txt', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2317.txt', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2318.txt'].


In [28]:
#creating a list of input_filenames

ls1=[]

for i in range(len(filenames)):
    r = os.path.join(directory + filenames[i]+ ".pdf") #concat directory, filenames, with .pdf extension
    ls1.append(r)

print(f'The tabula-py input files will be: {ls1}.')

The tabula-py input files will be: ['C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2314.pdf', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2315.pdf', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2316.pdf', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2317.pdf', 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/prf2318.pdf'].


<h3>Converting PDF to TXT using tabula-py</h3>

In [78]:
#using tabula-py convert_into() function to convert PDF files in dir to TXT

directory = 'C:/Users/john.mayers/Documents/27_Day/Data/test_data/'

for pdf in tqdm(directory):
    for i in range(len(ls)):
        convert_into(ls1[i], ls[i], pages=4) #iterating through input and output filenames from lists above

100%|██████████████████████████████████████████████████████████████████████████████████| 53/53 [03:03<00:00,  3.46s/it]


In [79]:
#verifying conversion was successful

num_txt = len(glob.glob1(directory,"*.txt"))

if num_pdfs == num_txt:
    print("All PDF files were successfully converted to TXT files")
else: 
    print("Some PDF files were not successfully converted")

All PDF files were successfully converted to TXT files


<h3>Converting all TXT files to Pandas Dataframe</h3>

In [88]:
f=[]

for file in os.listdir(directory):
    if file.endswith(".txt"):
        f.append(file)

In [89]:
main_dataframe = pd.DataFrame(pd.read_csv(f[0]))
  
for i in range(1,len(f)):
    data = pd.read_csv(f[i])
    df = pd.DataFrame(data)
    main_dataframe = pd.concat([main_dataframe,df],axis=1)
    main_dataframe.head()

In [90]:
main_dataframe.head()

Unnamed: 0,06 Jan,72,12,4,Unnamed: 4,20 Jan,70,5,2,Unnamed: 9,...,03 Feb,70.1,5.2,2.2,Unnamed: 4.1,17 Feb,72.1,5.1,2.1,Unnamed: 9.1
0,7,72,5,2,,21,70.0,5.0,2.0,,...,4,70,5,2,,18.0,72.0,5.0,2.0,
1,8,72,8,3,,22,70.0,5.0,2.0,,...,5,70,5,2,,19.0,72.0,5.0,2.0,
2,9,72,8,3,,23,70.0,5.0,2.0,,...,6,70,12,4,,20.0,72.0,5.0,2.0,
3,10,72,8,3,,24,70.0,5.0,2.0,,...,7,70,10,3,,21.0,72.0,5.0,2.0,
4,11,72,5,2,,25,71.0,5.0,2.0,,...,8,70,10,3,,22.0,72.0,5.0,2.0,


<h3>Cleaning the Data for Analysis</h3>