# Exploring Different Methods in Extracting Contents That Were Stuck in PDFs 

### 1. Using package `pdftools` in R to extract text
**Main Steps:** \
• Use `pdf_text` to extract all text from a pdf\
• Remove multiple spaces\
• Use RegEx to split and select the information that we need\
• Output the list of text, and each element should be the content of a table that we want\
• Further clean the text(hard to achieve due to the poor quality of pdf)\
• Convert text to dataframe(haven’t got this far)

**Pros:** Input is pdf file. No need to convert it to immages.\
**Cons:** The performance of RegEx method and `pdf_text`function is highly depends on the quality of input. We cannot improve their ability of reading the text since they are not training models. 

Click [here](https://drive.google.com/file/d/1bg73-6Najxc-K1_PM63pzAUkyNXK5_gp/view?usp=drive_link) to read the code in R.

### 2. Using Tabula
Tabula is a free and open-source tool designed for extracting tables from PDF files and converting them into more computation readable formats like CSV or JSON. This software is particularly useful for journalists, researchers, and data analysts who need to work with data trapped inside PDF documents, a format that is not inherently conducive to data analysis. You can find more information [here](https://tabula.technology/).

**Main Steps:** 
- Download and install Tabula app
- Upload the pdf to Tabula app
- Manually select the areas from which we want to extract the contents 
- Export the output in a format that you need (i.e. csv, zip, json)

**Example 1: P00301_003#1.pdf**(better outcome)

In [37]:
import zipfile
import pandas as pd
import numpy as np

In [27]:
# Replace 'your_zipfile.zip' with the path to your ZIP file
zip_file_path = 'P00301_003.zip'

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as z:
    # List to hold dataframes
    dfs = []
    
    # Iterate through each file in the ZIP
    for filename in z.namelist():
        # Check if the file is a CSV
        if filename.endswith('.csv'):
            # Read the CSV file
            with z.open(filename) as f:
                df = pd.read_csv(f, header = None)
                dfs.append(df)

In [61]:
# manually adding headers
col_names = ["lable_point", "delta_time", "presure_(PSI)", "T+DT/DT", "log", "Pw_Pf(PSI)", "comments"]
dfs[0].columns = col_names
dfs[0] # the first dataframe in the pdf

Unnamed: 0,lable_point,delta_time,presure_(PSI),T+DT/DT,log,Pw_Pf(PSI),comments
0,1.0,,1870.8,,,,HYDROSTATICMUD
1,2.0,0.0,1679.5,,,,INITIALFLOW(1)
2,,5.0,1707.6,,,,
3,,10.0,1723.5,,,,
4,3.0,11.0,1726.4,,,,INITIALFL0W(2)
5,3.0,0.0,1726.4,,,,STARTEDSHUT-IN
6,,5.0,1780.8,3.2,0.505,54.4,
7,,10.0,1798.6,2.1,0.322,72.2,
8,,15.0,1807.0,1.733,0.239,80.7,
9,,20.0,1813.6,1.55,0.19,87.2,


In [71]:
# remove spaces within each observations
dfs[0] = dfs[0].applymap(lambda x: x.replace(' ', '') if isinstance(x, str) else x)

#manually fix some misreadings
dfs[0]["delta_time"][16] = 55
dfs[0]["presure_(PSI)"][19] = 1740.4

dfs[0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs[0]["delta_time"][16] = 55
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfs[0]["presure_(PSI)"][19] = 1740.4


Unnamed: 0,lable_point,delta_time,presure_(PSI),T+DT/DT,log,Pw_Pf(PSI),comments
0,1.0,,1870.8,,,,HYDROSTATICMUD
1,2.0,0.0,1679.5,,,,INITIALFLOW(1)
2,,5.0,1707.6,,,,
3,,10.0,1723.5,,,,
4,3.0,11.0,1726.4,,,,INITIALFL0W(2)
5,3.0,0.0,1726.4,,,,STARTEDSHUT-IN
6,,5.0,1780.8,3.2,0.505,54.4,
7,,10.0,1798.6,2.1,0.322,72.2,
8,,15.0,1807.0,1.733,0.239,80.7,
9,,20.0,1813.6,1.55,0.19,87.2,


 **Example 2: P00315_A004**(worse outcome)

In [75]:
# Replace 'your_zipfile.zip' with the path to your ZIP file
zip_file_path = 'P00315_A004.zip'

# Open the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as z:
    # List to hold dataframes
    dfs2 = []
    
    # Iterate through each file in the ZIP
    for filename in z.namelist():
        # Check if the file is a CSV
        if filename.endswith('.csv'):
            # Read the CSV file
            with z.open(filename) as f:
                df2 = pd.read_csv(f, header = None)
                dfs2.append(df2)

In [91]:
dfs2[1]

Unnamed: 0,0,1,2,3,4,5
0,13 27 :30 27 -JY,15 492,0 000,248 0,3343 99,0 00
1,13 28 :06 27 -JY,15 502,0 010,249 3,3401 80,57 81 1 007
2,13 28 :42 27 -JY,15 512,0 020,250 4,3407 11,63 12 0 747
3,13 29 :18 27 -JY,15 522,0 030,251 5,3408 45,64 46 0 608
4,13 29 :54 27 -JY,15 532,0 040,252 5,3409 11,65 12 0 517
5,13 30 :30 27 -JY,15 542,0 050,253 3,3409 44,65 45 0 452
6,13 31 :06 27 -JY,15 552,0 060,254 1,3409 65,65 66 0 403
7,13 31 :42 27 -JY,15 562,0 070,254 7,3409 70,65 71 0 364
8,13 32 :18 27 -JY,15 572,0 080,255 3,3409 77,65 78 0 332
9,13 32 :54 27 -JY,15 582,0 090,255 8,3409 74,65 75 0 305


The output for this file was relatively less accurate than the one from example 1. It merged the first two columns and the last two, and it failed to capture most decimals. This happened due to the poor quality of the pdfs and the inconsistent distance between columns. 

**Pros and Cons**

Pros: It is free. Processes are easy and straightforward. Output could be exported in different formats. 

Cons: The outcome highly depends on the quality.