# Dataset Collection

-	The data was collected from Google Cloud Storage (GCS) where it was available for free in buckets for bulk Access.
-	The command line tool `gsutil` was used to access ArXive’s physics PDF buckets and downloaded into local machine.
-	The size was about 7.19GB of 22.3K PDFs of different versions  -	The dataset was then uploaded into Google Drive to be easily accessed through Google Collab.  

# PDF To Text
Since the files we have are **native PDFs** (which means text is already digitally encoded ) there is **no need to apply any OCR** (Optical character recognition) techniques.

- This means we will use **PyMuPDF** ,  **PyPDF2** or **PDFMiner.six**
- We will test all of them on one PDF file and see the results
- The evaluation is going to be done manually (human evalution)

In [None]:
pdf_file =  "/Users/tayssirboukrouba/Downloads/dataset/pdf/9905/9905061v3.pdf"

## Testing **PyMuPDF**

In [None]:
pip install PyMuPDF

In [None]:
import pymupdf

doc = pymupdf.open(pdf_file)
pymupdf_text = "\n".join([page.get_text() for page in doc])

In [None]:
print(pymupdf_text)

In [None]:
print(pymupdf_text)

## Testing **PyPDF2**

In [None]:
pip install PyPDF2

In [None]:
from PyPDF2 import PdfReader

reader = PdfReader(pdf_file)

pypdf_text = "\n".join([page.extract_text()for page in reader.pages])

In [None]:
print(pypdf_text)

In [None]:
pypdf_text == pymupdf_text

## Testing **PDFMiner.six**

In [None]:
pip install pdfminer.six

In [None]:
from io import StringIO
import re
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open(pdf_file, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

pdfminer_text = output_string.getvalue()

In [None]:
print(pdfminer_text)

## **Conclusion :**

> Overall , after investigating the text files that we've got from each python tool , **PDFMiner.six** gave the best results especially detecting the variables ( some equations haven't been detected but that's not our concern )

# Testing on Math Notations
In this step we will try to test our tools on a PDF of brute mathematical notations to see which one will do better

## Using **PyPDF2**

In [None]:
from PyPDF2 import PdfReader

math_pdf = "/Users/tayssirboukrouba/Desktop/Math Notations List - Cambridge -.pdf"
reader = PdfReader(math_pdf)

pypdf_math_text = "\n".join([page.extract_text()for page in reader.pages])

In [None]:
print(pypdf_math_text)

## Using **PDFMiner.six**




In [None]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open(math_pdf, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

pdfminer_math_text = output_string.getvalue()

In [None]:
print(pdfminer_math_text)

## **Conclusion :**
> **PyPDF2** had better formatting while **PDFMiner.Six** had better symbole identification

# PDF Text Cleanup Attempt
In this step we will try to fix the problems found with both tools to see if we can overcome one of them and use it as a tool to create textual database
- We'll try to fix **PyPDF2** by :
  - Fixing its symbole recognition problem (some symboles are given latex like notations instead of the symbole itself)
  - Fixing its spacing problem (some variables are joined into other words which makes them harder to recognize)

## Substitution-based Symbole correction

In [None]:
def replace_symbols(text, replacements):
  """Replaces custom symbols in text with their LaTeX equivalents from a dictionary."""
  new_text = text
  for symbol, replacement in replacements.items():
    new_text = new_text.replace(symbol, replacement)
  return new_text

In [None]:
replacements = {
	"/bardbl" : "║" ,
	"/angb∇acket∇ight" : ">" ,
	"/angb∇acketleft" : "<" ,
	"/parenleftBigg"  : "(" ,
	"/parenrightBigg" : ")" ,
	"/integraldisplay" : "∫"
}

replaced_text = replace_symbols(pypdf_text, replacements)
print(replaced_text)

## NLTK-based text spacing correction

In [None]:
import nltk
print(nltk.__version__)

In [None]:
def improve_spacing(text):
  """Attempts to improve spacing in a sentence using NLTK tokenization."""
  tokens = nltk.word_tokenize(text)  # Split text into words (tokens)
  return " ".join(tokens)  # Join tokens with spaces

In [None]:
improved_text = improve_spacing(replaced_text)
improved_text

## **Conclusion :**
> We are able to fix some symbole issues for **PyPDF2** but still couldn't fix the spacing problem

- Since **PDFminer.six** showed less issues with variable detection which is our main concern , we will use it instead for creating the textual data



# Making Textual Data
In this part we will try to go from PDF data to text data to complete our text preparation phase where :


1.  we test our pipeline on 1 PDF file
2.  we create our pipeline's function
3.  we apply it to our PDF data


## Testing on sample PDF

In [None]:
# getting filename
file_name = pdf_file.split("/")[-1].split(".pdf")[0]+".txt"
file_name

In [None]:
path =  "/Users/tayssirboukrouba/Downloads/dataset/pdf/9905/9905061v3.pdf"
# getting file name
file_name = path.split("/")[-1].split(".pdf")[0]+".txt"
output_string = StringIO()
with open(path, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
      interpreter.process_page(page)

path2 = "/Users/tayssirboukrouba/Downloads/dataset/text/"
text = output_string.getvalue()

f = open(path2+file_name, "w")
f.write(text)
f.close()

In [None]:
f = open(path2+file_name, "r")
print(f.read())

## Creating text pipeline's Function

In [None]:
def make_text(pdf_path, save_path):
    """
    Converts a PDF file to a text file that is saved on a selected path.

    Args:
        pdf_path (str): The path to the PDF file to be converted.
        save_path (str): The directory path where the extracted text file will be saved.

    Raises:
        OSError: If an error occurs while accessing or creating the files.
        ValueError: If either `pdf_path` or `save_path` is an empty string.

    Returns:
        None: This function does not explicitly return a value, but it creates a text file
              containing the extracted text from the PDF.

    Prints:
        A message indicating whether the text file was created successfully.

    This function uses the `pdfminer.six` library (not included by default) to extract text
    from the PDF and save it to a new text file.
    """
    
    # Getting filename
    file_name = pdf_path.split("/")[-1].split(".pdf")[0]+".txt"
    
    # Checking if the file already exists
    output_path = os.path.join(save_path, file_name)
    if not os.path.exists(output_path):
            # Getting text from PDF file (pdfminer.six)
            output_string = StringIO()
            with open(pdf_path, 'rb') as in_file:
                parser = PDFParser(in_file)
                doc = PDFDocument(parser)
                rsrcmgr = PDFResourceManager()
                device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                for page in PDFPage.create_pages(doc):
                    interpreter.process_page(page)
            # Writing the text file
            with open(output_path, 'w') as f:
                f.write(output_string.getvalue())
            # Checking if file was created
            if os.path.exists(output_path):
                print(f"'{file_name}' was created successfully!")
    else:
        print(f"Skipped '{file_name}' (already exists)")

In [None]:
help(make_text)

In [None]:
# testing on one file
path =  "/Users/tayssirboukrouba/Downloads/dataset/pdf/9905/9905061v3.pdf"
save_path = "/Users/tayssirboukrouba/Downloads/dataset/text/"

make_text(path,save_path)

## Applying the `make_text()` function

In [None]:
pdf_path =  "/Users/tayssirboukrouba/Downloads/dataset/pdf/"
save_path = "/Users/tayssirboukrouba/Downloads/dataset/text/"

for root, directories, files in os.walk(pdf_path):
  # Access files within the current directory (root)
  for filename in files:
    filepath = os.path.join(root, filename)
    make_text(filepath,save_path)