# Dataset Collection

-	The data was collected from Google Cloud Storage (GCS) where it was available for free in buckets for bulk Access.
-	The command line tool `gsutil` was used to access ArXive’s physics PDF buckets and downloaded into local machine.
-	The size was about 7.19GB of 22.3K PDFs of different versions  -	The dataset was then uploaded into Google Drive to be easily accessed through Google Collab.  

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# PDF To Text
Since the files we have are **native PDFs** (which means text is already digitally encoded ) there is **no need to apply any OCR** (Optical character recognition) techniques.

- This means we will use **PyMuPDF** ,  **PyPDF2** or **PDFMiner.six**
- We will test all of them on one PDF file and see the results
- The evaluation is going to be done manually (human evalution)

In [2]:
pdf_file =  "/content/drive/MyDrive/UH - Final Year Project/Dataset/pdf/9905/9905061v3.pdf"

## Testing **PyMuPDF**

In [3]:
pip install PyMuPDF

Collecting PyMuPDF
  Downloading PyMuPDF-1.24.5-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.3 (from PyMuPDF)
  Downloading PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, PyMuPDF
Successfully installed PyMuPDF-1.24.5 PyMuPDFb-1.24.3


In [4]:
import pymupdf

doc = pymupdf.open(pdf_file)
pymupdf_text = "\n".join([page.get_text() for page in doc])

In [5]:
print(pymupdf_text)

arXiv:physics/9905061v3  [physics.plasm-ph]  8 Jun 1999
DPNU-99-14
Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave
Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)
Abstract
Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics. It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >
∼100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is ver

## Testing **PyPDF2**

In [6]:
pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [23]:
from PyPDF2 import PdfReader

reader = PdfReader(pdf_file)

pypdf_text = "\n".join([page.extract_text()for page in reader.pages])

In [24]:
print(pypdf_text)

arXiv:physics/9905061v3  [physics.plasm-ph]  8 Jun 1999DPNU-99-14
Electron acceleration to ultrarelativistic energies in a c ollisionless
oblique shock wave
Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602 , Japan
(July 14, 2011)
Abstract
Electron motion in an oblique shock wave is studied by means o f a one-
dimensional, relativistic, electromagnetic, particle si mulation code with full
ion and electron dynamics. It is found that an oblique shock c an produce
electrons with ultra-relativistic energies; Lorentz fact ors with γ>∼100 have
been observed in our simulations. The physical mechanisms f or the reﬂection
and acceleration are discussed, and the maximum energy is es timated. If
the electron reﬂection occurs near the end of a large-amplit ude pulse, those
particles will then be trapped in the pulse and gain a great de al of energy.
The theory predicts that the electron energies can become es pecially high at
certain propagation angles. Thi

In [9]:
pypdf_text == pymupdf_text

False

## Testing **PDFMiner.six**

In [10]:
pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pdfminer.six
Successfully installed pdfminer.six-20231228


In [11]:
from io import StringIO
import re
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open(pdf_file, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

pdfminer_text = output_string.getvalue()

In [12]:
print(pdfminer_text)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

## **Conclusion :**

> Overall , after investigating the text files that we've got from each python tool , **PDFMiner.six** gave the best results especially detecting the variables ( some equations haven't been detected but that's not our concern )

# Testing on Math Notations
In this step we will try to test our tools on a PDF of brute mathematical notations to see which one will do better

## Using **PyPDF2**

In [14]:
from PyPDF2 import PdfReader

math_pdf = "/content/drive/MyDrive/UH - Final Year Project/Math Notations List - Cambridge -.pdf"
reader = PdfReader(math_pdf)

pypdf_math_text = "\n".join([page.extract_text()for page in reader.pages])

In [15]:
print(pypdf_math_text)

 
  
  
 
     
 
    
 
Notation List  
 
  
For Cambridge International Mathematics 
Qualifications 
 
 
   
For use from 2020 
 
 
Notation List for Cambridge International Ma thematics Qualifications (For use from 2020) 
 
2 Mathematical notation 
Examinations for CIE syllabuses may use relevant notation from the following list. 
1 Set notation 
∈ is an element of 
∉ is not an element of 
{x1, x2, …} the set with elements x1, x2, … 
{x : …} the set of all x such that … 
n(A) the number of elements in set A 
∅ the empty set 
 the universal set 
, the universal set (for 0607 IGCSE International Mathematics) 
A′ the complement of the set A 
ℕ the set of natural numbers, {1, 2, 3, …} 
ℤ the set of integers, {0, ±1, ±2, ±3, …} 
ℚ the set of rational numbers, :,  ppqq∈ 
	ℤ, 0q≠
 
ℝ the set of real numbers 
ℂ the set of complex numbers 
(x, y) the ordered pair x, y 
⊆ is a subset of 
⊂ is a proper subset of 
⋃ union 
⋂ intersection 
[a, b] the closed interval { x ∈ ℝ : a ⩽ x ⩽ b} 
[

## Using **PDFMiner.six**




In [16]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open(math_pdf, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

pdfminer_math_text = output_string.getvalue()

In [17]:
print(pdfminer_math_text)

Notation List  

For Cambridge International Mathematics 
Qualifications 

For use from 2020 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Notation List for Cambridge International Mathematics Qualifications (For use from 2020) 

Mathematical notation 

Examinations for CIE syllabuses may use relevant notation from the following list. 

1  Set notation 

∈ 
∉ 
{x1, x2, …} 
{x : …} 
n(A) 
∅ 

, 
A′ 
ℕ 
ℤ 

ℚ 

ℝ 
ℂ 
(x, y) 
⊆ 
⊂ 
⋃ 
⋂ 
[a, b] 
[a, b) 
(a, b] 
(a, b) 
(S, ∘) 

2  Miscellaneous symbols 

is an element of 
is not an element of 
the set with elements x1, x2, … 
the set of all x such that … 
the number of elements in set A 
the empty set 
the universal set 
the universal set (for 0607 IGCSE International Mathematics) 
the complement of the set A 
the set of natural numbers, {1, 2, 3, …} 
the set of integers, {0, ±1, ±2, ±3, …} 

the set of rational numbers, 





p
q

:

p q
,  

∈

	ℤ,

q


0
≠ 


the set of real numbers 
the set of complex numbers 
the ord

## **Conclusion :**
> **PyPDF2** had better formatting while **PDFMiner.Six** had better symbole identification

# PDF Text Cleanup Attempt
In this step we will try to fix the problems found with both tools to see if we can overcome one of them and use it as a tool to create textual database
- We'll try to fix **PyPDF2** by :
  - Fixing its symbole recognition problem (some symboles are given latex like notations instead of the symbole itself)
  - Fixing its spacing problem (some variables are joined into other words which makes them harder to recognize)

## Substitution-based Symbole correction

In [21]:
def replace_symbols(text, replacements):
  """Replaces custom symbols in text with their LaTeX equivalents from a dictionary."""
  new_text = text
  for symbol, replacement in replacements.items():
    new_text = new_text.replace(symbol, replacement)
  return new_text

In [26]:
replacements = {
	"/bardbl" : "║" ,
	"/angb∇acket∇ight" : ">" ,
	"/angb∇acketleft" : "<" ,
	"/parenleftBigg"  : "(" ,
	"/parenrightBigg" : ")" ,
	"/integraldisplay" : "∫"
}

replaced_text = replace_symbols(pypdf_text, replacements)
print(replaced_text)

arXiv:physics/9905061v3  [physics.plasm-ph]  8 Jun 1999DPNU-99-14
Electron acceleration to ultrarelativistic energies in a c ollisionless
oblique shock wave
Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602 , Japan
(July 14, 2011)
Abstract
Electron motion in an oblique shock wave is studied by means o f a one-
dimensional, relativistic, electromagnetic, particle si mulation code with full
ion and electron dynamics. It is found that an oblique shock c an produce
electrons with ultra-relativistic energies; Lorentz fact ors with γ>∼100 have
been observed in our simulations. The physical mechanisms f or the reﬂection
and acceleration are discussed, and the maximum energy is es timated. If
the electron reﬂection occurs near the end of a large-amplit ude pulse, those
particles will then be trapped in the pulse and gain a great de al of energy.
The theory predicts that the electron energies can become es pecially high at
certain propagation angles. Thi

## NLTK-based text spacing correction

In [29]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [31]:
def improve_spacing(text):
  """Attempts to improve spacing in a sentence using NLTK tokenization."""
  tokens = nltk.word_tokenize(text)  # Split text into words (tokens)
  return " ".join(tokens)  # Join tokens with spaces

In [33]:
improved_text = improve_spacing(replaced_text)
improved_text

'arXiv : physics/9905061v3 [ physics.plasm-ph ] 8 Jun 1999DPNU-99-14 Electron acceleration to ultrarelativistic energies in a c ollisionless oblique shock wave Naoki Bessho and Yukiharu Ohsawa Department of Physics , Nagoya University , Nagoya 464-8602 , Japan ( July 14 , 2011 ) Abstract Electron motion in an oblique shock wave is studied by means o f a one- dimensional , relativistic , electromagnetic , particle si mulation code with full ion and electron dynamics . It is found that an oblique shock c an produce electrons with ultra-relativistic energies ; Lorentz fact ors with γ > ∼100 have been observed in our simulations . The physical mechanisms f or the reﬂection and acceleration are discussed , and the maximum energy is es timated . If the electron reﬂection occurs near the end of a large-amplit ude pulse , those particles will then be trapped in the pulse and gain a great de al of energy . The theory predicts that the electron energies can become es pecially high at certain pro

## **Conclusion :**
> We are able to fix some symbole issues for **PyPDF2** but still couldn't fix the spacing problem

- Since **PDFminer.six** showed less issues with variable detection which is our main concern , we will use it instead for creating the textual data

