<a href="https://colab.research.google.com/github/taissirboukrouba/SEM-C-Project-/blob/main/notebooks/data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Collection

-	The data was collected from Google Cloud Storage (GCS) where it was available for free in buckets for bulk Access.
-	The command line tool `gsutil` was used to access ArXive’s physics PDF buckets and downloaded into local machine.
-	The size was about 7.19GB of 22.3K PDFs of different versions  -	The dataset was then uploaded into Google Drive to be easily accessed through Google Collab.  

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# PDF To Text
Since the files we have are **native PDFs** (which means text is already digitally encoded ) there is **no need to apply any OCR** (Optical character recognition) techniques.

- This means we will use **PyMuPDF** ,  **PyPDF2** or **PDFMiner.six**
- We will test all of them on one PDF file and see the results
- The evaluation is going to be done manually (human evalution)

In [6]:
pdf_file =  "/content/drive/MyDrive/UH - Final Year Project/Dataset/pdf/9905/9905061v3.pdf"

## Testing **PyMuPDF**

In [7]:
pip install PyMuPDF



In [8]:
import pymupdf

doc = pymupdf.open(pdf_file)
pymupdf_text = "\n".join([page.get_text() for page in doc])

In [9]:
print(pymupdf_text)

arXiv:physics/9905061v3  [physics.plasm-ph]  8 Jun 1999
DPNU-99-14
Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave
Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)
Abstract
Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics. It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >
∼100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is ver

## Testing **PyPDF2**

In [10]:
pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [11]:
from PyPDF2 import PdfReader

reader = PdfReader(pdf_file)

pypdf_text = "\n".join([page.extract_text()for page in reader.pages])

In [12]:
print(pypdf_text)

arXiv:physics/9905061v3  [physics.plasm-ph]  8 Jun 1999DPNU-99-14
Electron acceleration to ultrarelativistic energies in a c ollisionless
oblique shock wave
Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602 , Japan
(July 14, 2011)
Abstract
Electron motion in an oblique shock wave is studied by means o f a one-
dimensional, relativistic, electromagnetic, particle si mulation code with full
ion and electron dynamics. It is found that an oblique shock c an produce
electrons with ultra-relativistic energies; Lorentz fact ors with γ>∼100 have
been observed in our simulations. The physical mechanisms f or the reﬂection
and acceleration are discussed, and the maximum energy is es timated. If
the electron reﬂection occurs near the end of a large-amplit ude pulse, those
particles will then be trapped in the pulse and gain a great de al of energy.
The theory predicts that the electron energies can become es pecially high at
certain propagation angles. Thi

In [13]:
pypdf_text == pymupdf_text

False

## Testing **PDFMiner.six**

In [14]:
pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pdfminer.six
Successfully installed pdfminer.six-20231228


In [17]:
from io import StringIO
import re
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open(pdf_file, 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

text = output_string.getvalue()

In [19]:
print(text)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

In [20]:
print(re.sub(r"(REFERENCES(.|\s)*)", " ", text))

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

# Conclusion :

> Overall , after investigating the text files that we've got from each python tool , **PDFMiner.six** gave the best results especially detecting the variables ( some equations haven't been detected but that's not our concern )