<a href="https://colab.research.google.com/github/taissirboukrouba/Structured-Information-Retrieval-with-LLMs/blob/main/notebooks/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing Workflow :

- Removing References & Reference Numbers
- Removing Page Numbers
- Removing Stop words
- Removing Punctuation
- Removing URLs
- Lowercasing
- Tokenisation


## Testing RegEx Preprocessing Techniques :

In [35]:
import re

# defining lambda function for reading text
read = lambda src : open(src,"r").read()

# test-reading a paper
text = read("/content/drive/MyDrive/UH - Final Year Project/Dataset/text/0103020v2.txt")
print(text[:100])

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-1


### Removing in-text references :

In [36]:
x = re.sub(r"\[\d{1,2}\]|\(\d{1,2}\)","", text)
print(x)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

### Removing Page numberings :

In [37]:
x = re.sub(r"(?m)^[a-zA-Z]\n|^[0-9]{1,2}\n","", text)
print(x)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

### Removing punctuation :
Removes all punctuations but mathematical operations , paranthesese and brackets

In [53]:
x =  re.sub(r'[.,:?;\"\']+','',text)
print(x)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics Nagoya University Nagoya 464-8602 Japan
(July 14 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional relativistic electromagnetic particle simulation code with full
ion and electron dynamics
It is found that an oblique shock can produce
electrons with ultra-relativistic energies Lorentz factors with γ >∼ 100 have
been observed in our simulations The physical mechanisms for the reﬂection
and acceleration are discussed and the maximum energy is estimated
If
the electron reﬂection occurs near the end of a large-amplitude pulse those
particles will then be trapped in the pulse and gain a great deal of energy
The theory predicts that the electron energies can become especially high at
certain propagation angles This is veriﬁed by the simulations

5265Cc 5235Tc 5235Mw 9870Sa

9
9
9
1

n
u


### Removing URLs :

In [39]:
x = re.sub(r'http\S+|www.\S+','',text)
print(x)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

### Removing References :

In [40]:
pattern = re.compile(r'(?i)(References|Bibliography|Works Cited)(.*)',re.DOTALL)
x = re.split(pattern, text)[0]
print(x)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave

DPNU-99-14

Naoki Bessho and Yukiharu Ohsawa
Department of Physics, Nagoya University, Nagoya 464-8602, Japan
(July 14, 2011)

Abstract

Electron motion in an oblique shock wave is studied by means of a one-
dimensional, relativistic, electromagnetic, particle simulation code with full
ion and electron dynamics.
It is found that an oblique shock can produce
electrons with ultra-relativistic energies; Lorentz factors with γ >∼ 100 have
been observed in our simulations. The physical mechanisms for the reﬂection
and acceleration are discussed, and the maximum energy is estimated.
If
the electron reﬂection occurs near the end of a large-amplitude pulse, those
particles will then be trapped in the pulse and gain a great deal of energy.
The theory predicts that the electron energies can become especially high at
certain propagation angles. This is veriﬁed by the simulations.

52.65.Cc, 52.35.Tc, 52.35.

### Creating `regex_preprocess()` function :

In [62]:
def regex_preprocess (text) :
  """
    Preprocesses the input text by applying various regular expression-based transformations.

    Steps involved in preprocessing:
    1. Removes page numberings and single-lettered lines.
    2. Removes in-text references in the form of numbers enclosed in square or round brackets.
    3. Removes everything after and including references, bibliography, or works cited sections.
    4. Removes all punctuation.
    5. Removes all punctuation except for mathematical operation symbols (+, -, *, /) and parentheses/brackets.
    6. Removes URLs.

    Parameters:
    text (str): The input text to be preprocessed.

    Returns:
    str: The preprocessed text.
    """
  # getting rid of page numberings + one-lettered objects
  pattern = r"(?m)^[a-zA-Z]\n|^[0-9]{1,2}\n"
  x = re.sub(pattern,"", text)

  # getting rid of in-text references
  pattern = r"\[\d{1,2}\]|\(\d{1,2}\)"
  x = re.sub(pattern,"", x)

  # getting rid of references and everything afterwards
  pattern = re.compile(r'(?i)(References|Bibliography|Works Cited)(.*)',re.DOTALL)
  x = re.split(pattern, x)[0]

  # getting rid of punctuation (except mathematical operations)
  pattern = r'[.,:?;\"\']+'
  x =  re.sub(pattern,'',x)

  # getting rid of URLs
  pattern = r'http\S+|www.\S+'
  x = re.sub(pattern,'',x)

  # getting rid of double space :
  x = re.sub(r"[\n\n]+",'\n',x)

  return x

In [63]:
# testing it on a paper
clean_text = regex_preprocess(text)
print(clean_text)

Electron acceleration to ultrarelativistic energies in a collisionless
oblique shock wave
DPNU-99-14
Naoki Bessho and Yukiharu Ohsawa
Department of Physics Nagoya University Nagoya 464-8602 Japan
(July 14 2011)
Abstract
Electron motion in an oblique shock wave is studied by means of a one-
dimensional relativistic electromagnetic particle simulation code with full
ion and electron dynamics
It is found that an oblique shock can produce
electrons with ultra-relativistic energies Lorentz factors with γ >∼ 100 have
been observed in our simulations The physical mechanisms for the reﬂection
and acceleration are discussed and the maximum energy is estimated
If
the electron reﬂection occurs near the end of a large-amplitude pulse those
particles will then be trapped in the pulse and gain a great deal of energy
The theory predicts that the electron energies can become especially high at
certain propagation angles This is veriﬁed by the simulations
5265Cc 5235Tc 5235Mw 9870Sa
]
-
[
/
Typeset usi