# Preprocessing Workflow :

- Removing References & Reference Numbers
- Removing Page Numbers
- Removing Stop words
- Removing Punctuation
- Removing URLs
- Lowercasing
- Tokenisation


In [1]:
import re

# defining lambda function for reading text
read = lambda src : open(src,"r").read()

# test-reading a paper
text = read("/Users/tayssirboukrouba/Downloads/dataset/text/0001001v1.txt")
print(text)

0
0
0
2

n
a
J

1

]
n
y
d
-
u
l
f
.
s
c
i
s
y
h
p
[

1
v
1
0
0
1
0
0
0
/
s
c
i
s
y
h
p
:
v
i
X
r
a

Under consideration for publication in J. Fluid Mech.

1

Capillary-gravity wave transport over
spatially random drift

By G U I L L A U M E B A L∗ and T O M C H O U†
∗ Department of Mathematics, University of Chicago, Chicago, IL 60637
†Department of Mathematics, Stanford University, Stanford, CA 94305

(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W(x, k, t)
to represent the envelope of the wave amplitude at position x contained in waves with
wavevector k, we describe surface wave transport over static ﬂows consisting of two length
scales; one varying smoothly on the wavelength scale, the other varying on a scale com-
parable to the wavelength. The spatially rapidly varying but weak surface ﬂows augment
the characteristic equations with scat

# Testing RegEx Preprocessing Techniques :

### Removing in-text references :

In [2]:
x = re.sub(r"\[\d{1,2}\]|\(\d{1,2}\)","", text)
print(x)

0
0
0
2

n
a
J

1

]
n
y
d
-
u
l
f
.
s
c
i
s
y
h
p
[

1
v
1
0
0
1
0
0
0
/
s
c
i
s
y
h
p
:
v
i
X
r
a

Under consideration for publication in J. Fluid Mech.

1

Capillary-gravity wave transport over
spatially random drift

By G U I L L A U M E B A L∗ and T O M C H O U†
∗ Department of Mathematics, University of Chicago, Chicago, IL 60637
†Department of Mathematics, Stanford University, Stanford, CA 94305

(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W(x, k, t)
to represent the envelope of the wave amplitude at position x contained in waves with
wavevector k, we describe surface wave transport over static ﬂows consisting of two length
scales; one varying smoothly on the wavelength scale, the other varying on a scale com-
parable to the wavelength. The spatially rapidly varying but weak surface ﬂows augment
the characteristic equations with scat

### Removing Page numberings :

In [3]:
x = re.sub(r"\s\d{1,}\n|^[a-zA-Z]\n|^[0-9]{1,2}\n","", text,flags=re.MULTILINE)
print(x)



]
-
.
[
/
:

Under consideration for publication in J. Fluid Mech.

Capillary-gravity wave transport over
spatially random drift

By G U I L L A U M E B A L∗ and T O M C H O U†
∗ Department of Mathematics, University of Chicago, Chicago, IL†Department of Mathematics, Stanford University, Stanford, CA
(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W(x, k, t)
to represent the envelope of the wave amplitude at position x contained in waves with
wavevector k, we describe surface wave transport over static ﬂows consisting of two length
scales; one varying smoothly on the wavelength scale, the other varying on a scale com-
parable to the wavelength. The spatially rapidly varying but weak surface ﬂows augment
the characteristic equations with scattering terms that are explicit functions of the cor-
relations of the random surface currents. These sc

### Removing URLs :

In [4]:
x = re.sub(r'http\S+|www.\S+','',text)
print(x)

0
0
0
2

n
a
J

1

]
n
y
d
-
u
l
f
.
s
c
i
s
y
h
p
[

1
v
1
0
0
1
0
0
0
/
s
c
i
s
y
h
p
:
v
i
X
r
a

Under consideration for publication in J. Fluid Mech.

1

Capillary-gravity wave transport over
spatially random drift

By G U I L L A U M E B A L∗ and T O M C H O U†
∗ Department of Mathematics, University of Chicago, Chicago, IL 60637
†Department of Mathematics, Stanford University, Stanford, CA 94305

(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W(x, k, t)
to represent the envelope of the wave amplitude at position x contained in waves with
wavevector k, we describe surface wave transport over static ﬂows consisting of two length
scales; one varying smoothly on the wavelength scale, the other varying on a scale com-
parable to the wavelength. The spatially rapidly varying but weak surface ﬂows augment
the characteristic equations with scat

### Removing References :

In [5]:
pattern = re.compile(r'(?i)(References|Bibliography|Works Cited)(.*)',re.DOTALL)
x = re.split(pattern, text)[0]
print(x)

0
0
0
2

n
a
J

1

]
n
y
d
-
u
l
f
.
s
c
i
s
y
h
p
[

1
v
1
0
0
1
0
0
0
/
s
c
i
s
y
h
p
:
v
i
X
r
a

Under consideration for publication in J. Fluid Mech.

1

Capillary-gravity wave transport over
spatially random drift

By G U I L L A U M E B A L∗ and T O M C H O U†
∗ Department of Mathematics, University of Chicago, Chicago, IL 60637
†Department of Mathematics, Stanford University, Stanford, CA 94305

(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W(x, k, t)
to represent the envelope of the wave amplitude at position x contained in waves with
wavevector k, we describe surface wave transport over static ﬂows consisting of two length
scales; one varying smoothly on the wavelength scale, the other varying on a scale com-
parable to the wavelength. The spatially rapidly varying but weak surface ﬂows augment
the characteristic equations with scat

### Creating `regex_preprocess()` function :

In [6]:
def regex_preprocess (text) :
  """
    Preprocesses the input text by applying various regular expression-based transformations.

    Steps involved in preprocessing:
    1. Removes page numberings and single-lettered lines.
    2. Removes in-text references in the form of numbers enclosed in square or round brackets.
    3. Removes everything after and including references, bibliography, or works cited sections.
    4. Removes all punctuation.
    5. Removes all punctuation except for mathematical operation symbols (+, -, *, /) and parentheses/brackets.
    6. Removes URLs.

    Parameters:
    text (str): The input text to be preprocessed.

    Returns:
    str: The preprocessed text.
    """
  # getting rid of page numberings + one-lettered objects
  a = re.sub(r"\s\d{1,}\n|^[a-zA-Z]\n|^[0-9]{1,2}\n","", text,flags=re.MULTILINE)

  # getting rid of in-text references
  pattern = r"\[\d{1,2}\]|\\d{1,2}\)"
  b = re.sub(pattern,"", a)

  # getting rid of references and everything afterwards
  pattern = re.compile(r'(References|REFERENCES|Bibliography|Works Cited)\n(.*)', re.IGNORECASE | re.DOTALL)
  c = re.split(pattern, b)[0]


  # getting rid of URLs
  pattern = r'http\S+|www.\S+'
  d = re.sub(pattern,'',c)

  # getting rid of double space :
  #x = re.sub(r"[\n\n]+",'\n',x)

  return d

In [7]:
# testing it on a paper
clean_text = regex_preprocess(text)
print(clean_text)



]
-
.
[
/
:

Under consideration for publication in J. Fluid Mech.

Capillary-gravity wave transport over
spatially random drift

By G U I L L A U M E B A L∗ and T O M C H O U†
∗ Department of Mathematics, University of Chicago, Chicago, IL†Department of Mathematics, Stanford University, Stanford, CA
(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W(x, k, t)
to represent the envelope of the wave amplitude at position x contained in waves with
wavevector k, we describe surface wave transport over static ﬂows consisting of two length
scales; one varying smoothly on the wavelength scale, the other varying on a scale com-
parable to the wavelength. The spatially rapidly varying but weak surface ﬂows augment
the characteristic equations with scattering terms that are explicit functions of the cor-
relations of the random surface currents. These sc

# Stop-words & Tokenisation Preprocessing :

In [8]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [9]:
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize , sent_tokenize

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tayssirboukrouba/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tayssirboukrouba/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Testing on sample text

In [10]:
stop_words = set(stopwords.words('english'))
punctuation_table = str.maketrans('', '', string.punctuation)
sentences = sent_tokenize(text)
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# checking first tockenized sentence
tokenized_sentences[0]

['0',
 '0',
 '0',
 '2',
 'n',
 'a',
 'j',
 '1',
 ']',
 'n',
 'y',
 'd',
 '-',
 'u',
 'l',
 'f',
 '.']

In [11]:
joint_sentences = []

for tokens in tokenized_sentences:
    joint_tokens = ' '.join([token.translate(punctuation_table) for token in tokens if token.lower() not in stop_words])
    joint_sentences.append(joint_tokens)

print("showing random originql sentence : \n " , ' '.join(tokenized_sentences[42]))
print("showing random filtered sentence : \n " , joint_sentences[42])

showing random originql sentence : 
  an additional variation in height due to the velocity v ( x , z ) associated with surface waves is denoted η ( x , t ) .
showing random filtered sentence : 
  additional variation height due velocity v  x  z  associated surface waves denoted η  x   


In [12]:
processed_text = '\n'.join(joint_sentences)
print(processed_text[1000:2000])

doppler interaction presence slowly varying drift modiﬁes scattering processes provides mechanism coupling long wavelengths short wavelengths 
conservation wave action  cwa   typically derived slowly varying drift  extended systems rapidly varying ﬂow 
yet larger propagation distances  derive transport equations  equation wave energy diﬀusion 
associated diﬀusion constant also expressed terms surface ﬂow correlations 
results provide formal set equations analyse transport surface wave action  intensity  energy  wave scattering function slowly varying drifts correlation functions random  highly oscillatory surface ﬂows 
1 
introduction water wave dynamics altered interactions spatially varying surface ﬂows 
surface ﬂows modify free surface boundary conditions determine dis persion propagating water waves 
eﬀect smoothly varying  compared wavelength  currents analysed using ray theory  peregrine  1976   jonsson  1990   principle conservation wave action  cwa   cf 
longuethiggins  stewart

## Creating `remove_stop_words()` function :

In [13]:
def remove_stop_words(text) :
  """
    Remove stop words from the input text.

    Parameters:
    text (str): Input text containing sentences to be processed.

    Returns:
    str: Processed text where stop words have been removed from each sentence.
         Sentences are separated by newline characters

    Steps:
    1. Tokenizes the input text into sentences using NLTK's sent_tokenize.
    2. Tokenizes each sentence into words using NLTK's word_tokenize and converts them to lowercase.
    3. Removes English stop words using NLTK's stopwords.words('english').
    4. Joins the remaining tokens back into sentences, preserving sentence boundaries.
    5. Returns the processed text where each sentence is on a new line.
    """

  # defining english stop words
  stop_words = set(stopwords.words('english'))
  # getting sentence tokens
  sentences = sent_tokenize(text)
  # getiing word tokens from sentence tokens
  tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
  # deifning reconstructed text list
  joint_sentences = []

  # looping over word tokens in each sentence
  for tokens in tokenized_sentences:
      # reconstructing sentence from cleaned tokens
      joint_tokens = ' '.join([token for token in tokens if token.lower() not in stop_words])
      # appending sentence to the full text list
      joint_sentences.append(joint_tokens)

  # joining sentences (getting full text)
  processed_text = '\n'.join(joint_sentences)

  # returning cleaned text
  return processed_text

In [14]:
help(remove_stop_words)

Help on function remove_stop_words in module __main__:

remove_stop_words(text)
    Remove stop words from the input text.
    
    Parameters:
    text (str): Input text containing sentences to be processed.
    
    Returns:
    str: Processed text where stop words have been removed from each sentence.
         Sentences are separated by newline characters
    
    Steps:
    1. Tokenizes the input text into sentences using NLTK's sent_tokenize.
    2. Tokenizes each sentence into words using NLTK's word_tokenize and converts them to lowercase.
    3. Removes English stop words using NLTK's stopwords.words('english').
    4. Joins the remaining tokens back into sentences, preserving sentence boundaries.
    5. Returns the processed text where each sentence is on a new line.



In [15]:
text = read("/Users/tayssirboukrouba/Downloads/dataset/text/0001001v1.txt")
processed_text = remove_stop_words(text)
print(processed_text)

0 0 0 2 n j 1 ] n - u l f .
c h p [ 1 v 1 0 0 1 0 0 0 / c h p : v x r consideration publication j. fluid mech .
1 capillary-gravity wave transport spatially random drift g u l l u e b l∗ c h u† ∗ department mathematics , university chicago , chicago , il 60637 †department mathematics , stanford university , stanford , ca 94305 ( received 2 february 2008 ) derive transport equations propagation water wave action pres- ence static , spatially random surface drift .
using wigner distribution w ( x , k , ) represent envelope wave amplitude position x contained waves wavevector k , describe surface wave transport static ﬂows consisting two length scales ; one varying smoothly wavelength scale , varying scale com- parable wavelength .
spatially rapidly varying weak surface ﬂows augment characteristic equations scattering terms explicit functions cor- relations random surface currents .
scattering terms depend parametrically magnitudes directions smoothly varying drift shown give rise doppler

## Removing Custom Punctuation :

In [16]:
# getting rid of punctuation (except mathematical operations)
remove_punct = lambda text : re.sub(r'[.,-:?;\"\']+',"",text)
text = remove_punct(text)
print(text[1000:2000])

relations of the random surface currents These scattering terms depend parametrically
on the magnitudes and directions of the smoothly varying drift and are shown to give
rise to a Doppler coupled scattering mechanism The Doppler interaction in the presence
of slowly varying drift modiﬁes the scattering processes and provides a mechanism for
coupling long wavelengths with short wavelengths Conservation of wave action (CWA)
typically derived for slowly varying drift is extended to systems with rapidly varying
ﬂow At yet larger propagation distances we derive from the transport equations an
equation for wave energy diﬀusion The associated diﬀusion constant is also expressed
in terms of the surface ﬂow correlations Our results provide a formal set of equations
to analyse transport of surface wave action intensity energy and wave scattering as a
function of the slowly varying drifts and the correlation functions of the random highly
oscillatory surface ﬂows

 Introduction

Water wave dynam

# Stemming & Lemmatization :

## Testing on sample text :

In [17]:
# to do : DONT FORGET TO ADD THEM INTO REQUIREMENTS.txt file
!pip install spacy



In [18]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [19]:
text = read("/Users/tayssirboukrouba/Downloads/dataset/text/0001001v1.txt")
old_text = text[1000:2000]
doc = nlp(old_text)
tokens_list = []
for token in doc :
  print(token,"===>",token.lemma_)
  tokens_list.append(token.lemma_)

new_text = " ".join(tokens_list)

tering ===> tere
terms ===> term
that ===> that
are ===> be
explicit ===> explicit
functions ===> function
of ===> of
the ===> the
cor- ===> cor-

 ===> 

relations ===> relation
of ===> of
the ===> the
random ===> random
surface ===> surface
currents ===> current
. ===> .
These ===> these
scattering ===> scatter
terms ===> term
depend ===> depend
parametrically ===> parametrically

 ===> 

on ===> on
the ===> the
magnitudes ===> magnitude
and ===> and
directions ===> direction
of ===> of
the ===> the
smoothly ===> smoothly
varying ===> vary
drift ===> drift
and ===> and
are ===> be
shown ===> show
to ===> to
give ===> give

 ===> 

rise ===> rise
to ===> to
a ===> a
Doppler ===> Doppler
coupled ===> couple
scattering ===> scattering
mechanism ===> mechanism
. ===> .
The ===> the
Doppler ===> Doppler
interaction ===> interaction
in ===> in
the ===> the
presence ===> presence

 ===> 

of ===> of
slowly ===> slowly
varying ===> vary
drift ===> drift
modiﬁes ===> modiﬁes
the ===> the
scat

In [20]:
print("old text :\n", old_text )
print("-"*30)
print("lemmatised text :\n", new_text)

old text :
 tering terms that are explicit functions of the cor-
relations of the random surface currents. These scattering terms depend parametrically
on the magnitudes and directions of the smoothly varying drift and are shown to give
rise to a Doppler coupled scattering mechanism. The Doppler interaction in the presence
of slowly varying drift modiﬁes the scattering processes and provides a mechanism for
coupling long wavelengths with short wavelengths. Conservation of wave action (CWA),
typically derived for slowly varying drift, is extended to systems with rapidly varying
ﬂow. At yet larger propagation distances, we derive from the transport equations, an
equation for wave energy diﬀusion. The associated diﬀusion constant is also expressed
in terms of the surface ﬂow correlations. Our results provide a formal set of equations
to analyse transport of surface wave action, intensity, energy, and wave scattering as a
function of the slowly varying drifts and the correlation functions 

## Testing on full text :

In [21]:
old_text = read("/Users/tayssirboukrouba/Downloads/dataset/text/0001001v1.txt")
doc = nlp(old_text)
tokens_list = []
for token in doc :
  tokens_list.append(token.lemma_)

new_text = " ".join(tokens_list)

In [22]:
print("old text :\n", old_text[400:600] )
print("-"*69)
print("lemmatised text :\n", new_text[500:700] )

old text :
 94305

(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W
---------------------------------------------------------------------
lemmatised text :
  , Stanford , CA 94305 

 ( receive 2 February 2008 ) 

 we derive transport equation for the propagation of water wave action in the pres- 
 ence of a static , spatially random surface drift . use th


## Creating `lemmatize_text()` function :


In [49]:
def lemmatize_text(text) :
  """
    Lemmatizes the input text using SpaCy's en_core_web_sm model.

    Args:
    - text (str): The input text to be lemmatized.

    Returns:
    - str: The lemmatized text where each word is replaced by its lemma.

    Steps:
    1. Loads SpaCy's English model 'en_core_web_sm'.
    2. Tokenizes the input text into words.
    3. Lemmatizes each word to its base form.
    4. Joins the lemmatized words back into a single string.
    5. Returns the lemmatized text.
    """

  # loading spacy dict
  nlp = spacy.load('en_core_web_sm')
  # word tokenization
  doc = nlp(text)
  # defining tokens list
  tokens_list = []

  # looping over word tokens
  for token in doc :
    # replacing words by their lemma
    tokens_list.append(token.lemma_)

  # appending lemmas into text
  new_text = " ".join(tokens_list)

  return new_text

In [50]:
help(lemmatize_text)

Help on function lemmatize_text in module __main__:

lemmatize_text(text)
    Lemmatizes the input text using SpaCy's en_core_web_sm model.
    
    Args:
    - text (str): The input text to be lemmatized.
    
    Returns:
    - str: The lemmatized text where each word is replaced by its lemma.
    
    Steps:
    1. Loads SpaCy's English model 'en_core_web_sm'.
    2. Tokenizes the input text into words.
    3. Lemmatizes each word to its base form.
    4. Joins the lemmatized words back into a single string.
    5. Returns the lemmatized text.



In [51]:
old_text = read("/Users/tayssirboukrouba/Downloads/dataset/text/0001001v1.txt")
new_text = lemmatize_text(old_text)

In [52]:
print("old text :\n", old_text[400:600] )
print("-"*69)
print("lemmatised text :\n", new_text[500:700] )

old text :
 94305

(Received 2 February 2008)

We derive transport equations for the propagation of water wave action in the pres-
ence of a static, spatially random surface drift. Using the Wigner distribution W
---------------------------------------------------------------------
lemmatised text :
  , Stanford , CA 94305 

 ( receive 2 February 2008 ) 

 we derive transport equation for the propagation of water wave action in the pres- 
 ence of a static , spatially random surface drift . use th


# Combining Preprocessing Pipeline :

In [53]:
def preprocess_text(text) :
  """
    Preprocesses the input text by applying several text preprocessing steps.

    Args:
    - text (str): The input text to be preprocessed.

    Returns:
    - str: The preprocessed text after applying regex preprocessing, stop-words removal,
           custom punctuation removal, and lemmatization.

    Steps:
    1. Applies regex preprocessing to clean the text (function `regex_preprocess`).
    2. Removes stop words from the text (function `remove_stop_words`).
    3. Removes custom punctuation from the text (function `remove_punct`).
    4. Lemmatizes the text to replace words with their base forms (function `lemmatize_text`).
    5. Returns the preprocessed text.
    """

  # regex preprocessing
  a = regex_preprocess(text)

  # stop-words preprocessing
  b = remove_stop_words(a)

  # removing custom punctuation
  c = remove_punct(b)

  # lemmatizing text
  d = lemmatize_text(c)

  return d

In [54]:
help(preprocess_text)

Help on function preprocess_text in module __main__:

preprocess_text(text)
    Preprocesses the input text by applying several text preprocessing steps.
    
    Args:
    - text (str): The input text to be preprocessed.
    
    Returns:
    - str: The preprocessed text after applying regex preprocessing, stop-words removal,
           custom punctuation removal, and lemmatization.
    
    Steps:
    1. Applies regex preprocessing to clean the text (function `regex_preprocess`).
    2. Removes stop words from the text (function `remove_stop_words`).
    3. Removes custom punctuation from the text (function `remove_punct`).
    4. Lemmatizes the text to replace words with their base forms (function `lemmatize_text`).
    5. Returns the preprocessed text.



In [55]:
text = read("/Users/tayssirboukrouba/Downloads/dataset/text/0001001v1.txt")
new = preprocess_text(text)

In [56]:
print(new)

]  
 [    consideration publication j fluid mech 
 capillarygravity wave transport spatially random drift g u l l u e b l∗ c h u† ∗ department mathematics   university chicago   chicago   il†department mathematics   stanford university   stanford   can ( receive   february   ) derive transport equation propagation water wave action pre ence static   spatially random surface drift 
 use wign distribution w ( x   k   ) represent envelope wave amplitude position x contain wave wavevector k   describe surface wave transport static ﬂow consist two length scale   one vary smoothly wavelength scale   vary scale com parable wavelength 
 spatially rapidly vary weak surface ﬂow augment characteristic equation scatter term explicit function cor relation random surface current 
 scatter term depend parametrically magnitude direction smoothly vary drift show give rise doppler couple scatter mechanism 
 doppler interaction presence slowly vary drift modiﬁes scatter process provide mechanism couple l

# Applying The pipeline to the dataset

In [58]:
write = lambda filename, text: open(filename, 'w').write(text)

filename = '/Users/tayssirboukrouba/Downloads/example.txt'
text = 'Hello, my name is taissir'
write(filename, text)

25