Here we are focusing on developing a pipeline for text tokenization and noise reduction, specifically for analyzing historical religious and biblical texts. Utilizing a combination of Natural Language Processing (NLP) libraries such as NLTK and SpaCy, the task aims to preprocess and tokenize a dataset of Asian religious and biblical texts obtained from the UCI Machine Learning Repository.

In [None]:
# Download necessary libraries and packages for tokenization
!pip install nltk -U -q
!pip install spacy -U -q

In [None]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

import nltk
import re
import spacy
import zipfile
import requests
from io import BytesIO
from nltk.tokenize import word_tokenize

# download and install the spacy language model
!python3 -m spacy download en_core_web_sm
sp = spacy.load('en_core_web_sm')

# download the 'punkt' tokenizer models
nltk.download('punkt')


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m64.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
url = 'https://archive.ics.uci.edu/static/public/512/a+study+of+asian+religious+and+biblical+texts.zip'

# Download and extract the ZIP file
response=requests.get(url)

with zipfile.ZipFile(BytesIO(response.content)) as z:

  # Extract 'complete_data.txt' from the ZIP file
  with z.open('Complete_data .txt') as file:

    # Read and decode the file
    working_txt = file.read().decode('utf-8',errors='ignore')
    # Clean text by removing successive whitespace and line breaks
    clean_txt=re.sub(r"\s+"," ",working_txt)

print(clean_txt[:50])

0.1 1.The Buddha: "What do you think, Rahula: What


In [None]:
tokens = word_tokenize(clean_txt)
print(tokens[:50])

['0.1', '1.The', 'Buddha', ':', '``', 'What', 'do', 'you', 'think', ',', 'Rahula', ':', 'What', 'is', 'a', 'mirror', 'for', '?', '``', 'The', 'Buddha', ':', 'Rahula', ':', '``', 'For', 'reflection', ',', 'sir', '.', '``', 'Rahula', ':', 'The', 'Buddha', ':', '``', 'In', 'the', 'same', 'way', ',', 'Rahula', ',', 'bodily', 'acts', ',', 'verbal', 'acts', ',']


**removing noise**

* Remove successive whitespace and line breaks to standardize the text format.
* Replace non-alphabetic characters with single whitespace to focus on alphabetic content.
* Eliminate any leading and trailing whitespace to ensure clean token boundaries.


In [None]:
# replace non-alphabetic characters with single whitespace
reg_txt=re.sub(r'[^a-zA-Z\s]',' ',clean_txt)
print(reg_txt[:50])

#remove any whitespace that appears in sequence
reg_txt=re.sub(r'\s+',' ',reg_txt)
print(reg_txt[:50])

# remove any new leading and trailing whitespace
reg_txt=reg_txt.strip()
print(reg_txt[:50])

      The Buddha   What do you think  Rahula  What
 The Buddha What do you think Rahula What is a mir
The Buddha What do you think Rahula What is a mirr


**Word Tokenization:** Split the cleaned text into individual words using NLTK's word_tokenize method.

In [None]:
#tokenize regularized text
reg_tokens=word_tokenize(reg_txt)
print(reg_tokens[:50])

['The', 'Buddha', 'What', 'do', 'you', 'think', 'Rahula', 'What', 'is', 'a', 'mirror', 'for', 'The', 'Buddha', 'Rahula', 'For', 'reflection', 'sir', 'Rahula', 'The', 'Buddha', 'In', 'the', 'same', 'way', 'Rahula', 'bodily', 'acts', 'verbal', 'acts', 'mental', 'acts', 'are', 'to', 'be', 'done', 'with', 'repeated', 'reflection', 'The', 'Buddha', 'Whenever', 'you', 'want', 'to', 'perform', 'a', 'bodily', 'act', 'you']


**Character Tokenization:** Tokenize the text at the character level using regular expressions.

In [None]:
from nltk.tokenize import regexp_tokenize

In [None]:
print(clean_txt[:50])

0.1 1.The Buddha: "What do you think, Rahula: What


In [None]:
pattern=r"\S|\s"
character_tokens=regexp_tokenize(clean_txt,pattern)
print(character_tokens[:50])

['0', '.', '1', ' ', '1', '.', 'T', 'h', 'e', ' ', 'B', 'u', 'd', 'd', 'h', 'a', ':', ' ', '"', 'W', 'h', 'a', 't', ' ', 'd', 'o', ' ', 'y', 'o', 'u', ' ', 't', 'h', 'i', 'n', 'k', ',', ' ', 'R', 'a', 'h', 'u', 'l', 'a', ':', ' ', 'W', 'h', 'a', 't']


**Sentence Tokenization:** Divide the text into sentences using NLTK's sent_tokenize method.

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
print(clean_txt[:50])

0.1 1.The Buddha: "What do you think, Rahula: What


In [None]:
sentence_tokens=sent_tokenize(clean_txt)
print(sentence_tokens[:5])

['0.1 1.The Buddha: "What do you think, Rahula: What is a mirror for?', '"The Buddha:Rahula: "For reflection, sir.', '"Rahula:The Buddha: "In the same way, Rahula, bodily acts, verbal acts, & mental acts are to be done with repeated reflection.The Buddha:"Whenever you want to perform a bodily act, you should reflect on it: \'This bodily act I want to perform would it lead to self-affliction, to the affliction of others, or to both?', "Is it an unskillful bodily act, with painful consequences, painful results?'", 'If, on reflection, you know that it would lead to self-affliction, to the affliction of others, or to both; it would be an unskillful bodily act with painful consequences, painful results, then any bodily act of that sort is absolutely unfit for you to do.']
