# Text Pre-processing

## An extensive list of the major and commonly used techniques:

1. Tokenization: Breaking text into smaller units, such as words, subwords, or characters.

2. Normalization: Converting text to a standard form, including lowercasing, removing punctuation, accents, or diacritics.

3. Stopword Removal: Eliminating common words (e.g., "and", "the", "is") that don't carry significant meaning.

4. Stemming: Reducing words to their base or root form by removing suffixes.

5.  Lemmatization: Similar to stemming, but resulting in valid words (lemmas) through linguistic analysis.

6.  Part-of-Speech (POS) Tagging: Assigning grammatical categories (nouns, verbs, adjectives, etc.) to words in a sentence.

7.  Named Entity Recognition (NER): Identifying and classifying named entities (e.g., person names, locations, organizations) in text.

8.  Chunking: Grouping consecutive words into "chunks" based on syntactic patterns.

9.  Parsing: Analyzing the grammatical structure of sentences, often represented as parse trees.

10. Sentence Segmentation: Splitting text into individual sentences.

11. Spell Checking: Correcting spelling errors in text.

12. Text Normalization: Converting text into a canonical form, such as standardizing dates, numbers, or abbreviations.

13. Removing HTML Tags: Stripping HTML markup from text documents.

14. Removing URLs: Eliminating web URLs from text data.

15. Removing Punctuation: Deleting punctuation marks from text.

16. Removing Numeric Characters: Eliminating digits and numbers from text.

17. Removing Special Characters: Deleting special characters, symbols, or non-printable characters.

18. Removing Accents/Diacritics: Stripping accents or diacritical marks from letters.

19. Removing Stopwords: Eliminating commonly occurring words that do not carry much semantic meaning.

20. Lowercasing/Uppercasing: Converting text to lowercase or uppercase.

21. Removing Whitespace: Stripping extra spaces, tabs, or line breaks from text.

22. Text Expansion: Expanding contractions (e.g., "don't" to "do not").

23. Text Compression: Contracting repeated characters or phrases (e.g., "loooove" to "love").

24. Token Normalization: Standardizing tokens with similar meanings (e.g., "USA" to "United States").

25. Removing Emoticons/Emoji: Eliminating emoticons or emoji from text data.

26. Removing Short Words: Filtering out very short words or tokens.

27. Removing Rare Words: Eliminating infrequently occurring words from text.

28. Removing Duplicates: Deleting duplicate words or tokens from text.

29. Text Augmentation: Generating synthetic text examples through techniques like synonym replacement, word swapping, or paraphrasing.

30. Entity Masking: Replacing named entities with generic labels (e.g., replacing person names with "PERSON").

## Stpes done here
1. Lower casing
2. Removing a Number
3. Removal of stopwords
4. Removal of Punctuations
5. Removing whitespace
6. Stemming
7. Lemmatization
8. Spelling correction

## Importing text data from file into a string

In [1]:
with open('bangladesh_small.txt','r') as file:
    # line.rstrip() removes any trailing newline characters (\n) from each line, ensuring consistent handling of line endings.
    # The join() method combines the stripped lines into a single string, separated by spaces.
    text = " ".join(line.rstrip() for line in file)

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text: ' + '\033[0m' + text)

[1m[93mImported Text: [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.


## Importing Necessary libraries

In [2]:
# Natural Language Toolkit (NLTK) library provides various functionalities for text processing.
import nltk

# String module contains constants and utility function related to strings,
# including punctuation characters.
import string

# Regular Expressions (regex) module for pattern matching and manipulation,
# used for compressing multiple spaces.
import re

## 1. Lower Casing

In [3]:
lower_case = text.lower()

modified_text = text.lower()

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Lower Cased   : ' + '\033[0m' + lower_case)
print('\033[1m' +'\033[92m'+ 'Modified      : ' + '\033[0m' + modified_text)

[1m[93mImported Text : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mLower Cased   : [0mbangladesh1,[a] officially the people's republic of bangladesh,[b] is a country among many countries south asia got independece in 1971. it is the eighth-most populous country in the world. the official languages of bangladesh is bengali and english is also used in the government and official documents alongside bengali.
[1m[92mModified      : [0mbangladesh1,[a] officially the people's republic of bangladesh,[b] is a country among many countries south asia got independece in 1971. it is the eighth-most populous country in the world. the official languages of bangladesh is bengali and english is also used in t

## 2. Removing Number

In [4]:
# Removes any numeric digits from the lower_case string and assigns the result to the variable remove_number.
remove_number = re.sub(r'\d+', '', text)

modified_text = re.sub(r'\d+', '', modified_text)

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Number Removed: ' + '\033[0m' + remove_number)
print('\033[1m' +'\033[92m'+ 'Modified      : ' + '\033[0m' + modified_text)

[1m[93mImported Text : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mNumber Removed: [0mBangladesh,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in . It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[92mModified      : [0mbangladesh,[a] officially the people's republic of bangladesh,[b] is a country among many countries south asia got independece in . it is the eighth-most populous country in the world. the official languages of bangladesh is bengali and english is also used in the governm

# 3. Removal of Stopwords

In [5]:
# Imports the stopwords corpus from NLTK,
# which contains common words that are often filtered out in natural language processing tasks.
from nltk.corpus import stopwords

# Imports the word_tokenize function from NLTK,
# which is used to tokenize (split) text into words.
from nltk.tokenize import word_tokenize

# Downloads the stopwords corpus from NLTK
nltk.download('stopwords')

# (Debug)
stopwords_list = stopwords.words('english')
print('\033[1m' +'\033[95m'+ 'Stopwords Corpus : ' + '\033[0m' + str(stopwords_list))

[1m[95mStopwords Corpus : [0m['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# tokenizes the text into words, removes any stopwords from the text, and then joins the remaining words back into a single string separated by spaces.
stopwords_removed = ' '.join([word for word in text.split() if word not in stopwords_list])

modified_text = ' '.join([word for word in modified_text.split() if word not in stopwords_list])

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text     : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'stopwords removed : ' + '\033[0m' + stopwords_removed)
print('\033[1m' +'\033[92m'+ 'Modified          : ' + '\033[0m' + modified_text)

[1m[93mImported Text     : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mstopwords removed : [0mBangladesh1,[a] officially People's Republic Bangladesh,[b] country among many countries South Asia got independece 1971. It eighth-most populous country world. The official languages Bangladesh Bengali English also used government official documents alongside Bengali.
[1m[92mModified          : [0mbangladesh,[a] officially people's republic bangladesh,[b] country among many countries south asia got independece . eighth-most populous country world. official languages bangladesh bengali english also used government official documents alongside bengali.


## 4. Removal of Punctuations

#### 4.1 Method 1: Using string library

In [7]:
# temporary variable for 2 doffrent method
temp = modified_text

In [8]:
# Removes punctuation characters from the text by
# iterating through each character in the text
# and only including it in the result if it's not a punctuation character.
# using string.punctuation
punctuation_removed = ''.join([char for char in text if char not in string.punctuation])

modified_text = ''.join([char for char in modified_text if char not in string.punctuation])

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text        : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Punctuations Removed : ' + '\033[0m' + punctuation_removed)
print('\033[1m' +'\033[92m'+ 'Modified             : ' + '\033[0m' + modified_text)

[1m[93mImported Text        : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mPunctuations Removed : [0mBangladesh1a officially the Peoples Republic of Bangladeshb is a country among many countries South Asia got independece in 1971 It is the eighthmost populous country in the world The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali
[1m[92mModified             : [0mbangladesha officially peoples republic bangladeshb country among many countries south asia got independece  eighthmost populous country world official languages bangladesh bengali english also used government official documents alongside bengali


#### 4.2 Method 2: Using raw coding

In [9]:
# Iterates over every characters in the string
# using string.punctuation to detect punctuation
# replaces punctuation with spaces
punctuation_removed_2 = ""
for char in text:
    if char not in string.punctuation:
      punctuation_removed_2 += char
    else:
      punctuation_removed_2 += ' '

modified_text_2 = ""
for char in temp:
    if char not in string.punctuation:
      modified_text_2 += char
    else:
      modified_text_2 += ' '

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text        : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Punctuations Removed : ' + '\033[0m' + punctuation_removed_2)
print('\033[1m' +'\033[92m'+ 'Modified             : ' + '\033[0m' + modified_text_2)

[1m[93mImported Text        : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mPunctuations Removed : [0mBangladesh1  a  officially the People s Republic of Bangladesh  b  is a country among many countries South Asia got independece in 1971  It is the eighth most populous country in the world  The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali 
[1m[92mModified             : [0mbangladesh  a  officially people s republic bangladesh  b  country among many countries south asia got independece   eighth most populous country world  official languages bangladesh bengali english also used government official documents alongsi

## 5. Removal of Whitespace

In [10]:
# Uses regular expression (re) to replace multiple consecutive whitespace characters with a single space.
whitespace_removed = re.sub(r'\s+', ' ', text)

modified_text = re.sub(r'\s+', ' ', modified_text)

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text      : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Whitespaces Removed: ' + '\033[0m' + whitespace_removed)
print('\033[1m' +'\033[92m'+ 'Modified           : ' + '\033[0m' + modified_text)

[1m[93mImported Text      : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mWhitespaces Removed: [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[92mModified           : [0mbangladesha officially peoples republic bangladeshb country among many countries south asia got independece eighthmost populous country world official languages bangladesh bengali english also used government official documents alongside bengali


> Temporary variable for stemming & lemmatization because if "modified_text" passed firslt in stemming then it looses suffix, which can cause problem in lemmatization which also does kindly the same thing.

In [11]:
temp_for_stemming = modified_text

temp_for_lemmatization = modified_text

## 6.Stemming

In [12]:
# Imports the PorterStemmer class from NLTK,
# which is used for stemming words.
from nltk.stem.porter import PorterStemmer


In [13]:
stemming = nltk.PorterStemmer()

# Tokenizes the text into words, stems each word using the PorterStemmer,
# and then joins the stemmed words back into a single string separated by spaces.
stemmed_words = ' '.join([stemming.stem(word) for word in text.split()])

modified_text = ' '.join([stemming.stem(word) for word in temp_for_stemming.split()])

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text: ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Stemmed Words: ' + '\033[0m' + stemmed_words)
print('\033[1m' +'\033[92m'+ 'Modified     : ' + '\033[0m' + modified_text)

[1m[93mImported Text: [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mStemmed Words: [0mbangladesh1,[a] offici the people' republ of bangladesh,[b] is a countri among mani countri south asia got independec in 1971. it is the eighth-most popul countri in the world. the offici languag of bangladesh is bengali and english is also use in the govern and offici document alongsid bengali.
[1m[92mModified     : [0mbangladesha offici peopl republ bangladeshb countri among mani countri south asia got independec eighthmost popul countri world offici languag bangladesh bengali english also use govern offici document alongsid bengali


## 7. Lemmatization

In [14]:
# Downloads the WordNet corpus from NLTK
nltk.download('wordnet')
# Imports the WordNetLemmatizer class from NLTK.
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
lemmatization = nltk.WordNetLemmatizer()

# tokenizes the text into words,
# lemmatizes each word using the WordNetLemmatizer,
# and then joins the lemmatized words back into a single string separated by spaces.
lemmatized_words = ' '.join([lemmatization.lemmatize(word) for word in text.split()])

modified_text = ' '.join([lemmatization.lemmatize(word) for word in temp_for_lemmatization.split()])

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text    : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Lemmatized Words : ' + '\033[0m' + lemmatized_words)
print('\033[1m' +'\033[92m'+ 'Modified         : ' + '\033[0m' + modified_text)

[1m[93mImported Text    : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mLemmatized Words : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many country South Asia got independece in 1971. It is the eighth-most populous country in the world. The official language of Bangladesh is Bengali and English is also used in the government and official document alongside Bengali.
[1m[92mModified         : [0mbangladesha officially people republic bangladeshb country among many country south asia got independece eighthmost populous country world official language bangladesh bengali english also used government official document alongside bengali


## 8. Spelling correction

In [16]:
# Autocorrect package provides functionality for correcting spelling errors.
!pip install autocorrect
# imports the Speller class from the autocorrect package,
# which will be used for spell correction.
from autocorrect import Speller



In [17]:
spell = Speller(lang='en')

# Applies the spell correction function to the lemmatized words.
corrected_words = spell(text)

modified_text = spell(modified_text)

# (Debug)
print('\033[1m' +'\033[93m'+ 'Imported Text   : ' + '\033[0m' + text)
print('\033[1m' +'\033[96m'+ 'Corrected Words : ' + '\033[0m' + corrected_words)
print('\033[1m' +'\033[92m'+ 'Modified        : ' + '\033[0m' + modified_text)

[1m[93mImported Text   : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independece in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[96mCorrected Words : [0mBangladesh1,[a] officially the People's Republic of Bangladesh,[b] is a country among many countries South Asia got independence in 1971. It is the eighth-most populous country in the world. The official languages of Bangladesh is Bengali and English is also used in the government and official documents alongside Bengali.
[1m[92mModified        : [0mbangladesh officially people republic bangladesh country among many country south asia got independence eighthmost populous country world official language bangladesh bengali english also used government official document alongside bengali
