### **Ordered Steps for Text Preprocessing** 📝✅  

1️⃣ **Sentence Boundary Detection** – Split text into sentences.  
   **Example:** `"I love AI. It is amazing!"` → `["I love AI.", "It is amazing!"]`  

2️⃣ **Lowercase Conversion** – Convert all text to lowercase.  
   **Example:** `"Hello World!"` → `"hello world!"`  

3️⃣ **Spelling Correction** – Fix spelling mistakes.  
   **Example:** `"Ths is amzing"` → `"This is amazing"`  

4️⃣ **Punctuation Removal** – Remove special characters.  
   **Example:** `"Hello, world!"` → `"Hello world"`  

5️⃣ **Stop Words Removal** – Remove common words (e.g., "is", "the").  
   **Example:** `"This is a great day"` → `"great day"`  

6️⃣ **Stemming** – Reduce words to their root form.  
   **Example:** `"running", "runner"` → `"run"`  

7️⃣ **Text Normalization** – Standardize text (e.g., expand contractions, fix slang).  
   **Example:** `"I'm gonna go"` → `"I am going"`  

In [1]:
uncleanTextData = """
Once upon a time!!!, there were two FRIENDS.. They are, umm, living in a small vlllage near the river. 
One day, as they were walking.. on the road, they saw.. a big dark cloud in the sky! 
"oh no...! I think it's going to rain heavily," said one friend. 

The other replied, "Nahh! don’t worry, bro. It's just a cloudd. Let's keep walking". 

Suddenly, a strong wind started blowingg! The trees began to shake, and the leaves flew everywhere. 
The friends rushed to find shelter near an old, abandoned hut. "See? I told you!" - the first friend exclaimed. 

After a few minutes, the rain poured down heavilly. They had nowhere else to go, so they sat under the hut waitingg...

"""

In [2]:
# step one sentences Boundary 

from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(uncleanTextData)

In [3]:
sentences[:5]

['\nOnce upon a time!!',
 '!, there were two FRIENDS..',
 'They are, umm, living in a small vlllage near the river.',
 'One day, as they were walking.. on the road, they saw.. a big dark cloud in the sky!',
 '"oh no...!']

In [None]:
# step two lowercase conversion 

lower_sentences = [sentence.lower() for sentence in sentences]

In [5]:
lower_sentences[:5]

['\nonce upon a time!!',
 '!, there were two friends..',
 'they are, umm, living in a small vlllage near the river.',
 'one day, as they were walking.. on the road, they saw.. a big dark cloud in the sky!',
 '"oh no...!']

In [6]:
# step three spelling correction 

from textblob import TextBlob

In [20]:
correct_sentences = [str(TextBlob(sentence).correct()) for sentence in lower_sentences]

In [21]:
correct_sentences[:5]

['\nonce upon a time!!',
 '!, there were two friends..',
 'they are, ulm, living in a small village near the river.',
 'one day, as they were walking.. on the road, they saw.. a big dark cloud in the sky!',
 '"oh no...!']

In [9]:
# step four punctual removal 

import string 

In [24]:
punctuation_free_sentences = [sentence.translate(str.maketrans('', '', string.punctuation+'\n')) for sentence in correct_sentences]

In [25]:
punctuation_free_sentences[:5]

['once upon a time',
 ' there were two friends',
 'they are ulm living in a small village near the river',
 'one day as they were walking on the road they saw a big dark cloud in the sky',
 'oh no']

In [18]:
# step five stop words removal 

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [19]:
stop_words = list(ENGLISH_STOP_WORDS)

In [30]:
filter_sentences = [" ".join([word for word in sentence.split() if word not in stop_words]) for sentence in punctuation_free_sentences]

In [33]:
filter_sentences[:5]

['time',
 'friends',
 'ulm living small village near river',
 'day walking road saw big dark cloud sky',
 'oh']

In [34]:
# step six lemmatization

import spacy

In [35]:
nlp = spacy.load('en_core_web_sm')

In [45]:
lemmatized_sentences = [" ".join([token.lemma_ for token in nlp(sentence)]) for sentence in filter_sentences]

In [46]:
# step six stemming 

from nltk.stem import PorterStemmer

In [47]:
stemmer = PorterStemmer()

In [53]:
stemming_sentences = [" ".join([stemmer.stem(word) for word in sentence.split()]) for sentence in filter_sentences]

In [54]:
stemming_sentences[:5]

['time',
 'friend',
 'ulm live small villag near river',
 'day walk road saw big dark cloud sky',
 'oh']

In [55]:
# step seven text normalization 

import contractions 
import re 
from unidecode import unidecode

In [56]:
def normalize_text(text):
    text = contractions.fix(text)
    text = re.sub('r\s+', ' ', text).strip()
    text = unidecode(text)
    return text

In [57]:
normalized_sentences  = [normalize_text(sentence) for sentence in lemmatized_sentences]

In [58]:
normalized_sentences[:5]

['time',
 'friend',
 'ulm live small village nea river',
 'day walk road see big dark cloud sky',
 'oh']

In [60]:
print(" ".join(normalized_sentences))

time friend ulm live small village nea river day walk road see big dark cloud sky oh think go rain heavily say friend reply do not worry brow just cloud let walk suddenly strong wind start blow tree begin shake leave fly friend rush shelte nea old abandon hut  tell friend exclaim minutes rain pou heavily sit hut wait
