# NLP Text Preprocessing Project

This project shows basic Natural Language Processing (NLP) steps to clean and process text data, such as tokenization, removing punctuation, removing stopwords, and stemming using Python and NLTK.

The project helps prepare text data for tasks like text classification and sentiment analysis.
- Author : Raju Kumar


### Install & Import Libraries

In [1]:
# ===============================
# NLP Project - Setup
# ===============================

# Install packages (run once)
# !pip install nltk

import nltk
import string
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download required NLTK data
nltk.download("punkt")
nltk.download("stopwords")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rajus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rajus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

### Input Text Data


In [2]:
# ===============================
# Input Text
# ===============================

text = """
Natural language processing is fun to learn.
It helps computers to understand human language!
Let's explore tokenization today.
"""


### Convert Text to Lowercase

In [3]:
# ===============================
# Text Normalization
# ===============================

# Lowercase conversion improves consistency
lower_text = text.lower()

print("Lowercase Text:\n", lower_text)


Lowercase Text:
 
natural language processing is fun to learn.
it helps computers to understand human language!
let's explore tokenization today.



### Sentence Tokenization

In [4]:
# ===============================
# Sentence Tokenization
# ===============================

sentences = sent_tokenize(lower_text)

print("Sentences:")
for i, s in enumerate(sentences, 1):
    print(f"{i}. {s}")

print("Total Sentences:", len(sentences))


Sentences:
1. 
natural language processing is fun to learn.
2. it helps computers to understand human language!
3. let's explore tokenization today.
Total Sentences: 3


### Word Tokenization

In [5]:
# ===============================
# Word Tokenization
# ===============================

words = word_tokenize(lower_text)

print("Tokenized Words:")
print(words)
print("Total Words:", len(words))


Tokenized Words:
['natural', 'language', 'processing', 'is', 'fun', 'to', 'learn', '.', 'it', 'helps', 'computers', 'to', 'understand', 'human', 'language', '!', 'let', "'s", 'explore', 'tokenization', 'today', '.']
Total Words: 22


In [14]:
for i,s in enumerate(sentences,1):
      print(f"\n Sentences {i} : {s}")
      print("Token :",word_tokenize(s))


 Sentences 1 :  natural languge processing is fun to learn it helps computers to understand human languge !
Token : ['natural', 'languge', 'processing', 'is', 'fun', 'to', 'learn', 'it', 'helps', 'computers', 'to', 'understand', 'human', 'languge', '!']

 Sentences 2 : let's expolore tokenoization today
Token : ['let', "'s", 'expolore', 'tokenoization', 'today']


### Remove Punctuation

In [6]:
# ===============================
# Remove Punctuation
# ===============================

clean_words = [
    word for word in words
    if word not in string.punctuation
]

print("After Removing Punctuation:")
print(clean_words)


After Removing Punctuation:
['natural', 'language', 'processing', 'is', 'fun', 'to', 'learn', 'it', 'helps', 'computers', 'to', 'understand', 'human', 'language', 'let', "'s", 'explore', 'tokenization', 'today']


In [22]:
clean_words = []
for i in Words:
    if i not in string.punctuation:
        clean_words.append(i)

print("\n Oringinal Words  : " , Words ,len(Words))
print("\n After Removing punctuation : ", clean_words , len(clean_words))


 Oringinal Words  :  ['natural', 'languge', 'processing', 'is', 'fun', 'to', 'learn', 'it', 'helps', 'computers', 'to', 'understand', 'human', 'languge', '!', 'let', "'s", 'expolore', 'tokenoization', 'today'] 20

 After Removing punctuation :  ['natural', 'languge', 'processing', 'is', 'fun', 'to', 'learn', 'it', 'helps', 'computers', 'to', 'understand', 'human', 'languge', 'let', "'s", 'expolore', 'tokenoization', 'today'] 19


### Remove Stop Words

In [7]:
# ===============================
# Stop Word Removal
# ===============================

stop_words = set(stopwords.words("english"))

filtered_words = [
    word for word in clean_words
    if word not in stop_words
]

print("After Stopword Removal:")
print(filtered_words)


After Stopword Removal:
['natural', 'language', 'processing', 'fun', 'learn', 'helps', 'computers', 'understand', 'human', 'language', 'let', "'s", 'explore', 'tokenization', 'today']


### Stemming Words

In [8]:
# ===============================
# Stemming
# ===============================

stemmer = PorterStemmer()

stemmed_words = [
    stemmer.stem(word)
    for word in filtered_words
]

print("Stemmed Words:")
print(stemmed_words)


Stemmed Words:
['natur', 'languag', 'process', 'fun', 'learn', 'help', 'comput', 'understand', 'human', 'languag', 'let', "'s", 'explor', 'token', 'today']


### Full Pipeline Function

In [10]:
# ===============================
# NLP Pipeline Function
# ===============================

def preprocess_text(text):
    """
    Full NLP preprocessing pipeline.
    Returns processed tokens.
    """

    text = text.lower()
    words = word_tokenize(text)

    words = [w for w in words if w not in string.punctuation]

    stop_words = set(stopwords.words("english"))
    words = [w for w in words if w not in stop_words]

    stemmer = PorterStemmer()
    words = [stemmer.stem(w) for w in words]

    return words


In [11]:
result = preprocess_text(text)
print(result)


['natur', 'languag', 'process', 'fun', 'learn', 'help', 'comput', 'understand', 'human', 'languag', 'let', "'s", 'explor', 'token', 'today']


##  Perfoming operation on second text

### Input Text Data


In [28]:
text2 = """India, officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area; 
the most populous country since 2023; and, since its independence in 1947, the world's most populous democracy. Bounded by the Indian 
Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, 
it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. 
In the Indian Ocean, India is near Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Myanmar, 
Thailand, and Indonesia."""

In [29]:
l_text2 = text2.lower()

print("Lower case text :",l_text2)

Lower case text : india, officially the republic of india, is a country in south asia. it is the seventh-largest country by area; 
the most populous country since 2023; and, since its independence in 1947, the world's most populous democracy. bounded by the indian 
ocean on the south, the arabian sea on the southwest, and the bay of bengal on the southeast, 
it shares land borders with pakistan to the west; china, nepal, and bhutan to the north; and bangladesh and myanmar to the east. 
in the indian ocean, india is near sri lanka and the maldives; its andaman and nicobar islands share a maritime border with myanmar, 
thailand, and indonesia.


In [30]:
sentences2 = sent_tokenize(l_text2)
print("After Sentence Tokenize : \n",sentences2,"Total Sentences are ","\n",len(sentences2))

After Sentence Tokenize : 
 ['india, officially the republic of india, is a country in south asia.', "it is the seventh-largest country by area; \nthe most populous country since 2023; and, since its independence in 1947, the world's most populous democracy.", 'bounded by the indian \nocean on the south, the arabian sea on the southwest, and the bay of bengal on the southeast, \nit shares land borders with pakistan to the west; china, nepal, and bhutan to the north; and bangladesh and myanmar to the east.', 'in the indian ocean, india is near sri lanka and the maldives; its andaman and nicobar islands share a maritime border with myanmar, \nthailand, and indonesia.'] Total Sentences are  
 4


In [31]:
for i,s in enumerate(sentences2,1):
      print(f"\n Sentences {i} : {s}")
      print("Token :",word_tokenize(s))


 Sentences 1 : india, officially the republic of india, is a country in south asia.
Token : ['india', ',', 'officially', 'the', 'republic', 'of', 'india', ',', 'is', 'a', 'country', 'in', 'south', 'asia', '.']

 Sentences 2 : it is the seventh-largest country by area; 
the most populous country since 2023; and, since its independence in 1947, the world's most populous democracy.
Token : ['it', 'is', 'the', 'seventh-largest', 'country', 'by', 'area', ';', 'the', 'most', 'populous', 'country', 'since', '2023', ';', 'and', ',', 'since', 'its', 'independence', 'in', '1947', ',', 'the', 'world', "'s", 'most', 'populous', 'democracy', '.']

 Sentences 3 : bounded by the indian 
ocean on the south, the arabian sea on the southwest, and the bay of bengal on the southeast, 
it shares land borders with pakistan to the west; china, nepal, and bhutan to the north; and bangladesh and myanmar to the east.
Token : ['bounded', 'by', 'the', 'indian', 'ocean', 'on', 'the', 'south', ',', 'the', 'arabian

In [33]:
Words2 = word_tokenize(l_text2)
print("After Word Tokenize : \n",Words,"Total Words are ","\n",len(Words))

After Word Tokenize : 
 ['natural', 'languge', 'processing', 'is', 'fun', 'to', 'learn', 'it', 'helps', 'computers', 'to', 'understand', 'human', 'languge', '!', 'let', "'s", 'expolore', 'tokenoization', 'today'] Total Words are  
 20


In [35]:
clean_words2 = []
for i in Words2:
    if i not in string.punctuation:
        clean_words2.append(i)

print("\n Oringinal Words  : " , Words2 ,len(Words2))
print("\n After Removing punctuation : ", clean_words2 , len(clean_words2))


 Oringinal Words  :  ['india', ',', 'officially', 'the', 'republic', 'of', 'india', ',', 'is', 'a', 'country', 'in', 'south', 'asia', '.', 'it', 'is', 'the', 'seventh-largest', 'country', 'by', 'area', ';', 'the', 'most', 'populous', 'country', 'since', '2023', ';', 'and', ',', 'since', 'its', 'independence', 'in', '1947', ',', 'the', 'world', "'s", 'most', 'populous', 'democracy', '.', 'bounded', 'by', 'the', 'indian', 'ocean', 'on', 'the', 'south', ',', 'the', 'arabian', 'sea', 'on', 'the', 'southwest', ',', 'and', 'the', 'bay', 'of', 'bengal', 'on', 'the', 'southeast', ',', 'it', 'shares', 'land', 'borders', 'with', 'pakistan', 'to', 'the', 'west', ';', 'china', ',', 'nepal', ',', 'and', 'bhutan', 'to', 'the', 'north', ';', 'and', 'bangladesh', 'and', 'myanmar', 'to', 'the', 'east', '.', 'in', 'the', 'indian', 'ocean', ',', 'india', 'is', 'near', 'sri', 'lanka', 'and', 'the', 'maldives', ';', 'its', 'andaman', 'and', 'nicobar', 'islands', 'share', 'a', 'maritime', 'border', 'with',

In [36]:
clean_words2

['india',
 'officially',
 'the',
 'republic',
 'of',
 'india',
 'is',
 'a',
 'country',
 'in',
 'south',
 'asia',
 'it',
 'is',
 'the',
 'seventh-largest',
 'country',
 'by',
 'area',
 'the',
 'most',
 'populous',
 'country',
 'since',
 '2023',
 'and',
 'since',
 'its',
 'independence',
 'in',
 '1947',
 'the',
 'world',
 "'s",
 'most',
 'populous',
 'democracy',
 'bounded',
 'by',
 'the',
 'indian',
 'ocean',
 'on',
 'the',
 'south',
 'the',
 'arabian',
 'sea',
 'on',
 'the',
 'southwest',
 'and',
 'the',
 'bay',
 'of',
 'bengal',
 'on',
 'the',
 'southeast',
 'it',
 'shares',
 'land',
 'borders',
 'with',
 'pakistan',
 'to',
 'the',
 'west',
 'china',
 'nepal',
 'and',
 'bhutan',
 'to',
 'the',
 'north',
 'and',
 'bangladesh',
 'and',
 'myanmar',
 'to',
 'the',
 'east',
 'in',
 'the',
 'indian',
 'ocean',
 'india',
 'is',
 'near',
 'sri',
 'lanka',
 'and',
 'the',
 'maldives',
 'its',
 'andaman',
 'and',
 'nicobar',
 'islands',
 'share',
 'a',
 'maritime',
 'border',
 'with',
 'myanma

In [37]:
stop_words = set(stopwords.words("english"))

In [39]:
filtered_words = []
for i in clean_words2:
    if i not in stop_words:
        filtered_words.append(i)
print("After removing punctuation stop words : ",filtered_words)

After removing punctuation stop words :  ['india', 'officially', 'republic', 'india', 'country', 'south', 'asia', 'seventh-largest', 'country', 'area', 'populous', 'country', 'since', '2023', 'since', 'independence', '1947', 'world', "'s", 'populous', 'democracy', 'bounded', 'indian', 'ocean', 'south', 'arabian', 'sea', 'southwest', 'bay', 'bengal', 'southeast', 'shares', 'land', 'borders', 'pakistan', 'west', 'china', 'nepal', 'bhutan', 'north', 'bangladesh', 'myanmar', 'east', 'indian', 'ocean', 'india', 'near', 'sri', 'lanka', 'maldives', 'andaman', 'nicobar', 'islands', 'share', 'maritime', 'border', 'myanmar', 'thailand', 'indonesia']


### Stemming in NLP
Stemming chops off parts of a word like (ing,ed,s) to get it's stem(root).
It does not guarantee a real dictionary word - just common base form.

Eg:
- Running-----Run
- Runs------- Run
- Easily --------- Easili
- Studies----------Studi

* Notice how "easily--> easili and "Studies" are not valid english words -- that's normal in stemming

In [41]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = ["Running ", "Runs", "Change", "Studies"]

for i in word:
    # Strip any potential leading/trailing whitespace before stemming
    processed_word = i.strip()
    stemmed_word = stemmer.stem(processed_word)
    print(f"{processed_word} <-----------------------------> {stemmed_word}")



Running <-----------------------------> run
Runs <-----------------------------> run
Change <-----------------------------> chang
Studies <-----------------------------> studi


### Lemmatization
More accurate then stemming

In [42]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download("wordnet") #--> WordNet is a large dictionary of english word with meaning and alternate words
nltk.download("omw-1.4") #-->updated english data

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajus\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\rajus\AppData\Roaming\nltk_data...


True

In [43]:
lem = WordNetLemmatizer()

W = ["Running ", "Runs", "Change" , "Studies" ,"mice"]
for i in W : 
    print(i , "<-------------->" ,lem.lemmatize(i))

Running  <--------------> Running 
Runs <--------------> Runs
Change <--------------> Change
Studies <--------------> Studies
mice <--------------> mouse
