#**Text Preprocessing Pipeline for Sentiment Analysis**

## **Introduction**
Natural Language Processing (NLP) often requires converting raw text into a clean and structured format for better analysis and modeling. In this project, we demonstrate a comprehensive text preprocessing pipeline using the Sentiment140 dataset, which contains tweets labeled with sentiment scores.


### **Objective**
- Build a reusable text preprocessing pipeline for NLP projects.
- Demonstrate the use of tokenization, stopword removal, stemming, and lemmatization.
- Generate a cleaned dataset ready for sentiment analysis.


### **Techniques Used**
1. **Tokenization:** Breaking text into individual words or tokens.
2. **Stopword Removal:** Eliminating commonly used words that do not contribute to meaning.
3. **Stemming:** Reducing words to their root forms.
4. **Lemmatization:** Reducing words to their base forms using vocabulary and grammar rules.

In [1]:
# importing libraries
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string

In [14]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [11]:
# Load the dataset and display the first few rows.
data_path = '/content/testdata.manual.2009.06.14.csv'  # Update the path if needed
dataset = pd.read_csv(data_path, encoding='ISO-8859-1', header=None)
dataset.columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
print("Dataset Overview:")
dataset.head()

Dataset Overview:


Unnamed: 0,target,ids,date,flag,user,text
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


## **2. Text Tokenization**

### **What is Tokenization?**
The process of splitting text into individual components (tokens) such as words or phrases.

### **Why?**
Tokenization is the first step to analyze and manipulate text data efficiently.




In [16]:
def tokenize_text(text):
    return word_tokenize(text)

dataset['tokens'] = dataset['text'].apply(tokenize_text)
print("\nSample Tokenized Output:")
dataset[['text', 'tokens']].head()


Sample Tokenized Output:


Unnamed: 0,text,tokens
0,@stellargirl I loooooooovvvvvveee my Kindle2. ...,"[@, stellargirl, I, loooooooovvvvvveee, my, Ki..."
1,Reading my kindle2... Love it... Lee childs i...,"[Reading, my, kindle2, ..., Love, it, ..., Lee..."
2,"Ok, first assesment of the #kindle2 ...it fuck...","[Ok, ,, first, assesment, of, the, #, kindle2,..."
3,@kenburbary You'll love your Kindle2. I've had...,"[@, kenburbary, You, 'll, love, your, Kindle2,..."
4,@mikefish Fair enough. But i have the Kindle2...,"[@, mikefish, Fair, enough, ., But, i, have, t..."


## **3. Remove Stopwords**

### **What are Stopwords?**
Commonly used words (e.g., "is", "and", "the") that carry little meaningful information in text analysis.

### **Why?**
Removing stopwords reduces noise and focuses on words that contribute to the meaning of the text.

In [17]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

dataset['tokens_no_stopwords'] = dataset['tokens'].apply(remove_stopwords)
print("\nAfter Stopword Removal:")
dataset[['text', 'tokens_no_stopwords']].head()


After Stopword Removal:


Unnamed: 0,text,tokens_no_stopwords
0,@stellargirl I loooooooovvvvvveee my Kindle2. ...,"[@, stellargirl, loooooooovvvvvveee, Kindle2, ..."
1,Reading my kindle2... Love it... Lee childs i...,"[Reading, kindle2, ..., Love, ..., Lee, childs..."
2,"Ok, first assesment of the #kindle2 ...it fuck...","[Ok, ,, first, assesment, #, kindle2, ..., fuc..."
3,@kenburbary You'll love your Kindle2. I've had...,"[@, kenburbary, 'll, love, Kindle2, ., 've, mi..."
4,@mikefish Fair enough. But i have the Kindle2...,"[@, mikefish, Fair, enough, ., Kindle2, think,..."


## **4. Stemming**

### **What is Stemming?**
The process of reducing words to their root forms, often resulting in non-standard words (e.g., "running" -> "run").

### **Why?**
Stemming simplifies variations of words to their base forms.

In [18]:
stemmer = PorterStemmer()

def stem_tokens(tokens):
    return [stemmer.stem(word) for word in tokens]

dataset['stemmed'] = dataset['tokens_no_stopwords'].apply(stem_tokens)
print("\nAfter Stemming:")
dataset[['text', 'stemmed']].head()


After Stemming:


Unnamed: 0,text,stemmed
0,@stellargirl I loooooooovvvvvveee my Kindle2. ...,"[@, stellargirl, loooooooovvvvvvee, kindle2, ...."
1,Reading my kindle2... Love it... Lee childs i...,"[read, kindle2, ..., love, ..., lee, child, go..."
2,"Ok, first assesment of the #kindle2 ...it fuck...","[ok, ,, first, asses, #, kindle2, ..., fuck, r..."
3,@kenburbary You'll love your Kindle2. I've had...,"[@, kenburbari, 'll, love, kindle2, ., 've, mi..."
4,@mikefish Fair enough. But i have the Kindle2...,"[@, mikefish, fair, enough, ., kindle2, think,..."


## **5. Lemmatization**

### **What is Lemmatization?**
The process of reducing words to their dictionary base forms, considering grammar and context (e.g., "running" -> "run").

### **Why?**
Lemmatization produces standard and meaningful root words, making text analysis more precise.


In [19]:
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

dataset['lemmatized'] = dataset['tokens_no_stopwords'].apply(lemmatize_tokens)
print("\nAfter Lemmatization:")
dataset[['text', 'lemmatized']].head()


After Lemmatization:


Unnamed: 0,text,lemmatized
0,@stellargirl I loooooooovvvvvveee my Kindle2. ...,"[@, stellargirl, loooooooovvvvvveee, Kindle2, ..."
1,Reading my kindle2... Love it... Lee childs i...,"[Reading, kindle2, ..., Love, ..., Lee, child,..."
2,"Ok, first assesment of the #kindle2 ...it fuck...","[Ok, ,, first, assesment, #, kindle2, ..., fuc..."
3,@kenburbary You'll love your Kindle2. I've had...,"[@, kenburbary, 'll, love, Kindle2, ., 've, mi..."
4,@mikefish Fair enough. But i have the Kindle2...,"[@, mikefish, Fair, enough, ., Kindle2, think,..."


## **6. Combine Preprocessing Steps**

### **Why?**
To create a streamlined and reusable text preprocessing function.

In [22]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word.lower() not in stop_words]
    tokens = [stemmer.stem(word) for word in tokens]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

dataset['processed'] = dataset['text'].apply(preprocess_text)
print("\nFully Preprocessed Output:")
dataset[['text', 'processed']].head(10)


Fully Preprocessed Output:


Unnamed: 0,text,processed
0,@stellargirl I loooooooovvvvvveee my Kindle2. ...,"[@, stellargirl, loooooooovvvvvvee, kindle2, ...."
1,Reading my kindle2... Love it... Lee childs i...,"[read, kindle2, ..., love, ..., lee, child, go..."
2,"Ok, first assesment of the #kindle2 ...it fuck...","[ok, ,, first, ass, #, kindle2, ..., fuck, roc..."
3,@kenburbary You'll love your Kindle2. I've had...,"[@, kenburbari, 'll, love, kindle2, ., 've, mi..."
4,@mikefish Fair enough. But i have the Kindle2...,"[@, mikefish, fair, enough, ., kindle2, think,..."
5,@richardebaker no. it is too big. I'm quite ha...,"[@, richardebak, ., big, ., 'm, quit, happi, k..."
6,Fuck this economy. I hate aig and their non lo...,"[fuck, economi, ., hate, aig, non, loan, given..."
7,Jquery is my new best friend.,"[jqueri, new, best, friend, .]"
8,Loves twitter,"[love, twitter]"
9,how can you not love Obama? he makes jokes abo...,"[love, obama, ?, make, joke, .]"


In [23]:
cleaned_data_path = '/content/cleaned_sentiment140.csv'
dataset[['target', 'processed']].to_csv(cleaned_data_path, index=False)
print(f"\nCleaned dataset saved to {cleaned_data_path}")


Cleaned dataset saved to /content/cleaned_sentiment140.csv


## **Conclusion**

This text preprocessing pipeline demonstrates:
- How to clean and prepare textual data for NLP tasks.
- Practical implementations of tokenization, stopword removal, stemming, and lemmatization.

### **Learnings**
1. Text preprocessing significantly improves the quality of input for NLP models.
2. Balancing between stemming and lemmatization is crucial depending on the use case.
3. Each step in the pipeline has a specific role in transforming raw text into meaningful data.