#Formative Assessment: NLP - Emotion Classification in Text

Objective: 
Develop machine learning models to classify emotions in text samples.

Dataset:
https://drive.google.com/file/d/1HWczIICsMpaL8EJyu48ZvRFcXx3_pcnb/view?usp=drive_link

Key components to be fulfilled :



##1. Loading and Preprocessing

Load the dataset and perform necessary preprocessing steps. This should include text cleaning, tokenization, and removal of stopwords. Explain the preprocessing techniques used and their impact on model performance.


In [1]:
import pandas as pd

In [2]:

# Load the dataset
url = "https://drive.google.com/uc?id=1HWczIICsMpaL8EJyu48ZvRFcXx3_pcnb"
data = pd.read_csv(url)

# Display the first few rows of the dataset
print(data.head())


                                             Comment Emotion
0  i seriously hate one subject to death but now ...    fear
1                 im so full of life i feel appalled   anger
2  i sit here to write i start to dig out my feel...    fear
3  ive been really angry with r and i feel like a...     joy
4  i feel suspicious if there is no one outside l...    fear


##Preprocessing Steps

Text Cleaning  - This involves removing unwanted characters, special symbols, or numbers that don’t contribute to the emotional meaning of the text.

Text cleaning helps in normalizing the text, reducing variability, and making it easier for the model to learn patterns relevant to emotion classification.




In [3]:
import re

def clean_text(text):
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.lower()               # Convert to lowercase
    return text

data['cleaned_text'] = data['Comment'].apply(clean_text)

print(data.head())


                                             Comment Emotion  \
0  i seriously hate one subject to death but now ...    fear   
1                 im so full of life i feel appalled   anger   
2  i sit here to write i start to dig out my feel...    fear   
3  ive been really angry with r and i feel like a...     joy   
4  i feel suspicious if there is no one outside l...    fear   

                                        cleaned_text  
0  i seriously hate one subject to death but now ...  
1                 im so full of life i feel appalled  
2  i sit here to write i start to dig out my feel...  
3  ive been really angry with r and i feel like a...  
4  i feel suspicious if there is no one outside l...  


##Tokenization

This process splitting the cleaned text into individual words or tokens

In [4]:

import nltk

print(nltk.__version__)

# Download the punkt resource
nltk.download('punkt')

nltk.download('punkt_tab')



3.9.1


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:


# Tokenization function
from nltk.tokenize import word_tokenize
data['tokens'] = data['cleaned_text'].apply(word_tokenize)
print(data.head())

                                             Comment Emotion  \
0  i seriously hate one subject to death but now ...    fear   
1                 im so full of life i feel appalled   anger   
2  i sit here to write i start to dig out my feel...    fear   
3  ive been really angry with r and i feel like a...     joy   
4  i feel suspicious if there is no one outside l...    fear   

                                        cleaned_text  \
0  i seriously hate one subject to death but now ...   
1                 im so full of life i feel appalled   
2  i sit here to write i start to dig out my feel...   
3  ive been really angry with r and i feel like a...   
4  i feel suspicious if there is no one outside l...   

                                              tokens  
0  [i, seriously, hate, one, subject, to, death, ...  
1        [im, so, full, of, life, i, feel, appalled]  
2  [i, sit, here, to, write, i, start, to, dig, o...  
3  [ive, been, really, angry, with, r, and, i, fe...  
4  

##Removing Stopwords

Stopwords are common words that usually do not add significant meaning to a sentence (e.g., "and", "the", "is"). Removing them can improve model performance by reducing dimensionality and focusing on meaningful words.

In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
stop_words =set(stopwords.words('english'))
data['filtered_tokens'] = data['tokens'].apply(lambda x:[word for word in x if word not in stop_words])
print(data.head())

                                             Comment Emotion  \
0  i seriously hate one subject to death but now ...    fear   
1                 im so full of life i feel appalled   anger   
2  i sit here to write i start to dig out my feel...    fear   
3  ive been really angry with r and i feel like a...     joy   
4  i feel suspicious if there is no one outside l...    fear   

                                        cleaned_text  \
0  i seriously hate one subject to death but now ...   
1                 im so full of life i feel appalled   
2  i sit here to write i start to dig out my feel...   
3  ive been really angry with r and i feel like a...   
4  i feel suspicious if there is no one outside l...   

                                              tokens  \
0  [i, seriously, hate, one, subject, to, death, ...   
1        [im, so, full, of, life, i, feel, appalled]   
2  [i, sit, here, to, write, i, start, to, dig, o...   
3  [ive, been, really, angry, with, r, and, i, fe...  

## Explain the preprocessing techniques used and their impact on model performance.

Lowercasing: Helps in standardizing words, ensuring that the same words in different cases (e.g., "Happy" vs. "happy") are treated equally.
Removing punctuation and special characters: Reduces noise and focuses the model on meaningful content.
Tokenization: Converts the text into a structured format that can be easily analyzed.
Removing stopwords: Enhances model performance by reducing the input dimensionality and focusing on words that carry more emotional weight.


#2. Feature Extraction :

Implement feature extraction using CountVectorizer or TfidfVectorizer. Describe how the chosen method transforms the text data into numerical features.


Feature extraction is a crucial step in preparing text data for machine learning models. In this case, we can use either CountVectorizer or TfidfVectorizer from the sklearn.feature_extraction

##CountVectorizer

CountVectorizer converts a collection of text documents into a matrix of token counts. It essentially counts the occurrences of each word in the document and creates a sparse matrix where:

Rows represent documents.
Columns represent unique words (features).
Values represent the count of each word in the respective document

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

#initialise countVectorizer
count_vectorizer = CountVectorizer()

#fit and transform cleaned text data
x_count = count_vectorizer.fit_transform(data['cleaned_text'])

#convert to a dataframe for better visualization
count_df = pd.DataFrame(x_count.toarray(), columns =count_vectorizer.get_feature_names_out() )

print(count_df.head())


   aa  aac  aaron  ab  abandon  abandoned  abandonment  abbigail  abc  \
0   0    0      0   0        0          0            0         0    0   
1   0    0      0   0        0          0            0         0    0   
2   0    0      0   0        0          0            0         0    0   
3   0    0      0   0        0          0            0         0    0   
4   0    0      0   0        0          0            0         0    0   

   abdomen  ...  zendikar  zero  zest  zhu  zipline  zombies  zone  \
0        0  ...         0     0     0    0        0        0     0   
1        0  ...         0     0     0    0        0        0     0   
2        0  ...         0     0     0    0        0        0     0   
3        0  ...         0     0     0    0        0        0     0   
4        0  ...         0     0     0    0        0        0     0   

   zonisamide  zq  zumba  
0           0   0      0  
1           0   0      0  
2           0   0      0  
3           0   0      0  
4    

##TfidfVectorizer

TfidfVectorizer stands for Term Frequency-Inverse Document Frequency. It not only counts the word occurrences but also weighs them according to their importance across all documents. The idea is to down-weight common words and up-weight rarer words.



In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

#initialise TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

#fit and transform cleaned text data

x_tfidf = tfidf_vectorizer.fit_transform(data['cleaned_text'])

tfidf_df = pd.DataFrame(x_tfidf.toarray(), columns = tfidf_vectorizer.get_feature_names_out())

print(tfidf_df.head())

    aa  aac  aaron   ab  abandon  abandoned  abandonment  abbigail  abc  \
0  0.0  0.0    0.0  0.0      0.0        0.0          0.0       0.0  0.0   
1  0.0  0.0    0.0  0.0      0.0        0.0          0.0       0.0  0.0   
2  0.0  0.0    0.0  0.0      0.0        0.0          0.0       0.0  0.0   
3  0.0  0.0    0.0  0.0      0.0        0.0          0.0       0.0  0.0   
4  0.0  0.0    0.0  0.0      0.0        0.0          0.0       0.0  0.0   

   abdomen  ...  zendikar  zero  zest  zhu  zipline  zombies  zone  \
0      0.0  ...       0.0   0.0   0.0  0.0      0.0      0.0   0.0   
1      0.0  ...       0.0   0.0   0.0  0.0      0.0      0.0   0.0   
2      0.0  ...       0.0   0.0   0.0  0.0      0.0      0.0   0.0   
3      0.0  ...       0.0   0.0   0.0  0.0      0.0      0.0   0.0   
4      0.0  ...       0.0   0.0   0.0  0.0      0.0      0.0   0.0   

   zonisamide   zq  zumba  
0         0.0  0.0    0.0  
1         0.0  0.0    0.0  
2         0.0  0.0    0.0  
3         0.0  0

#3. Model Development :

Train the following machine learning models
a)Naive Bayes
b)Support Vector Machine 


To train machine learning models for emotion classification, we can use the Naive Bayes and Support Vector Machine (SVM) algorithms. Below, I'll outline the steps to develop and train these models using the features extracted with TfidfVectorizer.




In [10]:
#let's split the data into training and testing sets:

from sklearn.model_selection import train_test_split

# Assuming `X_tfidf` contains the features and `data['emotion']` is the target variable
X_train, X_test, y_train, y_test = train_test_split(x_tfidf, data['Emotion'], test_size=0.2, random_state=42)

## Training Naive Bayes Model


Naive Bayes is a simple yet effective algorithm for text classification. We'll use the Multinomial Naive Bayes variant, which is suitable for discrete data like word counts.

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,classification_report

#Train the naive bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train,y_train)

#predict the test set
y_pred_nb = nb_model.predict(X_test)

#evaluate the model

accuracy_nb = accuracy_score(y_test,y_pred_nb)
report_nb = classification_report(y_test,y_pred_nb)

print(f"Naive Bayes Accuracy: {accuracy_nb:.2f}")
print("Classification Report:\n", report_nb)


Naive Bayes Accuracy: 0.90
Classification Report:
               precision    recall  f1-score   support

       anger       0.87      0.94      0.90       392
        fear       0.92      0.89      0.90       416
         joy       0.91      0.88      0.90       380

    accuracy                           0.90      1188
   macro avg       0.90      0.90      0.90      1188
weighted avg       0.90      0.90      0.90      1188



## Training Support Vector Machine (SVM) Model
SVM is another powerful algorithm for classification tasks. We'll use a linear kernel for this example.



In [12]:
from sklearn.svm import SVC

#train the SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train,y_train)

#predict on the test set

y_pred_svm = svm_model.predict(X_test)

#evaluate the model

accuracy_svm = accuracy_score(y_test,y_pred_svm)
report_svm = classification_report(y_test,y_pred_svm)

print(f"SVM Accuracy: {accuracy_svm:.2f}")
print("Classification Report:\n", report_svm)


SVM Accuracy: 0.94
Classification Report:
               precision    recall  f1-score   support

       anger       0.91      0.94      0.92       392
        fear       0.97      0.91      0.94       416
         joy       0.93      0.96      0.94       380

    accuracy                           0.94      1188
   macro avg       0.94      0.94      0.94      1188
weighted avg       0.94      0.94      0.94      1188



The SVM model outperforms the Naive Bayes model with an accuracy of 94% compared to 90%. This indicates that SVM is more effective for this specific task of emotion classification.
