The goal of the Emotion Detection project using Natural Language Processing (NLP) is to develop a machine learning model that can accurately classify and detect the emotions expressed in text data. The project aims to analyze and understand the emotional content of textual information, enabling applications such as sentiment analysis, customer feedback analysis, social media monitoring, and more.

### Dataset URL :

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

This dataset consists of two columns. 
- Comment : Comment are the statements or messages regarding to a particular event/situation.

- Emotion : Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

This is a Multi-Class Classification problem.


In [2]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df_emotion = pd.read_csv("Emotion_classify_Data.csv")

#print the shape of dataframe
print(f'Shape of the Dataset : {df_emotion.shape}')

#print top 5 rows
df_emotion.head(5)

Shape of the Dataset : (5937, 2)


Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [3]:
#check the distribution of Emotion
df_emotion.Emotion.value_counts()

anger    2000
joy      2000
fear     1937
Name: Emotion, dtype: int64

In [4]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df_emotion['Emotion_num'] = df_emotion['Emotion'].map({"joy":0,"fear": 1, "anger" : 2})

#checking the results by printing top 5 rows
df_emotion.head(5)

Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


### Modelling without Pre-processing Text data

In [5]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling
X_train,X_test,y_train,y_test =  train_test_split(df_emotion.Comment,df_emotion.Emotion_num,test_size=0.2,random_state=2022,stratify=df_emotion.Emotion_num)


In [6]:
#print the shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4749,)
(1188,)
(4749,)
(1188,)


### Attempt 1 :

using RandomForest as the classifier and CountVectorizer with only trigrams to classify the data.

#### RandomForest Classifier with trigrams

In [7]:
#1. create a pipeline object

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (3, 3))),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.59      0.25      0.35       400
           1       0.36      0.81      0.50       388
           2       0.53      0.21      0.30       400

    accuracy                           0.42      1188
   macro avg       0.50      0.42      0.39      1188
weighted avg       0.50      0.42      0.38      1188



### Attempt 2 :

using MultimonialNB as the classifier and CountVectorizer with unigrams and bigrams to classify the data.

#### MultimonialNB Classifier with unigrams and bigrams

In [8]:
#1. create a pipeline object

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1, 2))),                       #using the ngram_range parameter 
    ('multinomial_NB', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.96      0.89       400
           1       0.94      0.86      0.90       388
           2       0.93      0.85      0.89       400

    accuracy                           0.89      1188
   macro avg       0.90      0.89      0.89      1188
weighted avg       0.90      0.89      0.89      1188



### Attempt 3 :

using RandomForest as the classifier and CountVectorizer with only trigrams to classify the data.

#### RandomForest Classifier with with unigrams and bigrams

In [14]:
#1. create a pipeline object

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,2))),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.97      0.90       400
           1       0.96      0.88      0.92       388
           2       0.94      0.86      0.90       400

    accuracy                           0.90      1188
   macro avg       0.91      0.90      0.90      1188
weighted avg       0.91      0.90      0.90      1188



### Attempt 4 :

using RandomForest as the classifier and use TF-IDF for preprocessing the text to classify the data.

#### RandomForest Classifier with TF-IDF

In [15]:
#1. create a pipeline object

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.95      0.90       400
           1       0.92      0.89      0.91       388
           2       0.93      0.86      0.89       400

    accuracy                           0.90      1188
   macro avg       0.90      0.90      0.90      1188
weighted avg       0.90      0.90      0.90      1188



### Use text pre-processing to remove stop words, punctuations and apply lemmatization


In [18]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [19]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient
df_emotion["preprocessed_comment"]=df_emotion['Comment'].apply(preprocess)


In [21]:
df_emotion.head()

Unnamed: 0,Comment,Emotion,Emotion_num,preprocessed_comment
0,i seriously hate one subject to death but now ...,fear,1,seriously hate subject death feel reluctant drop
1,im so full of life i feel appalled,anger,2,m life feel appalled
2,i sit here to write i start to dig out my feel...,fear,1,sit write start dig feeling think afraid accep...
3,ive been really angry with r and i feel like a...,joy,0,ve angry r feel like idiot trust place
4,i feel suspicious if there is no one outside l...,fear,1,feel suspicious outside like rapture happen


### Build a model with pre processed text



In [22]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X_train, X_test, y_train, y_test = train_test_split(
    df_emotion.preprocessed_comment, 
    df_emotion.Emotion_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df_emotion.Emotion_num
)

### Attempt 1 :

using RandomForest as the classifier and CountVectorizer with only trigrams to classify the data.

#### RandomForest Classifier with with unigrams and bigrams

In [23]:
#1. create a pipeline object

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range = (1,2))),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       400
           1       0.94      0.91      0.93       388
           2       0.91      0.93      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



### Attempt 2 :

using RandomForest as the classifier and use TF-IDF for preprocessing the text to classify the data.

#### RandomForest Classifier with TF-IDF

In [24]:
#1. create a pipeline object

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer', TfidfVectorizer()),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94       400
           1       0.92      0.93      0.93       388
           2       0.95      0.90      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188

