Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

#### About Data: Emotion Detection
Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

This data consists of two columns. - Comment - Emotion

Comment are the statements or messages regarding to a particular event/situation.

Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

As there are only 3 classes, this problem comes under the Multi-Class Classification.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_table('C:\\Users\\User\\Desktop\\Datasets\\Emotion detection\\train.txt', sep=';')
df.head()

Unnamed: 0,i didnt feel humiliated,sadness
0,i can go from feeling so hopeless to so damned...,sadness
1,im grabbing a minute to post i feel greedy wrong,anger
2,i am ever feeling nostalgic about the fireplac...,love
3,i am feeling grouchy,anger
4,ive been feeling a little burdened lately wasn...,sadness


In [6]:
column_name = ['comment', 'emotion']

In [7]:
df = pd.DataFrame(df, columns=column_name)

In [8]:
df.head()

Unnamed: 0,comment,emotion
0,i can go from feeling so hopeless to so damned...,sadness
1,im grabbing a minute to post i feel greedy wrong,anger
2,i am ever feeling nostalgic about the fireplac...,love
3,i am feeling grouchy,anger
4,ive been feeling a little burdened lately wasn...,sadness


In [9]:
df.shape

(15999, 2)

In [10]:
df['emotion'].value_counts()

emotion
joy         5362
sadness     4665
anger       2159
fear        1937
love        1304
surprise     572
Name: count, dtype: int64

In [11]:
from sklearn.preprocessing import LabelEncoder

In [12]:
encoder = LabelEncoder()

df['emotion'] = encoder.fit_transform(df['emotion'])

In [13]:
df.head()

Unnamed: 0,comment,emotion
0,i can go from feeling so hopeless to so damned...,4
1,im grabbing a minute to post i feel greedy wrong,0
2,i am ever feeling nostalgic about the fireplac...,3
3,i am feeling grouchy,0
4,ive been feeling a little burdened lately wasn...,4


In [14]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(df.comment, df.emotion, test_size=0.25, random_state=0, stratify=df.emotion)

### Without text preprocessing

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

#### Attempt 1 :

using the sklearn pipeline module create a classification pipeline to classify the data.

#### Note:

- using CountVectorizer with only trigrams.
- use RandomForest as the classifier.
- print the classification report.

In [16]:
clf = Pipeline([
    ('cv_trigrams', CountVectorizer(ngram_range = (3,3))),
    ('rf', RandomForestClassifier())
])

clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.56      0.15      0.24       540
           1       0.66      0.20      0.31       484
           2       0.40      0.88      0.55      1341
           3       0.55      0.10      0.17       326
           4       0.56      0.34      0.42      1166
           5       0.68      0.09      0.16       143

    accuracy                           0.45      4000
   macro avg       0.57      0.29      0.31      4000
weighted avg       0.52      0.45      0.40      4000



#### Attempt 2 :

using the sklearn pipeline module create a classification pipeline to classify the data.

#### Note:

- using CountVectorizer with both unigram and bigrams.
- use RandomForest as the classifier.
- print the classification report.

In [17]:
clf1 = Pipeline([
    ('cv_bigrams', CountVectorizer(ngram_range = (1,2))),
    ('rf', RandomForestClassifier())
])

clf1.fit(X_train,y_train)
y_pred1 = clf1.predict(X_test)
print(classification_report(y_test,y_pred1))

              precision    recall  f1-score   support

           0       0.89      0.78      0.83       540
           1       0.90      0.74      0.81       484
           2       0.78      0.96      0.86      1341
           3       0.89      0.64      0.74       326
           4       0.91      0.88      0.89      1166
           5       0.83      0.66      0.73       143

    accuracy                           0.85      4000
   macro avg       0.87      0.78      0.81      4000
weighted avg       0.86      0.85      0.85      4000



#### Attempt 3 :

using the sklearn pipeline module create a classification pipeline to classify the data.

#### Note:

- using CountVectorizer with both unigram and bigrams.
- use Multinomial Naive Bayes as the classifier.
- print the classification report.

In [18]:
clf2 = Pipeline([
    ('cv_bigrams', CountVectorizer(ngram_range = (1,2))),
    ('NB', MultinomialNB())
])

clf2.fit(X_train,y_train)
y_pred2 = clf2.predict(X_test)
print(classification_report(y_test,y_pred2))

              precision    recall  f1-score   support

           0       0.92      0.33      0.49       540
           1       0.90      0.25      0.39       484
           2       0.61      0.96      0.75      1341
           3       0.92      0.07      0.14       326
           4       0.67      0.89      0.77      1166
           5       1.00      0.01      0.01       143

    accuracy                           0.66      4000
   macro avg       0.84      0.42      0.42      4000
weighted avg       0.75      0.66      0.60      4000



#### Attempt 4 :

using the sklearn pipeline module create a classification pipeline to classify the Data.

#### Note:

- using TF-IDF vectorizer for pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

clf3 = Pipeline([
    ('tf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

clf3.fit(X_train,y_train)
y_pred3 = clf3.predict(X_test)
print(classification_report(y_test,y_pred3))

              precision    recall  f1-score   support

           0       0.89      0.81      0.85       540
           1       0.86      0.79      0.83       484
           2       0.81      0.94      0.87      1341
           3       0.87      0.67      0.76       326
           4       0.92      0.88      0.90      1166
           5       0.84      0.73      0.78       143

    accuracy                           0.86      4000
   macro avg       0.86      0.80      0.83      4000
weighted avg       0.86      0.86      0.86      4000



### With text preprocessing

In [21]:
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [22]:
lemmatizer = WordNetLemmatizer()

In [26]:
def preprocessing(text):
    sentences = nltk.sent_tokenize(text)
    corpus = []
    
    for i in range(len(sentences)):
        review = re.sub('[^a-zA-Z]', ' ', sentences[i])
        review = review.lower()
        review = review.split()
        
        review = [lemmatizer.lemmatize(word) for word in review if not word in stopwords.words('english')]
        review = ' '.join(review)
        corpus.append(review)
    return ' '.join(corpus)

In [27]:
df['comment'][0]

'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake'

In [28]:
preprocessing(df['comment'][0])

'go feeling hopeless damned hopeful around someone care awake'

In [29]:
df['preprocessed_txt'] = df['comment'].apply(preprocessing)

In [30]:
X_train,X_test,y_train,y_test = train_test_split(df.preprocessed_txt, df.emotion, test_size=0.25, random_state=0, stratify=df.emotion)

#### Attempt1 :

using the sklearn pipeline module create a classification pipeline to classify the data.

#### Note:

- using CountVectorizer with both unigrams and bigrams.
- use RandomForest as the classifier.
- print the classification report.


In [31]:
clf4 = Pipeline([
    ('cv_bigrams', CountVectorizer(ngram_range = (1,2))),
    ('rf', RandomForestClassifier())
])

clf4.fit(X_train,y_train)
y_pred4 = clf4.predict(X_test)
print(classification_report(y_test,y_pred4))

              precision    recall  f1-score   support

           0       0.85      0.90      0.88       540
           1       0.92      0.77      0.84       484
           2       0.90      0.94      0.92      1341
           3       0.84      0.77      0.80       326
           4       0.94      0.94      0.94      1166
           5       0.74      0.87      0.80       143

    accuracy                           0.90      4000
   macro avg       0.87      0.86      0.86      4000
weighted avg       0.90      0.90      0.89      4000



#### Attempt 2 :

using the sklearn pipeline module create a classification pipeline to classify the data.

#### Note:

- using TF-IDF vectorizer for pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [35]:
clf5 = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

clf5.fit(X_train,y_train)
y_pred5 = clf5.predict(X_test)
print(classification_report(y_test,y_pred5))

              precision    recall  f1-score   support

           0       0.87      0.85      0.86       540
           1       0.85      0.83      0.84       484
           2       0.87      0.92      0.90      1341
           3       0.81      0.71      0.76       326
           4       0.93      0.92      0.93      1166
           5       0.77      0.77      0.77       143

    accuracy                           0.88      4000
   macro avg       0.85      0.84      0.84      4000
weighted avg       0.88      0.88      0.88      4000



### Final Observations
As part of this exercise we have trained the data with algorithms like Multinomial Naive Bayes and Random Forest which are most used and provide good results for text related problems.

As Machine learning algorithms do not work on text data directly, we need to convert them into numeric vectors and feed that into models while training. For this purpose, we have used Bag of words(unigrams, bigrams, n-grams) and TF-IDF text representation techniques.

#### Key Findings

As the n_gram range keeps increasing, there's drastic fall of improvement in performance metrics.

There's seen a significant improvement in results before pre-processing and after pre-processing the data.

TF-IDF and Bag of words both performed equally well in performance metrics like Recall and F1-score.

Random Forest performed quite well when compared to Multinomial Naive Bayes.

#### Machine Learning is like a trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which gives good results and satisfies the requirements like latency, interpretability, etc.