<a href="https://colab.research.google.com/github/sreelakshmy-byte/Emotion-Classifier-NLP/blob/main/Emotion_NlP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

import re

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize


In [2]:
data = pd.read_csv('/content/NLP.csv')

In [3]:
data.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5937 entries, 0 to 5936
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  5937 non-null   object
 1   Emotion  5937 non-null   object
dtypes: object(2)
memory usage: 92.9+ KB


Preprocessing Steps:

Text Cleaning:

Remove special characters, numbers, and punctuation that do not contribute to the emotional meaning of the text.

Convert all text to lowercase to maintain uniformity.

Tokenization:

Split the cleaned text into individual words, which are the basic units for analysis.
Removal of Stopwords:

Stopwords are common words that may not have significant meaning in emotion classification. Using a predefined list from NLP libraries , we will remove these from our tokens.
Impact on Model Performance:

Cleaning the text helps to reduce noise, allowing the model to focus on meaningful words that contribute to the emotional context.

Tokenization simplifies the textual data for further processing.

Removing stopwords enhances computational efficiency and reduces dimensionality, which may lead to improved model performance by emphasizing words that carry more emotional meaning.

Text Cleaning

In [5]:
def clean_text(text):

    # Remove special characters and convert to lowercase

    text = re.sub(r'[^a-zA-Z\s]', '', text)

    return text.lower()


Preprocessing

In [6]:
nltk.download('punkt')

nltk.download('stopwords')
nltk.download('punkt_tab')

stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Apply cleaning and tokenization

In [7]:

data['cleaned_text'] = data['Comment'].apply(clean_text)

data['tokens'] = data['cleaned_text'].apply(word_tokenize)

data['tokens'] = data['tokens'].apply(lambda x: [word for word in x if word not in stop_words])

In [8]:

data

Unnamed: 0,Comment,Emotion,cleaned_text,tokens
0,i seriously hate one subject to death but now ...,fear,i seriously hate one subject to death but now ...,"[seriously, hate, one, subject, death, feel, r..."
1,im so full of life i feel appalled,anger,im so full of life i feel appalled,"[im, full, life, feel, appalled]"
2,i sit here to write i start to dig out my feel...,fear,i sit here to write i start to dig out my feel...,"[sit, write, start, dig, feelings, think, afra..."
3,ive been really angry with r and i feel like a...,joy,ive been really angry with r and i feel like a...,"[ive, really, angry, r, feel, like, idiot, tru..."
4,i feel suspicious if there is no one outside l...,fear,i feel suspicious if there is no one outside l...,"[feel, suspicious, one, outside, like, rapture..."
...,...,...,...,...
5932,i begun to feel distressed for you,fear,i begun to feel distressed for you,"[begun, feel, distressed]"
5933,i left feeling annoyed and angry thinking that...,anger,i left feeling annoyed and angry thinking that...,"[left, feeling, annoyed, angry, thinking, cent..."
5934,i were to ever get married i d have everything...,joy,i were to ever get married i d have everything...,"[ever, get, married, everything, ready, offer,..."
5935,i feel reluctant in applying there because i w...,fear,i feel reluctant in applying there because i w...,"[feel, reluctant, applying, want, able, find, ..."


Feature Extraction

for extraction use TfidfVectorizer for feature extraction. This method transforms the text data into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
vectorizer = TfidfVectorizer(tokenizer=lambda text: text.split())

X = vectorizer.fit_transform(data['cleaned_text'])

y = data['Emotion']



Model Development

Naive Bayes:

A probabilistic classifier based on Bayes’ theorem, suitable for text classification due to its simplicity and efficiency with large datasets.

Support Vector Machine (SVM):

A supervised learning model that can classify data by finding the optimal hyperplane that separates different classes. It works well with high-dimensional data, such as text.

Training the Models

In [12]:
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.svm import SVC


In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Naive Bayes

In [15]:
nb_model = MultinomialNB()

nb_model.fit(X_train, y_train)

# Support Vector Machine (SVM

In [16]:
svm_model = SVC(kernel='linear')

svm_model.fit(X_train, y_train)

Model Comparison

Evaluation Metrics

For evaluation,use accuracy and F1-score:

Accuracy: The ratio of correctly predicted observations to the total observations.

F1-Score: The harmonic mean of precision and recall, which considers both false positives and false negatives—particularly useful for imbalanced datasets.

In [17]:

from sklearn.metrics import accuracy_score, f1_score

In [18]:
nb_predictions = nb_model.predict(X_test)

svm_predictions = svm_model.predict(X_test)

In [19]:
nb_accuracy = accuracy_score(y_test, nb_predictions)

svm_accuracy = accuracy_score(y_test, svm_predictions)

nb_f1 = f1_score(y_test, nb_predictions, average='weighted')

svm_f1 = f1_score(y_test, svm_predictions, average='weighted')

print(f'Naive Bayes Accuracy: {nb_accuracy}, F1 Score: {nb_f1}')

print(f'SVM Accuracy: {svm_accuracy}, F1 Score: {svm_f1}')

Naive Bayes Accuracy: 0.9023569023569024, F1 Score: 0.9023080996440667
SVM Accuracy: 0.9351851851851852, F1 Score: 0.935248473037213
