# Project Name: Sentiment Analysis of comments.

### Goal: To predict sentiment of comments using Natural Language Processing

**Dataset**: IIT - KGP Kshitij AI Hackathon dataset

**About the Project**: With the increasing reliance on online platforms for reviews, feedback, and social media interactions, understanding customer sentiment has become a critical task for businesses across industries. Sentiment analysis allows companies to gain deeper insights into customer opinions, which can help improve products, services, and customer satisfaction. However, interpreting the true sentiment behind vast amounts of text data, such as reviews, comments, and social media posts, can be challenging due to the nuances of language, tone, and context.

To address this, we are developing a Natural Language Processing (NLP) model that analyzes customer feedback and classifies the sentiment expressed in the text. By training the model on a variety of text data with labeled sentiments, it will be able to accurately predict whether a given review or comment is positive, negative, or neutral. This tool will help businesses quickly identify areas of concern and opportunities for improvement by analyzing large volumes of customer feedback. The ultimate goal is to provide actionable insights that can enhance decision-making, improve customer relationships, and drive business success.

### Section 1: Data Collection

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
url = r"C:\Users\user\Desktop\Code Fragments\Machine Learning\Datasets\AI-Hackathon-test-data-set.csv"
data = pd.read_csv(url)
data

Unnamed: 0,"""Comments,Sentiment"""
0,"""I love the new features in the update.,Positive"""
1,"""Fast and efficient customer support.,Positive"""
2,"""The product quality is amazing!,Positive"""
3,"""Fast and efficient customer support.,Positive"""
4,"""I am disappointed with the delivery.,Neutral"""
...,...
19998,"""Fast and efficient customer support.,Positive"""
19999,"""The room was clean but noisy.,Neutral"""
20000,"""The product quality is amazing!,Positive"""
20001,"""I am disappointed with the delivery.,Negative"""


We have ~ 20000 rows in our dataset, but the dataset hasn't be converted into our desired dataframe format.<br>
We'll have to manipulate it manually.

In [5]:
data["Sentiment"] = data['"Comments,Sentiment"'].apply(lambda x: x[-9:-1])
data["Comments"] = data['"Comments,Sentiment"'].apply(lambda x: x[1:-10])
data

Unnamed: 0,"""Comments,Sentiment""",Sentiment,Comments
0,"""I love the new features in the update.,Positive""",Positive,I love the new features in the update.
1,"""Fast and efficient customer support.,Positive""",Positive,Fast and efficient customer support.
2,"""The product quality is amazing!,Positive""",Positive,The product quality is amazing!
3,"""Fast and efficient customer support.,Positive""",Positive,Fast and efficient customer support.
4,"""I am disappointed with the delivery.,Neutral""",",Neutral",I am disappointed with the delivery
...,...,...,...
19998,"""Fast and efficient customer support.,Positive""",Positive,Fast and efficient customer support.
19999,"""The room was clean but noisy.,Neutral""",",Neutral",The room was clean but noisy
20000,"""The product quality is amazing!,Positive""",Positive,The product quality is amazing!
20001,"""I am disappointed with the delivery.,Negative""",Negative,I am disappointed with the delivery.


Now that we have both the comments & sentiments separated, we can drop the attached column.

In [7]:
data.drop(columns = ['"Comments,Sentiment"'], inplace = True)
data

Unnamed: 0,Sentiment,Comments
0,Positive,I love the new features in the update.
1,Positive,Fast and efficient customer support.
2,Positive,The product quality is amazing!
3,Positive,Fast and efficient customer support.
4,",Neutral",I am disappointed with the delivery
...,...,...
19998,Positive,Fast and efficient customer support.
19999,",Neutral",The room was clean but noisy
20000,Positive,The product quality is amazing!
20001,Negative,I am disappointed with the delivery.


Since our dataframe is ready, lets move on to our data manipulation phase.

### Section 2: Data Cleaning / Manipulation

Initially, we can encode our sentiment values.

In [9]:
d = {
    "Positive": 1,
    "Negative": -1,
    ",Neutral": 0
}
data["Sentiment"] = data["Sentiment"].map(d)

Lets start with dealing with null values.

In [11]:
print(data[data["Comments"].isnull()])
print(data[data["Sentiment"].isnull()])
print(data[data['Comments'] == ''])

Empty DataFrame
Columns: [Sentiment, Comments]
Index: []
Empty DataFrame
Columns: [Sentiment, Comments]
Index: []
       Sentiment Comments
86             0         
153           -1         
180            1         
181           -1         
202            1         
...          ...      ...
19854         -1         
19903         -1         
19941          1         
19952         -1         
19993          1         

[1001 rows x 2 columns]


In [13]:
data = data[data["Comments"]!='']

As observed, our dataset has no null values.<br><br>
One thing commonly observed in textual datasets is redundancy, lets deal with redundancy now.

In [15]:
data["Sentiment"].value_counts()

Sentiment
 1    7461
-1    7358
 0    4183
Name: count, dtype: int64

As we can see, there is an deficit of neutral values. This may lead to underfitting / bias.<br>
We have to deal with this inbalance.

In [17]:
data["Comments"].value_counts()

Comments
I love the new features in the update.    1827
The product quality is amazing!           1811
Fast and efficient customer support.      1805
The app interface is confusing.           1801
I am disappointed with the delivery.      1799
Excellent ambiance and friendly staff.    1796
The food was cold and tasteless.          1793
Very poor customer service experience.    1793
The service was okay, not great           1712
The room was clean but noisy              1711
The room was clean but noisy.              200
The service was okay, not great.           193
I am disappointed with the delivery        106
Excellent ambiance and friendly staff      104
Very poor customer service experience      100
The food was cold and tasteless             95
I love the new features in the update       91
Fast and efficient customer support         90
The app interface is confusing              87
The product quality is amazing              86
Blah blah blah                               1
###r

As we can see, our dataset is extremely poor in terms of redundancy & annotations.<br>
With the occurence frequency in mind, the 

Countering the problems -

- redundancy : we do not try to solve the redundancy since dropping duplicates greatly reduces data size
- annotatons : for same data being annotated differently, we can adpot the annotation assigned to majority of the sub-data.
- imbalance : we train using 4k values each, and then fine tune while adding left over data.

#### Section 2.1 : Addressing Annotation Errors

In [19]:
unique_coms = data['Comments'].unique()
unique_coms

array(['I love the new features in the update.',
       'Fast and efficient customer support.',
       'The product quality is amazing!',
       'I am disappointed with the delivery',
       'The service was okay, not great',
       'The food was cold and tasteless.',
       'I am disappointed with the delivery.',
       'Excellent ambiance and friendly staff.',
       'Very poor customer service experience.',
       'The room was clean but noisy.', 'The app interface is confusing.',
       'The room was clean but noisy',
       'Excellent ambiance and friendly staff',
       'Very poor customer service experience',
       'The product quality is amazing',
       'The food was cold and tasteless',
       'The service was okay, not great.',
       'Fast and efficient customer support',
       'The app interface is confusing',
       'I love the new features in the update', 'Blah blah blah',
       '###random-text-12'], dtype=object)

We will proceed with removing the annotation inconsistency, by selecting the one with higher frequency.

In [21]:
for i in unique_coms:
    f1 = len(data[data["Comments"] == i])
    f2 = len(data[data["Comments"] == i+"."])
    if f1>f2:
        data.drop(data[data['Comments'] == i+"."].index, inplace=True)
    else:
        data.drop(data[data['Comments'] == i].index, inplace=True)

In [23]:
data['Comments'].unique()

array(['I love the new features in the update.',
       'Fast and efficient customer support.',
       'The product quality is amazing!',
       'The service was okay, not great',
       'The food was cold and tasteless.',
       'I am disappointed with the delivery.',
       'Excellent ambiance and friendly staff.',
       'Very poor customer service experience.',
       'The app interface is confusing.', 'The room was clean but noisy',
       'The product quality is amazing', 'Blah blah blah',
       '###random-text-12'], dtype=object)

Now that our annotation issues are clear, we can move on data manipulation techniques for textual data.<br>
We do the following:
- remove stop words
- lemmatize
- basic formatting

In [25]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def process(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

data["Comments"] = data["Comments"].apply(lambda x: process(x))

we create three parts of the dataset with different sentiments

In [27]:
pos_data = data[data["Sentiment"] == 1]
neg_data = data[data["Sentiment"] == -1]
net_data = data[data["Sentiment"] == 0]

We merge the three data parts into one, while leaving out parts of positive & negative data to ensure equal representation to neutral data

In [29]:
use_pos = pos_data[:4000]
use_neg = neg_data[:4000]
data1 = pd.concat([use_pos, use_neg, net_data])

We shuffle the data to maintain equal representation of data in all parts<br>
[ Concept of Stratified Shuffle Split ]

In [31]:
from sklearn.utils import shuffle
data1 = shuffle(data1, random_state = 42)
data1

Unnamed: 0,Sentiment,Comments
1785,-1,app interface confusing
7532,1,fast efficient customer support
9761,-1,disappointed delivery
9807,0,product quality amazing
11814,0,room clean noisy
...,...,...
18833,0,room clean noisy
3345,-1,app interface confusing
3865,-1,food cold tasteless
2255,1,app interface confusing


In [33]:
x = data1["Comments"]
y = data1["Sentiment"]

x_train = x[:int(len(x)*0.75)]
y_train = y[:int(len(y)*0.75)]
x_test = x[int(len(x)*0.75):]
y_test = y[int(len(y)*0.75):]

In [35]:
x_train[3]

'fast efficient customer support'

Our dataset is now ready for training phase 1.

### Section 3: Model selection

Google's bert is a very popular & relaible model.<br>
We will procees with it's use for our goal.

In [37]:
from transformers import BertTokenizer as BT, TFBertForSequenceClassification as tbsc
from sklearn.metrics import classification_report as cr




We use TFBertSC for our classification task and BertTokenizer of tokenizatiion of text.

In [38]:
tokenizer = BT.from_pretrained('bert-base-uncased', do_lower_case=True)

x_train_en = tokenizer.batch_encode_plus(x_train.astype(str).tolist(),
                                              padding=True, 
                                              truncation=True,
                                              max_length = 128,
                                              return_tensors='tf')
 
x_test_en= tokenizer.batch_encode_plus(x_test.astype(str).tolist(),
                                              padding=True, 
                                              truncation=True,
                                              max_length = 128,
                                              return_tensors='tf')

In [41]:
model = tbsc.from_pretrained('bert-base-uncased', num_labels=3)




All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As we have three classifications, we set num_labels to 3.<br>
This allows the model to predict values between [0-3)<br>
To avoid errors, we add 1 to all values of y_train & y_test (since minimum value is -1)

In [43]:
y_train = y_train + 1
y_test = y_test + 1

Now, we compile our model using an optimizer, specifying criteria of loss & accuracy.

In [45]:
import tensorflow as tf
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

We finally fit our data into the model.

In [47]:
'''
model.fit(
    [x_train_en['input_ids'], x_train_en['token_type_ids'], x_train_en['attention_mask']],
    y_train,
    validation_data=(
      [x_test_en['input_ids'], x_test_en['token_type_ids'], x_test_en['attention_mask']],y_test),
    batch_size=32,
    epochs=3
)
'''

"\nmodel.fit(\n    [x_train_en['input_ids'], x_train_en['token_type_ids'], x_train_en['attention_mask']],\n    y_train,\n    validation_data=(\n      [x_test_en['input_ids'], x_test_en['token_type_ids'], x_test_en['attention_mask']],y_test),\n    batch_size=32,\n    epochs=3\n)\n"

Since the training time is large, and no availability of an accelerator in this environment,<br>
the model has been trained using a GPU in a kaggle notebook<br>
( https://www.kaggle.com/code/gouravjana/notebookd86c597d7a/notebook?scriptVersionId=218325013 )<br>as an continuation to this project.<br>