# Sentiment Analysis
This notebook is aimed at sentiment analysis for 3 datasets.
Multiple algorithms have been tested on the data, to evaluate the best one based on accuracy.

**Note:** For the Youtube and Yelp datasets, efforts were made to train other algorithms on the data to estimate the best model. However, owing to the size of the data and size of vectors created by TFIDF, the training time increased extensively. Hence, only a few models are presented in the final results.

### Youtube Comments Sentiment Analysis






In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("amaanpoonawala/youtube-comments-sentiment-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/youtube-comments-sentiment-dataset


In [None]:
!ls /root/.cache/kagglehub/datasets/amaanpoonawala/youtube-comments-sentiment-dataset/versions/1

youtube_comments_cleaned.csv


In [76]:
#Importing libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

In [None]:
#Reading the dataset
df = pd.read_csv("/root/.cache/kagglehub/datasets/amaanpoonawala/youtube-comments-sentiment-dataset/versions/1/youtube_comments_cleaned.csv")
df = df[['CommentText','Sentiment']]
df.head()

Unnamed: 0,CommentText,Sentiment
0,Anyone know what movie this is?,Neutral
1,The fact they're holding each other back while...,Positive
2,waiting next video will be?,Neutral
3,Thanks for the great video.\n\nI don't underst...,Neutral
4,Good person helping good people.\nThis is how ...,Positive


In [None]:
df.isna().sum()

Unnamed: 0,0
CommentText,0
Sentiment,0


In [None]:
df.duplicated().sum()

np.int64(40484)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1032225 entries, 0 to 1032224
Data columns (total 2 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   CommentText  1032225 non-null  object
 1   Sentiment    1032225 non-null  object
dtypes: object(2)
memory usage: 15.8+ MB


In [None]:
#Splitting into target and text
x = df['CommentText']
y = df['Sentiment']
y = y.map({'Negative':-1,'Positive':1,'Neutral':0})

In [None]:
#Using a TFIDF Vectorizer to convert the texts into vector representations
tfidf = TfidfVectorizer(max_features=100000)
x = tfidf.fit_transform(x)

In [None]:
#Splitting into test and train
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.25, random_state = 47)

In [None]:
xtrain[0]

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2 stored elements and shape (1, 100000)>

In [None]:
#Defining a function to train multiple models on the dataset and select the best one based on the accuracy.
def sentiment_analysis(models):
  best_acc = 0
  best_model = None

  for name, model in zip(models.keys(), models.values()):
    model.fit(xtrain, ytrain)
    ypred = model.predict(xtest)
    acc = accuracy_score(ypred, ytest)
    print(f"Accuracy Score for {name}: {acc}")

    if best_acc < acc:
      best_acc = acc
      best_model = name

  print(f"\nBest Model: {best_model}, Best Accuracy: {best_acc}")

In [None]:
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regressor": LogisticRegression()
}

In [None]:
sentiment_analysis(models)

Accuracy Score for Multinomial Naive Bayes: 0.6482560054561589
Accuracy Score for Logistic Regressor: 0.6807992032767954

Best Model: Logistic Regressor, Best Accuracy: 0.6807992032767954


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### YELP Reviews dataset sentiment analysis

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ilhamfp31/yelp-review-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ilhamfp31/yelp-review-dataset?dataset_version_number=2...


100%|██████████| 162M/162M [00:07<00:00, 22.3MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/ilhamfp31/yelp-review-dataset/versions/2


In [None]:
!ls /root/.cache/kagglehub/datasets/ilhamfp31/yelp-review-dataset/versions/2/yelp_review_polarity_csv

readme.txt  test.csv  train.csv


In [None]:
#Reading the train dataset
train = pd.read_csv("/root/.cache/kagglehub/datasets/ilhamfp31/yelp-review-dataset/versions/2/yelp_review_polarity_csv/train.csv", header = None)
train.columns = ['label', 'text']
train['label'] = train['label'].map({1:0,2:1})  #0 = Negative & 1 = Positive review
train.head()

Unnamed: 0,label,text
0,0,"Unfortunately, the frustration of being Dr. Go..."
1,1,Been going to Dr. Goldberg for over 10 years. ...
2,0,I don't know what Dr. Goldberg was like before...
3,0,I'm writing this review to give you a heads up...
4,1,All the food is great here. But the best thing...


In [None]:
train.iloc[3,1]

"I'm writing this review to give you a heads up before you see this Doctor. The office staff and administration are very unprofessional. I left a message with multiple people regarding my bill, and no one ever called me back. I had to hound them to get an answer about my bill. \\n\\nSecond, and most important, make sure your insurance is going to cover Dr. Goldberg's visits and blood work. He recommended to me that I get a physical, and he knew I was a student because I told him. I got the physical done. Later, I found out my health insurance doesn't pay for preventative visits. I received an $800.00 bill for the blood work. I can't pay for my bill because I'm a student and don't have any cash flow at this current time. I can't believe the Doctor wouldn't give me a heads up to make sure my insurance would cover work that wasn't necessary and was strictly preventative. The office can't do anything to help me cover the bill. In addition, the office staff said the onus is on me to make su

In [None]:
train.isna().sum()

Unnamed: 0,0
label,0
text,0


In [None]:
train.duplicated().sum()

np.int64(0)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560000 entries, 0 to 559999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   label   560000 non-null  int64 
 1   text    560000 non-null  object
dtypes: int64(1), object(1)
memory usage: 8.5+ MB


In [None]:
#Reading the test dataset
test = pd.read_csv("/root/.cache/kagglehub/datasets/ilhamfp31/yelp-review-dataset/versions/2/yelp_review_polarity_csv/test.csv")
test.columns = ['label', 'text']
test['label'] = test['label'].map({1:0,2:1})
test.head()

Unnamed: 0,label,text
0,0,Last summer I had an appointment to get new ti...
1,1,"Friendly staff, same starbucks fair you get an..."
2,0,The food is good. Unfortunately the service is...
3,1,Even when we didn't have a car Filene's Baseme...
4,1,"Picture Billy Joel's \""Piano Man\"" DOUBLED mix..."


In [None]:
test.isna().sum()

Unnamed: 0,0
label,0
text,0


In [None]:
test.duplicated().sum()

np.int64(0)

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37999 entries, 0 to 37998
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   37999 non-null  int64 
 1   text    37999 non-null  object
dtypes: int64(1), object(1)
memory usage: 593.9+ KB


In [None]:
#Splitting the datasets
xtrain, xtest, ytrain, ytest = train.iloc[:,1], test.iloc[:, 1],  train.iloc[:, 0], test.iloc[:, 0]

In [None]:
#Creating a tfidf vectorizer to create vector representations of the text
tf = TfidfVectorizer(binary=False)
xtrain = tf.fit_transform(xtrain)
xtest = tf.transform(xtest)

In [None]:
#Defining a function to train multiple models and select the best one based on accuracy
def sentiment_analysis(models):
  best_acc = 0
  best_model = None
  print("Done!")
  for name, model in zip(models.keys(), models.values()):
    model.fit(xtrain, ytrain)
    ypred = model.predict(xtest)
    acc = accuracy_score(ypred, ytest)
    print(f"Accuracy Score for {name}: {acc}")

    if best_acc < acc:
      best_acc = acc
      best_model = name

  print(f"Best Model: {best_model}, Best Accuracy: {best_acc}")

In [None]:
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regressor": LogisticRegression(),
    "Decision Tree":DecisionTreeClassifier(max_depth = 9)
}

In [None]:
sentiment_analysis(models)

Done!
Accuracy Score for Multinomial Naive Bayes: 0.884207479144188
Accuracy Score for Logistic Regressor: 0.9379983683781152
Accuracy Score for Decision Tree: 0.7580462643753783
Best Model: Logistic Regressor, Best Accuracy: 0.9379983683781152


### Movie Reviews Form Sentiment Analysis

In [10]:
#Reading the dataset
df = pd.read_csv("Movie Review Survey.csv", names=['Timestamp','Email','Review','Sentiment'], header = 0)
df = df[['Review','Sentiment']]
df.head()

Unnamed: 0,Review,Sentiment
0,It's a good movie.,Good
1,It was an amazing movie. I really liked it a lot.,Good
2,Amazing comedy drama movie. Actors acting is r...,Good
3,"I could have been better, It lacked Comedy",Bad
4,The movie was a great watch. It is funny and w...,Good


In [11]:
df.isna().sum()

Unnamed: 0,0
Review,0
Sentiment,0


In [12]:
df.duplicated().sum()

np.int64(0)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Review     20 non-null     object
 1   Sentiment  20 non-null     object
dtypes: object(2)
memory usage: 452.0+ bytes


In [14]:
#Splitting into target and text
x = df['Review']
y = df['Sentiment']
y = y.map({'Good':1,'Bad':0})

In [25]:
#Using a TFIDF Vectorizer to convert the texts into vector representations
tfidf = TfidfVectorizer()
x = tfidf.fit_transform(x)

In [91]:
#Splitting the data into train and test while preserving the class balance/weights using StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size = 0.25, random_state = 99)
print(sss)

for train_index, test_index in sss.split(x,y):
  xtrain, xtest = x[train_index], x[test_index]
  ytrain, ytest = y[train_index], y[test_index]

StratifiedShuffleSplit(n_splits=1, random_state=99, test_size=0.25,
            train_size=None)


In [92]:
xtrain.shape, xtest.shape, ytrain.shape, ytest.shape

((15, 166), (5, 166), (15,), (5,))

In [93]:
ytest

Unnamed: 0,Sentiment
10,1
16,0
5,1
2,1
9,1


In [94]:
xtrain[0]

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10 stored elements and shape (1, 166)>

In [95]:
#Defining a function to train multiple models on the dataset and select the best one based on the accuracy.
#Here, no resampling has been done, hence the models might not be accurate for the class 0 due to imbalance.
#Let's evaluate the accuracy on the imbalanced data first
def sentiment_analysis(models):
  best_acc = 0
  best_model = None

  for name, model in zip(models.keys(), models.values()):
    model.fit(xtrain, ytrain)
    ypred = model.predict(xtest)
    acc = accuracy_score(ypred, ytest)
    print(f"Accuracy Score for {name}: {acc}")

    if best_acc < acc:
      best_acc = acc
      best_model = name

  print(f"\nBest Model: {best_model}, Best Accuracy: {best_acc}")

In [96]:
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regressor": LogisticRegression(),
    "SVM": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier()
}

In [97]:
sentiment_analysis(models)

Accuracy Score for Multinomial Naive Bayes: 0.8
Accuracy Score for Logistic Regressor: 0.8
Accuracy Score for SVM: 0.8
Accuracy Score for Decision Tree: 0.8
Accuracy Score for Random Forest: 0.8

Best Model: Multinomial Naive Bayes, Best Accuracy: 0.8


It seems that all the models are working well for class 1 due to majority of samples belonging to it, but fail to identify the single sample of class 0, hence obtaining an accuracy of 80% for all models.

In [105]:
sss = StratifiedShuffleSplit(n_splits=1, test_size = 0.25, random_state = 99)
print(sss)

#Applying SMOTE to resample/upsample the data.
smote = SMOTE(random_state=42, k_neighbors=3)

for train_index, test_index in sss.split(x,y):
  xtrain, xtest = x[train_index], x[test_index]
  ytrain, ytest = y[train_index], y[test_index]

xtrain_resampled, ytrain_resampled = smote.fit_resample(xtrain, ytrain)

StratifiedShuffleSplit(n_splits=1, random_state=99, test_size=0.25,
            train_size=None)


In [99]:
xtrain_resampled.shape, ytrain_resampled.shape

((22, 166), (22,))

In [100]:
ytrain_resampled.value_counts() #SMOTE has done oversampling on samples of class 0 to create balanced training data.

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
1,11
0,11


In [101]:
#Trying multiple models on the balanced data, to evaluate if any difference has been made
def sentiment_analysis(models):
  best_acc = 0
  best_model = None

  for name, model in zip(models.keys(), models.values()):
    model.fit(xtrain_resampled, ytrain_resampled)
    ypred = model.predict(xtest)
    acc = accuracy_score(ypred, ytest)
    print(f"Accuracy Score for {name}: {acc}")

    if best_acc < acc:
      best_acc = acc
      best_model = name

  print(f"\nBest Model: {best_model}, Best Accuracy: {best_acc}")

In [102]:
models = {
    "Multinomial Naive Bayes": MultinomialNB(),
    "Logistic Regressor": LogisticRegression(),
    "SVM": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier()
}

In [103]:
sentiment_analysis(models)

Accuracy Score for Multinomial Naive Bayes: 1.0
Accuracy Score for Logistic Regressor: 0.8
Accuracy Score for SVM: 0.8
Accuracy Score for Decision Tree: 0.8
Accuracy Score for Random Forest: 0.8

Best Model: Multinomial Naive Bayes, Best Accuracy: 1.0


As visible, the Multinomial Naive Bayes Classifier has clearly performed well on the testing data as well.