## ML Sentiment Analyzer

#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project does the sentiment analysis of the statement.


### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/atifaliak/youtube-comments-dataset

### 2.1 Import Data and Required Packages
####  Importing Pandas, Numpy, Matplotlib and Seaborn

In [68]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

#### Import the CSV Data as Pandas DataFrame

In [69]:
df = pd.read_csv('raw-data/YoutubeCommentsDataSet.csv')

#### Show Top 5 Records

In [70]:
df.head()

Unnamed: 0,Comment,Sentiment
0,lets not forget that apple pay in 2014 require...,neutral
1,here in nz 50 of retailers don’t even have con...,negative
2,i will forever acknowledge this channel with t...,positive
3,whenever i go to a place that doesn’t take app...,negative
4,apple pay is so convenient secure and easy to ...,positive


#### Shape of the dataset

In [71]:
df.shape

(18408, 2)

### 2.2 Dataset information
- Comment : contains comment of users 
- Sentiment : sentiment of comment -> (negative/positive/neutral)

### 3. Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

### 3.1 Check Missing values

In [72]:
df.isna().sum()

Comment      44
Sentiment     0
dtype: int64

#### There are 44 missing values in the data set (comment)

In [73]:
# Fixing Missing Values
# Drop rows where 'Comment' is NaN
# This is necessary to ensure that the analysis does not include empty comments
# which could skew the results or lead to errors in processing.
df = df.dropna(subset=['Comment'])

In [74]:
# verifying if missing values issue is resolved
df.isna().sum()

Comment      0
Sentiment    0
dtype: int64

### 3.2 Check Duplicates

In [75]:
df['Comment'].duplicated().sum()

np.int64(493)

In [76]:
# Fixing Duplicates
# Remove duplicate comments
df = df.drop_duplicates(subset=['Comment'])

In [77]:
# Verifying if duplicates issue is resolved
df['Comment'].duplicated().sum()

np.int64(0)

### 3.3 Check data types

In [78]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17871 entries, 0 to 18407
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Comment    17871 non-null  object
 1   Sentiment  17871 non-null  object
dtypes: object(2)
memory usage: 418.9+ KB


### 3.4 Checking the number of unique values of each column

In [79]:
df.nunique()

Comment      17871
Sentiment        3
dtype: int64

### 3.5 Check statistics of data set

In [80]:
df.describe()

Unnamed: 0,Comment,Sentiment
count,17871,17871
unique,17871,3
top,lets not forget that apple pay in 2014 require...,positive
freq,1,11052


## Model Training

In [None]:
# Modelling
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from catboost import CatBoostRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tanujsengar/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# Step 1: Label Encoding of Sentiment
# Step 2: Converting comments in to lowercase
# Step 3: Stopword removal from comments
# Step 4: Word Tokenization of Comments                                             
# Step 4: Vecortization of comments
# Step 5: Model Fitting 
# Step 7: Accuracy Calculations

In [90]:
# Step 1: Label Encoding of Sentiment
labelencoder = LabelEncoder()
df['Encoded_Sentiment'] = labelencoder.fit_transform(df['Sentiment'])
df.head()

Unnamed: 0,Comment,Sentiment,Encoded_Sentiment,Lower_Comments,Comment_No_Stop_Words
0,lets not forget that apple pay in 2014 require...,neutral,1,lets not forget that apple pay in 2014 require...,lets not forget apple pay 2014 required brand ...
1,here in nz 50 of retailers don’t even have con...,negative,0,here in nz 50 of retailers don’t even have con...,nz 50 retailers don’t even contactless credit ...
2,i will forever acknowledge this channel with t...,positive,2,i will forever acknowledge this channel with t...,forever acknowledge channel help lessons ideas...
3,whenever i go to a place that doesn’t take app...,negative,0,whenever i go to a place that doesn’t take app...,whenever go place doesn’t take apple pay doesn...
4,apple pay is so convenient secure and easy to ...,positive,2,apple pay is so convenient secure and easy to ...,apple pay convenient secure easy use used kore...


In [91]:
# Step 2: Converting comments in to lowercase
df['Lower_Comments'] = df['Comment'].str.lower()
df.head()

Unnamed: 0,Comment,Sentiment,Encoded_Sentiment,Lower_Comments,Comment_No_Stop_Words
0,lets not forget that apple pay in 2014 require...,neutral,1,lets not forget that apple pay in 2014 require...,lets not forget apple pay 2014 required brand ...
1,here in nz 50 of retailers don’t even have con...,negative,0,here in nz 50 of retailers don’t even have con...,nz 50 retailers don’t even contactless credit ...
2,i will forever acknowledge this channel with t...,positive,2,i will forever acknowledge this channel with t...,forever acknowledge channel help lessons ideas...
3,whenever i go to a place that doesn’t take app...,negative,0,whenever i go to a place that doesn’t take app...,whenever go place doesn’t take apple pay doesn...
4,apple pay is so convenient secure and easy to ...,positive,2,apple pay is so convenient secure and easy to ...,apple pay convenient secure easy use used kore...


In [92]:
# Step 3: Stopword removal from comments
en_stopwords = stopwords.words('english')
en_stopwords.remove('not')
df['Comment_No_Stop_Words'] = df['Lower_Comments'].apply(lambda x : ' '.join([word for word in x.split() if word not in (en_stopwords)]))
df.head()

Unnamed: 0,Comment,Sentiment,Encoded_Sentiment,Lower_Comments,Comment_No_Stop_Words
0,lets not forget that apple pay in 2014 require...,neutral,1,lets not forget that apple pay in 2014 require...,lets not forget apple pay 2014 required brand ...
1,here in nz 50 of retailers don’t even have con...,negative,0,here in nz 50 of retailers don’t even have con...,nz 50 retailers don’t even contactless credit ...
2,i will forever acknowledge this channel with t...,positive,2,i will forever acknowledge this channel with t...,forever acknowledge channel help lessons ideas...
3,whenever i go to a place that doesn’t take app...,negative,0,whenever i go to a place that doesn’t take app...,whenever go place doesn’t take apple pay doesn...
4,apple pay is so convenient secure and easy to ...,positive,2,apple pay is so convenient secure and easy to ...,apple pay convenient secure easy use used kore...


In [106]:
# Step 4: Train Test Split
X = df['Comment_No_Stop_Words']
y = df['Encoded_Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 42, test_size=0.2)
print(X_train)
print(X_test)
print(y_train)
print(y_train)

768                                         praying miners
4115     cant use timer black surface black doesnt refl...
16503                          beautiful family love movie
18094    wow massive learnings thanks wonder wouldn’t m...
6146     adam kills puts great journalistic interviewin...
                               ...                        
11599                                           420 obunga
12323    honestly really proud guy hope thrives extraor...
5532     besides ai i’m impressed man’s deep knowledge ...
876      mister kalo mister dapet 2jt subscriber mister...
16240    even 6 months movies released changes slate st...
Name: Comment_No_Stop_Words, Length: 14296, dtype: object
5000     apple product excited september best iphone ac...
2783     1948 understand beginners starting sql opinion...
9080     want go pakistan eat food ali really enjoyed p...
9849     nick youtuber call peasant camera…and make lau...
12403    got 1460 790 math 670 english got duke gf didn..

In [107]:
# Step 5: Vecotrizing Features
vectorizer = TfidfVectorizer(max_features = 5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
print(X_train_vec)
print(X_test_vec)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 190615 stored elements and shape (14296, 5000)>
  Coords	Values
  (0, 3371)	1.0
  (1, 714)	0.20185779033726656
  (1, 4706)	0.20760884979615354
  (1, 4504)	0.3478466332667741
  (1, 547)	0.5227952595185392
  (1, 4318)	0.33680487913719764
  (1, 1323)	0.2314244691812018
  (1, 2571)	0.2784618669218244
  (1, 4610)	0.23195720524652708
  (1, 4714)	0.22482366167101941
  (1, 4863)	0.27610130878225875
  (1, 4347)	0.31100201738277533
  (2, 479)	0.553878496503601
  (2, 1659)	0.5696734236221519
  (2, 2657)	0.3227387990659535
  (2, 2922)	0.5143252561434181
  (3, 4929)	0.24227335466342909
  (3, 2743)	0.30172023533240355
  (3, 4438)	0.17883951442510812
  (3, 4902)	0.2821723803830592
  (3, 4926)	0.32293726446069626
  (3, 2847)	0.24586721673738954
  (3, 1297)	0.3309048432626663
  (3, 2440)	0.4779054353894054
  (3, 4925)	0.1666904938024464
  :	:
  (14292, 2008)	0.41208507504401287
  (14292, 4460)	0.3769912458919671
  (14292, 2134)	0.44796921617

In [109]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [115]:
# Step 6: Model Fitting and accuracy calcualtions
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "CatBoosting Regressor": CatBoostRegressor(verbose=False),
    "AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train_vec, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train_vec)
    y_test_pred = model.predict(X_test_vec)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.4103
- Mean Absolute Error: 0.3181
- R2 Score: 0.6717
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.6544
- Mean Absolute Error: 0.4980
- R2 Score: 0.1338


Lasso
Model performance for Training set
- Root Mean Squared Error: 0.7161
- Mean Absolute Error: 0.6349
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.7034
- Mean Absolute Error: 0.6258
- R2 Score: -0.0005


Ridge
Model performance for Training set
- Root Mean Squared Error: 0.4469
- Mean Absolute Error: 0.3517
- R2 Score: 0.6105
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.5603
- Mean Absolute Error: 0.4370
- R2 Score: 0.3651


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 0.6275
- Mean Absolute Error: 0.5419
- R2 Score: 0.2322
----------------------

In [113]:
pd.DataFrame(list(zip(model_list, r2_list)), columns=['Model Name', 'R2_Score']).sort_values(by=["R2_Score"],ascending=False)

Unnamed: 0,Model Name,R2_Score
2,Ridge,0.365123
6,CatBoosting Regressor,0.360134
5,Random Forest Regressor,0.308723
0,Linear Regression,0.133829
7,AdaBoost Regressor,0.029471
1,Lasso,-0.000537
4,Decision Tree,-0.159324
3,K-Neighbors Regressor,-0.20855
