# SMAI Assignment - 2

## Question 1: Naive Bayes and Clustering

### Part 1: Naive Bayes

[Files](https://drive.google.com/drive/folders/1OUVrOMp2jSSBDJSqvEyXDFTrhiyZnqit?usp=sharing)

You will be performing Sentiment Analysis on a product review dataset with reviews from customers and star rating belonging to four classes (1,2,4,5). You can use sklearn for this question. Your tasks are as follows:

1.   Clean the text by removing punctations and preprocess them using techniques such as stop word removal, stemming etc. You can explore anything!
1.  Create BoW features using the word counts. You can choose the words that form the features such that the performance is optimised. Use the train-test split provided in `train_test_index.pickle` and report any interesting observations based on metrics such as accurarcy, precision, recall and f1 score (You can use Classification report in sklearn).
1. Repeat Task 2 with TfIdf features.

In [1]:
import numpy as np
import pandas as pd


In [2]:
import pickle
with open('train_test_index.pickle', 'rb') as handle:
    train_test_index_dict = pickle.load(handle)

In [3]:
import pandas as pd

data = pd.read_csv('product_reviews.csv')
data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,Went in for a lunch. Steak sandwich was delici...,5.0,1
2,This place has gone down hill. Clearly they h...,1.0,0
3,"Walked in around 4 on a Friday afternoon, we s...",1.0,0
4,Michael from Red Carpet VIP is amazing ! I rea...,4.0,1


In [4]:
# Convert all text to lower
def convert_to_lower(text):
    return text.lower()


In [5]:
data['text']=data['text'].apply(convert_to_lower)

In [6]:
#Function to remove special character
def remove_special_characters(text):
    ans=''
    for i in text:
        if i.isalnum():
            ans=ans+i
        else:
            ans=ans+ ' '
    return ans

In [7]:
data['text']=data['text'].apply(remove_special_characters)

In [8]:
data.head(10)

Unnamed: 0,text,stars,sentiment
0,total bill for this horrible service over 8g...,1.0,0
1,went in for a lunch steak sandwich was delici...,5.0,1
2,this place has gone down hill clearly they h...,1.0,0
3,walked in around 4 on a friday afternoon we s...,1.0,0
4,michael from red carpet vip is amazing i rea...,4.0,1
5,you can t really find anything wrong with this...,5.0,1
6,great lunch today staff was very helpful in a...,4.0,1
7,we ve been a huge slim s fan since they opened...,5.0,1
8,our family loves the food here quick friendl...,5.0,1
9,the food is always good and the prices are rea...,4.0,1


In [9]:
#remove stop words
import nltk

from nltk.corpus import stopwords
from nltk.corpus import stopwords

def remove_stopwords(text):
    ans=[]
    for word in text.split():
        if word not in stopwords.words('english'):
            ans.append(word)

    y=ans[:]
    ans.clear()
    return y
            
        
    

In [10]:
data['text']=data['text'].apply(remove_stopwords)

In [11]:
data.head(10)

Unnamed: 0,text,stars,sentiment
0,"[total, bill, horrible, service, 8gs, crooks, ...",1.0,0
1,"[went, lunch, steak, sandwich, delicious, caes...",5.0,1
2,"[place, gone, hill, clearly, cut, back, staff,...",1.0,0
3,"[walked, around, 4, friday, afternoon, sat, ta...",1.0,0
4,"[michael, red, carpet, vip, amazing, reached, ...",4.0,1
5,"[really, find, anything, wrong, place, pastas,...",5.0,1
6,"[great, lunch, today, staff, helpful, assistin...",4.0,1
7,"[huge, slim, fan, since, opened, one, texas, t...",5.0,1
8,"[family, loves, food, quick, friendly, delicio...",5.0,1
9,"[food, always, good, prices, reasonable, altho...",4.0,1


In [12]:
#steaming
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [13]:
ans=[]
def stem_words(text):
    for word in text:
        ans.append(ps.stem(word))
    temp=ans[:]
    ans.clear()
    return temp;

In [14]:
data['text']=data['text'].apply(stem_words)

In [15]:
data.head()

Unnamed: 0,text,stars,sentiment
0,"[total, bill, horribl, servic, 8g, crook, actu...",1.0,0
1,"[went, lunch, steak, sandwich, delici, caesar,...",5.0,1
2,"[place, gone, hill, clearli, cut, back, staff,...",1.0,0
3,"[walk, around, 4, friday, afternoon, sat, tabl...",1.0,0
4,"[michael, red, carpet, vip, amaz, reach, need,...",4.0,1


In [16]:
#Join back
def join_back(list_input):
    return " ".join(list_input)
    

In [17]:
data['text']=data['text'].apply(join_back)

In [18]:
data.head()

Unnamed: 0,text,stars,sentiment
0,total bill horribl servic 8g crook actual nerv...,1.0,0
1,went lunch steak sandwich delici caesar salad ...,5.0,1
2,place gone hill clearli cut back staff food qu...,1.0,0
3,walk around 4 friday afternoon sat tabl bar wa...,1.0,0
4,michael red carpet vip amaz reach need help pl...,4.0,1


### Part 2: Clustering

You will be performing kmeans clustering on the same product reviews dataset from Part 1. In this question, instead of statistically computing features, you will use the embeddings obtained from a neural sentiment analysis model (huggingface: siebert/sentiment-roberta-large-english).

You can use sklearn for this question. Your tasks are as follows:


1. Perform kmeans clustering using sklearn. Try various values for number of clusters (k) and plot the elbow curve. For each value of k, plot WCSS (Within-Cluster Sum of Square). WCSS is the sum of the squared distance between each point and the centroid in a cluster.
1. Perform task 1 with cluster initialisation methods [k-means++, forgy ("random" in sklearn)].
1. In this case, since the ground truth labels (star rating) are available we can evaluate the clustering using metrics like purity, nmi and rand score. Implement these metrics from scratch and evaluate the clustering. [Reference](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html)

In [19]:
import gzip
import numpy as np

f = gzip.GzipFile('roberta_embeds.npy.gz', "r")
embeds = np.load(f)
print(embeds.shape)

(26661, 1024)


In [20]:
X=data.iloc[:,0:2].values
X

array([['total bill horribl servic 8g crook actual nerv charg us 69 3 pill check onlin pill 19 cent avoid hospit er cost',
        1.0],
       ['went lunch steak sandwich delici caesar salad absolut delici dress perfect amount dress distribut perfectli across leaf know go salad perfect drink price pretti good server dawn friendli accommod happi summat great pub experi would go',
        5.0],
       ['place gone hill clearli cut back staff food qualiti mani review written menu chang go year food qualiti gone hill servic slow salad 15 bad get worth spend money place mani option',
        1.0],
       ...,
       ['petit café sympa peu de place assis bonn bouff bon expresso un servic correct san plu idéal pour casser la croût avec un ami ou simplement lire sur la terrass qui donn direct sur l ambianc de princ arthur',
        4.0],
       ['absolut delici food full amaz flavor owner quit love definit return highli recommend',
        5.0],
       ['best place sport event servic locat am

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
cv=CountVectorizer()
tfidf=TfidfVectorizer(max_features=10000)

In [45]:
X=cv.fit_transform(data['text']).toarray()
TX=tfidf.fit_transform(data['text']).toarray()

In [23]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [24]:
X.shape

(26661, 20356)

In [25]:
y=data.iloc[:,2:].values

In [62]:
#Splitting data into traning and testing data
from sklearn.model_selection import train_test_split
# for CountVectorizer
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

In [63]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

clf1=GaussianNB()
clf2=MultinomialNB()
clf3=BernoulliNB()

In [64]:
clf1.fit(X_train,y_train)


  y = column_or_1d(y, warn=True)


In [65]:
clf2.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [66]:
clf3.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [67]:
y_pred1=clf1.predict(X_test)
y_pred2=clf2.predict(X_test)
y_pred3=clf3.predict(X_test)

In [37]:
y_test.shape

(5333, 1)

In [38]:
y_pred1.shape

(5333,)

In [57]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score


In [68]:
print("Gaussian : ",accuracy_score(y_pred1,y_test))

Gaussian :  0.5302831426964185


In [69]:
print("Multinomial: ",accuracy_score(y_pred2,y_test))

Multinomial:  0.9341833864616539


In [70]:
print("Bernoulli: ",accuracy_score(y_pred3,y_test))

Bernoulli:  0.9216201012563285


In [71]:
# Count Vectorizer
print("Gaussian(Count Vectorizer) : ",precision_score(y_pred1,y_test))
print("Multinomial(Count Vectorizer): ",precision_score(y_pred2,y_test))
print("Bernoulli(Count Vectorizer) : ",precision_score(y_pred3,y_test))

Gaussian(Count Vectorizer) :  0.4851004851004851
Multinomial(Count Vectorizer):  0.9586509586509586
Bernoulli(Count Vectorizer) :  0.9595749595749595


In [72]:
# Count Vectorizer
print("Gaussian(Count Vectorizer) : ",recall_score(y_pred1,y_test))
print("Multinomial(Count Vectorizer): ",recall_score(y_pred2,y_test))
print("Bernoulli(Count Vectorizer) : ",recall_score(y_pred3,y_test))

Gaussian(Count Vectorizer) :  0.8838383838383839
Multinomial(Count Vectorizer):  0.960203609440074
Bernoulli(Count Vectorizer) :  0.9447350466226972


In [73]:
# Count Vectorizer
print("Gaussian(Count Vectorizer) : ",f1_score(y_pred1,y_test))
print("Multinomial(Count Vectorizer): ",f1_score(y_pred2,y_test))
print("Bernoulli(Count Vectorizer) : ",f1_score(y_pred3,y_test))

Gaussian(Count Vectorizer) :  0.6263982102908278
Multinomial(Count Vectorizer):  0.9594266558779332
Bernoulli(Count Vectorizer) :  0.9520971808388723


In [74]:
print("Bernoulli(Count Vectorizer) : ",recall_score(y_pred3,y_test))

Bernoulli(Count Vectorizer) :  0.9447350466226972


In [75]:
X_train,X_test,y_train,y_test=train_test_split(TX,y,test_size=0.2)

In [76]:
X_train.shape

(21328, 10000)

In [77]:
clf1.fit(X_train,y_train)
clf2.fit(X_train,y_train)
clf3.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [78]:
y_pred1=clf1.predict(X_test)
y_pred2=clf2.predict(X_test)
y_pred3=clf3.predict(X_test)
print("Gaussian(tfidf) : ",accuracy_score(y_pred1,y_test))
print("Multinomial(tfidf) : ",accuracy_score(y_pred2,y_test))
print("Bernouli(tfidf) : ",accuracy_score(y_pred3,y_test))


Gaussian(tfidf) :  0.5327207950496906
Multinomial(tfidf) :  0.8790549409338083
Bernouli(tfidf) :  0.9251828239264954


In [79]:
print("Gaussian(Tfidf) : ",precision_score(y_pred1,y_test))
print("Multinomial(Tfidf): ",precision_score(y_pred2,y_test))
print("Bernoulli(Tfidf) : ",precision_score(y_pred3,y_test))

Gaussian(Tfidf) :  0.48825854452452916
Multinomial(Tfidf):  0.9944199023482911
Bernoulli(Tfidf) :  0.9511741455475471


In [80]:
# Count Vectorizer
print("Gaussian(Count Vectorizer) : ",recall_score(y_pred1,y_test))
print("Multinomial(Count Vectorizer): ",recall_score(y_pred2,y_test))
print("Bernoulli(Count Vectorizer) : ",recall_score(y_pred3,y_test))

Gaussian(Count Vectorizer) :  0.8782936010037641
Multinomial(Count Vectorizer):  0.8732135565536954
Bernoulli(Count Vectorizer) :  0.9558411214953271


In [81]:
# Count Vectorizer
print("Gaussian(Count Vectorizer) : ",f1_score(y_pred1,y_test))
print("Multinomial(Count Vectorizer): ",f1_score(y_pred2,y_test))
print("Bernoulli(Count Vectorizer) : ",f1_score(y_pred3,y_test))

Gaussian(Count Vectorizer) :  0.6276150627615062
Multinomial(Count Vectorizer):  0.9298836830090227
Bernoulli(Count Vectorizer) :  0.9535019228528144


In [82]:
import pandas as pd

# Data for CountVectorizer
count_data = {
    "Classifier": ["Gaussian", "Multinomial", "Bernoulli"],
    "Accuracy Score": [0.5302831426964185, 0.9341833864616539, 0.9216201012563285],
    "Precision Score": [0.4851004851004851, 0.9586509586509586, 0.9595749595749595],
    "Recall Score": [0.8838383838383839, 0.960203609440074, 0.9447350466226972],
    "F1 Score": [0.6263982102908278, 0.9594266558779332, 0.9520971808388723]
}

# Data for TF-IDF
tfidf_data = {
    "Classifier": ["Gaussian", "Multinomial", "Bernoulli"],
    "Accuracy Score": [0.5327207950496906, 0.8790549409338083, 0.9251828239264954],
    "Precision Score": [0.48825854452452916, 0.9944199023482911, 0.9511741455475471],
    "Recall Score": [0.8782936010037641, 0.8732135565536954, 0.9558411214953271],
    "F1 Score": [0.6276150627615062, 0.9298836830090227, 0.9535019228528144]
}

# Create DataFrames
count_df = pd.DataFrame(count_data)
tfidf_df = pd.DataFrame(tfidf_data)

# Set index to 'Classifier' column
count_df.set_index('Classifier', inplace=True)
tfidf_df.set_index('Classifier', inplace=True)

# Concatenate the two DataFrames side by side
comparison_df = pd.concat([count_df, tfidf_df], axis=1, keys=['CountVectorizer', 'TF-IDF'])

print(comparison_df)


            CountVectorizer                                         \
             Accuracy Score Precision Score Recall Score  F1 Score   
Classifier                                                           
Gaussian           0.530283        0.485100     0.883838  0.626398   
Multinomial        0.934183        0.958651     0.960204  0.959427   
Bernoulli          0.921620        0.959575     0.944735  0.952097   

                    TF-IDF                                         
            Accuracy Score Precision Score Recall Score  F1 Score  
Classifier                                                         
Gaussian          0.532721        0.488259     0.878294  0.627615  
Multinomial       0.879055        0.994420     0.873214  0.929884  
Bernoulli         0.925183        0.951174     0.955841  0.953502  


#Report

I have implementaed Three classification Algorithms using each Count vectorizer and Tf-idf
1) Gaussian Naive Bayes
2) Multinomial Naive Bayes
3) Bernoulli Naive Bayes

Following are the results
            CountVectorizer                                         \
             Accuracy Score Precision Score Recall Score  F1 Score   
Classifier                                                           
Gaussian           0.530283        0.485100     0.883838  0.626398   
Multinomial        0.934183        0.958651     0.960204  0.959427   
Bernoulli          0.921620        0.959575     0.944735  0.952097   

                    TF-IDF                                         
            Accuracy Score Precision Score Recall Score  F1 Score  
Classifier                                                         
Gaussian          0.532721        0.488259     0.878294  0.627615  
Multinomial       0.879055        0.994420     0.873214  0.929884  
Bernoulli         0.925183        0.951174     0.955841  0.953502  


