# Data Mining Term Project - Board Game Geek Rating Prediction¶


# **Abhishek Shinde**
# **Student ID - 1001754842**



# **MY LINKS**

## **Link for my live prediction:  https://myridgeclassifierlive.herokuapp.com/**

## **Link for my GitHub : https://github.com/Warlord3097/My-Ridge-Classifier**

## **Link to the dataset used : https://www.kaggle.com/jvanelteren/boardgamegeek-reviews**


# Reference Links
1. https://monkeylearn.com/text-classification/
2. https://www.geeksforgeeks.org/applying-multinomial-naive-bayes-to-nlp-problems/
3. https://github.com/jushih/Sentiment-Analysis
4. http://www.site.uottawa.ca/~stan/csi5387/DMNB-paper.pdf
5. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
6. https://scikit-learn.org/stable/modules/svm.html
7. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html


This is the term project for my Data Mining Class.

Due to the large size of the data set, I have divided the dataset into different smaller subsets and have developed my model on those subsets of data as my laptop was not able to handle the computations that came along with the big dataset.

## Overview


Text Classification can be defined as the simple process of assigning tags or categorising data according to its content. One of the most crutial tasks  in this process is Natual Language Processing (NLP) which is widely used in Sentiment Analysis, spam detection and intent detection.


In today's day and age, there is an abundance of data. Data has become an economy and the more smartly one can use it, the more successful they become. This data can be of any format, either textual, numerical, images, etc. Out of these, text data is often considered to be the most rich source of information.

But it is equally challenging and and time consuming to extract valuable information out of all text data. All major businesses are in need of smart decision making algorithms which can help them predict the information needed for them to make a profit for themselves.

Which is why, I am implementing this project as a prime example of how textual data can be, how we perform NLP and other operations, and using different models to predict the accuracy. 
And to also find which model is having better accuracy among the others. 

We shall also cover what improvements can be made to the existing logic so as to improve our data prediction accuracy


# Purpose of this project

The main and foremost objective of this project is to predict the rating for a game, given a textual review.

Other objectives include understanding the mechanisms of text cleaning and NLP. 

Also to understand the working of different classification models.

Amother objective is to develop our presentation skills and documentation skills.


# **So without any further delay, lets begin**

***An understanding of our dataset***

The original Board Game Dataset is a csv file which consists of 1,31,70,703 rows of data. 

This data is a combination of Review, Rating, Users, ID and Game Name.

For the purpose of our project, since our goal is to predict the **Rating** using the given **Review** we will not focus on any other data.




**Note:-** As it is very difficult and out of my system's capabilities to work on such a large dataset at once, we will be dividing our dataset into smaller samples.

They will be explained below.

I have used 3 models for comparison and prediction of our data. They are : 

# **1. Multinomial Naive Bayes Classifier**

The Naive Bayes classifier is a very straightforward probabilistic classifier which has been based on the Bayes Theorem. This classifier assumes strong and very naive independence assumptions.

Bayes theorem calculates probability P(c|x) where c is the class of the possible outcomes and x is the given instance which has to be classified, representing some certain features.

**P(c|x) = P(x|c) * P(c) / P(x)**


Naive Bayes is one of the most common and basic text classification techniques with various applications in Spam Detection, Disaster Tweet Detection, Sentiment Analysis etc. 

Working:
We are given a dataset and we have 11 possible classifications, i.e. The rating of our review can lie between 0 and 10.

Therefore, for a given review,
we first perform various NLP operations such as standardizing the text by converting it all to lower case, removing punctuations, numerical data, special characters, hyperlinks and html tags.

For a better cleaning, we use the NLTK library to remove all the stopwords present in our data.

Stopwords are common words such as "it, is, able, else, and, that" ,etc.

After performing all these text cleaning operations,
we tokenize the words and to find the probability of every word for every rating, starting from 0 to 10.

We use CountVectorizer. It provides a simple and efficient method to tokenize a collection of text documents and to build a vocabulary of known words, while encoding the new documents using that vocabulary.

The CountVectorizer returns a matrix of words and their corresponding occurrences in our dataset, i.e for a word "game" it would also display how many times the word "game" has occurred in the dataset.


We use TfIdfTransformer. It returns a normalized tf-idf representation for a count matrix. It is a common weighing scheme in information retrieval. The goal of using Tf-Idf  instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus. Tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

After we get our collection or corpus of words, and have identified their numerical count in all ratings, we use the conditional probability formula of Naive Bayes to Calculate the Probability of a Review by multiplying the probabilities of all the words in the review.

The benefit of using Multinomial Naive-Bayes Classifier is that it can efficiently make predictions when there are multiple possibilities for a class prediction.





# 2. Linear SVM using SGD Training
The SGD Classifier is a Linear Classifier for SVM or Logistic Regression with a Stochastic Gradient Descent Training ( SGD Training )
This trainer helps in implementing regularized linear models with a stochastic gradient descent learning. 

This implementation works with the data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter, while by default, it fits a Linear Support Vector machine (Linear SVM).

SVM or Support Vector Machine is a linear model for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems. The idea of SVM is simple: The algorithm creates a line or a hyperplane which separates the data into classes.  SVM is an algorithm that takes the data as an input and outputs a line that separates those classes if possible. **A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts**. 

# 3. Ridge Classifier
This classifier uses Ridge Regression for its implementation. This classifier first converts the target values into {-1,1} and then treats the problem as a regression task. It performs a multi-output regression in our case as we have multiple classes to predict.


This documentation will explain every code block in a detailed and concise manner

## **Now that we have a better understanding of the different models used in this project, lets start with the programming part of our project**

**Importing Libraries**

This is where we import all the important libraries which will be needed by us later in the program. 

There is no compulsion to have all your import statements together in one block. It is simply a personal preference if you would like to keep all your import statements together or if you would like to keep them wherever.

In [None]:
import math
import pandas as pd
import numpy as np
import random as rnd
import matplotlib.pyplot as plt
import time
import pickle
import string
import sys
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import re
import os
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

**Reading Dataset File**

Here, after mounting our drive and importing our essential libraries, we read our dataset from our mounted google drive using Pandas, as we want to read the data into our dataframe.

We use the .head() function to display the top 5 records of our dataset. 

As you can see below, we have 5 rows of data, and the column names are also available.

In [None]:
# df = pd.read_csv('/content/drive/My Drive/Colab_Data_set/bgg-13m-reviews.csv')
# df.head()

df =  pd.read_csv('../input/boardgamegeek-reviews/bgg-13m-reviews.csv', index_col=0)
df.head()

As you know, the scope of our project is to predict the rating, given a review. Therefore, we are only concerned with the columns "rating" and "comment". 

Therefore, we drop the rest of the columns as they would not fit any purpose for our data analysis.
Hence, here, we drop the "ID","user","name" below.

Dropping Column "ID", "user", "name"

In [None]:
dataset = df.drop(columns="ID")
dataset = dataset.drop(columns="user")
dataset = dataset.drop(columns="name")



**Current DataSet**

Here, we can see the modified dataset with only the review and rating column, as desired.

In [None]:
dataset.head()

**Removing all the comments which are empty**

Our dataset consists of 13 million reviews. But many of them(approx. 80%) of them contain empty comments. We can see those empty comments as "NaN" in the "comment" column above.

Since this data is also not essential to us and we wont be able to use it to perform any sort of analysis or prediction,we delete all the rows which have no column value.

In [None]:
#Removing Empty Comments
dataset = dataset.dropna(subset=['comment'])

**Dataset after removing empty comments**

Therefore, after removing all the comments with a "NaN" value, we get the following dataset.

In [None]:
dataset.head()


For better exploration of data, this histogram displays the distribution of all the ratings in our dataset againt the number of reviews for each rating.

In [None]:
#plot histogram of ratings
num_bins = 500
plt.hist(dataset.rating, num_bins, facecolor='blue', alpha=10)

#plt.xticks(range(9000))
plt.title('Histogram of Ratings')
plt.xlabel('Ratings')
plt.ylabel('Count')
plt.show()

According to the above histogram, we can see that our dataset is heavily unbalanced, i.e. it contains a large majority of reviews which have a sentimentally positive rating that lies between 6 and 8. 


# **Splitting the Dataset into Train and Test Subsets**
Since the Total Dataset is still huge, we split it into Train And Test Set in a 75:25 ratio.

For this operation, we use train_test_split library from Sklearn package.

The train_test_split function is for splitting a single dataset for two different purposes: training and testing. The training subset is for building your model. The testing subset is for using the model on unknown data to evaluate the performance of the model. 
This function makes random partitions for the two subsets.

In [None]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(dataset,test_size = 0.25,random_state=0)

Train Set

In [None]:
train.head()

Our Train Set has a total of 1978317 records

Our Test Set has a total of 659439 records

We make a copy of our training set as we are going to make two samples of our training dataset and make development sets from these samples for calculating our model accuracy and performance.

In [None]:
temp_train= train.copy()

The Maximum value of ratings : 

In [None]:
temp_train['rating'].max()

The Minimum value of ratings :

In [None]:
temp_train['rating'].min()

In [None]:
temp_train.head()

# **For Ease of Computation and High Performance, we further divide our training dataset into two samples**

Here, we split the training set into 2 Sample sets.

These sample sets each have 50 % of the Main Training Set

In [None]:
sample_1,sample_2 = train_test_split(temp_train,test_size=0.5,random_state = 2)

Sample Set 1 :

In [None]:
sample_1.head()

Sample Set 2 :

In [None]:
sample_2.head()

# **We now Start Cleaning Our Samples so that we can then use the refined dataset to fit into our models**

In the cleaning process for our text data, we:
1. Remove all Punctuations that are present in our text data.
2. Convert all text data into a standardized LowerCase Text.
3. Removing all the Stopwords from our text data.

Stopwords are a set of commonly used words, irrespective of the language. The main reason for removing Stopwords from our text data is so that if we remove the common words, we will be able to focus on the important words instead.

To import the list of stopwords which we can use to remove them easily, we need to use nltk and download it to our system one time so that we can perform text cleaning without errors.

Cleaning Sample Set 1

In [None]:
#lowercase and remove punctuation
sample_1['comment'] = sample_1['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

# stopword list to use
stopwords_list = stopwords.words('english')

"""
Since this is a game dataset review, and since i ran the data cleaning process once before,
i identified a few extra words which can also be added to our stopword list

"""
stopwords_list.extend(('game','play','played','players','player','people','really','board','games','one','plays','cards','would')) 
#remove stopwords
sample_1['comment'] = sample_1['comment'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))


Here we see that 'sample_1' has been cleaned and has been converted to lowercase. 

It also does not have any special characters or any punctuations.

In [None]:
sample_1.head()

Cleaning Sample Set 2

In [None]:
#lowercase and remove punctuation
sample_2['comment'] = sample_2['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

#remove stopwords
sample_2['comment'] = sample_2['comment'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))


Here is the cleaned 'sample_2'

In [None]:
sample_2.head()

We now divide each sample into a train and a development set so that we can test the accuracy of our models

We divide the sample set into a ratio of 80:20

In [None]:
#Splitting Sample 1 
sample_1_train, sample_1_dev = train_test_split(sample_1,test_size=0.2,random_state = 0)

#Splitting Sample 2
sample_2_train, sample_2_dev = train_test_split(sample_2,test_size=0.2,random_state = 0)

Train Set for Sample 1

In [None]:
sample_1_train.head()

Development Set for Sample 1

In [None]:
sample_1_dev.head()

Train Set for Sample 2

In [None]:
sample_2_train.head()

Development Set for Sample 2

In [None]:
sample_2_dev.head()

## **We make x and y train sets for both our samples**

Creating Train X, Train Y ,  Dev X and Dev Y for Sample 1

In [None]:
#Training X and Y for Sample 1
train_1x=[]
for i in sample_1_train['comment']:
    train_1x.append(i)

train_1y=[]
for i in sample_1_train['rating']:
    train_1y.append(i)

dev_1x=[]
for i in sample_1_dev['comment']:
    dev_1x.append(i)

dev_1y=[]
for i in sample_1_dev['rating']:
    dev_1y.append(i)

Creating Train X, Train Y ,  Dev X and Dev Y for Sample 2

In [None]:
#Training X and Y for Sample 2
train_2x=[]
for i in sample_2_train['comment']:
    train_2x.append(i)

train_2y=[]
for i in sample_2_train['rating']:
    train_2y.append(i)

dev_2x=[]
for i in sample_2_dev['comment']:
    dev_2x.append(i)

dev_2y=[]
for i in sample_2_dev['rating']:
    dev_2y.append(i)

Therefore, our current Train and Development Sets for Sample 1 (first 5 records)  are:

In [None]:
train_1x[:5]

In [None]:
train_1y[:5]

In [None]:
dev_1x[:5]

In [None]:
dev_1y[:5]

And our current Train and Development Sets for Sample 2 (first 5 records) are:

In [None]:
train_2x[:5]

In [None]:
train_2y[:5]

In [None]:
dev_2x[:5]

In [None]:
dev_2y[:5]

Here we perform Text Cleaning Again. This is an optional Step. We aim to also clean out any and all numerical values present in our data samples as well.


**This is an optional step**

The Functions below remove all special characters and numerical values along with html tags and line spaces and numerical values. We also standardize the comments by converting all text to lowercase.

In [None]:
#Text Cleaning
def TextClean(data):
    
    txt = []
    for T in data:
        T = re.sub(r'@[A-Za-z0-9_]+','',T)
        T = re.sub(r"http\S+", "", T)
        T = T.replace('<br />', '')
        T = T.replace("\'","")
        T = T.replace("?'","")
        T = T.replace("*", "")
        T = T.replace("/", "")
        T = T.replace("\ ", "")
        T = T.replace(".", "")
        T = T.replace("(", "")
        T = T.replace(")", "")
        T = T.replace(":", "")
        T = T.replace('"', "")
        T = T.replace(",", "")
        T = T.replace("!", "")
        T = T.replace("'", "")
        T = T.replace("&", "")
        T = re.sub(r"[0-9]*", "", T)
        T = re.sub(r"(”|“|-|\+|`|#|,|;|\|/|\\|)*","", T)
        T = re.sub(r"&amp","", T)
        T = T.lower()
        txt.append(T)
    return txt


#Removing Special Characters
def Remove_SC(text):
    alphabet = []
    alpha = 'a'
    for i in range(0, 26): 
        alphabet.append(alpha) 
        alpha = chr(ord(alpha) + 1)
    l = []
    for i in text:
        txt = []
        t = i.split(' ')
        for j in t:
            m = j
            for k in m:
                if k not in alphabet:
                    m = m.replace(k, '')
            if m != '':
                txt.append(m)
        #l.append(txt)
        s = ''
        for j in txt:
            s = s + j + ' '
        l.append(s)
    return l

Cleaning the Train and Development Sets for Sample 1

In [None]:
#Cleaning Sample 1 Train Sets and Development Sets
#Execution takes 120 seconds

#Clean Text and Remove Numerical Values 
train_1x = TextClean(train_1x)
dev_1x   = TextClean(dev_1x)
#Remove Special Characters
train_1x = Remove_SC(train_1x)
dev_1x = Remove_SC(dev_1x)

### As we can see now, we do not have any numerical value in our dataset as well. This means that our data is now completely clean and ready to be fit into the models.

After Cleaning and Removing Special Characters, the Train and Development Sets for Sample 1 (first 5 records) are : 

In [None]:
train_1x[:5]

In [None]:
dev_1x[:5]

In [None]:
#Cleaning Sample 2 Train Sets and Development Sets
#Execution takes 120 seconds
#Clean Text
train_2x = TextClean(train_2x)
dev_2x   = TextClean(dev_2x)
#Remove Special Characters
train_2x = Remove_SC(train_2x)
dev_2x = Remove_SC(dev_2x)

After Cleaning and Removing Special Characters, the Train and Development Sets for Sample 2 (first 5 records) are : 

In [None]:
train_2x[:5]

In [None]:
dev_2x[:5]

# **Rounding off Rating Values**

**In our dataset, we know that there are ratings in the float format, i.e. they have decimal values as well.**

**If we use these decimal values as they are for prediction, we will end up with more than 10 classes as it will consider every unique rating as a seperate class. Therefore, we will round off all rating values in our sample sets so that we can identity the classes easily, as we will get all the values in whole numbers, without any decimal value.**

In [None]:
#Rounding Rating Values for : 
#Training And Development Set for Sample 1

train_1y  = [round(num) for num in train_1y]
dev_1y    = [round(num) for num in dev_1y]

#Training And Development Set for Sample 2

train_2y  = [round(num) for num in train_2y]
dev_2y    = [round(num) for num in dev_2y]



---



---



Since, we now have both the cleaned data and the rounded ratings, we will now fit our models with the training data and calculate the accuracy.

## We now fit our training data into the Multinomial NaiveBayes Classifier and Predict the Development Set

Calculating Accuracy for the Development Set of Sample 1


In [None]:
#Takes 30 seconds to execute
nb_1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_1.fit(train_1x,train_1y)
#Prediction
y_pred = nb_1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 1 :", accuracy_score(dev_1y, y_pred)*100," %")


Calculating Accuracy for the Development Set of Sample 2

In [None]:
#Takes 30 seconds to execute
nb_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_2.fit(train_2x,train_2y)
#Prediction
y_pred = nb_2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 2 :", accuracy_score(dev_2y, y_pred)*100," %")

So we have got the accuracy of our Multinomial Naive Bayes Classifier to be around 30 % which is not bad at all, given that we are using 20% of the entire dataset and are using a subset of that dataset for our model execution.

## We now fit our training data into the Linear SVM Classifier and Predict the Development Set

In [None]:
#Takes 1 minute to execute
sgd_clf_1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',
  SGDClassifier(penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_1.fit(train_1x,train_1y)
#Prediction
y_pred = sgd_clf_1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Linear SVM Classifier for Sample 1 :", accuracy_score(dev_1y, y_pred)*100," %")

Calculating Accuracy for the Development Set of Sample 2

In [None]:
#Takes 1 minute to execute
sgd_clf_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_2.fit(train_2x,train_2y)
#Prediction
y_pred = sgd_clf_2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Linear SVM Classifier for Sample 2 :", accuracy_score(dev_2y, y_pred)*100," %")

For our Linear SVM Classifier, we have achieved an accuracy of ~26% which is not too bad. But it proves that Naive Bayes is the better among the two.

## We now fit our training data into the Ridge Classifier and Predict the Development Set

Calculating Accuracy for the Development Set of Sample 1

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_1.fit(train_1x,train_1y)

y_pred = tridge_clf_1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Ridge Classifier for Sample 1  :", accuracy_score(dev_1y, y_pred)*100," %")

Calculating Accuracy for the Development Set of Sample 2

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_2.fit(train_2x,train_2y)

y_pred = tridge_clf_2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Ridge Classifier for Sample 2  :", accuracy_score(dev_2y, y_pred)*100," %")

### **This is Interesting!! Our Ridge Classifier has a Higher Accuracy than Naive Bayes. It may only be by a margin of 1% but is a detail that must be tracked and made note of.**

# To Summarize the accuracies:

For Multinomial Naive-Bayes

1.   Accuracy of Multinomial Naive-Bayes Classifier for 
     Sample 1 : 30.30702818553116  %

2.   Accuracy of Multinomial Naive-Bayes Classifier for 
     Sample 2 : 30.147296696186665  %


For Linear SVM

1.   Accuracy of Linear SVM Classifier for 
     Sample 1 : 25.813316349225605  %

2.   Accuracy of Linear SVM Classifier for 
     Sample 2 : 26.562942294472077  %

For Ridge Classifier

1.   Accuracy of Ridge Classifier for 
     Sample 1 : 31.3134376642808  %

2.   Accuracy of Ridge Classifier for 
     Sample 2 : 31.268955477374742  %





# Contribution -  For a more accurate result, we will add a smoothing value to our accuracy calculation.

We know that there are 2.6 million reviews each of which has a rating between 0 and 10.These ratings also had values which had significance to the 3rd decimal as well, e.g. 3.234. Therefore, it is possible that, if a review has a rating of 10, but is predicted to be 9, then we should also consider the prediction value "9" to be accurate as well.But if the same review is predicted to be a rating of "8" then that is an inaccurate prediction.

Hence, to add this consideration while calculating accuracy, we create a new function which checks the value of the prediction, and if it is off by only some decimal value which is less than 1, then it predicts it to be the correct prediction value.

For eg., if a review has a rating of 10 or 9.5 or 9.6 or some value in the decimal of 9 , but is predicted as 9, we will classify this as an accurate prediction instead of an inaccurate prediction.

In [None]:
#Function for calculating accuracy with a smoothing factor
def smooth_acc(yr,yp):
  c=0
  for i in range(len(yr)):
    if(yr[i] == yp[i] or yr[i]==(yp[i]+1) or yr[i]==(yp[i]-1) ):
      c=c+1
    
  return c/len(yr)

We now calculate the accuracies for our models using the new accuracy function. We should have some increase in the accuracy of our models.

Lets Observe :)

## Using Multinomial Naive-Bayes

**Calculating Accuracy for Sample 1**

In [None]:
nb_1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_1.fit(train_1x,train_1y)
#Prediction
y_pred = nb_1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 1 :", smooth_acc(dev_1y, y_pred)*100," %")

**Calculating Accuracy for Sample 2**

In [None]:
nb_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_2.fit(train_2x,train_2y)
#Prediction
y_pred = nb_2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 1 :", smooth_acc(dev_2y, y_pred)*100," %")

WOW!! Thats a big improvement in the accuracy from a mere 30% to a WHOPPING 66%. That is a really good prediction.

Lets see how are the results for our other classifiers :D

## Using Linear SVM

**Calculating Accuracy for Sample 1**

In [None]:
sgd_clf_1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_1.fit(train_1x,train_1y)
#Prediction
y_pred = sgd_clf_1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Linear SVM Classifier for Sample 1 :", smooth_acc(dev_1y, y_pred)*100," %")

**Calculating Accuracy for Sample 2**

In [None]:
sgd_clf_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_2.fit(train_2x,train_2y)
#Prediction
y_pred = sgd_clf_2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Linear SVM Classifier for Sample 2 :", smooth_acc(dev_2y, y_pred)*100," %")

Great! Our SVM Classifier also has an increase in accuracy in comparison to before when we did not have any smoothing for our accuracy.

It still has a lower accuracy than our Naive Bayes so it still comes second to it.

## Using Ridge Classifier

**Calculating Accuracy for Sample 1**

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_1.fit(train_1x,train_1y)

y_pred = tridge_clf_1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Ridge Classifier for Sample 1  :", smooth_acc(dev_1y, y_pred)*100," %")

**Calculating Accuracy for Sample 2**

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_2.fit(train_2x,train_2y)

y_pred = tridge_clf_2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Ridge Classifier for Sample 2  :", smooth_acc(dev_2y, y_pred)*100," %")

Amazing! Our Ridge Classifier has had the highest accuracy so far after applying the smoothing accuracy function. 

And thats a big jump from a 31% accuracy to a 69% accuracy. 


This reinforces our assumption and helps us conclude with confidence that our Smoothing Function is providing a good improvement in accuracy.

# To Summarize the Accuracies using our Smoothing Accuracy Calculating Function:

**Using Smoothing Function for Calculating Accuracy for Model**

For Multinomial Naive-Bayes

1.   Accuracy of Multinomial Naive-Bayes Classifier for 
     Sample 1 : 66.01763112135549  %
2.   Accuracy of Multinomial Naive-Bayes Classifier for 
     Sample 2 : 65.79067087225525  %

For Linear SVM 

1.   Accuracy of Linear SVM Classifier for 
     Sample 1 : 63.035302681062724  %
2.   Accuracy of Linear SVM Classifier for 
     Sample 2 : 63.0494561041692  %

For Ridge Classifier

1.   Accuracy of Ridge Classifier for 
     Sample 1 : 69.1101540701201  %
2.   Accuracy of Ridge Classifier for 
     Sample 2 : 69.2319746047151  %






### **Therefore, we have attained the maximum accuracy attainable using our smoothing function for calculating accuracy.**

But there are other approaches that can also be used to improve the rating prediction.

They are mentioned below. :)



---



---



# **Contribution - Rescale the Ratings to a scale of 0 to 5 from the original scale 0 to 10**

Another Approach to Improving the accuracy would be to rescale the ratings from 0-10 to 0-5.

Our samplesets have ratings ranging from 0 to 10. That means that there are just as many reviews which lie on a broader spectrum of ratings. But if we were to change the scale in such a way that we can condense the scale of ratings, then it should, theoretically, increase the base accuracy of our models, without the smoothing function as well.

So, lets just jump into it. :)

## Since the textual data will remain the same, we will use the same datasets that we have created above

Train And Development Sets for Sample 1

In [None]:
train_1x[:5]

In [None]:
dev_1x[:5]

Train And Development Sets for Sample 2

In [None]:
train_2x[:5]

In [None]:
dev_2x[:5]

## **Before we rescale the ratings, let us first have a look at the originally rounded off ratings.**

**Since they are still in a scale of 0 to 10, the maximum value should be 10 and the minimum value should be 0.**


For Sample 1

In [None]:
print("For Training Set ( Sample 1 ) ")
print("Max value : ",max(train_1y), " Min Rating : ",min(train_1y))

print("For Development Set ( Sample 1 ) ")
print("Max value : ",max(dev_1y), " Min Rating : ",min(dev_1y))

For Sample 2

In [None]:
print("For Training Set ( Sample 2 ) ")
print("Max value : ",max(train_2y), " Min Rating : ",min(train_2y))

print("For Development Set ( Sample 2 ) ")
print("Max value : ",max(dev_2y), " Min Rating : ",min(dev_2y))

**As we can see, our sample sets rating values are from 0 - 10.**

**Where 10 is the maximum and 0 is the minimum, just as we predicted.**

We have 4 sets of ratings, as we have two samples and each have a train and a development set.

Reading Rating Values for Both Sample Sets and appending to List

In [None]:
#Train and Development for Sample 1
new_train_1y=[]
for i in sample_1_train['rating']:
    new_train_1y.append(i)

new_dev_1y=[]
for i in sample_1_dev['rating']:
    new_dev_1y.append(i)

#Train and Development for Sample 2
new_train_2y=[]
for i in sample_2_train['rating']:
    new_train_2y.append(i)

new_dev_2y=[]
for i in sample_2_dev['rating']:
    new_dev_2y.append(i)


Since we are using the same samples as we used before, we already have the cleaned dataset of comments. So we now only need to rescale the rating values

Re-Scaling for Sample 1

In [None]:
#Re-Scaling Values
#Training And Development Set for Sample 1
new_train_1y  = [round(num/2) for num in new_train_1y]
new_dev_1y    = [round(num/2) for num in new_dev_1y]

print("For Training Set ( Sample 1 ) ")
print("Max value : ",max(new_train_1y), " Min Rating : ",min(new_train_1y))

print("For Development Set ( Sample 1 ) ")
print("Max value : ",max(new_dev_1y), " Min Rating : ",min(new_dev_1y))

Re-Scaling for Sample 2

In [None]:
#Re-Scaling Values
#Training And Development Set for Sample 2
new_train_2y  = [round(num/2) for num in new_train_2y]
new_dev_2y    = [round(num/2) for num in new_dev_2y]

print("For Training Set ( Sample 2 ) ")
print("Max value : ",max(new_train_2y), " Min Rating : ",min(new_train_2y))

print("For Development Set ( Sample 2 ) ")
print("Max value : ",max(new_dev_2y), " Min Rating : ",min(new_dev_2y))

As we can see, the maximum rating value in all of the ratings is now 5 and the minimum value is 0. 

## This is what we wanted to achieve, so now lets again calculate the accuracy using these new scaled rating values

Calculating Accuracy for Naive Bayes using Scaled Ratings

In [None]:
nb_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_n1.fit(train_1x,new_train_1y)
#Prediction
y_pred = nb_n1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 1 :", accuracy_score(new_dev_1y, y_pred)*100," %")

nb_n2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_n2.fit(train_2x,new_train_2y)
#Prediction
y_pred = nb_n2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 2 :", accuracy_score(new_dev_2y, y_pred)*100," %")

**Great! Our Assumption and expectation is successful. We have an improved base accuracy of 55%.**

**Initially, when we did not rescale our ratings, we had achieved an accuracy of 30%, but now using the scaled ratings, we have a huge jump to 55% in the base accuracy, which is very good.**


Lets now see how our other models have fared this new change.

Calculating Accuracy for Linear SVM using Scaled Ratings

In [None]:
sgd_clf_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_n1.fit(train_1x,new_train_1y)
#Prediction
y_pred = sgd_clf_n1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Linear SVM Classifier for Sample 1 :", accuracy_score(new_dev_1y, y_pred)*100," %")

sgd_clf_n2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_n2.fit(train_2x,new_train_2y)
#Prediction
y_pred = sgd_clf_n2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Linear SVM Classifier for Sample 2 :", accuracy_score(new_dev_2y, y_pred)*100," %")

Yay! Although lower than Naive Bayes, our SVM has still achieved an accuracy boost which is almost twice as much as when the ratings were not scaled. SVM had an accuracy of ~26% before scaling, but it now achieved 54% accuracy, which is amazing.

Calculating Accuracy for Ridge Classifier using Scaled Ratings

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_n1.fit(train_1x,new_train_1y)

y_pred = tridge_clf_n1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Ridge Classifier for Sample 1  :", accuracy_score(new_dev_1y, y_pred)*100," %")

#This Code takes 230 seconds to execute
tridge_clf_n2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_n2.fit(train_2x,new_train_2y)

y_pred = tridge_clf_n2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Ridge Classifier for Sample 2  :", accuracy_score(new_dev_2y, y_pred)*100," %")

**Wow!! The Accuracy of our Ridge Classifier is by far the most high among all other classifiers, and with respect to its previous accuracy as well.**


**Ridge Classifier showed an accuracy of 31% without scaling the ratings but now has achieved an accuracy of 59% which is a significant growth.**


# To Summarize the Accuracies of our models after scaling the ratings from (0 to 10) to ( 0 to 5)

For Multinomial Naive-Bayes

1. Accuracy of Multinomial Naive-Bayes Classifier for 
   Sample 1 : 55466254195478996  %
2. Accuracy of Multinomial Naive-Bayes Classifier for 
   Sample 2 : 55.331796675967496  %

For Linear SVM 

1. Accuracy of Linear SVM Classifier for 
   Sample 1 : 54.846030975777424  %
2. Accuracy of Linear SVM Classifier for 
   Sample 2 : 54.76717618989851  %

For Ridge Classifier

1. Accuracy of Ridge Classifier for 
   Sample 1  : 59.35743459096607  %
2. Accuracy of Ridge Classifier for 
   Sample 2  : 59.299809939746865  %


# We will now further try to achieve a higher accuracy by combining this approach, i.e. rescaling the interval approach, with the smooothed accuracy function that we have created above 

Calculating Accuracy for Naive Bayes using Scaled Ratings and Smoothing Accuracy Function

In [None]:
nb_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_n1.fit(train_1x,new_train_1y)
#Prediction
y_pred = nb_n1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 1 :", smooth_acc(new_dev_1y, y_pred)*100," %")

nb_n2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
#Fitting Training Set to Model
nb_n2.fit(train_2x,new_train_2y)
#Prediction
y_pred = nb_n2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Multinomial Naive-Bayes Classifier for Sample 2 :", smooth_acc(new_dev_2y, y_pred)*100," %")

Amazing. We have achieved an accuracy of 82% on our Naive Bayes Classifier by combining our Rescaled Ratings and Our Smoothing Accuracy Function.

We have thus Succeeded in changing the overall accuracy of our Naive Bayes Model from 30% to a whole 82%

Let us look at the other competition :)

Calculating Accuracy for Linear SVM using Scaled Ratings and Smoothing Accuracy Function


In [None]:
sgd_clf_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_n1.fit(train_1x,new_train_1y)
#Prediction
y_pred = sgd_clf_n1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Linear SVM Classifier for Sample 1 :", smooth_acc(new_dev_1y, y_pred)*100," %")

sgd_clf_n2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', 
  SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)),])
#Fitting Model 
sgd_clf_n2.fit(train_2x,new_train_2y)
#Prediction
y_pred = sgd_clf_n2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Linear SVM Classifier for Sample 2 :", smooth_acc(new_dev_2y, y_pred)*100," %")

Great, this shows that our approach to changing the scale and accuracy function is a correct approach as we can see the significant improvement in our accuracies.

SVM has jumped from an accuracy of 26% to ~82 %.



Calculating Accuracy for Ridge Classifier using Scaled Ratings and Smoothing Accuracy Function

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_n1.fit(train_1x,new_train_1y)

y_pred = tridge_clf_n1.predict(dev_1x)
#Predicting For Development Set 1
print("Accuracy of Ridge Classifier for Sample 1  :", smooth_acc(new_dev_1y, y_pred)*100," %")

#This Code takes 230 seconds to execute
tridge_clf_n2 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
tridge_clf_n2.fit(train_2x,new_train_2y)

y_pred = tridge_clf_n2.predict(dev_2x)
#Predicting For Development Set 2
print("Accuracy of Ridge Classifier for Sample 2  :", smooth_acc(new_dev_2y, y_pred)*100," %")

WOW!!
So this should officially prove that our Ridge Classifier is more superior than all other classifiers as we have found the rescaling and smoothing to be big contributing factors.

We have jumped from an initial accuracy of 31% to a big 88% accuracy.



# To Summarise the Accuracy of all Classifiers after : 

**Rescaling Rating Intervals from 0 - 10 to 0 - 5**

**AND**

**Using Smoothing Function for Calculating Accuracy for Model**


For Multinomial Naive-Bayes

1. Accuracy of Multinomial Naive-Bayes Classifier for 
   Sample 1 : 82.80864571960048  %
2. Accuracy of Multinomial Naive-Bayes Classifier for 
   Sample 2 : 82.6347608071495  %

For Linear SVM 

1. Accuracy of Linear SVM Classifier for 
   Sample 1 : 81.92506773423915  %
2. Accuracy of Linear SVM Classifier for 
   Sample 2 : 81.81841158154394  %

For Ridge Classifier

1. Accuracy of Ridge Classifier for 
   Sample 1  : 88.37650127380809  %
2. Accuracy of Ridge Classifier for 
   Sample 2  : 88.27338347688948  %


# Therefore, we will use our Ridge Classifier model with the scaled values of the rating for the testing of our test set

Cleaning the Data for Train and Test Set

In [None]:
#lowercase and remove punctuation
train['comment'] = train['comment'].str.lower().apply(lambda x:''.join([i for i in x if i not in string.punctuation]))

#remove stopwords
test['comment'] = test['comment'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))

Building X nd Y List for Train and Test Set1


In [None]:
#Training X and Y
x_train=[]
for i in train['comment']:
    x_train.append(i)

y_train=[]
for i in train['rating']:
  y_train.append(i)

#Test X and Y
x_test=[]
for i in test['comment']:
    x_test.append(i)

y_test=[]
for i in test['rating']:
    y_test.append(i)

Cleaning Once Again to Remove all the Numerical Values 

In [None]:
#Cleaning Train Sets and Test Sets

#Execution takes 300 seconds
#Clean Text
x_train = TextClean(x_train)
x_test   = TextClean(x_test)
#Remove Special Characters
x_train = Remove_SC(x_train)
x_test = Remove_SC(x_test)

Rounding and Rescaling the Interval to 0 ot 5

In [None]:
#Training And Test Set for Sample 1
y_train  = [round(num/2) for num in y_train]
y_test   = [round(num/2) for num in y_test]


Fitting Train Data into Ridge Classifier Model for Test Set Accuracy Calculation

In [None]:
#This Code takes 230 seconds to execute
tridge_clf_n1 = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf',RidgeClassifier()),])
#Fitting Training Set to Model
tridge_clf_n1.fit(train_1x,new_train_1y)


**Using Normal Accuracy Function**

In [None]:
#Prediction
y_pred = tridge_clf_n1.predict(x_test)
#Predicting For Development Set 1
print("Accuracy of Ridge Classifier for Test Set  :", accuracy_score(y_test, y_pred)*100," %")


Using Smooth Accuracy Function

In [None]:
#Prediction
y_pred = tridge_clf_n1.predict(x_test)
#Predicting For Development Set 1
print("Accuracy of Ridge Classifier for Test Set  :", smooth_acc(y_test, y_pred)*100," %")


# Great!!

# For our Final Test Set,
# We get an accuracy of 57.63% if we do not apply any smoothing, while we get an accuracy of 86.65% while we use our smoothing accuracy function. 


# Thus we conclude that our Ridge Classifier is the most accurate model as compared to our other models. And we will put this model for hosting to perform live prediction.

# Challenges :
One of the major challenges in this project, for me was the managing of the huge dataset. Since i have no prior experience in the field of data mining in general, it took me quite some time to adapt to the different techniques and their workings. Using all the pre-existing knowledge that i had, i tried my best and was able to make a comparison of the 3 models and was able to perform some optimizations on the accuracy of our project. From coming up with various ideas for the project to being able to actually implement them was a mammoth task for me as well. 

Another challenge which i am very happy about that i was able to implement was the extra credit part of our project, i.e. to deploy our model on a live web server using Flask Web App.

This was a difficult project but i have been able to learn quite alot from this challenge and i feel that my grasp over the various topics in the domain of Data Mining has strengthened. I also feel very comfortable with Jupyter Notebook, Kaggle and Google Colab. 

I also put a lot of effort and plenty of hours into the Web Hosting Part of our assignment as it would not allow a dataset of 1 gb to be uploaded, hence i had to use Pickle to save my model by serializing it, and then deserialize it and run it. After putting almost 5 to 6 hours into understanding Flask, i am now successfully able to deploy my mmodel on a live web server using Heroku.

# Contribution
There are 2 contributions in this project that i can confidently say are my own thinking and my own approach. I have already mentioned them in the Document as and when they come up. 

One of the contributions are Applying a Smoothing Parameter for our accuracy calculation, where we keep a buffer of almost one rating and if the predicted rating is within the range of that one rating with respect to the actual rating, then we consider the predicted rating to be accurate as well.
This helped increase the Accuracy of my Model from 30% to 55% in case of normal rounded off ratings. And it boosted the accuracy from 66% to 88% while using the rescaled rating values.

Which brings me to my second contribution, the approach of rescaling our rating values by dividing all rating values by 2 and rounding them off to get all values within a range of 0 to 5. This majorly boosted our base accuracy to 55% and our Smoothed Accuracy to 88%. That has been my major contribution in this project along with detailed comparison of accuracy for each of the three models.