# Final Project
### Anoop Kunjumon Scariah
### video demo [click here](https://youtu.be/yQJyGSp7OLc)
### Hosted Application [click here](https://ratingprediction-ml-nlp.herokuapp.com/)
### Github [click here](https://github.com/AnoopKunju/RatingPredictor)


In [None]:
import os
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import string
import spacy
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk
from sklearn.svm import LinearSVC
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

In [None]:
nltk.download('stopwords')
nlp = spacy.load("en_core_web_sm")

In [None]:
df = pd.read_csv('../input/boardgamegeek-reviews/bgg-15m-reviews.csv')

# Dataset

In [None]:
df.head(10)

*The dataset consist of 6 columns the columns ID, name and user is not taken into consideration for this projects as we are preparing a model which can predict rating using NLP*

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.drop(['Unnamed: 0', 'user','ID','name'], axis=1, inplace= True) #dropping columns which is not fruitful for this project
df.head()

As we can see in the comment column it consist of Null data which doesnot make any contribution to our prediction so we will first clean the data

In [None]:
df.dropna(inplace= True)
df.head(10)

In [None]:
df.shape

*We can see that after doing the cleaning for NULL value we have found that the data size srink from 15 million to almost 3 million*

### Undersatnding Data  

In [None]:
plt.hist(df['rating'], bins = 50)
plt.show()

*From the above plot we can see that the ratings are not just integer value they in float too so we the model genration purpose we need to round the rating to nearest integer*

In [None]:
df.rating = df['rating'].round()

In [None]:
plt.hist(df['rating'], bins = 50)
plt.show()

*From the above plot we can understand that the dataset consist of comments which are rated 6 to 8 are more in number* 

### Data Reduciton 
*As the data consist of almost 3 million data and requires a huge amount of computation power therefore for carrying out preprocessing and model generation im using 10% of the data create the model and carry out testing on it.*

In [None]:
df_final = df.sample(frac=0.1, replace=False, random_state=1)
df_final.head()

In [None]:
plt.hist(df_final['rating'], bins = 50)
plt.show()

*From the above graph we can see that the data is split in the same fashion as the orginal data of 3 million , therefore we can be sure there is no loss of information*

In [None]:
X = df_final['comment'].values
Y = df_final['rating'].values

# Removing StopWords, Punctuatio and Integers from Comment

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30)

In [None]:
print(np.where(Y_train == 3))

In [None]:
exclude_list = string.digits + string.punctuation #removing the punctuation & digits
stopwords = nltk.corpus.stopwords.words("english")

def cleaning(text):
  nlp.max_length = len(text)
  # Preprocessing by removing Uppercase and removing the punctuation & digits
  raw_text = text.lower() #coverting to lower case
  # Lemmatization
  table = str.maketrans(exclude_list,len(exclude_list)*" ")
  raw_text = raw_text.translate(table)
  doc = nlp(raw_text, disable = ['ner', 'parser']) #Loading into the 'en model' Lemmatization
  lemmatized_output = " ".join([token.lemma_ for token in doc]) #Lemmatize list of words and join
  # Removing Stopwords
  words = lemmatized_output.split()
  clean_text = " ".join([w for w in words if w not in stopwords])
  return clean_text

In [None]:
Xtrain_clean = list()
for text in X_train:
  Xtrain_clean.append(cleaning(text))

In [None]:
Xtrain_clean = list(map(lambda st: str.replace(st,'-PRON-',''), Xtrain_clean)) #removing string -PRON-

# Model Creation and Evaluation

In [None]:
vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.5 , max_features= 15000)
dtm = vectorizer.fit_transform(Xtrain_clean)
dtm_test = vectorizer.transform(X_test)

### MultiNomial Naive Bayes

In [None]:
NaiveBayes = MultinomialNB()
NaiveBayes.fit(dtm, Y_train)

In [None]:
Y_pred = NaiveBayes.predict(dtm_test)

In [None]:
print("MUltinomial Naive Bayes Test Accuracy is {} %".format(accuracy_score(Y_test, Y_pred)*100))

### Linear SVC

In [None]:
SVC = LinearSVC()
SVC.fit(dtm,Y_train)

In [None]:
Y_pred_svm = SVC.predict(dtm_test)

In [None]:
print("Linear SVC Test Accuracy is {} %".format(accuracy_score(Y_test, Y_pred_svm)*100))

# Hyperparameter tunning 

Hypeparameter for Multinomial Naive Bayes

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
alpha = list(range(1, 11, 1))
accuracy = dict()
for x in alpha:
  MNBC  = MultinomialNB(alpha= x)
  scores = cross_val_score(MNBC, dtm, Y_train, cv=3, scoring='accuracy')
  accuracy[x] = scores.mean()
print(accuracy)

In [None]:
naiveDF = pd.DataFrame.from_dict(accuracy,orient='index')
naiveDF.sort_values(0,ascending= False)

*from above we can see that the alpha was 1 outperform others in the case of Multinomial Naive Bayes therefore keeping alpha 1 as the hyperparameter for the Naive Bayes model*

Hypeparameter for Linear SVC

In [None]:
SVC_acc = dict()
SVC_10 = LinearSVC(C= 10)
scores = cross_val_score(SVC, dtm, Y_train, cv=3, scoring='accuracy')
SVC_acc[10] = scores.mean()

In [None]:
SVC_100 = LinearSVC(C= 100)
scores100 = cross_val_score(SVC, dtm, Y_train, cv=3, scoring='accuracy')
SVC_acc[100] = scores.mean()

In [None]:
SVC_1 = LinearSVC()
scores = cross_val_score(SVC, dtm, Y_train, cv=3, scoring='accuracy')
SVC_acc[1] = scores.mean()

In [None]:
SVCDF = pd.DataFrame.from_dict(SVC_acc,orient='index')
SVCDF.sort_values(0,ascending= False)

*from the above comparison we can see that the hyperparameter C= 100 performs better for Linear SVC model*

# Conclusion

In [None]:
review = list()
comment = "I hate the game"
review.append(comment)
dtm_rev = vectorizer.transform(review)
pred_naive = NaiveBayes.predict(dtm_rev)
pred_SVC = SVC.predict(dtm_rev)
print("prediction for NAIVE Bayes:",pred_naive)
print("prediction for SVC:",pred_SVC)

*As the model is been trained on partial amount of data due to computation limitation and as we can see that the data is not balance in the case of rating which we can understand by seeing the visulization of rating that the amount of comments with rating 8 is more in the dataset as compared to others, therefore which has lead to  error of overfitting the model for the class 8 rating data in Navie Bayes model where as SVC has performed in a proper fashion. In theroy and statics wise the Naive Bayes does perform good on the data but after observation we can understand that Naive Bayes model is Overfitted*

*Therefore for the application development I'm using Linear SVC model for prediction*

# Saving the model

In [None]:
# import pickle

In [None]:
# pickle.dump(NaiveBayes,open('modelNaiveBayes.pkl','wb'))

In [None]:
# pickle.dump(SVC,open('modelSVM.pkl','wb'))

In [None]:
# pickle.dump(vectorizer,open('modelvectorizer.pkl','wb'))

# Reference 
https://monkeylearn.com/text-classification/

https://www.kaggle.com/jvanelteren/collaborative-filtering-defining-similar-games

https://www.kaggle.com/jvanelteren/exploring-the-13m-reviews-bgg-dataset


# Challenge 
 The amount of data was very huge 15 million data the preprocessing itself took the life of my personal computer, because of which I implemented the whole  project using Google Colab still it was very time consuming for some algorithms to obtain fitting example SVM.
 
 Therotically the Naive bayes model was outperforming Lineasr SVC, but at the end when i used it for manual testing by passing the comments my self i understood it has been overfitted for the class 8 comments. 

