RATINGS PREDICTION PROJECT
We have a client who has a website where people write different reviews for technical products. 
Now they are adding a new feature to their website i.e. The reviewer will have to add stars(rating) 
as well with the review. The rating is out 5 stars and it only has 5 options available 1 star, 2 stars, 
3 stars, 4 stars, 5 stars. Now they want to predict ratings for the reviews which were written in the 
past and they don’t have a rating. So, we have to build an application which can predict the rating 
by seeing the review.

In [130]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk import pos_tag
#from wordcloud import WordCloud
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [131]:
df = pd.read_csv('scraped_data.csv')
df

Unnamed: 0.1,Unnamed: 0,rating,review_summary,full_review
0,0,5.0 out of 5 stars,The best value for money Macbook in years.,The M1 obliterates other laptops in price to v...
1,1,5.0 out of 5 stars,Wonderful product. Just go for it,Awesome products. Unbelievably fast and respon...
2,2,5.0 out of 5 stars,Longterm review Fanless Wonder,I had fears about getting a laptop with passiv...
3,3,5.0 out of 5 stars,Worth investing!,It is worth investing in MAC because it gives ...
4,4,5.0 out of 5 stars,Unreal,This is the device that got me into Apple ecos...
...,...,...,...,...
13490,13490,1.0 out of 5 stars,WiFi signal problem started in just 4 months,"WiFi signal problem started in just 4 months, ..."
13491,13491,1.0 out of 5 stars,High Value Low Quality,The router is defective and after an year of u...
13492,13492,1.0 out of 5 stars,Worst product,The Router has been working fine for 1.5 month...
13493,13493,1.0 out of 5 stars,Poor Customer Service,"Within 2 months of Purchase, this router has p..."


In [132]:
df.columns

Index(['Unnamed: 0', 'rating', 'review_summary', 'full_review'], dtype='object')

In [133]:
#dropping the index column
df.drop('Unnamed: 0',axis=1,inplace=True)

In [134]:
df.shape

(13495, 3)

In [135]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13495 entries, 0 to 13494
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   rating          13495 non-null  object
 1   review_summary  13494 non-null  object
 2   full_review     13138 non-null  object
dtypes: object(3)
memory usage: 158.2+ KB


In [136]:
df.dtypes

rating            object
review_summary    object
full_review       object
dtype: object

In [137]:
df.describe()

Unnamed: 0,rating,review_summary,full_review
count,13495,13494,13138
unique,5,5825,6512
top,4.0 out of 5 stars,Good,Good
freq,4565,522,294


In [138]:
df['rating'].value_counts()

4.0 out of 5 stars    4565
3.0 out of 5 stars    2610
5.0 out of 5 stars    2250
2.0 out of 5 stars    2070
1.0 out of 5 stars    2000
Name: rating, dtype: int64

In [139]:
#Removing extra text
df['rating']=df['rating'].str.replace('out of 5 stars','')

In [140]:
df['rating']=pd.to_numeric(df['rating']).astype(int)

In [141]:
df.head(5)

Unnamed: 0,rating,review_summary,full_review
0,5,The best value for money Macbook in years.,The M1 obliterates other laptops in price to v...
1,5,Wonderful product. Just go for it,Awesome products. Unbelievably fast and respon...
2,5,Longterm review Fanless Wonder,I had fears about getting a laptop with passiv...
3,5,Worth investing!,It is worth investing in MAC because it gives ...
4,5,Unreal,This is the device that got me into Apple ecos...


In [142]:
#splitting dataset
df.columns

Index(['rating', 'review_summary', 'full_review'], dtype='object')

In [154]:
#converting text tot Tfid vector
def tf_idf(column):
    return TfidfVectorizer(min_df=3,smooth_idf=True).fit_transform(column)

In [155]:
#splitting column'
x=df.drop('rating',axis=1)
y=df['rating']

In [156]:
x.columns

Index(['review_summary', 'full_review'], dtype='object')

In [157]:
x_merged = x['review_summary'] +' '+ x['full_review']

In [158]:
x=tf_idf(x_merged.astype('U').values)

In [159]:
x.shape                                                            

(13495, 7499)

In [160]:
#Training Testing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import joblib
from sklearn.metrics import log_loss
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import GridSearchCV

In [161]:
models = [
    LogisticRegression(),
    BernoulliNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    
]

In [162]:
for model in models:
    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20,random_state=42)
    model.fit(x_train,y_train)
    y_pred = model.predict(x_test)
    print('Model for:',model)
    print(accuracy_score(y_test,y_pred))
    print(confusion_matrix(y_test,y_pred))
    print(classification_report(y_test,y_pred))
    print('-'*100)

Model for: LogisticRegression()
0.671359762875139
[[272  51  36  25   4]
 [ 55 237  61  48   3]
 [ 29  53 289 134  13]
 [ 15  27  64 748  76]
 [  8  10  11 164 266]]
              precision    recall  f1-score   support

           1       0.72      0.70      0.71       388
           2       0.63      0.59      0.61       404
           3       0.63      0.56      0.59       518
           4       0.67      0.80      0.73       930
           5       0.73      0.58      0.65       459

    accuracy                           0.67      2699
   macro avg       0.67      0.65      0.66      2699
weighted avg       0.67      0.67      0.67      2699

----------------------------------------------------------------------------------------------------
Model for: BernoulliNB()
0.5379770285290848
[[229  34  17 104   4]
 [ 41 164  30 167   2]
 [ 39  25 140 307   7]
 [ 22  25  28 792  63]
 [ 12   3   7 310 127]]
              precision    recall  f1-score   support

           1       0.67      

In [163]:
# Best modal is random forest so doing Hyper parameter on that 
param_grid = {
    'max_depth' : range(10,14),
    'max_features' : ['auto', 'sqrt'],
    'min_samples_leaf': range(1, 4)
}
gridSearchCV = GridSearchCV(RandomForestClassifier(),param_grid=param_grid,refit=True,verbose=3)
gridSearchCV.fit(x,y)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END max_depth=10, max_features=auto, min_samples_leaf=1;, score=0.443 total time=   2.9s
[CV 2/5] END max_depth=10, max_features=auto, min_samples_leaf=1;, score=0.413 total time=   2.4s
[CV 3/5] END max_depth=10, max_features=auto, min_samples_leaf=1;, score=0.388 total time=   4.0s
[CV 4/5] END max_depth=10, max_features=auto, min_samples_leaf=1;, score=0.369 total time=   3.7s
[CV 5/5] END max_depth=10, max_features=auto, min_samples_leaf=1;, score=0.385 total time=   5.8s
[CV 1/5] END max_depth=10, max_features=auto, min_samples_leaf=2;, score=0.427 total time=   3.3s
[CV 2/5] END max_depth=10, max_features=auto, min_samples_leaf=2;, score=0.406 total time=   2.6s
[CV 3/5] END max_depth=10, max_features=auto, min_samples_leaf=2;, score=0.383 total time=   3.0s
[CV 4/5] END max_depth=10, max_features=auto, min_samples_leaf=2;, score=0.366 total time=   3.1s
[CV 5/5] END max_depth=10, max_features=auto, min_sample

[CV 4/5] END max_depth=12, max_features=sqrt, min_samples_leaf=2;, score=0.373 total time=   4.3s
[CV 5/5] END max_depth=12, max_features=sqrt, min_samples_leaf=2;, score=0.390 total time=   4.4s
[CV 1/5] END max_depth=12, max_features=sqrt, min_samples_leaf=3;, score=0.456 total time=   3.5s
[CV 2/5] END max_depth=12, max_features=sqrt, min_samples_leaf=3;, score=0.427 total time=   3.1s
[CV 3/5] END max_depth=12, max_features=sqrt, min_samples_leaf=3;, score=0.392 total time=   5.1s
[CV 4/5] END max_depth=12, max_features=sqrt, min_samples_leaf=3;, score=0.372 total time=   4.0s
[CV 5/5] END max_depth=12, max_features=sqrt, min_samples_leaf=3;, score=0.392 total time=   4.2s
[CV 1/5] END max_depth=13, max_features=auto, min_samples_leaf=1;, score=0.504 total time=   5.6s
[CV 2/5] END max_depth=13, max_features=auto, min_samples_leaf=1;, score=0.445 total time=   4.2s
[CV 3/5] END max_depth=13, max_features=auto, min_samples_leaf=1;, score=0.419 total time=   5.3s
[CV 4/5] END max_dep

In [164]:
gridSearchCV.best_params_

{'max_depth': 13, 'max_features': 'sqrt', 'min_samples_leaf': 1}

In [165]:
joblib.dump(gridSearchCV.best_estimator_,'randomForestClassifier.obj')

['randomForestClassifier.obj']