# problem statement
We have a client who has a website where people write different reviews for technical products. 
Now they are adding a new feature to their website i.e. The reviewer will have to add stars(rating) 
as well with the review. The rating is out 5 stars and it only has 5 options available 1 star, 2 stars, 
3 stars, 4 stars, 5 stars. Now they want to predict ratings for the reviews which were written in the 
past and they don’t have a rating. So, we have to build an application which can predict the rating 
by seeing the review

## importing libraries

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
from scipy.stats import zscore

In [18]:
df=pd.read_csv("electronics3.csv", encoding='cp1252')

In [19]:
df.head(10)

Unnamed: 0,Review,star
0,"Delivery was delayed a little bit, it was wort...",5.0
1,Working fine but one thing I want to mention i...,5.0
2,Amazing camera. Lot of beautiful options to co...,5.0
3,,
4,,
5,,
6,,
7,,
8,,
9,,


In [20]:
df.shape

(19361, 2)

the dataset contains 19361 rows and 2 columns

In [21]:
df.isnull().sum()

Review    13106
star      12974
dtype: int64

In [22]:
df1 = df.dropna(how='any',axis=0) 

In [23]:
df1.isnull().sum()

Review    0
star      0
dtype: int64

In [26]:
df1.sample(15)

Unnamed: 0,Review,star
18161,Just an Average Phone\nPros:\n?The phone build...,5.0
2148,An awesome product from Sony.\nErgonomics is s...,5.0
13867,It's a good performance phone and battery is v...,5.0
11742,It is really good .I love this product .thanks...,5.0
18081,I Gift to My Mother. This Phone Purchase for m...,5.0
15361,very nice awesome ???? nice handy feel battrey...,5.0
4152,An awesome product from Sony.\nErgonomics is s...,5.0
8639,Cons\n1. Left airdope disconnects randomly. Do...,4.0
1739,Best camera ever. Loving it.\n\nIncredible spe...,5.0
10796,nice very beautiful Realme buds\nVery good sou...,4.0


null values have been dropped

In [35]:
x=df1.drop('star',axis=1)


In [36]:

## Get the Dependent features
y=df1['star']

In [29]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer


In [37]:

messages=df1.copy()

In [40]:

messages.reset_index(inplace=True)

In [41]:

messages.head(10)


Unnamed: 0,index,Review,star
0,0,"Delivery was delayed a little bit, it was wort...",5.0
1,1,Working fine but one thing I want to mention i...,5.0
2,2,Amazing camera. Lot of beautiful options to co...,5.0
3,10,Very bad product don't buy this,5.0
4,11,Very poor,5.0
5,12,Very bad product,5.0
6,23,Good,5.0
7,24,Super camera,5.0
8,45,Mind Blowing ?? I bought a camera for the firs...,5.0
9,130,Sony A7c best camera for photography and video...,4.0


In [42]:

messages['Review'][42]


'Simply superb camera. Wonderful how they pack such an advance technology in a small pack.This is the magic of mirrorless technology from SONY.Dynamite in small size. It incorporate such advance camera technologies that beats bigger dslrs easily. Especially the eye auto focus is blazingly fast. At this price it beats even double priced DSLR of other brands. Pictures are razor sharp with nice pleasing color. Very easy to carry as small lightweight. Excellent buy in sale. IF you are planning ...\nREAD MORE'

In [45]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Samrat\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [46]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['Review'][i])
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [47]:
corpus[3]

'bad product buy'

In [48]:
## TFidf Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v=TfidfVectorizer(max_features=5000,ngram_range=(1,3))
X=tfidf_v.fit_transform(corpus).toarray()


In [49]:
X.shape


(1984, 5000)

In [50]:
y=messages['star']

In [51]:
## Divide the dataset into Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [52]:
tfidf_v.get_feature_names()[:20]

['abl',
 'absolut',
 'absolut love',
 'absolut love super',
 'accessori',
 'accord',
 'accord price',
 'accur',
 'accur color',
 'accuraci',
 'accuraci super',
 'accuraci super respons',
 'action',
 'actual',
 'ad',
 'ad nice',
 'ad nice sound',
 'add',
 'add reason',
 'add reason price']

In [53]:
tfidf_v.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': 5000,
 'min_df': 1,
 'ngram_range': (1, 3),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

In [54]:

count_df = pd.DataFrame(X_train, columns=tfidf_v.get_feature_names())

In [55]:

count_df.head()

Unnamed: 0,abl,absolut,absolut love,absolut love super,accessori,accord,accord price,accur,accur color,accuraci,...,year,year includ,year includ free,yesterday,yet,yet power,yet power awesom,yet use,yet use fulli,youtub
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:

import matplotlib.pyplot as plt

In [57]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    See full source and example: 
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


# MultinomialNB Algorithm

In [58]:

from sklearn.naive_bayes import MultinomialNB
classifier=MultinomialNB()


In [60]:

from sklearn import metrics
import numpy as np
import itertools


In [61]:

classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)



accuracy:   0.802


In [62]:

classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
score


0.8015267175572519

In [63]:

y_train.shape


(1329,)

# Passive Aggressive Classifier Algorithm

In [66]:
from sklearn.linear_model import PassiveAggressiveClassifier
linear_clf = PassiveAggressiveClassifier()

In [67]:

linear_clf.fit(X_train, y_train)
pred = linear_clf.predict(X_test)
score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)



accuracy:   0.725


# Multinomial Classifier with Hyperparameter

In [68]:
classifier=MultinomialNB(alpha=0.1)

In [69]:

previous_score=0
for alpha in np.arange(0,1,0.1):
    sub_classifier=MultinomialNB(alpha=alpha)
    sub_classifier.fit(X_train,y_train)
    y_pred=sub_classifier.predict(X_test)
    score = metrics.accuracy_score(y_test, y_pred)
    if score>previous_score:
        classifier=sub_classifier
    print("Alpha: {}, Score : {}".format(alpha,score))

Alpha: 0.0, Score : 0.7099236641221374
Alpha: 0.1, Score : 0.7145038167938931
Alpha: 0.2, Score : 0.7312977099236642
Alpha: 0.30000000000000004, Score : 0.7572519083969466
Alpha: 0.4, Score : 0.7740458015267175
Alpha: 0.5, Score : 0.7847328244274809
Alpha: 0.6000000000000001, Score : 0.7923664122137405
Alpha: 0.7000000000000001, Score : 0.7908396946564885
Alpha: 0.8, Score : 0.7984732824427481
Alpha: 0.9, Score : 0.8


# saving the model

In [72]:
import pickle
filename="rating.pkl"
pickle.dump(classifier,open(filename,'wb'))