## Rating-Prediction

Problem


We have a client who has a website where people write different reviews for technical products.
Now they are adding a new feature to their website i.e. The reviewer will have to add stars(rating)
as well with the review. The rating is out 5 stars and it only has 5 options available 1 star, 2 stars,
3 stars, 4 stars, 5 stars. Now they want to predict ratings for the reviews which were written in the
past and they don’t have a rating. So, we have to build an application which can predict the rating
by seeing the review.

Data Collection Phase

You have to scrape at least 20000 rows of data. You can scrape more data as well, it’s up to you.
more the data better the model


In this section you need to scrape the reviews of different laptops, Phones, Headphones, smart
watches, Professional Cameras, Printers, Monitors, Home theater, Router from different ecommerce websites.
Basically, we need these columns

1) reviews of the product.

2) rating of the product.

You can fetch other data as well, if you think data can be useful or can help in the project. It
completely depends on your imagination or assumption.

Hint:
• Try to fetch data from different websites. If data is from different websites, it will help our
model to remove the effect of over fitting.

• Try to fetch an equal number of reviews for each rating, for example if you are fetching
10000 reviews then all ratings 1,2,3,4,5 should be 2000. It will balance our data set.

• Convert all the ratings to their round number, as there are only 5 options for rating i.e.,
1,2,3,4,5. If a rating is 4.5 convert it 5.

Model Building Phase

After collecting the data, you need to build a machine learning model. Before model building do
all data preprocessing steps involving NLP. Try different models with different hyper parameters
and select the best model.

Follow the complete life cycle of data science. Include all the steps like1. Data Cleaning

2. Exploratory Data Analysis

3. Data Preprocessing

4. Model Building

5. Model Evaluation

6. Selecting the best model

In [1]:
#importing libraries
!pip install selenium
from selenium import webdriver
import time
import pandas as pd
import numpy as np
import sys
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



In [2]:
#reading data
data = pd.read_csv("Rating_data.csv")

In [3]:
data = data.iloc[:,1:]

In [8]:
data

Unnamed: 0,rating,review
0,5,Fantastic value for money machine!! Absolute b...
1,5,"The best you can get, looks and performance bo..."
2,5,"For everyone, who is planning to buy MBA M1-\n..."
3,5,"Ultimate machine, best laptop I have ever used..."
4,5,At the current scenario where a proper Graphic...
...,...,...
166752,5,Nice product
166753,5,"Best Monitor ,New model 2021 Best display ,ver..."
166754,3,Product is good. But the refresh rate of 75Hz ...
166755,5,Bast


In [10]:
#finding the mini. no of rating so that we can balance the data
data.rating.value_counts().min()

6275

In [20]:
#seperate and reducing the data by eual numbers by rating 
data5 = data[data.rating == 5].iloc[:data.rating.value_counts().min(),:]
data4 = data[data.rating == 4].iloc[:data.rating.value_counts().min(),:]
data3 = data[data.rating == 3].iloc[:data.rating.value_counts().min(),:]
data2 = data[data.rating == 2].iloc[:data.rating.value_counts().min(),:]
data1 = data[data.rating == 1].iloc[:data.rating.value_counts().min(),:]

In [21]:
#concatinate all data to get balanced dataset
data = pd.concat([data1, data2, data3, data4, data5] , axis=0)

In [25]:
#resetting the index
data = pd.DataFrame.reset_index(data)

In [26]:
data

Unnamed: 0,index,rating,review
0,90,1,Charger stopped working after 2 weeks of purch...
1,171,1,Battery drain fastly
2,175,1,Poor battery life.
3,187,1,Very cheap product with very cheap service of ...
4,210,1,"Received defective product ,the display blinki..."
...,...,...,...
31370,10957,5,It's amazing laptop .
31371,10958,5,best price product ❤️🔥
31372,10959,5,The laptop isn't heavy as i thought it to be. ...
31373,10963,5,Best gaming leptop🔥


In [43]:
#remove unwanted texts and lemmatize the words
corpus = []
for i in range(0, len(data)):
    review = re.sub("[^a-zA-Z]", " ", data["review"][i])
    review = review.lower()
    review = review.split()
    review = [WordNetLemmatizer().lemmatize(word) for word in review]
    review = " ".join(review)
    corpus.append(review)

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

In [45]:
#trnasform the words using count vectorizer
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()
y = data["rating"]


In [46]:
from sklearn.model_selection import train_test_split

In [47]:
#spliiting the dataset
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.2, random_state=0)

In [48]:
#trying different models to find best model 
from sklearn.naive_bayes import MultinomialNB

In [49]:
model = MultinomialNB()
model.fit(xtrain, ytrain)

MultinomialNB()

In [50]:
predict = model.predict(xtest)

In [51]:
from sklearn.metrics import accuracy_score

In [52]:
accuracy_score(ytest, predict)

0.7470916286879348

In [53]:
from sklearn.tree import DecisionTreeClassifier

In [54]:
dctree = DecisionTreeClassifier(criterion = "entropy", random_state = 0)
dctree.fit(xtrain, ytrain)
predict = dctree.predict(xtest)
accuracy_score(ytest, predict)

0.8781782201966899

In [55]:
from sklearn.ensemble import RandomForestClassifier

In [56]:
rf = RandomForestClassifier(n_estimators = 300, criterion = "entropy", random_state = 0)
rf.fit(xtrain, ytrain)
predict = rf.predict(xtest)


In [57]:
accuracy_score(ytest, predict)

0.8868733509234829

In [None]:
#RandomForest is the best model which has an accuracy of 88.6%

In [2]:
#saving the model
import joblib

In [59]:
joblib.dump(rf,"model.obj")

['model.obj']

In [60]:
joblib.dump(cv,"cv.obj")

['cv.obj']

In [3]:
#a test review
mag = """Overview: Good for the price"""
cv = joblib.load("cv.obj")
rf = joblib.load("model.obj")

In [4]:
corpus = []
review = re.sub("[^a-zA-Z]", " ", mag)
review = review.lower()
review = review.split()
review = [WordNetLemmatizer().lemmatize(word) for word in review ]
review = " ".join(review)
corpus.append(review)

In [5]:
x = cv.transform(corpus).toarray()

In [6]:
pr = rf.predict(x)

In [7]:
pr

array([3], dtype=int64)