<a href="https://colab.research.google.com/github/worklifesg/Natural-Language-Processing/blob/main/Projects/3.%20Rating-Reviews%20based%20Prediction%20Model/Rating_based_Prediction_Model_using_ML_techniques_and_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3 align='center'>Rating based Prediction Model using ML techniques and NLP </h3> 

In this program, rating prediction model using NLP and machine learning techniques such as Random Forest, AdaBoost and Naives Bayes are implemented. 

<b> The final objective of this program is to perform and evaluate best estimator and identigy the mismatch cases. </b>

<b> DESCRIPTION </b>

Using NLP and machine learning, make a model to predict the rating in a review based on the content of the text review. This will help identify cases with a mismatch.

<b> Problem Statement: </b>  

Zomato is India’s largest platform for discovering restaurants and ordering food. It operates in India as well as a few cities internationally. Bangalore is one of the biggest customers and restaurant bases for Zomato with 4 to 5 million users using the platform each month.

Users on the platform can also post reviews of restaurants and provide a rating accompanying the review. The content in the reviews should ideally reflect the rating provided by the customer. In many cases, there is a mismatch, owing to multiple reasons, where the rating does not match the customer review. The reviews and rating match is very important as it builds customer trust on the platform and helps the user get an accurate picture of the restaurant. 

You, as a data scientist, need to enable the identification and cleanup of such cases to ensure the ratings reflect the reviews and that the reviews seem trustworthy to the customer. You will need to use NLP techniques in conjunction with machine learning models to predict the rating from the review text.


#### Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

#### 1. Load the data using read_csv function from pandas package

In [2]:
#Read the dataset
df = pd.read_csv('Zomato_reviews.csv')
df.head()

Unnamed: 0,rating,review_text
0,1.0,"Their service is worst, pricing in menu is dif..."
1,5.0,really appreciate their quality and timing . I...
2,4.0,"Went there on a Friday night, the place was su..."
3,4.0,A very decent place serving good food.\r\nOrde...
4,5.0,One of the BEST places for steaks in the city....


#### 2. Remove the records where the review text is null

In [3]:
#checking the dataset size

print('Zomato reviews dataset has %.f rows and %.f columns'%(df.shape[0],df.shape[1]))

Zomato reviews dataset has 27762 rows and 2 columns


In [4]:
#Checking missing data
null_sum = df.isnull().sum() #sum wise
null_count = df.isnull().count() #count wise
null_percent = null_sum/null_count*100

print(null_sum)
pd.concat([null_sum, null_percent],axis=1,keys=['Total','Percent']).transpose()

rating          0
review_text    14
dtype: int64


Unnamed: 0,rating,review_text
Total,0.0,14.0
Percent,0.0,0.050429


In [5]:
##Removing rows that has null values
df_1 = df.dropna()
print('Zomato reviews dataset has %.f rows and %.f columns'%(df_1.shape[0],df_1.shape[1]))

#Checking missing data
null_sum1 = df_1.isnull().sum() #sum wise
null_count1 = df_1.isnull().count() #count wise
null_percent1 = null_sum1/null_count1*100

print(null_sum1)
pd.concat([null_sum1, null_percent1],axis=1,keys=['Total','Percent']).transpose()

Zomato reviews dataset has 27748 rows and 2 columns
rating         0
review_text    0
dtype: int64


Unnamed: 0,rating,review_text
Total,0.0,0.0
Percent,0.0,0.0


#### 3. Perform cleanup on the data 
  - Normalize the casing
  - Remove extra line breaks from the text
  - Remove stop words. Note: Terms like ‘no’, ‘not’, ‘don’, ‘won’ are important, don’t remove
  - Remove punctuation



In [6]:
import nltk;
nltk.download('punkt');
nltk.download('stopwords');
nltk.download('wordnet', quiet=True)

#Normalizing to lower case
lower_text = [txt.lower() for txt in df_1['review_text']]
print('The text after applying lower case looks like: \n',lower_text[3:5])
print('-------------------------------------------------------------------------------------------')

#Remove extra line breaks
linebreak_text = [' '.join(txt.split()) for txt in lower_text]
print('The text after removing line breaks to lower case text looks like: \n', linebreak_text[3:5])
print('-------------------------------------------------------------------------------------------')

#Applying tokenization to lower case text after line breaks

token = word_tokenize(linebreak_text[0])
print('Tokens for a sample are: \n', token)
print('-------------------------------------------------------------------------------------------')
tokens_total = [word_tokenize(wd) for wd in linebreak_text]
print('The tokenized sentence looks like: \n',linebreak_text[0])
print('-------------------------------------------------------------------------------------------')

#Remove stopwords and punctuation

stop_words = stopwords.words('english')
stop_punct = list(punctuation)

print('Stop words are : \n',stop_words)
print('-------------------------------------------------------------------------------------------')

stop_words.remove('no')
stop_words.remove('not')
stop_words.remove('don')
stop_words.remove('won')

print('Is word no in the stop words list : ','no' in stop_words)
print('-------------------------------------------------------------------------------------------')

stop_final = stop_words + stop_punct + ['...','``',"''",'====','must']

#creating a function for data cleaning
def delete(wd):
  return [term for term in wd if term not in stop_final]

print('Review after removing stop_final \n ',delete(tokens_total[1]))
print('-------------------------------------------------------------------------------------------')

token_clean = [delete(wd) for wd in tokens_total]

df_clean = [' '.join(wd) for wd in token_clean]

print('Final cleaned data (first two reviews): \n',df_clean[:2])
print('-------------------------------------------------------------------------------------------')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
The text after applying lower case looks like: 
 ['a very decent place serving good food.\r\nordered chilli fish, chicken & pork sizzler.\r\neverything tasted good but pork could have been slightly better cooked.\r\ntried 2 beverages, both were very sweet.', 'one of the best places for steaks in the city. tried the beef steak with chili rum & grilled fish with orange and jalapenos. both were exceptionally good. the herbed rice and mashed potatoes serves alongside were equally delecatble. service is prompt and zomato gold is a great steal. if you are a steak lover, this place is a must visit. hope they come up with another ourself somewhere in the cbd.\r\n\r\nwish to be back soon.\r\nbon appetit !']
--------------------------------------------------------------

#### 4. Separation into train and test sets

  - Use train-test method to divide your data into 2 sets: train and test
  - Use a 70-30 split


In [7]:
X_train, X_test,y_train,y_test = train_test_split(df_clean,df_1['rating'],test_size=0.3,random_state=42)

#### 5. Use TF-IDF values for the terms as features to get into a vector space model

 - Import TF-IDF vectorizer from sklearn
 - Instantiate with a maximum of 5000 terms in your vocabulary
 - Fit and apply on the train set
 - Apply on the test set

In [8]:
#creating tfidf and vectorizing training and testing dataset
vector = TfidfVectorizer(max_features=5000)

#fit and apply to training and testing dataset
X_train_tfidf = vector.fit_transform(X_train)
X_test_tfidf = vector.fit_transform(X_test)

print('Training size : ',X_train_tfidf.shape)
print('Testing size : ', X_test_tfidf.shape)

Training size :  (19423, 5000)
Testing size :  (8325, 5000)


#### 6. Model building: Random Forest Regressor

 - Instantiate RandomForestRegressor from sklearn (set random seed)
 - Fit on the train data
 - Make predictions for the train set

In [9]:
# RandomForestRegressor model built
model_RF = RandomForestRegressor(n_estimators=10, random_state=42)

#Training the model
model_RF.fit(X_train_tfidf,y_train)

#Prediction on training dataset
y_pred = model_RF.predict(X_train_tfidf)
print('RMSE on training dataset is : ',mean_squared_error(y_train,y_pred)**0.5)

RMSE on training dataset is :  0.2648944171413865


#### 7. Hyperparameter tuning

 - Import GridSearch
 - Provide the parameter grid to choose:
  - max_features – ‘auto’, ‘sqrt’, ‘log2’
  - max_depth – 10, 15, 20, 25