________________________________________________________________________________________________________________________________

   ###                                ---------------   Suggesting restaurants based on customer reviews for specific Cuisine ------------
________________________________________________________________________________________________________________________________
            
       ***Sumanto Mukherjee (sumanto2@iilinois.edu), Anuj Kumar (anujk2@illinois.edu)***
                                **Text Information System**
                          University of Illinios - Urbana-Champaign

#### Data Source Details:

##### URL: https://www.kaggle.com/jkgatt/restaurant-data-with-100-trip-advisor-reviews-each/home

   **Context of Data Source**:
            This dataset was used to provide a Mock API of a Review Website (such as Trip Advisor or Yelp) for a Natural Language Processing project.

   **Content description**:
    The data found in the dataset contain 57 restaurants in the Bay Area of San Francisco, which have a 100 reviews each. For each restaurant there is a large number of information which was taken from the Factual API and a 100 reviews which were taken from Trip Advisor.

   **Acknowledgements for DataSource**:
    This dataset has been possible thanks to both Factual and Trip Advisor.

*********************************************************************************************************************
## Documentation for each code block and functionality
  
*********************************************************************************************************************

### Code block to Importing necessary Libraries:

  *packages used: numpy, pandas, nltk, sklearn( CountVectorizer, TfidTransformer, MultinomialNB, confusion_matrix, classification_report)*

In [2]:
#Importing the libraries
import numpy as np
import pandas as pd

import nltk
from nltk.corpus import stopwords #this is to remove the english stopwords from the customer review comments
nltk.download('stopwords')
set(stopwords.words('english'))
    
import sklearn as sk
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

import string

#import matplotlib.pyplot as plt
#import seaborn as sns

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\suman\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Below block takes input from the user for the cuisine the user is looking for:

The original proposal was to get the user enter a specific dish. The application should then look for that dish with the various reviews available online in the review sites like Yelp and TripAdvisors.
**The idea is to use Name Entity Recognition and Classification (NER/C) to extract the dish name out of the reviews.
Further analysis has been done on NER as part of our Technical Review.

This application can be futher extended to take the food dish as the input and apply NER to suggest restaurants based on the dish, same as it is doing for the cuisine

Based on the dataset we have used, below are the list of cuisines that are currently available in the set to be used:

*Indian, American, Asian, Austrian, Bakery, Barbecue, Bistro, British, Burgers, Cafe, Californian, Chinese, Chowder, Coffee, Contemporary, Crepes, Creole, Deli, Dim Sum, Diner, European, Fast Food, French, Fusion, German, Grill, Gyros, International, Italian, Latin American, Mediterranean, Mexican, Middle Eastern, Pasta, Peruvian, Pho, Pizza, Pub Food, Seafood, Southern, Spanish, Steak, Sushi, Thai, Vegetarian, Vietnamese*

Here, I have used **Indian** to provide the demo of the tool.

In [20]:
#User Input for Cuisine
cuisine = input('enter cuisine: ')
print('The cuisine you are looking for: ' , cuisine)
user_input = [{'craving_for':cuisine}]
dataset_cuisine = pd.DataFrame(user_input)

enter cuisine: Indian
The cuisine you are looking for:  Indian


### Below is the function defined to take care of removing the stopwords and punctuations so that the function returns only valid words that can be further used to vectorize the review set:




In [21]:
'''
Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Return the cleaned text as a list of words
'''
def text_proc(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

Below block imports the actual dataset that has been converted to .csv format from JSON as a pre-processing outside the python implementation using online tool. The csv containing the dataset is TripAdvisorReviews.csv.

*As mentioned above, the dataset contains 100 TripAdvisors reviews for each restaurant in the Bay Area region.*

The dataset has been imported using pandas into a dataframe named **dataset**

In [22]:
dataset = pd.read_csv('TripAdvisorReviews.csv')
print(dataset.shape)

(147, 617)


### The next task is to filter the **dataset** based on the cuisine entered by the user and create the filtered dataframe called **filter_data**

Please note that the cuisine in our dataset is spread across 5 different columns namely 'cuisine/0'.....'cuisine/4'


In [23]:
#filtering out the data based on the cuisine entered by the user
filter_d0 = dataset[dataset['cuisine/0' ].isin(dataset_cuisine['craving_for'].tolist())]
filter_d1 = dataset[dataset['cuisine/1' ].isin(dataset_cuisine['craving_for'].tolist())]
filter_d2 = dataset[dataset['cuisine/2' ].isin(dataset_cuisine['craving_for'].tolist())]
filter_d3 = dataset[dataset['cuisine/3' ].isin(dataset_cuisine['craving_for'].tolist())]
filter_d4 = dataset[dataset['cuisine/4' ].isin(dataset_cuisine['craving_for'].tolist())]

filter_data = pd.concat([filter_d0, filter_d1, filter_d2, filter_d3, filter_d4], ignore_index=True)

filter_data.shape
filter_data

Unnamed: 0,name,address,locality,region,country,tel,fax,website,email,cuisine/0,...,reviews/98/review_text,reviews/98/review_rating,reviews/98/review_date,reviews/99/review_website,reviews/99/review_url,reviews/99/review_title,reviews/99/review_text,reviews/99/review_rating,reviews/99/review_date,trip_advisor_url
0,Amber India Restaurant,25 Yerba Buena Ln,San Francisco,CA,us,(415) 777-0500,(415) 777-0560,http://www.amber-india.com/SanFrancisco/,amberindiasf@gmail.com,Indian,...,"Always busy, this restaurant offers a comprehe...",5,5/23/2012 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g3...,Yum,This is consistently one of the better Indian ...,3,5/15/2012 0:00,http://www.tripadvisor.com/Restaurant_Review-g...
1,Chutney Restaurant,511 Jones St,San Francisco,CA,us,(415) 931-5541,,http://www.chutneysf.com,send@giftrocket.com,Indian,...,There are at least 3 Indian Veg buffets on Dru...,4,8/18/2013 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g1...,End of the affair,I've been coming to Chutneys for years on my t...,3,8/12/2013 0:00,http://www.tripadvisor.com/Restaurant_Review-g...
2,Dosa,995 Valencia St,San Francisco,CA,us,(415) 642-3672,(415) 643-8823,http://www.dosasf.com,comments@dosasf.com,Indian,...,I had dinner at Dosa with several colleagues m...,4,11/15/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g6...,Good flavors but lacking the essential Indian ...,"Very nice staff, beautiful room, creative menu...",4,11/11/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...
3,DOSA on Fillmore,1700 Fillmore St,San Francisco,CA,us,(415) 441-3672,(415) 643-8823,http://dosasf.com,reservations@dosasf.com,Indian,...,I had dinner at Dosa with several colleagues m...,4,11/15/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g6...,Good flavors but lacking the essential Indian ...,"Very nice staff, beautiful room, creative menu...",4,11/11/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...
4,Lahore Karahi,612 Ofarrell St,San Francisco,CA,us,(415) 567-8603,(415) 567-5411,http://www.lahorekarahisanfrancisco.com,alinetwork@hotmail.com,Indian,...,We love the Lahore Karahi for fantastic food t...,4,4/7/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g1...,Good value for money!,Excellent and fast service and food delicious!...,5,3/29/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...
5,Punjab Restaurant,2838 24th St,San Francisco,CA,us,(415) 282-4011,,http://sunriserestaurantsf.com,,Chinese,...,Attracted by large range of veggie dishes. 5:3...,2,12/4/2015 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g1...,Delicious,It was great to visit this restaurant . Food w...,4,11/30/2015 0:00,https://www.tripadvisor.com/Restaurant_Review-...
6,B Star Cafe,127 Clement St,San Francisco,CA,us,(415) 933-9900,,http://www.bstarbar.com,hello@bstarbar.com,Cafe,...,We have been visiting Chincoteague since the 1...,4,5/11/2015 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g5...,Take out hide away,Found this gem on the way to the beach. Sandwi...,5,5/6/2015 0:00,https://www.tripadvisor.com/Restaurant_Review-...


### Next step is to :

    1. filter out the 100 review_text coulumns associated with each of these restaurants that serves the entered cuisine.
    2. use the melt functionality to create a dataframe with all the reviews listed in a single column.
    3. This will eventually act as our X for classification task.
    

In [24]:
df1 = filter_data.filter(like='/review_text')
df_review = pd.melt(df1, value_name = 'reviews').drop('variable', axis=1)
print(df_review)

                                               reviews
0    They have moved to a really nice location with...
1    I've been there several times and the buffet w...
2    There are delightful surprises at Dosa. The cu...
3    There are delightful surprises at Dosa. The cu...
4    This place is very busy because it gives its p...
5    We tried to book online on the day before we t...
6    What a wonderful find! We had lunch here two d...
7    1. Well First They have changed Location , Jus...
8    Went to Chutneys for dinner off the a la carte...
9    Out of town visitors and originally dined at t...
10   Out of town visitors and originally dined at t...
11   The food was amazing, I absolutely loved every...
12   a busy restaurant with an authentic menu. you ...
13   There's a reason why there is always a line at...
14   I always look for good Indian Food when I trav...
15   It is a BYOB with great vegetarian/south-india...
16   This is by far my favorite Indian food I have ...
17   This 

### Same above steps for the ratings for each review text :
    1. filter out only the ratings coulumns associated with each of these 100 reviews for each restaurant.
    2. use the melt functionality to create a dataframe with all the reviews listed in a single column. 
    3. This will eventually act as our Y for classification task


In [25]:
df2 = filter_data.filter(like='/review_rating')
df_rating = pd.melt(df2, value_name = 'rating').drop('variable', axis=1)
print(df_rating)

     rating
0         5
1         5
2         5
3         5
4         5
5         4
6         5
7         5
8         5
9         2
10        2
11        5
12        4
13        5
14        4
15        4
16        5
17        5
18        4
19        5
20        5
21        4
22        5
23        5
24        5
25        4
26        5
27        5
28        5
29        2
..      ...
670       4
671       5
672       4
673       3
674       5
675       5
676       4
677       4
678       5
679       5
680       5
681       5
682       5
683       2
684       4
685       5
686       5
687       4
688       4
689       4
690       4
691       2
692       4
693       3
694       3
695       4
696       4
697       5
698       4
699       5

[700 rows x 1 columns]


**Creating the X (list of all the reviews for all the restaurants for that cuisine) and**

**Creating the Y (list of all the ratings for all the restaurants for that cuisine)**

In [26]:
X = df_review['reviews']
Y = df_rating['rating']

## Vectorization

The classification algorithm will need some sort of **feature vector** in order to perform the classification task. The simplest way to convert a corpus to a vector format is the **bag-of-words approach**, where each unique word in a text will be represented by one number.


**At the point, we have our reviews as lists of tokens (also known as ***lemmas***). To enable Scikit-learn algorithms to work on our text, we will convert each review into a vector.**

**We will now use Scikit-learn’s *CountVectorizer* to convert the text collection into a matrix of token counts. This resulting matrix is a 2-D matrix, where each row is a unique word, and each column is a review.**


Since there are many reviews, we would expect a lot of zero counts for the presence of a word in the collection. Because of this, Scikit-learn will output a sparse matrix.

Let’s import CountVectorizer and fit an instance to our review text (stored in X), passing in our text_process function as the analyzer.


In [27]:
bow_transformer = CountVectorizer(analyzer=text_proc).fit(X)
len(bow_transformer.vocabulary_)

5087

**Now that we’ve done the vectorisation process, we will now transform our X dataframe into a sparse matrix. To do this, we have used the .transform() method on our bag-of-words transformed object.**



In [28]:
X = bow_transformer.transform(X)

**We will now check out the shape of our new X.**

In [29]:
print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)

density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print('Density: {}'.format((density)))

Shape of Sparse Matrix:  (700, 5087)
Amount of Non-Zero occurrences:  25436
Density: 0.7143137970737735


### Training data and test data

**As we have finished processing the review text in X, we will now split our Xand y into a training and a test set using train_test_split from Scikit-learn. We will use 75% of the dataset for training.**


In [30]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=101)

### Training our model

**Multinomial Naive Bayes is a specialised version of Naive Bayes designed more for text documents. We are now building a Multinomial Naive Bayes model and fit it to our training set (X_train and Y_train).**

In [31]:
nb = MultinomialNB()
nb.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Testing our Model

In [32]:
preds = nb.predict(X_test)
print(confusion_matrix(Y_test, preds))
print('\n')
print(classification_report(Y_test, preds))

[[ 1  6  1  2  1]
 [ 0  1  0  1  0]
 [ 0  0  2  9  2]
 [ 0  0  1 40 30]
 [ 0  0  0 15 63]]


             precision    recall  f1-score   support

          1       1.00      0.09      0.17        11
          2       0.14      0.50      0.22         2
          3       0.50      0.15      0.24        13
          4       0.60      0.56      0.58        71
          5       0.66      0.81      0.72        78

avg / total       0.64      0.61      0.59       175



#### Here we define a function called calculate that takes all the customer reviews that are passed to it and then it uses the model we have created and it spits out the rating based on the reviews it takes as parameter:

In [33]:
def calculate (customer_review):
    positive_review_transformed = bow_transformer.transform([customer_review])
    return nb.predict(positive_review_transformed)[0]

### In this block:

**1. we add a column called "calculated_rating" to our existing dataframe "filter_data".**

**2. The new column calculated_rating is obtained by passing all the reviews of a particular restaurant to the model we have generated and get the rating in the response.**

****** Please note that this rating is based on our trained model that uses all the available comments for all the restaurants for that cuisine ******

In [34]:
filter_data['calculated_rating'] = df1.apply(calculate, axis=1)
filter_data

Unnamed: 0,name,address,locality,region,country,tel,fax,website,email,cuisine/0,...,reviews/98/review_rating,reviews/98/review_date,reviews/99/review_website,reviews/99/review_url,reviews/99/review_title,reviews/99/review_text,reviews/99/review_rating,reviews/99/review_date,trip_advisor_url,calculated_rating
0,Amber India Restaurant,25 Yerba Buena Ln,San Francisco,CA,us,(415) 777-0500,(415) 777-0560,http://www.amber-india.com/SanFrancisco/,amberindiasf@gmail.com,Indian,...,5,5/23/2012 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g3...,Yum,This is consistently one of the better Indian ...,3,5/15/2012 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,4
1,Chutney Restaurant,511 Jones St,San Francisco,CA,us,(415) 931-5541,,http://www.chutneysf.com,send@giftrocket.com,Indian,...,4,8/18/2013 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g1...,End of the affair,I've been coming to Chutneys for years on my t...,3,8/12/2013 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,4
2,Dosa,995 Valencia St,San Francisco,CA,us,(415) 642-3672,(415) 643-8823,http://www.dosasf.com,comments@dosasf.com,Indian,...,4,11/15/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g6...,Good flavors but lacking the essential Indian ...,"Very nice staff, beautiful room, creative menu...",4,11/11/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,5
3,DOSA on Fillmore,1700 Fillmore St,San Francisco,CA,us,(415) 441-3672,(415) 643-8823,http://dosasf.com,reservations@dosasf.com,Indian,...,4,11/15/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g6...,Good flavors but lacking the essential Indian ...,"Very nice staff, beautiful room, creative menu...",4,11/11/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,5
4,Lahore Karahi,612 Ofarrell St,San Francisco,CA,us,(415) 567-8603,(415) 567-5411,http://www.lahorekarahisanfrancisco.com,alinetwork@hotmail.com,Indian,...,4,4/7/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g1...,Good value for money!,Excellent and fast service and food delicious!...,5,3/29/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,4
5,Punjab Restaurant,2838 24th St,San Francisco,CA,us,(415) 282-4011,,http://sunriserestaurantsf.com,,Chinese,...,2,12/4/2015 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g1...,Delicious,It was great to visit this restaurant . Food w...,4,11/30/2015 0:00,https://www.tripadvisor.com/Restaurant_Review-...,4
6,B Star Cafe,127 Clement St,San Francisco,CA,us,(415) 933-9900,,http://www.bstarbar.com,hello@bstarbar.com,Cafe,...,4,5/11/2015 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g5...,Take out hide away,Found this gem on the way to the beach. Sandwi...,5,5/6/2015 0:00,https://www.tripadvisor.com/Restaurant_Review-...,5


### In this block we filter out only those restaurants with requested cuisine for which our model came back with **5 STAR RATING!!**

In [35]:
suggested_restaurants_with_attributes = filter_data.query("calculated_rating == 5")
suggested_restaurants_with_attributes

Unnamed: 0,name,address,locality,region,country,tel,fax,website,email,cuisine/0,...,reviews/98/review_rating,reviews/98/review_date,reviews/99/review_website,reviews/99/review_url,reviews/99/review_title,reviews/99/review_text,reviews/99/review_rating,reviews/99/review_date,trip_advisor_url,calculated_rating
2,Dosa,995 Valencia St,San Francisco,CA,us,(415) 642-3672,(415) 643-8823,http://www.dosasf.com,comments@dosasf.com,Indian,...,4,11/15/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g6...,Good flavors but lacking the essential Indian ...,"Very nice staff, beautiful room, creative menu...",4,11/11/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,5
3,DOSA on Fillmore,1700 Fillmore St,San Francisco,CA,us,(415) 441-3672,(415) 643-8823,http://dosasf.com,reservations@dosasf.com,Indian,...,4,11/15/2014 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g6...,Good flavors but lacking the essential Indian ...,"Very nice staff, beautiful room, creative menu...",4,11/11/2014 0:00,http://www.tripadvisor.com/Restaurant_Review-g...,5
6,B Star Cafe,127 Clement St,San Francisco,CA,us,(415) 933-9900,,http://www.bstarbar.com,hello@bstarbar.com,Cafe,...,4,5/11/2015 0:00,TripAdvisor,https://www.tripadvisor.com/ShowUserReviews-g5...,Take out hide away,Found this gem on the way to the beach. Sandwi...,5,5/6/2015 0:00,https://www.tripadvisor.com/Restaurant_Review-...,5


## A clean output to only show the Restaurants and the Associated Ratings given by our Model based on training of all the customer  reviews

In [36]:
suggested_restaurants_temp = suggested_restaurants_with_attributes.filter(['name', 'calculated_rating'])
suggested_restaurants = suggested_restaurants_temp.reset_index(drop=True)
print(suggested_restaurants)

               name  calculated_rating
0              Dosa                  5
1  DOSA on Fillmore                  5
2       B Star Cafe                  5
