## Restaurant Entrepreneur:  Predicting best type and location with Yelp Review Data

Bill (an investor) went on a work trip recently to Las Vegas and Phoenix for a few days. During his stay, he really liked the restaurant options that were available. Being that he already had interests in opening up a restaurant for some time now, he wanted to know what the type of restaurant and where to open would be most successful.

As a first look, our group decided to use Yelp API to collect data on the mix of restaurants across cities and states within the United States. In this case, we decided to use the review data that Yelp is known for. Since Yelp provides a wide variety of restaurant categories, it is possible to get more information in regards to reviews and ratings for all different types in each city. After our initial attempts to use Yelp API, we found issues with importing the Yelp API data. When we created a dataframe from the API, it would only pull a small portion of the data on each run, not a sufficient amount of data to do a full analysis on. Fortunately, we were able to find some recent Yelp Review data on Kaggle.com that had a huge dataset to work with. 

Before deciding on which Machine Learning Model we were going to choose to do our analysis on, we wanted to test a few to see what would be the best. We began setting up our code by importing various libraries, which included Random Forest Classifier and DeepLearning Machine Learning Models. Random Forest Classifier is a good model if you want high performance with less need for interpretation. Deep Learning Model is know for it's supremacy in terms of accuracy when trained with huge amounts of data and to get more neural network predictions. We've also imported train_test_split which will help us split our data for training and testing.

For the preprocessing, we imported StandardScaler and OneHotEncoder. The StandardScaler is needed to transform the data so that it has a mean of 0 and a standard deviation of 1. The OneHotEncoder is needed as it creates a binary column for each category type of restaurant.



In [1]:
#Random forest and DeepLearning
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
import pandas as pd
import tensorflow as tf
from sqlalchemy import create_engine
import numpy as np

In [2]:
# The cleaned data is loaded into postgres database. It is also formatted in the format required for ML during transformation process.
# We are trying random forest classifier as the data will be divided into smaller sets and prediction could be near to accuracy
# We are also adding deep learning to get more neural network predition
# Based on the line identified, the output variable will be predicted for the input vairable
# Once the complete dataset is loaded and the accuracy is identified, we will pick the best approch. This should be sometime in next session

The cleaned data is loaded into a postgres database. It is also formatted to fit the required format for Machine Learning during the transformation process. 

In [3]:
#Pull data from busiensses table from postgres
engine = create_engine('postgresql+psycopg2://postgres:postgres@localhost/Yelp_db')

In [4]:
from sqlalchemy.orm import Session
session = Session(engine)

In [5]:
#reviewsDF = pd.read_sql('select stars,city,state,postal_code,category from reviews r, businesses b where b.business_id = r.business_id and length(postal_code)>0',engine)
reviewsDF = pd.read_sql ('select b.stars , b.city, b.postal_code, r.ethnic_type from business_reviews r, business_info b where b.business_id = r.business_id and length(b.postal_code)>0 group by b.stars,b.city,b.postal_code, r.ethnic_type,useful order by b.postal_code',engine)
XInput =reviewsDF

In [6]:
y = reviewsDF.stars
yDF=round(pd.DataFrame(y))
X = reviewsDF

In [7]:
postalCountsX=X.postal_code.value_counts()
postalCountsX

89109    1345
89102     825
89119     708
89103     644
89146     610
         ... 
85388       3
89018       3
93013       2
85303       2
89161       2
Name: postal_code, Length: 108, dtype: int64

In [8]:
categoryCountsX=X.ethnic_type.value_counts()
categoryCountsX

American         4194
Mexican          3436
Italian          2194
Asian_Fusion     1620
Chinese          1616
Japanese         1427
Thai              784
Mediterranean     666
Greek             590
Vietnamese        522
Hawaiian          517
French            508
Indian            436
Korean            327
Spanish           207
British           150
African            79
Name: ethnic_type, dtype: int64

In [9]:
replace_type=list(categoryCountsX[categoryCountsX<100].index)
replace_postalcode=list(postalCountsX[postalCountsX<100].index)

In [10]:
for application in replace_type:
    X.ethnic_type =  X.ethnic_type.replace(application,"Others")
for application in replace_postalcode:
    X.postal_code =  X.postal_code.replace(application,"Combined")    
X.head()   

Unnamed: 0,stars,city,postal_code,ethnic_type
0,4.5,Phoenix,Combined,Italian
1,4.5,Phoenix,Combined,Italian
2,4.5,Phoenix,Combined,Italian
3,5.0,Phoenix,Combined,American
4,5.0,Phoenix,Combined,American


In [11]:
# Generate our categorical variable list
reviewCatX = X.dtypes[X.dtypes == "object"].index.tolist()
X[reviewCatX].nunique()

city            2
postal_code    64
ethnic_type    17
dtype: int64

In [12]:
# Generate our categorical variable list
reviewCaty = yDF.dtypes[yDF.dtypes == "int64"].index.tolist()
yDF[reviewCaty].nunique()

Series([], dtype: float64)

In [13]:
predictInputDF = pd.DataFrame(X.groupby(['stars','postal_code','city','ethnic_type']).sum()).reset_index()
predictInputDF['stars'] = round(predictInputDF['stars'])

In [14]:
X = X.drop(columns=['stars'])

In [15]:
predictInputDF

Unnamed: 0,stars,postal_code,city,ethnic_type
0,1.0,85004,Phoenix,Chinese
1,1.0,85022,Phoenix,Italian
2,1.0,85032,Phoenix,Indian
3,1.0,85033,Phoenix,Italian
4,1.0,85040,Phoenix,Mexican
...,...,...,...,...
1720,5.0,Combined,Las Vegas,Mexican
1721,5.0,Combined,Phoenix,American
1722,5.0,Combined,Phoenix,Japanese
1723,5.0,Combined,Phoenix,Mexican


In [16]:
predictInputDF=predictInputDF.drop(columns=['stars'])
predictInputDF=predictInputDF.drop_duplicates()
XInput=predictInputDF
XInput

Unnamed: 0,postal_code,city,ethnic_type
0,85004,Phoenix,Chinese
1,85022,Phoenix,Italian
2,85032,Phoenix,Indian
3,85033,Phoenix,Italian
4,85040,Phoenix,Mexican
...,...,...,...
1645,Combined,Phoenix,Hawaiian
1665,85029,Phoenix,Thai
1682,89103,Las Vegas,Others
1700,89121,Las Vegas,Spanish


In [17]:
# Split training/test datasets
X_train, X_test, y_train, y_test = train_test_split(X, yDF, random_state=1, stratify=y)

In [18]:
# Create a OneHotEncoder instance
enc = OneHotEncoder(sparse=False)
# Fit and transform the OneHotEncoder using the categorical variable list
encodeDFX_train = pd.DataFrame(enc.fit_transform(X_train[reviewCatX]))
# Add the encoded variable names to the DataFrame
encodeDFX_train.columns = enc.get_feature_names(reviewCatX)

# Fit and transform the OneHotEncoder using the categorical variable list
encodeDFX_test = pd.DataFrame(enc.fit_transform(X_test[reviewCatX]))
# Add the encoded variable names to the DataFrame
encodeDFX_test.columns = enc.get_feature_names(reviewCatX)
#INput data fit
encodeInputX = pd.DataFrame(enc.fit_transform(XInput[reviewCatX]))
encodeInputX.columns = enc.get_feature_names(reviewCatX)

In [19]:
# Capply one hot on target labels
enc.fit(y_train)
#encodeDFy_train = pd.DataFrame(enc.fit_transform(y_train[reviewCaty]))
#encodeDFy_test = pd.DataFrame(enc.fit_transform(y_test[reviewCaty]))
encoded_y_train = enc.transform(y_train)                           
encoded_y_test = enc.transform(y_test)                       

In [20]:
# # Merge one-hot encoded features and drop the originals
# TestDF = reviewsDF.merge(encodeDF,left_index=True, right_index=True)
# reviewsDF = reviewsDF.drop(reviewCat,1)
# reviewsDF.head()

In [21]:
# Create a StandardScaler instance
scaler = StandardScaler()
# Fit the StandardScaler
X_scaler = scaler.fit(encodeDFX_train)

In [22]:
# Scale the dataencodeDFX_train
X_train_scaled = X_scaler.transform(encodeDFX_train)
X_test_scaled = X_scaler.transform(encodeDFX_test)
XInput_scaled = X_scaler.transform(encodeInputX)

In [23]:
print(XInput_scaled.shape)

(641, 83)


In [24]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [25]:
# Create a sequential model
model = Sequential()

In [26]:
# Add the first layer where the input dimensions are the 561 columns of the training data
model.add(Dense(100, activation='relu', input_dim=X_train_scaled.shape[1]))

In [27]:
# The output layer has 5 columns that are one-hot encoded
y_train.stars.value_counts()
#number_outputs = 5

4.0    10794
3.0     1783
2.0     1582
5.0      282
1.0       13
Name: stars, dtype: int64

In [28]:
# Add output layer using 5 output nodes
model.add(Dense(5, activation="softmax"))

In [29]:
# Compile the model using categorical_crossentropy for the loss function, the adam optimizer,
# and add accuracy to the training metrics
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [30]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 100)               8400      
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 505       
Total params: 8,905
Trainable params: 8,905
Non-trainable params: 0
_________________________________________________________________


In [31]:
model.fit(
    X_train_scaled,
    encoded_y_train,
    epochs=30,
    shuffle=True,
    verbose=2
)

Train on 14454 samples
Epoch 1/30
14454/14454 - 1s - loss: 0.7725 - accuracy: 0.7379
Epoch 2/30
14454/14454 - 1s - loss: 0.6958 - accuracy: 0.7469
Epoch 3/30
14454/14454 - 1s - loss: 0.6728 - accuracy: 0.7498
Epoch 4/30
14454/14454 - 1s - loss: 0.6602 - accuracy: 0.7511
Epoch 5/30
14454/14454 - 1s - loss: 0.6497 - accuracy: 0.7541
Epoch 6/30
14454/14454 - 1s - loss: 0.6407 - accuracy: 0.7562
Epoch 7/30
14454/14454 - 1s - loss: 0.6331 - accuracy: 0.7572
Epoch 8/30
14454/14454 - 1s - loss: 0.6266 - accuracy: 0.7577
Epoch 9/30
14454/14454 - 1s - loss: 0.6216 - accuracy: 0.7578
Epoch 10/30
14454/14454 - 1s - loss: 0.6200 - accuracy: 0.7567
Epoch 11/30
14454/14454 - 1s - loss: 0.6147 - accuracy: 0.7592
Epoch 12/30
14454/14454 - 1s - loss: 0.6128 - accuracy: 0.7603
Epoch 13/30
14454/14454 - 1s - loss: 0.6097 - accuracy: 0.7610
Epoch 14/30
14454/14454 - 1s - loss: 0.6081 - accuracy: 0.7596
Epoch 15/30
14454/14454 - 1s - loss: 0.6056 - accuracy: 0.7615
Epoch 16/30
14454/14454 - 1s - loss: 0.60

<tensorflow.python.keras.callbacks.History at 0x1f012b5db88>

In [32]:
# Evaluate the model using the test data
model_loss, model_accuracy =model.evaluate(X_test_scaled,encoded_y_test)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

Loss: 0.6243524110262808, Accuracy: 0.7484955191612244


In [33]:
y_pred_output = model.predict(XInput_scaled)

In [34]:
print(y_pred_output.shape)

(641, 5)


In [35]:
import numpy as np
predictInputDF['prediction'] = pd.DataFrame(np.argmax(y_pred_output, axis=1))

In [36]:
predictInputDF["prediction"] = predictInputDF["prediction"].fillna(0)

In [37]:
#Import data into postgres
predictInputDF.to_sql(name='review_prediction', con=engine, if_exists='replace' ,index=False)

In [38]:
# from sklearn.metrics import classification_report
# print(classification_report(predictInputDF['stars'],predictInputDF['prediction']))