# Sites Recommendation (Yelp dataset)


* [Introduction](#introduction)
* [Data](#data)
* [Preprocessing Data](#preprocessing-data) 
    * [Get Dummies from attributes](#get-dummies) 
* [Content Based Filtering Model](#content-based)
    * [K-nearest neighbours](#knn)       
* [Collaboritive Filtering - Model](#collaboritive)
    * [SVD - Singular Value Decomposition](#svd)  
        - [Building a Utility Matrix](#u-matrix)
        - [Transposing the Matrix](#transpose-matrix)
        - [Decomposing the Matrix](#decompose-matrix)
        - [Generating a correlation Matrix](#gen-corr-matrix)  
        - [Isolating the most popular restaurant from the Correlation Matrix](#isolate)
        - [Recommend highly correlated Restaurants](#recommend)                   
    * [Neural Network - keras](#NN-keras)    
        - [Prediction](#prediction)
        - [Cosine similarity](#cos-similarity)
        - [Recommendation](#recommendation)

<a id="introduction"></a>
# Introduction

In this notebook, we are building a recommender system using Yelp Dataset. In order to build this system, we had two main approaches which are Content-Based Filtering and Collaborative Filtering. 

> * **Content-Based Filtering:** It is based on the features of the restaurants rather than the user features. The idea is if the user likes a restaurant then he/she will like the other similar restaurants.
>  
> * **Collaborative Filtering:** It is based on the assumption that people like restaurants similar to other restaurants they like, and restaurants that are liked by other people with similar tastes.

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
# for dirname, _, filenames in os.walk('C:/Users/sam79/OneDrive/桌面/W210 Yelp/data'):
    
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 

# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="data"></a>
# Data

We are using subsets of each table since we have a large dataset to work with. For this notebook, we used _business_ and _review_ tables.

In [3]:
# import the data (chunksize returns jsonReader for iteration)
# businesses = pd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_business.json", lines=True, orient='columns', chunksize=1000000)
# reviews = pd.read_json("/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json", lines=True, orient='columns', chunksize=1000000)

In [4]:
# import the data (chunksize returns jsonReader for iteration)
non_food_reviews = pd.read_csv("./yelp_dataset/non_food_merged.csv")
# businesses = pd.read_json("./yelp_dataset/smaller/yelp_academic_dataset_business.json", lines=True, orient='columns', chunksize=1000000)
businesses = pd.read_json("./yelp_dataset/yelp_academic_dataset_business.json", lines=True, orient='columns', chunksize=1000000)
# reviews = pd.read_json("./yelp_dataset/smaller//yelp_academic_dataset_review.json", lines=True, orient='columns', chunksize=1000000) # 1000000 Total is 8.2 M reviews
reviews = pd.read_json("./yelp_dataset/yelp_academic_dataset_review.json", lines=True, orient='columns', chunksize=1000000) # 1000000 Total is 8.2 M reviews

# users = pd.read_json("./data/yelp_academic_dataset_user.json", lines=True, orient='columns', chunksize=10000000)

In [5]:
# read the data 
for business in businesses:
    subset_business = business
    break
    
for review in reviews:
    subset_review = review
    break

# for user in users:
#     subset_user = user
#     break

subset_non_food_reviews = non_food_reviews


In [None]:
# peak the tables
display(subset_business.head(2))
display(subset_review.head(2))
display(subset_non_food_reviews.head(2))


In [None]:
subset_business.columns

In [None]:
subset_review.columns

In [None]:
subset_non_food_reviews.columns

In [None]:
print(subset_business.shape)
print(subset_review.shape)
# print(subset_user.shape)

In [None]:
x=subset_business['stars'].value_counts()
x=x.sort_index()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Star Rating Distribution")
plt.ylabel('# of businesses', fontsize=12)
plt.xlabel('Star Ratings ', fontsize=12)

In [None]:
business_cats = ''.join(subset_business['categories'].astype('str'))

cats=pd.DataFrame(business_cats.split(','),columns=['categories'])

#prep for chart
x=cats.categories.value_counts()

x=x.sort_values(ascending=False)
x=x.iloc[0:20]

#chart
plt.figure(figsize=(16,4))
ax = sns.barplot(x.index, x.values, alpha=0.8)#,color=color[5])
plt.title("What are the top categories?",fontsize=25)
locs, labels = plt.xticks()
plt.setp(labels, rotation=80)
plt.ylabel('# businesses', fontsize=12)
plt.xlabel('Category', fontsize=12)

#adding the text labels
# rects = ax.patches
# labels = x.values
# for rect, label in zip(rects, labels):
#     height = rect.get_height()
#     ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')

plt.show()

In [None]:
#Get the distribution of the ratings
x=subset_business['city'].value_counts()
x=x.sort_values(ascending=False)
x=x.iloc[0:20]
plt.figure(figsize=(16,4))
ax = sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Which city has the most reviews?")
locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
plt.ylabel('# businesses', fontsize=12)
plt.xlabel('City', fontsize=12)

In [None]:
# No New York, Maybe in other Naming convention ?
subset_business[(subset_business['city'].str.contains('New'))]['city'].value_counts()

In [None]:
# No San Diego and San Francisco, Maybe in other Naming convention ?
subset_business[(subset_business['city'].str.contains('San'))]['city'].value_counts()

In [None]:
# No Paris, Maybe in other Naming convention ?
subset_business[(subset_business['city'].str.contains('Paris'))]['city'].value_counts()

In [None]:
subset_business['city'].value_counts().sort_values(ascending=False)

### Non-food Sites

In [None]:
# Select locations which are categorized as Shopping, Nightlife, Home Services, Health & Medical, Local Services, Beauty & Spas, Event Planning & Services, Automotive, Active Life, Home & Garden, Fashion
Non_food = subset_business[(subset_business['categories'].str.contains('Shopping')) | (subset_business['categories'].str.contains('Nightlife')) | (subset_business['categories'].str.contains('Home Services'))| (subset_business['categories'].str.contains('Health & Medical'))| (subset_business['categories'].str.contains('Local Services'))| (subset_business['categories'].str.contains('Beauty & Spas'))| (subset_business['categories'].str.contains('Event Planning & Services'))| (subset_business['categories'].str.contains('Automotive'))| (subset_business['categories'].str.contains('Active Life'))| (subset_business['categories'].str.contains('Home & Garden'))| (subset_business['categories'].str.contains('Fashion'))]
Non_food = Non_food[(subset_business['is_open'] == 1)]
print(Non_food.shape)
Non_food.head()

In [None]:
#prep for chart
Non_food_cats = ''.join(Non_food['categories'].astype('str'))

cats=pd.DataFrame(Non_food_cats.split(','),columns=['categories'])

x=cats.categories.value_counts()
x = x.drop([' Restaurants',' Food'])

x=x.sort_values(ascending=False)
x=x.iloc[0:20]

#chart
plt.figure(figsize=(16,4))
ax = sns.barplot(x.index, x.values, alpha=0.8)#,color=color[5])
plt.title("What are the Top Non-food Categories?",fontsize=25)
locs, labels = plt.xticks()
plt.setp(labels, rotation=80)
plt.ylabel('# businesses', fontsize=12)
plt.xlabel('Category', fontsize=12)

In [None]:
print('Shopping Locations:')
display(Non_food[(Non_food['categories'].str.contains('Shopping'))].head())
print('Home Services Locations:')
display(Non_food[(Non_food['categories'].str.contains('Home Services'))].head())
print('Health & Medical Locations:')
display(Non_food[(Non_food['categories'].str.contains('Health & Medical'))].head())

<a id="preprocessing-data"></a>
# Preprocessing the Data

We chose Philadelphia since it has the highest number of restraunts. The restaurant is the most popular category among businesses. 

In [None]:
# Businesses in Philadelphia and currently open business
city = subset_business[(subset_business['city'].str.contains('Philadelphia')) & (subset_business['is_open'] == 1)]
Philadelphia = city[['business_id','name','address', 'categories', 'attributes','stars']]
Philadelphia

In [None]:
# getting just restaurants from Philadelphia business
rest = Philadelphia[Philadelphia['categories'].str.contains('Restaurant.*')==True].reset_index()
rest

<a id="get-dummies"></a>
* ** Get Dummies from attributes and categories columns**

> In "attributes" column has nested attributes. In order to create a feature table, we need to separate those nested attributes into their own columns. Therefore, the following functions will be used to achieve this goal.

In [None]:
# Function that extract keys from the nested dictionary
def extract_keys(attr, key):
    if attr == None:
        return "{}"
    if key in attr:
        return attr.pop(key)

# convert string to dictionary
import ast
def str_to_dict(attr):
    if attr != None:
        return ast.literal_eval(attr)
    else:
        return ast.literal_eval("{}")    

In [None]:
# get dummies from nested attributes
rest['BusinessParking'] = rest.apply(lambda x: str_to_dict(extract_keys(x['attributes'], 'BusinessParking')), axis=1)
rest['Ambience'] = rest.apply(lambda x: str_to_dict(extract_keys(x['attributes'], 'Ambience')), axis=1)
rest['GoodForMeal'] = rest.apply(lambda x: str_to_dict(extract_keys(x['attributes'], 'GoodForMeal')), axis=1)
rest['Dietary'] = rest.apply(lambda x: str_to_dict(extract_keys(x['attributes'], 'Dietary')), axis=1)
rest['Music'] = rest.apply(lambda x: str_to_dict(extract_keys(x['attributes'], 'Music')), axis=1)

In [None]:
rest

In [None]:
# create table with attribute dummies
df_attr = pd.concat([rest['attributes'].apply(pd.Series), rest['BusinessParking'].apply(pd.Series),
                    rest['Ambience'].apply(pd.Series), rest['GoodForMeal'].apply(pd.Series), 
                    rest['Dietary'].apply(pd.Series) ], axis=1)
df_attr_dummies = pd.get_dummies(df_attr)
df_attr_dummies

In [None]:
rest_attribute = rest['attributes'].apply(pd.Series)
rest_attribute

Most attributes reflect facilities and service of a restaurant

In [None]:
rest_business_parking = rest['BusinessParking'].apply(pd.Series)
rest_business_parking

Business parking features reflect parking related information

In [None]:
rest_ambience = rest['Ambience'].apply(pd.Series)
rest_ambience

Ambience features reflect the atmosphere of a restaurant

In [None]:
rest_goodformeal = rest['GoodForMeal'].apply(pd.Series)
rest_goodformeal

GoodForMeal features reflects the availability of meals in a day

In [None]:
# get dummies from categories
df_categories_dummies = pd.Series(rest['categories']).str.get_dummies(',')
df_categories_dummies

##### Based on the observation above, we may adopt categories and ambience to be the features and drop attributes, parking, good for meal features

In [None]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
display(df_categories_dummies)

In [None]:
# pull out names and stars from rest table 
result = rest[['name','stars']]
result

In [None]:
# Concat all tables and drop Restaurant column
df_final = pd.concat([df_attr_dummies, df_categories_dummies, result], axis=1)
df_final.drop('Restaurants',inplace=True,axis=1)

In [None]:
# map floating point stars to an integer
mapper = {1.0:1,1.5:2, 2.0:2, 2.5:3, 3.0:3, 3.5:4, 4.0:4, 4.5:5, 5.0:5}
df_final['stars'] = df_final['stars'].map(mapper)

In [None]:
# Final table for the models 
df_final

## Check how many attributes(Tags) in the dataset for restraunts for our recomendation algorithms

In [None]:
# Check how many attributes(Tags) for restraunts for our recomendation algorithms
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
df_final.head()

In [None]:
# List out all attributes with values
df_final.drop(['name','stars'], axis =1).sum(axis=0).sort_values(ascending = False).tail(100)

<a id="content-based"></a>
# Content Based Filtering- Model

build a system that recognizes the similarity between restaurants based on specific features and recommends restaurants that are most similar to a particular restaurant. __df_final__ (features) table used to build this system.

<a id="knn"></a>
## 1. K-Nearest Neighbours model (KNN)
> 
>    - Split the data into train and test set  (80:20)
>    - Instantiate and fit the model
>    - Test the model: we used the last row as a validation set (we didn't include this last row to train the model)
>    - Recommend restaurants for the validation set (the last restaurant in the df_final table)

In [None]:
# Create X (all the features) and y (target)
X = df_final.iloc[:,:-2]
y = df_final['stars']

* **Split the data into train and test set (80:20)**

In [None]:
# Split the data into train and test sets
from sklearn.model_selection import train_test_split
X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(X, y, test_size=0.2, random_state=1)

* **Instantiate and fit the model**

In [None]:
y_train_knn.head()

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=20)
knn.fit(X_train_knn, y_train_knn)

#y_pred = knn.predict(X_test)

accuracy_train = knn.score(X_train_knn, y_train_knn)
accuracy_test = knn.score(X_test_knn, y_test_knn)

print(f"Score on training set: {accuracy_train}")
print(f"Score on test set: {accuracy_test}")

The restaurant of the validation set

In [None]:
# look at the last row for the test
display(df_final.iloc[-1:])

# look at the restaurant name from the last row.
print("Validation set (Restaurant name): ", df_final['name'].values[-1])

* **Test the model:** 

> We used the last row as a validation set (we didn't include this last row for modeling). 

In [None]:
# test set from the df_final table (only last row): Restaurant name: "Steak & Cheese & Quick Pita Restaurant"
test_set = df_final.iloc[-1:,:-2]

# validation set from the df_final table (exclude the last row)
X_val =  df_final.iloc[:-1,:-2]
y_val = df_final['stars'].iloc[:-1]

In [None]:
# fit model with validation set
n_knn = knn.fit(X_val, y_val)

After fitting the KNN model to the validation set, we are going to find the distances between the validation set and the other restaurants based on their similar features. 

In [None]:
# distances and indeces from validation set (Steak & Cheese & Quick Pita Restaurant)
distances, indeces =  n_knn.kneighbors(test_set)
#n_knn.kneighbors(test_set)[1][0]

# create table distances and indeces from "Steak & Cheese & Quick Pita Restaurant"
final_table = pd.DataFrame(n_knn.kneighbors(test_set)[0][0], columns = ['distance'])
final_table['index'] = n_knn.kneighbors(test_set)[1][0]
final_table.set_index('index')

We are creating the following ***result*** table which displays similar restaurants to the validation restrauant by their distances. Based on this recommendation system, the short distance means having more similarity to the validation restrauant.

In [None]:
# get names of the restaurant that similar to the validation restrauant
result = final_table.join(df_final,on='index')
result[['distance','index','name','stars']].head(5)

The problem of the Content-Based Filtering Method is that it doesn't capture any information about users' preferences since it only cares about restaurant features. Next, we will implement the Collaborative Filtering Methods.

<a id="collaboritive"></a>
# User Collaborative Filtering - Model - Restraurants

Use the Collaborative Filtering technique to make a recommendation to restaurant users. This algorithm is based on the idea that similar users can have a similar restaurant preference. 

We are implementing the following machine learning techniques to build a recommender system:
1. Singular Value Decomposition model (SVD)
2. Neural Network (Keras)

<a id="svd"></a>
## 1. Singular Value Decomposition model (SVD)

In [47]:
# looking at the columns of subset_review table
subset_review.columns

Index(['review_id', 'user_id', 'business_id', 'stars', 'useful', 'funny',
       'cool', 'text', 'date'],
      dtype='object')

In [48]:
subset_review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


In [49]:
# pull out needed columns from subset_review table
df_review = subset_review[['user_id','business_id','stars', 'date']]
df_review

Unnamed: 0,user_id,business_id,stars,date
0,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,2018-07-07 22:09:11
1,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,2012-01-03 15:28:18
2,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,2014-02-05 20:30:30
3,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,2015-01-04 00:01:03
4,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,2017-01-14 20:54:15
...,...,...,...,...
999995,oX7o1TH0PHUWp9r9ry9_vw,jLn69WQupjsDKrbPw_nlGQ,3,2017-11-15 09:43:07
999996,v8wlapFKVLs2qTYCGhCdiw,t6v8g8UeNiq3O2GoEc7R4Q,4,2014-09-03 18:27:33
999997,rLlYc1RzIBnOmnX3AbpEYw,ZYRul0i1bhOjirHED6Kd0w,3,2016-02-20 22:25:29
999998,eEH-8CEPU5ndPxDGzVfHiQ,onGXKwnxPLtKnO8yqQMPSA,1,2010-06-27 02:17:30


In [50]:
# pull out names and addresses of the restaurants from rest table
restaurant = rest[['business_id', 'name', 'address']]
restaurant

NameError: name 'rest' is not defined

In [51]:
# combine df_review and restaurant table
combined_business_data = pd.merge(df_review, restaurant, on='business_id')
combined_business_data

NameError: name 'restaurant' is not defined

In [52]:
# the most POPULAR restaurants by stars.
combined_business_data.groupby('business_id')['user_id'].nunique().sort_values(ascending=False).head()

NameError: name 'combined_business_data' is not defined

In [None]:
# the most active user with most number of reviews
combined_business_data.groupby('user_id')['business_id'].nunique().sort_values(ascending=False).head()

In [None]:
# see the NAME of the most popular restaurant
Filter = combined_business_data['business_id'] == 'EtKSTHV5Qx_Q7Aur9o4kQQ'
print("Name: ", combined_business_data[Filter]['name'].unique())
print("Address:", combined_business_data[Filter]['address'].unique())

The popular restaurant by ratings is **"Village Whiskey"**.

<a id="u-matrix"></a>
* **Building a Utility Matrix (User-Restaurant Matrix)**

This matrix contains each user, each restaurant, and the rating each user gave to each restaurant. Notice this matrix will be sparse because every user doesn't review every restaurant.

In [None]:
# create a user-item matrix
rating_crosstab = combined_business_data.pivot_table(values='stars', index='user_id', columns='name', fill_value=0)
rating_crosstab.head()

In [None]:
# Perform a random Check on one restaurant and see whether we have ratings in place
rating_crosstab[rating_crosstab['1 Stop Pizza']!=0].head()

<a id="transpose-matrix"></a>
* **Transposing the Matrix**

After transpose the matrix, users are represented by columns, and restaurants are represented by rows.

In [None]:
# shape of the Utility matrix (original matrix) 
rating_crosstab.shape

In [None]:
# Transpose the Utility matrix
X = rating_crosstab.values.T
X.shape

<a id="decompose-matrix"></a>
* **Decomposing the Matrix**

Use TruncatedSVD from sklearn to compress the transposed matrix into down to a number of rows by 12 matrices. All of the restaurants are in the rows. But the users will be compressed down to 12 components arbitrarily that represent a generalized view of users' tastes.  

In [None]:
import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score


SVD = TruncatedSVD(n_components=12, random_state=17)
result_matrix = SVD.fit_transform(X)
result_matrix.shape

In [None]:
result_matrix

<a id="gen-corr-matrix"></a>
* **Generating a Correlation Matrix**

We calculated PearsonR coefficient for every restaurant pair in the result_matrix. The correlation-based on similarities between users' tastes. 

In [None]:
# PearsonR coef 
corr_matrix = np.corrcoef(result_matrix)
corr_matrix.shape

<a id="isolate"></a>
* **Isolating the most popular restaurant from the Correlation Matrix**

In our case, the most popular restaurant is "Village Whiskey". So we will extract the correlation values between the target restaurant with all other restaurants from corr_matrix.

In [None]:
# get the index of the popular restaurant
restaurant_names = rating_crosstab.columns
restaurants_list = list(restaurant_names)

popular_rest = restaurants_list.index('Village Whiskey')
print("index of the popular restaurant: ", popular_rest)

# restaurant of interest 
corr_popular_rest = corr_matrix[popular_rest]

<a id="recommend"></a>
* **Recommend Highly Correlated Restaurants**

Now we will filter out the most correlated restaurant to "Village Whiskey" by applying the following conditions as shown below.

In [None]:
list(restaurant_names[(corr_popular_rest < 1.0) & (corr_popular_rest > 0.9)])

In [None]:
display(rest[rest['name'] == 'Village Whiskey'])
display(rest[rest['name'] == 'Guavaberry Foods & Drinks '])
display(rest[rest['name'] == 'Halal Food Special'])
display(rest[rest['name'] == 'Prince Pizza II'])

<a id="NN-keras"></a>
## Model Performance Validation - Restaurants

<a id="recommend"></a>
* **User Story Simulation**

In [None]:
# the most active user with most number of reviews
combined_business_data.groupby('user_id')['business_id'].nunique().sort_values(ascending=False).head()


In [None]:
# Select a user with several reviews
target_user = '0DB3Irpf_ETVXu_Ou9vPow'
combined_business_data[combined_business_data['user_id']==target_user].groupby('user_id')['business_id'].nunique()

In [None]:
# Check the reviews from target user
combined_business_data[combined_business_data['user_id']==target_user].head()

In [None]:
# Check the reviews from target user
rest_reviews = combined_business_data[combined_business_data['user_id']==target_user]['name'].unique()
rest_reviews_index = []
for k in rest_reviews:    
    rest_reviews_index.append(restaurants_list.index(k))
print('Number of sites with reviews from target user:',len(rest_reviews))

In [None]:
# Business with reviews from the target user
combined_business_data[(combined_business_data['user_id']==target_user) & (combined_business_data['stars']==5)]['name'].unique()
rating_sites = combined_business_data[combined_business_data['user_id']==target_user].groupby('name').mean().sort_values(by='stars',ascending=False)
rating_sites['site_name'] = rating_sites.index
display(rating_sites.head())
display(rating_sites.tail())

In [None]:
# rating_sites.index
subset_business.shape

In [None]:
rating_sites['site_name'].index

In [None]:
subset_business[subset_business['name'].isin(rating_sites['site_name'])]['city'].value_counts()

In [None]:
subset_business[subset_business['name'].isin(['target','st honore pastries'])].head()

In [None]:
subset_business.head()

In [None]:
subset_business['name'].head()

In [None]:
# get the index of the popular restaurant
target_rest = restaurants_list.index(rating_sites.index[0])
print('High rating restaurant:',rating_sites.index[0])
print("index of the high rating restaurant: ", target_rest)

# restaurant of interest 
corr_target_rest = corr_matrix[target_rest]

In [None]:
# Sites with the highest rec_score
Rec_Score = corr_target_rest[rest_reviews_index]
Rec_result = pd.DataFrame({"name":rest_reviews,"rec_score":Rec_Score}).sort_values(by='rec_score',ascending=False)
Rec_result.head()

In [None]:
# Summarize the result of 
rec_summary = pd.merge(rating_sites,Rec_result, left_on='site_name', right_on='name').reindex(columns=['name', 'site_name', 'stars','rec_score'])
rec_summary = rec_summary.drop('name',axis=1)
rec_summary = rec_summary.drop(0)
rec_summary['avg_star'] = rec_summary['stars'].mean()
rec_summary['rank'] = rec_summary['rec_score'].rank(ascending=False)
top5_rec = rec_summary.sort_values(by='rec_score',ascending=False)
top5_rec[0:5]

In [None]:
review_rec_corr = rec_summary['stars'].corr(rec_summary['rec_score'])
print('Correlation coefficient between actual rating and recommendation:',review_rec_corr)

In [None]:
top5_rec['stars'].mean()

In [None]:
import matplotlib.pyplot as plt

# create scatter plot
plt.scatter(rec_summary['rec_score'],rec_summary['stars'])

# set plot title and labels
plt.title('Scatter Plot')
plt.xlabel('rec_score')
plt.ylabel('stars')

# show the plot
plt.show()

In [None]:
# plt.figure(figsize=(20, 8))
# plt.title("Number Of Contribution By Receipt Amount", fontsize=24)
# plt.xlabel("Receipt Amount", fontsize=14)
# plt.ylabel("Number Of Contribution", fontsize=14)
# plt.tick_params(axis='both', labelsize=12, color='darkblue')
# # plt.xticks(bin_range)

# plt.hist(rec_summary['rec_score'], facecolor='darkblue')

# User Collaborative for Non_food

In [6]:
# Select the origin andd destination city
my_city = "Brentwood".lower()
dest_city = "Philadelphia".lower()
target_user = '0DB3Irpf_ETVXu_Ou9vPow'
target_business_name = 'famous footwear'
subset_non_food_reviews["name"] = subset_non_food_reviews["name"].str.lower()
subset_non_food_reviews["city"] = subset_non_food_reviews["city"].str.lower()
subset_non_food_reviews_orgin = subset_non_food_reviews[subset_non_food_reviews["city"]==my_city]
subset_non_food_reviews_dest = subset_non_food_reviews[(subset_non_food_reviews["city"]==dest_city) | (subset_non_food_reviews["city"]==my_city)]
subset_non_food_reviews_dest.shape

(99695, 27)

In [7]:
# see the NAME of the most popular restaurant
# non_food_Filter = subset_non_food_reviews_dest['business_id'] == target_business_id
# print("Name: ", subset_non_food_reviews_dest[non_food_Filter]['name'].unique())
# print("Address:", subset_non_food_reviews_dest[non_food_Filter]['address'].unique())

In [8]:
# create a user-item matrix
rating_crosstab = subset_non_food_reviews_dest.pivot_table(values='stars_y', index='user_id', columns='name', fill_value=0)
rating_crosstab.head()

# shape of the Utility matrix (original matrix) 
rating_crosstab.shape



(58534, 2822)

In [9]:
# Transpose the Utility matrix
X = rating_crosstab.values.T
X.shape

(2822, 58534)

In [10]:
# Decomposing the matrix
import sklearn
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score


SVD = TruncatedSVD(n_components=12, random_state=17)
result_matrix = SVD.fit_transform(X)
result_matrix.shape


(2822, 12)

In [11]:
# PearsonR coef, generating a correlation matrix
corr_matrix = np.corrcoef(result_matrix)
corr_matrix.shape

(2822, 2822)

In [12]:
# get the index of the popular restaurant
non_food_names = rating_crosstab.columns
non_food_list = list(non_food_names)

target_location_index = non_food_list.index('famous footwear')
print("index of the target location: ", target_location_index)

corr_target_location = corr_matrix[target_location_index]

list(non_food_names[(corr_target_location < 1.0) & (corr_target_location > 0.95)])

index of the target location:  801


['18th century garden',
 'alaska airlines',
 'cole haan',
 'el nuevo estilo',
 'he his exclusively',
 'j&t nail salon',
 "lily's nail salon",
 "macy's",
 "penn's landing",
 'philadelphia taxi',
 'polo ralph lauren factory store',
 'rail park',
 'rebecca nail salon',
 "red's creative cuts unisex salon",
 'shell gas',
 'starr garden playground',
 'statewide roadservice',
 "the men's club barber shop",
 'tooba fashions']

In [13]:
corr_target_location_tbl = pd.DataFrame({'name':non_food_names,'User_collaborative_Score':corr_target_location})


# Content-based Collaborative Filtering

In [14]:
import gensim
# import nltk
import pandas as pd

from gensim import corpora
from gensim.summarization import keywords
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from textblob import TextBlob

from modules import CBR

pd.set_option("display.width", None)
import warnings

warnings.filterwarnings("ignore")
pd.options.display.max_columns = None


In [15]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [16]:
tst = subset_non_food_reviews.iloc[:20000,]
tmp = CBR.sentiment_analysis(tst)
tst_lda, tst_dict = CBR.lda_model(tst, num_topics=10, remove_stopwords=True)
tmp = CBR.extract_topics(tmp, tst_lda, tst_dict, num_words=10, remove_stopwords=True)
tmp = CBR.extract_keywords(tmp, remove_stopwords=True)
tmp

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sam79\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sam79\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0.1,Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars_x,review_count,is_open,attributes,categories,hours,BusinessAcceptsCreditCards,WheelchairAccessible,BusinessParking,ByAppointmentOnly,review_id,user_id,stars_y,useful,funny,cool,text,date,sentiment_polarity,sentiment_subjectivity,topic_id,topic_words,keywords
0,0,n_0UpQx1hsNbnPUSlodU8w,famous footwear,"8522 Eager Road, Dierbergs Brentwood Point",brentwood,MO,63144,38.627695,-90.340465,2.5,13,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Sporting Goods, Fashion, Shoe Stores, Shopping...","{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', ...",{},{},"{'garage': False, 'street': False, 'validated'...",{},e_PZZ0m2sEG9UovGRxdZRQ,dT6O_rV9DWYS-zHXhA6S6w,4,3,2,2,This has become my go to place for shoes. I a...,2015-12-06 16:46:43,0.109394,0.519021,1,"[like, get, one, store, place, really, I'm, fi...","[become, go, place, shoes, rewards, member, of..."
1,1,n_0UpQx1hsNbnPUSlodU8w,famous footwear,"8522 Eager Road, Dierbergs Brentwood Point",brentwood,MO,63144,38.627695,-90.340465,2.5,13,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Sporting Goods, Fashion, Shoe Stores, Shopping...","{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', ...",{},{},"{'garage': False, 'street': False, 'validated'...",{},WNv6UCHTmce7wgImLKP4sg,AAYvaNRQ0TD_2Lpo-wFOUA,4,3,2,1,"Oh, I do enjoy Famous Footwear. \n\nThe occasi...",2016-03-20 21:52:13,0.109394,0.519021,1,"[like, get, one, store, place, really, I'm, fi...","[enjoy, famous, footwear, oh, occasional, bogo..."
2,2,n_0UpQx1hsNbnPUSlodU8w,famous footwear,"8522 Eager Road, Dierbergs Brentwood Point",brentwood,MO,63144,38.627695,-90.340465,2.5,13,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Sporting Goods, Fashion, Shoe Stores, Shopping...","{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', ...",{},{},"{'garage': False, 'street': False, 'validated'...",{},VwoJCaULB5cRGnDiYGFYJA,V9fW3-fJ-sEMz_ewPpzXXg,1,2,0,0,Ordered shoes online it clearly says free retu...,2014-10-03 21:42:17,0.109394,0.519021,5,"[would, get, back, time, told, said, got, went...","[ordered, shoes, online, clearly, says, free, ..."
3,3,n_0UpQx1hsNbnPUSlodU8w,famous footwear,"8522 Eager Road, Dierbergs Brentwood Point",brentwood,MO,63144,38.627695,-90.340465,2.5,13,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Sporting Goods, Fashion, Shoe Stores, Shopping...","{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', ...",{},{},"{'garage': False, 'street': False, 'validated'...",{},crAIe0dciujX2sFvi7bSkA,iSgusF1eKu23mNG87zav4Q,1,0,0,0,Poor customer service will never be shopping a...,2018-08-13 15:40:01,0.109394,0.519021,5,"[would, get, back, time, told, said, got, went...","[poor, customer, service, never, shopping, sto..."
4,4,n_0UpQx1hsNbnPUSlodU8w,famous footwear,"8522 Eager Road, Dierbergs Brentwood Point",brentwood,MO,63144,38.627695,-90.340465,2.5,13,1,"{'RestaurantsPriceRange2': '2', 'BikeParking':...","Sporting Goods, Fashion, Shoe Stores, Shopping...","{'Monday': '0:0-0:0', 'Tuesday': '10:0-18:0', ...",{},{},"{'garage': False, 'street': False, 'validated'...",{},DRvrkDpdTOXK4j-Vx9qyDg,XfdP4UU3xMcdJbM3qUIaPA,5,0,0,0,Found a great deal on a pair of Nike running s...,2017-01-25 19:00:25,0.109394,0.519021,1,"[like, get, one, store, place, really, I'm, fi...","[found, great, deal, pair, nike, running, shoe..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,19995,FSiPq3GSzHOch1axV8VCKA,blind dog tavern,"50 N Sierra St, Ste 1A",reno,NV,89501,39.525276,-119.813913,4.5,59,1,"{'RestaurantsReservations': 'False', 'Restaura...","Nightlife, Bars, Speakeasies, Lounges, Dive Ba...","{'Monday': '15:0-2:0', 'Tuesday': '15:0-2:0', ...",{},True,"{'garage': False, 'street': True, 'validated':...",False,IeIkt1BY9zKi4p3HM-tcJg,TzjQHqlWUcWkQ1zLavRvig,5,0,0,0,"Seriously, my new favorite place in Reno. My b...",2019-03-16 03:40:28,0.291484,0.561843,4,"[great, place, time, always, recommend, I've, ...","[new, favorite, place, reno, seriously, boyfri..."
19996,19996,FSiPq3GSzHOch1axV8VCKA,blind dog tavern,"50 N Sierra St, Ste 1A",reno,NV,89501,39.525276,-119.813913,4.5,59,1,"{'RestaurantsReservations': 'False', 'Restaura...","Nightlife, Bars, Speakeasies, Lounges, Dive Ba...","{'Monday': '15:0-2:0', 'Tuesday': '15:0-2:0', ...",{},True,"{'garage': False, 'street': True, 'validated':...",False,MtWNqNHfla_OmCIUc-n9Nw,AUmp265ZW1MRbcm-OPpxYw,5,2,0,2,I've been searching for the best spiked coffee...,2019-02-24 04:58:09,0.291484,0.561843,4,"[great, place, time, always, recommend, I've, ...","[searching, best, spiked, coffee, would, guess..."
19997,19997,FSiPq3GSzHOch1axV8VCKA,blind dog tavern,"50 N Sierra St, Ste 1A",reno,NV,89501,39.525276,-119.813913,4.5,59,1,"{'RestaurantsReservations': 'False', 'Restaura...","Nightlife, Bars, Speakeasies, Lounges, Dive Ba...","{'Monday': '15:0-2:0', 'Tuesday': '15:0-2:0', ...",{},True,"{'garage': False, 'street': True, 'validated':...",False,8S28hDMSH6kVytQIeAPKbQ,ZjFRD1oZRZx8jto03gb-Og,5,0,0,1,Blind Dog Tavern is the best. The Staff is ama...,2018-06-18 17:19:14,0.291484,0.561843,1,"[like, get, one, store, place, really, I'm, fi...","[blind, dog, tavern, best, staff, amazing, coc..."
19998,19998,FSiPq3GSzHOch1axV8VCKA,blind dog tavern,"50 N Sierra St, Ste 1A",reno,NV,89501,39.525276,-119.813913,4.5,59,1,"{'RestaurantsReservations': 'False', 'Restaura...","Nightlife, Bars, Speakeasies, Lounges, Dive Ba...","{'Monday': '15:0-2:0', 'Tuesday': '15:0-2:0', ...",{},True,"{'garage': False, 'street': True, 'validated':...",False,MTOhvqxG7Zl1jUrD4WuklA,U4zhpIxjPYtrqxXiz-V--A,5,0,0,0,Omg T made us the best drinks ever.. best thin...,2019-03-08 23:57:12,0.291484,0.561843,8,"[tour, place, bar, good, food, one, fun, like,...","[omg, made, us, best, drinks, ever, maybe, top..."


In [19]:
tmp_agg = CBR.content_based_aggregate(tmp)
tmp_agg

Unnamed: 0,business_id,name,city,state,stars,review_count,sentiment_polarity,sentiment_subjectivity,topic_words_top20,keywords_top20,reviews
0,-7GjicSH_rM8JeZGCXGcUg,double decker,tampa,FL,3.181818,86.0,0.175360,0.499188,"[one, place, good, like, tour, bar, food, fun,...","[bar, karaoke, n, place, fun, great, like, nig...",We stopped in while bar hopping for a friend's...
1,-9n0NDe_pP1ZnrWr-lsDXQ,ralph's barber shop,glenside,PA,3.611111,18.0,0.170614,0.521920,"[get, place, time, really, one, great, always,...","[haircut, great, get, hair, barber, n, place, ...",An incredible local family establishment. I've...
2,-BhSR6dAry5-2x3ndjX_9w,meister's barber shop,philadelphia,PA,3.913580,79.0,0.231719,0.538825,"[get, place, really, time, great, always, reco...","[cut, great, hair, n, haircut, barber, place, ...",Coming from Los Angeles where the hair style c...
3,-UNPalKlpI-_2ejgFNCBPg,francesca's,nashville,TN,3.583333,12.0,0.189286,0.586814,"[get, one, place, really, like, store, I'm, fi...","[store, n, like, cute, back, one, little, --, ...",This store is packed into a pretty small space...
4,-cK2OGOzkvSaxkb91SIjVg,fantastic sams cut & color,zephyrhills,FL,2.333333,6.0,0.181969,0.525376,"[time, get, would, back, told, said, got, went...","[get, cut, appointment, n, hair, one, first, t...",Fantastic Sam's Hair Salon in Zephyrhills...
...,...,...,...,...,...,...,...,...,...,...,...
846,zn-WXkqHag5FSVc_LC9dEQ,renee hair salon,havertown,PA,4.200000,5.0,0.254306,0.594231,"[time, get, great, place, always, recommend, I...","[salon, hair, renee, told, elena, amy, amazing...",I've been going to Renee for 16 years & I woul...
847,znTKlh4x8NoBIojm4Yo5hA,albert's transportation,mount laurel,NJ,3.850000,20.0,0.209416,0.499913,"[get, time, one, place, really, would, back, t...","[albert, back, service, wedding, time, pick, d...","This is my first ever yelp review, because I t..."
848,znwTYnVgJ1MKQguvqEtMrA,"the foot whisperer, inc",eagle,ID,4.555556,9.0,0.361408,0.561340,"[get, time, place, really, great, always, reco...","[jeanne, pedicure, best, experience, spa, rela...",I've left voicemail messages with this busines...
849,zyge4T5eSiPHq1-IaJb_Qg,nancy le nails,philadelphia,PA,1.800000,10.0,0.076507,0.483948,"[would, get, back, time, told, said, got, went...","[n, service, nails, like, get, went, salon, go...",Came here for my first time with my friend and...


In [20]:
cs_tmp = CBR.calc_cosine_similarity(
    tmp_agg, target_business_name,my_city, dest_city
)
CBR.content_based_recommender(cs_tmp)

Unnamed: 0,business_id,name,city,state,sentiment_polarity,sentiment_subjectivity,topic_words_top20,keywords_top20,reviews,sentiment_polarity_rank,sentiment_subjectivity_rank,topic_words_top20_rank,keywords_top20_rank,reviews_rank,avg_rank
0,IAj1Lw3FAOY-yZn4IO7ElQ,baum's dancewear,philadelphia,PA,1.0,1.0,0.945905,0.38949,0.0,1,1,2,1,1,1.2
1,he6ypFmnUF95PlNUWT6i5g,tj maxx,philadelphia,PA,1.0,1.0,1.0,0.270369,0.0,1,1,1,3,1,1.4
2,LvFmVnPSbi0lmgpcK_qw-Q,foot locker,philadelphia,PA,1.0,1.0,0.945905,0.324443,0.0,1,1,2,2,1,1.4
3,h-y5azB-VlQAT3m7Ff2g2Q,p's & q's - premium quality,philadelphia,PA,1.0,1.0,0.945905,0.222566,0.0,1,1,2,4,1,1.8
4,9VRg8Ho9SoZWKPmjfrpVmw,forman mills,philadelphia,PA,1.0,1.0,1.0,0.216295,0.0,1,1,1,5,1,1.8
5,eJ77e9lGxY3ArzaoDbHhYw,paddy whacks irish sports pub - south street,philadelphia,PA,1.0,1.0,0.778981,0.216295,0.0,1,1,3,5,1,2.2
6,C2KhibuDzv3HJvZubkHTJA,village thrift stores,philadelphia,PA,1.0,1.0,0.945905,0.166924,0.0,1,1,2,7,1,2.4
7,yZP6Z8sbDpkeXyjNMDiyDg,guess - walnut street,philadelphia,PA,1.0,1.0,0.945905,0.162221,0.0,1,1,2,8,1,2.6
8,swiwJRUQHt79dJ5hqAERtQ,total serenity day spa,philadelphia,PA,1.0,1.0,0.759257,0.166924,0.0,1,1,5,7,1,3.0
9,zyge4T5eSiPHq1-IaJb_Qg,nancy le nails,philadelphia,PA,1.0,1.0,0.766965,0.162221,0.0,1,1,4,8,1,3.0


# Algorithm Evaluation

In [None]:
# Business with reviews from the target user
subset_non_food_reviews[(subset_non_food_reviews['user_id']==target_user) & (subset_non_food_reviews['stars_y']==5)]['name'].unique()
rating_sites = subset_non_food_reviews[subset_non_food_reviews['user_id']==target_user].groupby('name').mean().sort_values(by='stars_y',ascending=False)
rating_sites['site_name'] = rating_sites.index
display(rating_sites.head())
display(rating_sites.tail())

In [None]:
target_user_non_food = pd.merge(subset_non_food_reviews[(subset_non_food_reviews['user_id']==target_user)& (subset_non_food_reviews['city']==dest_city)],corr_target_location_tbl, left_on='name', right_on='name')
target_user_non_food.head()

<a id="NN-keras"></a>
## 2. Neural Network Model - Keras

Finally, we’ll build a neural network and see how it compares to the other collaborative filtering approach. 

In [None]:
# create the copy of combined_business_data table
combined_business_data_keras = combined_business_data.copy()
combined_business_data_keras.head(1)

We are using LabelEncoder from sklearn to encode business and user id's. We will create variables that store unique users, restaurants, min_rating, and max_rating.

In [None]:
from sklearn.preprocessing import LabelEncoder

user_encode = LabelEncoder()

combined_business_data_keras['user'] = user_encode.fit_transform(combined_business_data_keras['user_id'].values)
n_users = combined_business_data_keras['user'].nunique()

item_encode = LabelEncoder()

combined_business_data_keras['business'] = item_encode.fit_transform(combined_business_data_keras['business_id'].values)
n_rests = combined_business_data_keras['business'].nunique()

combined_business_data_keras['stars'] = combined_business_data_keras['stars'].values#.astype(np.float32)

min_rating = min(combined_business_data_keras['stars'])
max_rating = max(combined_business_data_keras['stars'])

print(n_users, n_rests, min_rating, max_rating)

combined_business_data_keras

Split the data into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X = combined_business_data_keras[['user', 'business']].values
y = combined_business_data_keras['stars'].values

X_train_keras, X_test_keras, y_train_keras, y_test_keras = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_keras.shape, X_test_keras.shape, y_train_keras.shape, y_test_keras.shape

In [None]:
X_train_keras[:, 0]

We will need another variable that stores the number of factors per user/restaurant for the model. This number can be arbitrary. But for the Collaborative filtering model it needs to be the same size for both users and restaurants. 

Finally, we will store users and restaurants into separate arrays for the train and test set. It is because in Keras they’ll each be defined as distinct inputs.

In [None]:
n_factors = 50

X_train_array = [X_train_keras[:, 0], X_train_keras[:, 1]]
X_test_array = [X_test_keras[:, 0], X_test_keras[:, 1]]

In [None]:
X_train_array, X_test_array

Here, we’re going to use embeddings to represent each user and each restaurant in the data. To get these embeddings we need to do the dot product between the user vector and restaurant vector. As a result, we will have vectors of size n factors to capture the weights related to each user per restaurant. 

In order to increase the model performance, we add the "bias" to each embedding. We run the output of the dot product through a sigmoid layer and then scaling the result using the min and max ratings in the data. 

In [None]:
from keras.layers import Add, Activation, Lambda
from keras.models import Model
from keras.layers import Input, Reshape, Dot
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
from keras.regularizers import l2

class EmbeddingLayer:
    def __init__(self, n_items, n_factors):
        self.n_items = n_items
        self.n_factors = n_factors
    
    def __call__(self, x):
        x = Embedding(self.n_items, self.n_factors, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(x)
        x = Reshape((self.n_factors,))(x)
        
        return x
    
def Recommender(n_users, n_rests, n_factors, min_rating, max_rating):
    user = Input(shape=(1,))
    u = EmbeddingLayer(n_users, n_factors)(user)
    ub = EmbeddingLayer(n_users, 1)(user)
    
    restaurant = Input(shape=(1,))
    m = EmbeddingLayer(n_rests, n_factors)(restaurant)
    mb = EmbeddingLayer(n_rests, 1)(restaurant)   
    
    x = Dot(axes=1)([u, m])
    x = Add()([x, ub, mb])
    x = Activation('sigmoid')(x)
    x = Lambda(lambda x: x * (max_rating - min_rating) + min_rating)(x)  
    
    model = Model(inputs=[user, restaurant], outputs=x)
    opt = Adam(lr=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)  
    
    return model

In [None]:
import tensorflow as tf
print(tf.__version__)

In [None]:
# !pip uninstall keras

In [None]:
# !pip install keras

In [None]:
keras_model = Recommender(n_users, n_rests, n_factors, min_rating, max_rating)
keras_model.summary()

Let’s go ahead and train this for a few epochs and see what we get.

In [None]:
keras_model.fit(x=X_train_array, y=y_train_keras, batch_size=64,\
                          epochs=5, verbose=1, validation_data=(X_test_array, y_test_keras))

<a id="prediction"></a>
* **Prediction**

After creating the model now it's time to predict the test dataset. 

In [None]:
# prediction
predictions = keras_model.predict(X_test_array)

By creating the following table, we are able to see the model performance by comparing the actual stars and predictions.

In [None]:
# create the df_test table with prediction results
df_test = pd.DataFrame(X_test_keras[:,0])
df_test.rename(columns={0: "user"}, inplace=True)
df_test['business'] = X_test_keras[:,1]
df_test['stars'] = y_test_keras
df_test["predictions"] = predictions
df_test.head()

In [None]:
# Plotting the distribution of actual and predicted stars
import matplotlib.pyplot as plt
import seaborn as sns
values, counts = np.unique(df_test['stars'], return_counts=True)

plt.figure(figsize=(8,6))
plt.bar(values, counts, tick_label=['1','2','3','4','5'], label='true value')
plt.hist(predictions, color='orange', label='predicted value')
plt.xlabel("Ratings")
plt.ylabel("Frequency")
plt.title("Ratings Histogram")
plt.legend()
plt.show()

In [None]:
# # plot 
# import matplotlib.pyplot as plt
# import seaborn as sns

# plt.figure(figsize=(15,6))

# ax1 = sns.distplot(df_test['stars'], hist=False, color="r", label="Actual Value")
# sns.distplot(predictions, hist=False, color="g", label="model2 Fitted Values" , ax=ax1)

# plt.title('Actual vs Fitted Values for Restaurant Ratings')
# plt.xlabel('Stars')
# plt.ylabel('Proportion of Ratings')

# plt.show()
# plt.close()

<a id="cos-similarity"></a>
* **Cosine similarity**

We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two restaurants. Therefore, we need to extract embedding layers from the Keras model to compute the cosine similarity by doing a dot product.

In [None]:
# Extract embeddings
emb = keras_model.get_layer('embedding_3')
emb_weights = emb.get_weights()[0]

print("The shape of embedded weights: ", emb_weights.shape)
print("The length of embedded weights: ", len(emb_weights))

Each restaurant is now represented as a 50-dimensional vector. We need to normalize the embeddings so that the dot product between two embeddings becomes the cosine similarity.

Source:  https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9

In [None]:
# normalize and reshape embedded weights
emb_weights = emb_weights / np.linalg.norm(emb_weights, axis = 1).reshape((-1, 1))
len(emb_weights)

In [None]:
# get all unique business_ids (restaurants)
rest_id_emb = combined_business_data_keras["business_id"].unique()
len(rest_id_emb)

We are going to create a table that contains all the unique restaurants in 50 dimensions with their embedded weights.

In [None]:
rest_pd = pd.DataFrame(emb_weights)
rest_pd["business_id"] = rest_id_emb
rest_pd = rest_pd.set_index("business_id")
rest_pd

In [None]:
# merging rest_pd and temp tables to get the name of the restaurants.
temp = combined_business_data_keras[['business_id', 'name']].drop_duplicates()
df_recommend = pd.merge(rest_pd, temp, on='business_id')
df_recommend

<a id="recommendation"></a>
* **Recommendation**

Now we going to use this model to recommend restaurants to a popular restaurant which was "Wvrst".

In [None]:
# exrtract the target restaurant from the df_recommend table
target = df_recommend[df_recommend['name'] == 'Wvrst']
target.iloc[:,1:51]

We are creating a function that calculates the cosine similarity between the target and the rest of the other restaurants and returns the table with the result.

In [None]:
def find_similarity_total(rest_name):
    """Recommends restaurant based on the cosine similarity between restaurants"""
    cosine_list_total = []
    result = []

    for i in range(0, df_recommend.shape[0]):
        sample_name = df_recommend[df_recommend["name"] == rest_name].iloc[:,1:51]
        row = df_recommend.iloc[i,1:51]
        cosine_total = np.dot(sample_name, row)
        
        recommended_name = df_recommend.iloc[i,51]
        cosine_list_total.append(cosine_total)
        result.append(recommended_name)
        
    cosine_df_total = pd.DataFrame({"similar_rest" : result, "cosine" : cosine_list_total})

    return cosine_df_total

In [None]:
# call the function with input of "Wvrst" and store it in result variable.
result = find_similarity_total('Wvrst')

In [None]:
# head of result table
result.head()

We created the following function to get rid of the "[ ]" in "cosine" column.

In [None]:
'''
- function that replace '[]' to empty str 
- convert string to float
'''
def convert(input):
    return float(str(input).replace('[','').replace(']',''))

In [None]:
# create new column called "cos" in result table
result['cos'] = result.apply(lambda x: convert(x['cosine']), axis=1)

# drop original 'cosine' column (which had values with np.array)
result.drop('cosine', axis=1, inplace=True)

# sort values with cos
result.sort_values('cos', ascending=False).head(10)