# Yelp Dataset Challenge

![Yelp Data Challenge](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/6d323fc75cb1/assets/img/dataset/960x225_dataset@2x.png)

## Load data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('last_2_years_restaurant_reviews.csv')

In [3]:
df.head()

Unnamed: 0,business_id,name,categories,avg_stars,cool,date,funny,review_id,stars,text,useful,user_id
0,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-07-03,0,c6iTbCMMYWnOd79ZiWwobg,1,"I ordered a few 12 inch sandwiches , a turkey ...",1,ih7Dmu7wZpKVwlBRbakJOQ
1,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2018-03-10,0,5iDdZvpK4jOv2w5kZ15TUA,1,Worst subway of any I have visited. I have man...,1,m3WBc9bGxn1q1ikAFq8PaA
2,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-12-26,0,oCUrLS4T-paZBr6WnrXg_A,2,Good luck trying to get the order right. The c...,0,H7bJDtGzhdg1fsmBL4KZWg
3,kgffcoxT6BQp-gJ-UQ7Czw,Subway,"Fast Food, Restaurants, Sandwiches",2.5,0,2016-12-16,0,qXHvWYgL-8yfcGvP_ydKGA,2,Here to get my pick up order at the moment it ...,0,58sXi_0oTgVlM3aUuFYHUA
4,0jtRI7hVMpQHpUVtUy4ITw,Omelet House Summerlin,"Beer, Wine & Spirits, Italian, Food, American ...",4.0,1,2016-12-29,0,j9l7IMJX9bvWjkJ18EWGpg,5,"My husband & I were visiting the area, found t...",0,ZS7V0uC4kVrJR_4Yi3oTHA


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398037 entries, 0 to 398036
Data columns (total 12 columns):
business_id    398037 non-null object
name           398037 non-null object
categories     398037 non-null object
avg_stars      398037 non-null float64
cool           398037 non-null int64
date           398037 non-null object
funny          398037 non-null int64
review_id      398037 non-null object
stars          398037 non-null int64
text           398037 non-null object
useful         398037 non-null int64
user_id        398037 non-null object
dtypes: float64(1), int64(4), object(7)
memory usage: 36.4+ MB


## Clean data and get rating data

#### Select relevant columns in the original dataframe
* `business_id`, `user_id`, `stars`

In [5]:
df_recommender = df[['business_id', 'user_id', 'stars']]
df_recommender.head()

Unnamed: 0,business_id,user_id,stars
0,kgffcoxT6BQp-gJ-UQ7Czw,ih7Dmu7wZpKVwlBRbakJOQ,1
1,kgffcoxT6BQp-gJ-UQ7Czw,m3WBc9bGxn1q1ikAFq8PaA,1
2,kgffcoxT6BQp-gJ-UQ7Czw,H7bJDtGzhdg1fsmBL4KZWg,2
3,kgffcoxT6BQp-gJ-UQ7Czw,58sXi_0oTgVlM3aUuFYHUA,2
4,0jtRI7hVMpQHpUVtUy4ITw,ZS7V0uC4kVrJR_4Yi3oTHA,5


#### Exclude the users who didn't give review

In [6]:
missing_value_counts = df_recommender['stars'].isnull().sum() # No missing
print(missing_value_counts)

0


No missing values in current dataset.

We still use `dropna()` to drop the row if there is missing values. So when we use some new data with missing values, these rows will be droped.

In [7]:
df_recommender.dropna(axis=0) # 沿著 col 方向看到 missing value 就丟掉 row
df_recommender.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398037 entries, 0 to 398036
Data columns (total 3 columns):
business_id    398037 non-null object
user_id        398037 non-null object
stars          398037 non-null int64
dtypes: int64(1), object(2)
memory usage: 9.1+ MB


In [8]:
df_recommender.head()

Unnamed: 0,business_id,user_id,stars
0,kgffcoxT6BQp-gJ-UQ7Czw,ih7Dmu7wZpKVwlBRbakJOQ,1
1,kgffcoxT6BQp-gJ-UQ7Czw,m3WBc9bGxn1q1ikAFq8PaA,1
2,kgffcoxT6BQp-gJ-UQ7Czw,H7bJDtGzhdg1fsmBL4KZWg,2
3,kgffcoxT6BQp-gJ-UQ7Czw,58sXi_0oTgVlM3aUuFYHUA,2
4,0jtRI7hVMpQHpUVtUy4ITw,ZS7V0uC4kVrJR_4Yi3oTHA,5


In [9]:
print(df_recommender['business_id'].nunique())
print(df_recommender['user_id'].nunique())

4621
189936


#### Create utility matrix from records

* user is row, item is column, and rating is cell value
  * user: `user_id`
  * item: `business_id`
  * rating: `stars`
* If there is missing values then use 0 to replace.

In [10]:
df_utility = pd.pivot_table(data=df_recommender, index='user_id', columns='business_id', values='stars', fill_value=0)
df_utility.head()

business_id,--9e1ONYQuAa-CB_Rrw7Tw,-1m9o3vGRA8IBPNvNqKLmA,-3zffZUHoY8bQjGfPSoBKQ,-8R_-EkGpUhBk55K9Dd4mg,-9YyInW1wapzdNZrhQJ9dg,-AD5PiuJHgdUcAK-Vxao2A,-ADtl9bLp8wNqYX1k3KuxA,-AGdGGCeTS-njB_8GkUmjQ,-Bf8BQ3yMk8U2f45r2DRKw,-BmqghX1sv7sgsxOIS2yAg,...,znWHLW1pt19HzW1VY6KfCA,zp-K5s3pGTWuuaVBWo6WZA,zpoZ6WyQUYff18-z4ZU1mA,zr42_UsWfaIF-rcp37OpwA,zsQk990PubOHjr1YcLkQFw,zt9RLUIU32fZYOBh2L0NNQ,zttcrQP4MxNS5X5itzStXg,zuwba6QEBIDZT0tJZmNhdQ,zwNC-Ow4eIMan2__bS9-rg,zx_j6OuuHHa2afVoAZuLpA
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
---1lKK3aKOuomHnwAkAow,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
---udAKDsn0yQXmzbWQNSw,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
--2bpE5vyR-2hAP7sZZ4lA,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
--2vR0DIsmQ6WfcSzKWigw,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
--3WaS23LcIXtxyFULJHTA,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
df_utility.shape

(189936, 4621)

## Item-item similarity recommender

* Firstly, convert the dataframe to sparse matrix
* Item is `business_id`, so the dimension of item-item simiarity matrix is 4621 x 4621
  * Need to transpose the sparse matrix before calculating item-item similarity matrix
* Use 100 neighborhood
* Predict ratings for a user
* Recommend top 5 to the user

In [12]:
from scipy import sparse
utility_matrix = sparse.csr_matrix(df_utility.values)
print(utility_matrix.shape, utility_matrix.T.shape)
print(utility_matrix.toarray()) # 因為是 sparse matrix ，顯示大都是 0 看不出什麼來

from sklearn.metrics.pairwise import cosine_similarity
item_item_similarity_matrix = cosine_similarity(utility_matrix.T)
print(item_item_similarity_matrix.shape)
print(item_item_similarity_matrix)

(189936, 4621) (4621, 189936)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(4621, 4621)
[[1.         0.0105333  0.01684862 ... 0.         0.0070165  0.        ]
 [0.0105333  1.         0.01565553 ... 0.         0.         0.        ]
 [0.01684862 0.01565553 1.         ... 0.         0.00724205 0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.0070165  0.         0.00724205 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


In [13]:
neighborhoods = np.argsort(item_item_similarity_matrix, axis=1)[:, -1:-101:-1] # 由大到小排序
print(neighborhoods.shape)
print(neighborhoods)

(4621, 100)
[[   0 2933  828 ...  756 1796  199]
 [   1 3076 2729 ...  604 3332 4168]
 [   2 2933 3914 ... 2861 2528  604]
 ...
 [4618 1085  446 ...  666 1899  101]
 [4619 3683 1170 ... 4261 2130  299]
 [4620 4523 3220 ... 1193 2829 1123]]


In [14]:
random_number = np.random.randint(0, df_utility.shape[0])
print(random_number)
rating_matrix = utility_matrix[random_number] # randomly pick a user
print(rating_matrix)
print(rating_matrix.nonzero())

n_users = utility_matrix.shape[0]
n_items = utility_matrix.shape[1]

items_rated_by_this_user = rating_matrix.nonzero()[1] # 找出哪些欄位已經被使用者打分數了，把欄位的 index 存下來
out = np.zeros(n_items)
for item_to_rate in range(n_items):
    relevant_items = np.intersect1d(neighborhoods[item_to_rate], # 找 neighborhoods 和 items_rated_by_this_user 的交集
                                    items_rated_by_this_user,
                                    assume_unique=True) # assume_unique 可以加速 intersect1d() 的計算

    out[item_to_rate] = utility_matrix[random_number, relevant_items] * \
                        item_item_similarity_matrix[item_to_rate, relevant_items] / \
                        item_item_similarity_matrix[item_to_rate, relevant_items].sum()
    
pred_ratings = np.nan_to_num(out) # out 中的 NaN 會被改成 0
print(pred_ratings)
print(pred_ratings.shape)
print(sparse.csr_matrix(pred_ratings)) # 用 sparse matrix 比較容易看哪個是非0的值

74150
  (0, 377)	5
  (0, 1595)	5
(array([0, 0], dtype=int32), array([ 377, 1595], dtype=int32))




[5. 0. 0. ... 0. 0. 0.]
(4621,)
  (0, 0)	5.0
  (0, 20)	5.0
  (0, 51)	5.0
  (0, 70)	5.0
  (0, 77)	5.0
  (0, 87)	5.0
  (0, 90)	5.0
  (0, 104)	5.0
  (0, 121)	5.0
  (0, 129)	5.0
  (0, 178)	5.0
  (0, 192)	5.0
  (0, 203)	5.0
  (0, 212)	5.0
  (0, 217)	5.0
  (0, 235)	5.0
  (0, 268)	5.0
  (0, 285)	5.0
  (0, 289)	5.0
  (0, 310)	5.0
  (0, 316)	5.0
  (0, 318)	5.0
  (0, 322)	5.0
  (0, 341)	5.0
  (0, 359)	5.0
  :	:
  (0, 4290)	5.0
  (0, 4297)	5.0
  (0, 4304)	5.0
  (0, 4329)	5.0
  (0, 4333)	5.0
  (0, 4337)	5.0
  (0, 4364)	5.0
  (0, 4366)	5.0
  (0, 4372)	5.0
  (0, 4381)	5.0
  (0, 4395)	5.0
  (0, 4396)	5.0
  (0, 4433)	5.0
  (0, 4438)	5.0
  (0, 4441)	5.0
  (0, 4454)	5.0
  (0, 4459)	5.0
  (0, 4462)	5.0
  (0, 4488)	5.0
  (0, 4525)	5.0
  (0, 4569)	5.000000000000001
  (0, 4571)	5.0
  (0, 4578)	5.0
  (0, 4582)	5.0
  (0, 4616)	5.0


這邊推薦 item-item based 的前五名

In [15]:
n = 5
item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))[::-1]
item_rated_by_this_user = utility_matrix[random_number].nonzero()[1]

# We want to exclude the items that have been rated by user
unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                if item not in items_rated_by_this_user]

recommends = unrated_items_by_pred_rating[:n] # 前五名的欄位的 index
print(recommends)

[2306, 2030, 696, 4237, 2867]


In [18]:
print(pred_ratings[recommends]) # 前五名的欄位的分數

[5. 5. 5. 5. 5.]


In [21]:
recommended_business_ids = df_utility.columns[recommends] # 前五名欄位的 business_id
print(recommended_business_ids)

Index(['UT6L3b7Zll_nvRidijiDSA', 'R0ukZ5FgY_2Pn96Go5mftA',
       '8hDKFHyRrILlXp5DfTlSGw', 'u_8cVZyxh0J468zEZUjNDQ',
       'b3vRI8yXNK34hgC0Wd4Iag'],
      dtype='object', name='business_id')


Get the name, categories, and average stars of the top five recommended restaurant

In [76]:
# df_recommended_restaurant = df[df['business_id'].isin(recommended_business_ids)]
df_recommended_restaurant = df[df['business_id'].isin(recommended_business_ids)][['name', 'categories', 'avg_stars']]
df_recommended_restaurant.drop_duplicates().reset_index(drop=True)

Unnamed: 0,name,categories,avg_stars
0,Chin Chin,"Sushi Bars, Chinese, Asian Fusion, Restaurants",3.5
1,Steak & Spud Factory,"Restaurants, Fast Food",3.5
2,Pin-Up Pizza,"Pizza, Restaurants",3.5
3,Five50 Pizza Bar,"Pizza, Restaurants",4.0
4,Johnny Rockets,"Restaurants, Burgers, American (Traditional), ...",3.0


## Matrix Factorization recommender

#### Use NMF and UVD and compare the results

## Other recommenders

* Popularity based
* Content based
* Hybrid