# Yelp Data Challenge - Restaurant Recommender

BitTiger DS501

Zhenning Tan, Jun 2017

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

% matplotlib inline
plt.style.use("ggplot")

In [2]:
df = pd.read_csv('yelp_dataset_challenge_round9/last_2_years_restaurant_reviews.csv')

In [4]:
df.shape

(111548, 13)

In [5]:
df.head(2)

Unnamed: 0,business_id,name,categories,ave_stars,cool,date,funny,review_id,stars,text,type,useful,user_id
0,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-07-28,0,iHP55csZHjPGqOMwIo70qQ,5,Exceptional...exceptional steakhouse!! Ordered...,review,0,TU5j2S_Ub__ojLOpD_UepQ
1,--9e1ONYQuAa-CB_Rrw7Tw,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",4.0,0,2016-07-17,0,GWI2xpBBwxK9-w1etLz51A,5,In a city with overrated 'celebrity' steakhous...,review,0,OC_WdUmY2fK-c1SD4JqSsw


## 1. Clean data and get rating data 

#### Select relevant columns in the original dataframe

In [6]:
# Get business_id, user_id, stars for recommender
col_select = ["business_id", "user_id", "stars"]
df_select = df[col_select]
df_select.head(2)

Unnamed: 0,business_id,user_id,stars
0,--9e1ONYQuAa-CB_Rrw7Tw,TU5j2S_Ub__ojLOpD_UepQ,5
1,--9e1ONYQuAa-CB_Rrw7Tw,OC_WdUmY2fK-c1SD4JqSsw,5


#### There are many users that haven't given many reviews, exclude these users from the item-item similarity recommender

**Q**: How do we recommend to these users anyways?

In [7]:
# To be implemented
# review count for each user
n_review_per_user = df_select["user_id"].value_counts()
n_review_per_user.shape #total number of users in this df

(64047L,)

In [8]:
# set minimum review number for user
min_review = 1

# percentage of user with minimum review
1.0*(n_review_per_user <= min_review).sum()/n_review_per_user.shape[0] 

0.70936968164004555

In [9]:
# keep users with review number above minimum review
user_filter = (n_review_per_user > min_review).astype(int).nonzero()[0]
user_filter = list(n_review_per_user.index[user_filter])
user_to_keep = df_select["user_id"].apply(lambda x: x in user_filter)

In [10]:
df_select_min_review = df_select[user_to_keep]
df_select_min_review.shape

(66115, 3)

In [11]:
df_select_min_review.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66115 entries, 0 to 111546
Data columns (total 3 columns):
business_id    66115 non-null object
user_id        66115 non-null object
stars          66115 non-null int64
dtypes: int64(1), object(2)
memory usage: 2.0+ MB


In [12]:
df_select_min_review.head(2)

Unnamed: 0,business_id,user_id,stars
0,--9e1ONYQuAa-CB_Rrw7Tw,TU5j2S_Ub__ojLOpD_UepQ,5
2,--9e1ONYQuAa-CB_Rrw7Tw,A6zYXofgFj6UhonFPrEDHw,3


#### Create utility matrix from records

In [13]:
# To be implemented
from scipy import sparse

n_users = df_select_min_review["user_id"].value_counts().shape[0]
n_business = df_select_min_review["business_id"].value_counts().shape[0]

utility_mat = sparse.lil_matrix((n_users, n_business))
utility_mat

<18614x3508 sparse matrix of type '<type 'numpy.float64'>'
	with 0 stored elements in LInked List format>

In [14]:
business_id_dict = {}
user_id_dict = {}

business_ind = 0
user_ind = 0

for _, row in df_select_min_review.iterrows():
   
    if row["business_id"] not in business_id_dict:
        business_id_dict[row["business_id"]] = business_ind
        business_ind +=1
    if row["user_id"] not in user_id_dict:
        user_id_dict[row["user_id"]] = user_ind
        user_ind +=1
    
    utility_mat[user_id_dict[row["user_id"]], business_id_dict[row["business_id"]]] = row["stars"]


In [15]:
utility_mat 

<18614x3508 sparse matrix of type '<type 'numpy.float64'>'
	with 66115 stored elements in LInked List format>

## 2. Item-Item similarity recommender

### Let's reuse the ItemItemRecommender class derived from previous exercise

Hint: we need to make modification to accommodate the dense numpy array

In [16]:
# To be implemented
from sklearn.metrics.pairwise import cosine_similarity
from time import time


class ItemItemRecommender(object):

    def __init__(self, neighborhood_size):
        self.neighborhood_size = neighborhood_size

    def fit(self, ratings_mat):
        self.ratings_mat = ratings_mat
        self.n_users = ratings_mat.shape[0]
        self.n_items = ratings_mat.shape[1]
        self.item_sim_mat = cosine_similarity(self.ratings_mat.T)
        self._set_neighborhoods()

    def _set_neighborhoods(self):
        least_to_most_sim_indexes = np.argsort(self.item_sim_mat, 1)
        self.neighborhoods = least_to_most_sim_indexes[:, -self.neighborhood_size:]

    def pred_one_user(self, user_id, report_run_time=False):
        start_time = time()
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1] # 0 is row index, 1 is col index
        # Just initializing so we have somewhere to put rating preds
        out = np.zeros(self.n_items)
        for item_to_rate in range(self.n_items):
            relevant_items = np.intersect1d(self.neighborhoods[item_to_rate],
                                            items_rated_by_this_user,
                                            assume_unique=True)  # assume_unique speeds up intersection op
            out[item_to_rate] = self.ratings_mat[user_id, relevant_items] * \
                self.item_sim_mat[item_to_rate, relevant_items] / \
                self.item_sim_mat[item_to_rate, relevant_items].sum()
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        cleaned_out = np.nan_to_num(out)
        return cleaned_out

    def pred_all_users(self, report_run_time=False):
        start_time = time()
        all_ratings = [
            self.pred_one_user(user_id) for user_id in range(self.n_users)]
        if report_run_time:
            print("Execution time: %f seconds" % (time()-start_time))
        return np.array(all_ratings)

    def top_n_recs(self, user_id, n):
        pred_ratings = self.pred_one_user(user_id)
        item_index_sorted_by_pred_rating = list(np.argsort(pred_ratings))
        items_rated_by_this_user = self.ratings_mat[user_id].nonzero()[1]
        unrated_items_by_pred_rating = [item for item in item_index_sorted_by_pred_rating
                                        if item not in items_rated_by_this_user]
        return unrated_items_by_pred_rating[-n:]

In [17]:
my_rec_engine = ItemItemRecommender(neighborhood_size=75)
my_rec_engine.fit(utility_mat)

# Show top 20 recommendations for user #1
top_recom = my_rec_engine.top_n_recs(1, 20)

In [18]:
print top_recom

[1555, 1836, 2331, 1437, 3199, 679, 1408, 1432, 1978, 2438, 1980, 1983, 1984, 1986, 1120, 2804, 1820, 1998, 517, 769]


In [32]:
# get business id from index
def get_business_id(index):
    for key, value in business_id_dict.iteritems():
        if value == index:
            return key

print "Top recommended business id:"
recom_business = [get_business_id(ind) for ind in top_recom]
print recom_business

Top recommended business id:
['RESDUcs7fIiihp38-d6_6g', 'VzUo-RURV3VnfNItAYM8yg', 'dupc1Q5bl1gwTheS1ZqEig', 'PAilv1TpsWMsLTZk5d3guw', 'u8-WDsLXAl0dXQW_wqWrDg', 'BZSzoBFhkXBUQTx4Cgl5Aw', 'OaM2Bjeo2Ftt84ruTrzPNQ', 'P6O50VeFlBIJpP0QPYsXbQ', 'YNDxeeRUARbd8GRnscJSvg', 'fvWMTH2uMQXIvWSFf5wi4A', 'YNUdy-W_ZFO9B2SZUKRrPw', 'YQ--LJ7pvjiDSqNv0TuKTQ', 'YRiQtFNteLUUEiGkdQ23vg', 'YSHLvvIOg5w7ON396yNmVA', 'JIG5xGUdWbaaPeo3MDonYg', 'mUdalYTuAtnZm2K9zoH27Q', 'Vg1C_1eqwIwkZLIXGMTW3g', 'YZGSNhgTS6YeyUYoivD-Ww', '8b5ll2kjXfjgFIqWsjkr8Q', 'DA-ddRqcReCe_DcXKicvsQ']


In [24]:
def get_user_id(index):
    for key, value in user_id_dict.iteritems():
        if value == index:
            return key
user_id_1 = get_user_id(1)
print "user id for user#1:", user_id_1

user id for user#1: A6zYXofgFj6UhonFPrEDHw


#### Compare recommended restaurants with user rated restaurant

In [21]:
business_df = pd.read_csv('yelp_dataset_challenge_round9/selected_business.csv')

In [43]:
business_df.head(2)

Unnamed: 0,business_id,name,categories,ave_stars
0,saWZO6hB4B8P-mIzS1--Xw,Kabob Palace,"[Persian/Iranian, Restaurants, Ethnic Food, Fo...",2.5
1,hMh9XOwNQcu31NAOCqhAEw,Taste of India,"[Restaurants, Vegetarian, Indian]",3.5


In [29]:
# user#1 rated restaurants
review_user1 = df[df["user_id"] == user_id_1][["name", "categories", "stars"]]
review_user1

Unnamed: 0,name,categories,stars
2,Delmonico Steakhouse,"[Steakhouses, Restaurants, Cajun/Creole]",3
78523,Yardbird Southern Table & Bar,"[Restaurants, Southern, American (New)]",5
95251,StripSteak,"[Steakhouses, American (New), Cheesesteaks, Re...",5
101202,Paris Baguette,"[Coffee & Tea, Bakeries, Cafes, Food, Restaura...",5


In [42]:
# recommended restaurants for user#1
recom_filter = business_df["business_id"].apply(lambda x: x in recom_business)
business_df[recom_filter][["name", "categories"]]

Unnamed: 0,name,categories
63,Boba Hut,"[Restaurants, Chinese, Hawaiian, Food, Coffee ..."
175,Bacchanal Buffet,"[Buffets, Sandwiches, Food, Restaurants, Break..."
349,L'Atelier de Joël Robuchon,"[Restaurants, French]"
824,Coco's Bakery Restaurant,"[American (Traditional), Food, Restaurants, Ba..."
1546,Panevino Italian Grille,"[Bars, Restaurants, Food, Nightlife, Italian, ..."
2450,Very Venice,[Restaurants]
2544,Superbook Deli,"[Sandwiches, Restaurants, Delis, American (New)]"
2761,Grimaldi's Pizzeria,"[Pizza, Restaurants]"
2872,Guy Savoy,"[Restaurants, French]"
2893,Seabreeze Cafe,"[Restaurants, Breakfast & Brunch, American (Tr..."


Overall, the users rated 5 stars for Southern American food, Steakhouse and bakery branded by Paris bakery. The recommended restaurants contain a list of similar places to the rated restaurants, including American bar&grill, Italian and French restaurants. Additionally, it also recommends some fast food places, sandwich shop, Asian and Mexican food.  

## 3. Matrix Factorization recommender

Take a look at Graphlab Create examples

In [44]:
import graphlab 

In [45]:
sf = graphlab.SFrame(df_select_min_review)

This non-commercial license of GraphLab Create for academic use is assigned to tanz@indiana.edu and will expire on July 02, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Zhenning\AppData\Local\Temp\graphlab_server_1498943685.log.0


In [46]:
# use 10 latent features
gl_recom_10 = graphlab.recommender.factorization_recommender.create(sf,
                                                                user_id = "user_id",
                                                                item_id = "business_id",
                                                                target ="stars",
                                                                num_factors = 10,
                                                                random_seed = 100)

In [47]:
# recommended restaurants for user#1
gl_recom_10.recommend(users = ["A6zYXofgFj6UhonFPrEDHw"], k=10)["business_id"]

user_id,business_id,score,rank
A6zYXofgFj6UhonFPrEDHw,7wMCJ9NqL9eBEX4WdJWuIA,6.44823237577,1
A6zYXofgFj6UhonFPrEDHw,dYZqJ2S1ND9KghLIKJg71g,6.03564807096,2
A6zYXofgFj6UhonFPrEDHw,bgx6gYdktqEoQwBdo5lRbA,6.03471013227,3
A6zYXofgFj6UhonFPrEDHw,893VryJbZcCm5V9xon_aLA,6.012339317,4
A6zYXofgFj6UhonFPrEDHw,RwMLuOkImBIqqYj4SSKSPg,5.98470076242,5
A6zYXofgFj6UhonFPrEDHw,UaL6yRGSv9fYCyn2DJLu8w,5.97460647741,6
A6zYXofgFj6UhonFPrEDHw,0ps5B5C6oNaRsGyUJzFCuw,5.96976801077,7
A6zYXofgFj6UhonFPrEDHw,zOYqvrpenMaAz0rc7n41Aw,5.88827332178,8
A6zYXofgFj6UhonFPrEDHw,vRFYqRz5F41ici5IOO2_pg,5.85957118192,9
A6zYXofgFj6UhonFPrEDHw,Ozy5LlmU0k-mVs7RXNisZQ,5.84935923734,10


In [48]:
# use 100 latent features
gl_recom_100 = graphlab.recommender.factorization_recommender.create(sf,
                                                                user_id = "user_id",
                                                                item_id = "business_id",
                                                                target ="stars",
                                                                num_factors = 100,
                                                                random_seed = 100)

In [50]:
gl_recom_100.recommend(users = ["A6zYXofgFj6UhonFPrEDHw"], k=10)

user_id,business_id,score,rank
A6zYXofgFj6UhonFPrEDHw,dtqT51H8Q8mIvrLylVuiZg,6.59740635076,1
A6zYXofgFj6UhonFPrEDHw,7wMCJ9NqL9eBEX4WdJWuIA,6.40077205816,2
A6zYXofgFj6UhonFPrEDHw,dYZqJ2S1ND9KghLIKJg71g,6.39601560751,3
A6zYXofgFj6UhonFPrEDHw,893VryJbZcCm5V9xon_aLA,6.3040556542,4
A6zYXofgFj6UhonFPrEDHw,UaL6yRGSv9fYCyn2DJLu8w,6.15378900686,5
A6zYXofgFj6UhonFPrEDHw,0ps5B5C6oNaRsGyUJzFCuw,6.1432642571,6
A6zYXofgFj6UhonFPrEDHw,OQcvO5P3gH0cuJ-bPXwfQQ,6.09730728784,7
A6zYXofgFj6UhonFPrEDHw,l1GJnB9TJgGgEeI4at1M0A,6.02048107305,8
A6zYXofgFj6UhonFPrEDHw,bgx6gYdktqEoQwBdo5lRbA,5.98309298673,9
A6zYXofgFj6UhonFPrEDHw,Ozy5LlmU0k-mVs7RXNisZQ,5.94025190988,10


Using 10 latent features and 100 latent features give similar results. 

#### Compare recommended restaurants with user rated restaurant

In [56]:
gl_recom_business = list(gl_recom_10.recommend(users = ["A6zYXofgFj6UhonFPrEDHw"], k=20)["business_id"])

In [57]:
# recommended restaurants for user#1
recom_filter = business_df["business_id"].apply(lambda x: x in gl_recom_business)
business_df[recom_filter][["name", "categories"]]

Unnamed: 0,name,categories
169,Serrano's Mexican Restaurant,"[Restaurants, Mexican]"
252,Forte European Tapas Bar and Bistro,"[Ukrainian, Tapas Bars, Restaurants, Spanish]"
958,Those Guys Pies,"[Restaurants, Cheesesteaks, American (New), Sa..."
1785,Tacos El Gordo,"[Mexican, Restaurants]"
1883,La Papaya,"[Food, Juice Bars & Smoothies, Delis, Restaura..."
1952,Taco San Francisco,"[Mexican, Food, Food Trucks, Restaurants]"
2139,Outback Steakhouse,"[Restaurants, Steakhouses]"
2436,Port of Subs,"[Restaurants, Sandwiches, Delis]"
2438,Subway,"[Fast Food, Sandwiches, Restaurants]"
2533,Tacos El Burrito Loco,"[Restaurants, Mexican]"


The recommended items are  different between item-item similary recommender and matrix factorization recommender from graphlab. In the item-item similary recommender, the similariy determined by cosine distance may have an effect on the true similarity. The result could be different using other similarity metric. 

The matrix factorization model decomposed the original utility matrix into matrices with latent features. By choosing 10 and 100 latent features, the recommendation is not very different, indicating that the latent features can capture the variance in the utility matrix well. 

Overall, the recommended restaurants in both recommenders are quite similar and share the same type of food, such as Sushi, sandwiches, fast food and taco

## 4. Other recommenders (optional)

What are other ways you can build a better recommender?

* Other features (have you noticed there are other features in the Yelp dataset, e.g. tips, etc.?)
* Popularity-based
* Content-based
* Hybrid

To explore other recommenders, i.e. popularity-based and content-based, we need to construct features for each item. These features can include restaurant type, location, expensive level, stars, etc. After constructing the item profiles, we can calculate item similary based these features. With that, we can recommend similar restaurant to users based on their preference/rating