These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

# Making Recommendations Based on Correlation

In [1]:
import numpy as np
import pandas as pd


In [2]:
# rating_final.csv
url = 'https://drive.google.com/file/d/1ptu4AlEXO4qQ8GytxKHoeuS1y4l_zWkC/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
frame = pd.read_csv(path)

# chefmozcuisine.csv
url = 'https://drive.google.com/file/d/1S0_EGSRERIkSKW4D8xHPGZMqvlhuUzp1/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cuisine = pd.read_csv(path)

# 'geoplaces2.csv'
url = 'https://drive.google.com/file/d/1ee3ib7LqGsMUksY68SD9yBItRvTFELxo/view?usp=sharing' 
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
geodata = pd.read_csv(path, encoding = 'CP1252') # change encoding to 'mbcs' in Windows

### Preparing Data For Correlation

We will look for restaurants that are similar to the most popular restaurant from the last notebook "Tortas Locas Hipocampo". "Similarity" will be defined by how well other places correlate with "Tortas Locas" in the user-item matrix. In this matrix, we have all the users in the rows and all the restaurants in the columns. It has many NaNs because most of the time users have not visited many restaurants —we call this a sparse matrix.

In [14]:
frame.head(10)

Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2
5,U1068,132740,0,0,0
6,U1068,132663,1,1,1
7,U1068,132732,0,0,0
8,U1068,132630,1,1,1
9,U1067,132584,2,2,2


In [5]:
user_item_df = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
user_item_df.head(10)

placeID,132560,132561,132564,132572,132583,132584,132594,132608,132609,132613,...,135080,135081,135082,135085,135086,135088,135104,135106,135108,135109
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,,,,,,,,,,,...,,,,0.0,,,,,,
U1002,,,,,,,,,,,...,,,,1.0,,,,1.0,,
U1003,,,,,,,,,,,...,2.0,,,,,,,,,
U1004,,,,,,,,,,,...,,,,,,,,2.0,,
U1005,,,,,,,,,,,...,,,,,,,,,,
U1006,,,,1.0,,,,,,,...,,,,,,,,,,
U1007,,,,1.0,,,,,,,...,,,,1.0,0.0,,,,1.0,
U1008,,,,,,,,,,,...,,,,,,,,,1.0,
U1009,,,,,,,,,,,...,,,,,,,,,,
U1010,,,,,,,,,,,...,,,,,,,,,,


Let's look at the users that have visited "Tortas Locas":

In [15]:
tortas_id = 132572

Tortas_ratings = user_item_df.loc[:,tortas_id]
Tortas_ratings[Tortas_ratings>=0] # exclude NaNs

userID
U1006    1.0
U1007    1.0
U1013    1.0
U1033    1.0
U1046    1.0
U1055    2.0
U1061    1.0
U1073    0.0
U1083    2.0
U1090    1.0
U1091    2.0
U1092    1.0
U1108    1.0
U1112    0.0
U1134    0.0
Name: 132572, dtype: float64

## Evaluating Similarity Based on Correlation

Now we will look at how well other restaurants correlate with Tortas Locas. A strong positive correlation between two restaurants indicates that users who liked one restaruant also liked the other. A negative correlation would mean that users who liked one restaurant did not like the other. So, we will look for strong, positive correlations to find similar restaurants.

In [16]:
# we get warnings because computing the pearson correlation coefficient with NaNs, but the results are still ok
similar_to_Tortas = user_item_df.corrwith(Tortas_ratings)
similar_to_Tortas

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


placeID
132560    NaN
132561    NaN
132564    NaN
132572    1.0
132583    NaN
         ... 
135088    NaN
135104    NaN
135106    NaN
135108    NaN
135109    NaN
Length: 130, dtype: float64

Many restuarants get a NaN, because there are no users that went to both that restaurant _and_ Tortas Locas. But some of them give us a correlation score. Let's drop NaNs and look at the valid results:

In [17]:
corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head(12)

Unnamed: 0_level_0,PearsonR
placeID,Unnamed: 1_level_1
132572,1.0
132723,-0.57735
132825,0.0
132834,0.5
132862,0.866025
132875,-1.0
132921,0.612372
132951,0.870388
132954,0.912871
135025,0.333333


Some correlations are a perfect 1. It is possible that this is because very few users went to both that restaurant and "Tortas Locas" (also because there are very few rating options, only 0, 1 and 2). 

In [18]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].count()).rename(columns={"rating": "rating_count"})
rating

Unnamed: 0_level_0,rating_count
placeID,Unnamed: 1_level_1
132560,4
132561,4
132564,4
132572,15
132583,4
...,...
135088,6
135104,7
135106,10
135108,11


In [None]:
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary.drop(tortas_id, inplace=True) # drop Tortas Locas itself
Tortas_corr_summary

Let's filter out restaurants with a rating count below 10.

Then, take the top 10 restaurants in terms of similarity to Tortas:

In [None]:
top10 = Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)
top10

In [None]:
places =  geodata[['placeID', 'name']]

In [None]:
top10 = top10.merge(places, left_index=True, right_on="placeID")
top10

Let's look at the cuisine type (some restaurants do not have a cuisine type... but for the ones that do, here it is):

In [None]:
top10.merge(cuisine)

## Challenge:

Create a function that takes as input a restaurant id and a number (n), and outputs the names of the top n most similar restuarants to the inputed one.

You can assume that the user-item matrix (user_item_df) is already created.

In [13]:
rating = pd.DataFrame(frame.groupby('placeID')['rating'].count()).rename(columns={"rating": "rating_count"})

restaurant_id = 135085
n = 5


def top_similar_rest(restaurant_id, n):
    Tortas_ratings = user_item_df.loc[:,restaurant_id]
    similar_to_Tortas = user_item_df.corrwith(Tortas_ratings)
    corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
    corr_Tortas.dropna(inplace=True)
    Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
    Tortas_corr_summary.drop(tortas_id, inplace=True) 
    top = Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(n)
    places =  geodata[['placeID', 'name']]
    top = top.merge(places, left_index=True, right_on="placeID")
    return top
    
    
    
top_similar_rest(restaurant_id,n)




  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,PearsonR,rating_count,placeID,name
52,1.0,12,135066,Restaurante Guerra
121,1.0,36,135085,Tortas Locas Hipocampo
13,1.0,13,135076,Restaurante Pueblo Bonito
117,0.930261,13,132754,Cabana Huasteca
28,0.912871,13,135045,Restaurante la Gran Via


### BONUS (Next iteration)
Instead of flitering out restaurants with a rating count below 10, let's consider a restaurant X as similar to Y only if at least 3 users have gone to both X and Y. 

i.e. user 143, 153, and 168 went to both restaurants - not 3 random users visited X, and a different 3 random users visited y

In [None]:
# your code here