# Introduction :
For this project, I wanted to build something that would be meaningful for people in their everyday lives. This led me to build a **Content-based** filtering recommender, built using the **Restaurant Profile** data and comparing those profiles to the User Preferences. So, the idea was to set up a method for a user to enter some preferences such as if user prefer a restaurant where smoking is permitted, or budget is low, or their preferred cuisine and based on that they'll get a recommendations for restaurants that are closely matched to their preference.

The dataset that I built my models on was taken from the **<a href="https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data#">UCI Repository</a>**. The dataset contains data about restaurants, users, and ratings.
- The data contains records of 130 retaurants and 138 users.
- I have used KNN to find the restaurants that most closely matched the user preferences.

There were three datafiles used for this project :

- **User preferences** which includes cuisines types, smoking, alcohol, GPS location, height, weight, religion, favorite color.
- **Restaurant profiles** about their facilities, the cuisine served, locations, parking
- **Ratings** ratings for overall, service, and food - focused just on overall ratings.

I combined the datasets needed for the analysis, and focused on variables I thought were most relevant for the users and restaurants including: cuisines, smoking area, budget, ambience, and ratings. There were some missing values in the dataset containing ‘?’, so I filled those values with 0s. I also converted some categorical variables into numeric values.

To determine the predictive power of my content-based recommender systems, I evaluated it using Hit-Ratio method. 

**Loading the libraries**

In [1]:
import pandas as pd
import numpy as np
from numpy import *
import seaborn as sns 
import matplotlib.pyplot as plt

import sklearn
from sklearn.neighbors import NearestNeighbors

import warnings
warnings.filterwarnings("ignore")

**Setting up working directory**

In [2]:
import os
os.chdir('C:\\Users\\shruti\\Desktop\\DSC 478\\Proj\\Files_original')

## Overview of Data Cleaning & PreProcessing :
- As the first step of our data cleaning process, I removed the columns that weren’t significant for the analysis such as weight, height, marital status, birth year, religion, color, activity, latitude, longitude, geom_meter, transport, fax number, URL from our data files.
- The variables I focused on are as follow : userID, placeID, cuisine name, restaurant name, smoking area, budget, ambience, ratings.
- At first I converted the categorical variables to dummy variables, but later realized that will be converting data into a list so need numerical values. Thus, I converted the categorical variables into numeric/ordinal variables. 

### Prepare - Restaurant Data

**Loading the restaurant data file consisting restaurant facilities and locations**

In [3]:
restaurants = pd.read_csv('geoplaces2.csv',encoding='latin-1')
restaurants.head()

Unnamed: 0,placeID,latitude,longitude,the_geom_meter,name,address,city,state,country,fax,...,alcohol,smoking_area,dress_code,accessibility,price,url,Rambience,franchise,area,other_services
0,134999,18.915421,-99.184871,0101000020957F000088568DE356715AC138C0A525FC46...,Kiku Cuernavaca,Revolucion,Cuernavaca,Morelos,Mexico,?,...,No_Alcohol_Served,none,informal,no_accessibility,medium,kikucuernavaca.com.mx,familiar,f,closed,none
1,132825,22.147392,-100.983092,0101000020957F00001AD016568C4858C1243261274BA5...,puesto de tacos,esquina santos degollado y leon guzman,s.l.p.,s.l.p.,mexico,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,f,open,none
2,135106,22.149709,-100.976093,0101000020957F0000649D6F21634858C119AE9BF528A3...,El Rincón de San Francisco,Universidad 169,San Luis Potosi,San Luis Potosi,Mexico,?,...,Wine-Beer,only at bar,informal,partially,medium,?,familiar,f,open,none
3,132667,23.752697,-99.163359,0101000020957F00005D67BCDDED8157C1222A2DC8D84D...,little pizza Emilio Portes Gil,calle emilio portes gil,victoria,tamaulipas,?,?,...,No_Alcohol_Served,none,informal,completely,low,?,familiar,t,closed,none
4,132613,23.752903,-99.165076,0101000020957F00008EBA2D06DC8157C194E03B7B504E...,carnitas_mata,lic. Emilio portes gil,victoria,Tamaulipas,Mexico,?,...,No_Alcohol_Served,permitted,informal,completely,medium,?,familiar,t,closed,none


**Only keeping columns that are Significant to the analysis as shown below:**

In [4]:
restaurants = restaurants[['placeID', 'name', 'smoking_area', 'price', 'Rambience']]

**Converting categorical variables to ordinal/numeric :**

In [5]:
cleanup_data = {"smoking_area" : {"none" : 0 , "only at bar" : 1, 'permitted' : 1, 'not permitted' : 0, "section":1},
                "Rambience": {"familiar": 1, "quiet": 0},
                "price" : {"low": 1, "medium" : 2, "high" : 3}}

**Cleaned dataset:**

In [6]:
restaurants_num = restaurants.replace(cleanup_data)
restaurants_num

Unnamed: 0,placeID,name,smoking_area,price,Rambience
0,134999,Kiku Cuernavaca,0,2,1
1,132825,puesto de tacos,0,1,1
2,135106,El Rincón de San Francisco,1,2,1
3,132667,little pizza Emilio Portes Gil,0,1,1
4,132613,carnitas_mata,1,2,1
...,...,...,...,...,...
125,132866,Chaires,0,2,1
126,135072,Sushi Itto,0,2,1
127,135109,Paniroles,0,2,0
128,135019,Restaurant Bar Coty y Pablo,0,1,1


**Loading restaurant data file consisting information about types of cuisines**

In [7]:
rest_cuisine = pd.read_csv('chefmozcuisine.csv')
print(rest_cuisine.shape)
rest_cuisine.head()

(916, 2)


Unnamed: 0,placeID,Rcuisine
0,135110,Spanish
1,135109,Italian
2,135107,Latin_American
3,135106,Mexican
4,135105,Fast_Food


#### Pivot Table on restaurant-cuisinse to create a matrix and see which place offers what types of cuisine

In [8]:
rest_cuisine_pvt = rest_cuisine.pivot_table(index=['placeID'],
                     columns='Rcuisine',
                     aggfunc=np.count_nonzero).reset_index().fillna(0).astype(int)
rest_cuisine_pvt

Rcuisine,placeID,Afghan,African,American,Armenian,Asian,Bagels,Bakery,Bar,Bar_Pub_Brewery,...,Soup,Southern,Southwestern,Spanish,Steaks,Sushi,Thai,Turkish,Vegetarian,Vietnamese
0,132001,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,132002,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,132003,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,132004,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,132005,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
764,135105,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
765,135106,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
766,135107,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
767,135109,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Merging the restaurant location based data and restaurant cuisine data using left join**

In [9]:
restaurant_data = pd.merge(restaurants_num, rest_cuisine_pvt, on='placeID', how='left')

**Replacing missing values with zeros.**

In [10]:
restaurant_data = restaurant_data.fillna(0)
restaurant_data

Unnamed: 0,placeID,name,smoking_area,price,Rambience,Afghan,African,American,Armenian,Asian,...,Soup,Southern,Southwestern,Spanish,Steaks,Sushi,Thai,Turkish,Vegetarian,Vietnamese
0,134999,Kiku Cuernavaca,0,2,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,132825,puesto de tacos,0,1,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,135106,El Rincón de San Francisco,1,2,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,132667,little pizza Emilio Portes Gil,0,1,1,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,132613,carnitas_mata,1,2,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,132866,Chaires,0,2,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
126,135072,Sushi Itto,0,2,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
127,135109,Paniroles,0,2,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
128,135019,Restaurant Bar Coty y Pablo,0,1,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**This is a set of cuisines we have available from restaurants**

In [11]:
colnames_lst = restaurant_data.columns[5:]
colnames_lst

Index(['Afghan', 'African', 'American', 'Armenian', 'Asian', 'Bagels',
       'Bakery', 'Bar', 'Bar_Pub_Brewery', 'Barbecue', 'Brazilian',
       'Breakfast-Brunch', 'Burgers', 'Cafe-Coffee_Shop', 'Cafeteria',
       'California', 'Caribbean', 'Chinese', 'Contemporary',
       'Continental-European', 'Deli-Sandwiches', 'Dessert-Ice_Cream', 'Diner',
       'Dutch-Belgian', 'Eastern_European', 'Ethiopian', 'Family', 'Fast_Food',
       'Fine_Dining', 'French', 'Game', 'German', 'Greek', 'Hot_Dogs',
       'International', 'Italian', 'Japanese', 'Juice', 'Korean',
       'Latin_American', 'Mediterranean', 'Mexican', 'Mongolian',
       'Organic-Healthy', 'Persian', 'Pizzeria', 'Polish', 'Regional',
       'Seafood', 'Soup', 'Southern', 'Southwestern', 'Spanish', 'Steaks',
       'Sushi', 'Thai', 'Turkish', 'Vegetarian', 'Vietnamese'],
      dtype='object')

### Preparing a dataset to print out the results in nicely formatted manner at the end:

In [12]:
rest_cuisine_combined = pd.DataFrame(rest_cuisine.groupby("placeID")["Rcuisine"].apply(list))
rest_cuisine_combined.head()

Unnamed: 0_level_0,Rcuisine
placeID,Unnamed: 1_level_1
132001,[Dutch-Belgian]
132002,[Seafood]
132003,[International]
132004,[Seafood]
132005,"[French, Seafood]"


In [13]:
restaurant_profile = pd.merge(restaurants, rest_cuisine_combined, on='placeID', how='left')
restaurant_profile = restaurant_profile.fillna("none specified")
restaurant_profile.head()

Unnamed: 0,placeID,name,smoking_area,price,Rambience,Rcuisine
0,134999,Kiku Cuernavaca,none,medium,familiar,[Japanese]
1,132825,puesto de tacos,none,low,familiar,[Mexican]
2,135106,El Rincón de San Francisco,only at bar,medium,familiar,[Mexican]
3,132667,little pizza Emilio Portes Gil,none,low,familiar,[Armenian]
4,132613,carnitas_mata,permitted,medium,familiar,[Mexican]


### Preparing User Preferences Data
- The data consist of about 130 users

**Loading the userProfile data which contains information about user preferences**

In [14]:
userProfile = pd.read_csv('userprofile.csv')
userProfile.head()

Unnamed: 0,userID,latitude,longitude,smoker,drink_level,dress_preference,ambience,transport,marital_status,hijos,birth_year,interest,personality,religion,activity,color,weight,budget,height
0,U1001,22.139997,-100.978803,False,abstemious,informal,family,on foot,single,independent,1989,variety,thrifty-protector,none,student,black,69,medium,1.77
1,U1002,22.150087,-100.983325,False,abstemious,informal,family,public,single,independent,1990,technology,hunter-ostentatious,Catholic,student,red,40,low,1.87
2,U1003,22.119847,-100.946527,False,social drinker,formal,family,public,single,independent,1989,none,hard-worker,Catholic,student,blue,60,low,1.69
3,U1004,18.867,-99.183,False,abstemious,informal,family,public,single,independent,1940,variety,hard-worker,none,professional,green,44,medium,1.53
4,U1005,22.183477,-100.959891,False,abstemious,no preference,family,public,single,independent,1992,none,thrifty-protector,Catholic,student,black,65,medium,1.69


**Loading the userCuisine data which contains information about userID and cuisines**

In [15]:
userCuisine = pd.read_csv('usercuisine.csv')
userCuisine.head()
print(userCuisine.shape)

(330, 2)


**Dropping the variables that are unsignificant for the analysis**

In [16]:
users = userProfile.drop(['latitude', 'longitude','drink_level', 'dress_preference','transport','marital_status','hijos','birth_year','interest','personality', 'religion', 'activity', 'color', 'weight', 'height'], axis=1)
users

Unnamed: 0,userID,smoker,ambience,budget
0,U1001,false,family,medium
1,U1002,false,family,low
2,U1003,false,family,low
3,U1004,false,family,medium
4,U1005,false,family,medium
...,...,...,...,...
133,U1134,false,family,medium
134,U1135,false,family,low
135,U1136,true,friends,low
136,U1137,false,family,low


**Replacing the missing values '?' with 'NaN'**

In [17]:
users = users.replace('?',np.NaN)

**Converting the categorical variables into ordinal/numeric variables**

In [18]:
cleanup_data = {"smoker" : {"false" : 0 , "true" : 1},
                "ambience": {"family": 1, "friends": 1, "solitary": 0},
                "budget" : {"low": 1, "medium" : 2, "high" : 3}}

In [19]:
users = users.replace(cleanup_data)
print(users.shape)
users.head()

(138, 4)


Unnamed: 0,userID,smoker,ambience,budget
0,U1001,0.0,1.0,2.0
1,U1002,0.0,1.0,1.0
2,U1003,0.0,1.0,1.0
3,U1004,0.0,1.0,2.0
4,U1005,0.0,1.0,2.0


### Pivot table on user-cuisine to see which user likes which cuisines

In [20]:
userCuisine['val'] = 1
user_cuisine = userCuisine.pivot_table(index='userID',
                     columns='Rcuisine', values='val')
user_cuisine = user_cuisine.fillna(0)
user_cuisine

Rcuisine,Afghan,African,American,Armenian,Asian,Australian,Austrian,Bagels,Bakery,Bar,...,Swiss,Tapas,Tea_House,Tex-Mex,Thai,Tibetan,Tunisian,Turkish,Vegetarian,Vietnamese
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1001,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1005,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U1134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1135,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
U1136,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
U1137,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Filter Cuisines List
- Since both restaurant data and user data contains information about cuisine, but there were more cuisine choices for the user than we had restaurants that served those cuisines, so decided to only keep the cuisines from restaurant dataset.

**User cuisine that match the restaurants data**

In [21]:
user_cuisine = user_cuisine[colnames_lst]

**Merging the user data with user cuisine data**

In [22]:
#pd.options.display.max_rows = 999
user_profile = pd.merge(users, user_cuisine, on='userID', how='left')
user_profile = user_profile.fillna(0)
user_profile

Unnamed: 0,userID,smoker,ambience,budget,Afghan,African,American,Armenian,Asian,Bagels,...,Soup,Southern,Southwestern,Spanish,Steaks,Sushi,Thai,Turkish,Vegetarian,Vietnamese
0,U1001,0.0,1.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,U1002,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,U1003,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,U1004,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,U1005,0.0,1.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,U1006,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,U1007,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,U1008,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,U1009,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,U1010,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Converting the data into a list**

In [23]:
list(user_profile.columns)

['userID',
 'smoker',
 'ambience',
 'budget',
 'Afghan',
 'African',
 'American',
 'Armenian',
 'Asian',
 'Bagels',
 'Bakery',
 'Bar',
 'Bar_Pub_Brewery',
 'Barbecue',
 'Brazilian',
 'Breakfast-Brunch',
 'Burgers',
 'Cafe-Coffee_Shop',
 'Cafeteria',
 'California',
 'Caribbean',
 'Chinese',
 'Contemporary',
 'Continental-European',
 'Deli-Sandwiches',
 'Dessert-Ice_Cream',
 'Diner',
 'Dutch-Belgian',
 'Eastern_European',
 'Ethiopian',
 'Family',
 'Fast_Food',
 'Fine_Dining',
 'French',
 'Game',
 'German',
 'Greek',
 'Hot_Dogs',
 'International',
 'Italian',
 'Japanese',
 'Juice',
 'Korean',
 'Latin_American',
 'Mediterranean',
 'Mexican',
 'Mongolian',
 'Organic-Healthy',
 'Persian',
 'Pizzeria',
 'Polish',
 'Regional',
 'Seafood',
 'Soup',
 'Southern',
 'Southwestern',
 'Spanish',
 'Steaks',
 'Sushi',
 'Thai',
 'Turkish',
 'Vegetarian',
 'Vietnamese']

#### Now that we have all the data files combined and prossessed. We will build the Content Based recommender using KNN 

### Setup KNN Recommender Restaurants

Using the restaurant data attributes, comparing the user's preferences and identify restaurants they will like.

**Getting User Preferences:**
- Suppose, we have a user U1008 who's looking for a restaurant where smoking is not permitted and ambience is family/friends. 
- The user has low budget and want to eat American food.

In [24]:
#select a user from our set to get recommendations for
user = user_profile[user_profile['userID'] == 'U1008']
user 

Unnamed: 0,userID,smoker,ambience,budget,Afghan,African,American,Armenian,Asian,Bagels,...,Soup,Southern,Southwestern,Spanish,Steaks,Sushi,Thai,Turkish,Vegetarian,Vietnamese
7,U1008,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
user_preferences = user_profile[user_profile['userID'] == 'U1008'].values[0]
user_preferences_arr = np.array(user_preferences)[1:]

print(user_preferences_arr.shape)
print("User Prefers: ",user_preferences)

(62,)
User Prefers:  ['U1008' 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]


**Getting Restaurants:**

In [26]:
restaurant_data_arr = np.array(restaurant_data)[:, 2:]
restaurant_data_arr[0:5]
print(restaurant_data_arr.shape)

(130, 62)


#### Getting nearest 5 recommendation using KNN
- By matching user preferences to restaurant characteristic.

#### Finding the nearest restaurants that matched user preferences

In [27]:
nbrs = NearestNeighbors(n_neighbors=5).fit(restaurant_data_arr)
print(nbrs.kneighbors([user_preferences_arr]))

(array([[3.16227766, 3.16227766, 3.16227766, 3.16227766, 3.16227766]]), array([[129,  19,   6,   1,  18]], dtype=int64))


In [28]:
dist, rest_idx = nbrs.kneighbors([user_preferences_arr])
dist, rest_idx

(array([[3.16227766, 3.16227766, 3.16227766, 3.16227766, 3.16227766]]),
 array([[129,  19,   6,   1,  18]], dtype=int64))

**Recommending restaurants that user will like**

In [29]:
for i in rest_idx.tolist():
    print("Recommended Restaurants: \n",restaurant_profile.iloc[i], "\n")

Recommended Restaurants: 
      placeID                name smoking_area price Rambience        Rcuisine
129   132877    sirloin stockade         none   low  familiar  none specified
19    132668      TACOS EL GUERO         none   low  familiar       [Mexican]
6     132732  Taqueria EL amigo          none   low  familiar       [Mexican]
1     132825     puesto de tacos         none   low  familiar       [Mexican]
18    132856       Unicols Pizza         none   low  familiar       [Italian] 



### EVALUATION: The system has been evaluated thoroughly using Hit Ratio Method
- I evaluated the system using Hit Ratio to test how many restaurants I recommended to the user that were in the set of restaurants that the user already liked in the Ratings data. I measured the hit counts on basis of such if I recommend a restaurant to a user and the user has already visited that place and given it a high rating, then it’s a hit. 
- Overall, The method ensures that if we recommend a restaurant to a user, there are high chances that they will like it.

**Loading the ratings file to get data of which restarants user has rated more**
-  The ratings ranges on a scale of 0-2 (0 is low and 2 is a high rating)

In [30]:
ratings = pd.read_csv('rating_final.csv')
print(ratings.shape)
ratings.head()

(1161, 5)


Unnamed: 0,userID,placeID,rating,food_rating,service_rating
0,U1077,135085,2,2,2
1,U1077,135038,2,2,1
2,U1077,132825,2,2,2
3,U1077,135060,1,2,2
4,U1068,135104,1,1,2


**Evaluating user has given which restaurants 1 or more rating to get the set of restaurants user likes**

In [31]:
#get the set of restaurants a user has like 
user_likes_rows = ratings[ratings['rating'] >= 1]
user_likes = pd.DataFrame(user_likes_rows.groupby("userID")["placeID"].apply(list))
user_likes = user_likes.reset_index(level=0)
user_likes.head()

Unnamed: 0,userID,placeID
0,U1001,"[132830, 132825, 135040, 135039, 135045, 13503..."
1,U1002,"[132921, 135062, 135106, 132825, 135052, 13286..."
2,U1003,"[132825, 135075, 132862, 132937, 132922, 13272..."
3,U1004,"[135060, 135028, 135106, 135062, 135032, 13295..."
4,U1005,"[135050, 135076, 132830, 135066, 135041, 13505..."


**Hit Ratio: How many restaurants we recommended to the user that were in the set of restaurant that user already likes in the Ratings Data.**

In [32]:
number_of_recs = 5
user_recs = {}
hits = 0
userlst = np.array(user_likes['userID'])
total_users = len(userlst)
for u in userlst:
    user_preferences = user_profile[user_profile['userID'] == u].values[0]
    user_preferences_arr = np.array(user_preferences)[1:].astype('int64')
    rest_data = restaurant_data.iloc[:, 2:].astype('int64')
    restaurant_data_arr = np.array(rest_data)
    nbrs = NearestNeighbors(n_neighbors=number_of_recs).fit(restaurant_data_arr)
    dist, recommendations = nbrs.kneighbors([user_preferences_arr])
    rec_lst = []
    for i in recommendations[0].tolist():
        rec_lst.append(restaurant_profile.iloc[i].placeID)
    user_recs[u] = rec_lst
    user_likes_set = user_likes[user_likes['userID'] == u]["placeID"].to_list()[0]
    user_likes_set = set(user_likes[user_likes.userID == u]["placeID"].to_list()[0])
    recs_set = user_recs[u]
    common_likes = user_likes_set.intersection(recs_set)
    hits += len(common_likes)
    hit_ratio = hits / total_users
print("Hit ratio : ", hit_ratio)

Hit ratio :  0.23622047244094488


In [33]:
number_of_recs = 20
user_recs = {}
hits = 0
userlst = np.array(user_likes['userID'])
total_users = len(userlst)
for u in userlst:
    user_preferences = user_profile[user_profile['userID'] == u].values[0]
    user_preferences_arr = np.array(user_preferences)[1:].astype('int64')
    rest_data = restaurant_data.iloc[:, 2:].astype('int64')
    restaurant_data_arr = np.array(rest_data)
    nbrs = NearestNeighbors(n_neighbors=number_of_recs).fit(restaurant_data_arr)
    dist, recommendations = nbrs.kneighbors([user_preferences_arr])
    rec_lst = []
    for i in recommendations[0].tolist():
        rec_lst.append(restaurant_profile.iloc[i].placeID)
    user_recs[u] = rec_lst
    user_likes_set = user_likes[user_likes['userID'] == u]["placeID"].to_list()[0]
    user_likes_set = set(user_likes[user_likes.userID == u]["placeID"].to_list()[0])
    recs_set = user_recs[u]
    common_likes = user_likes_set.intersection(recs_set)
    hits += len(common_likes)
    hit_ratio = hits / total_users
print("Hit ration: ", hit_ratio)

Hit ration:  0.9448818897637795


The above two example shows good results. If we use 20 recomendations, then there are chances that we'll hit almost everyone's preffered restaurants with hit ratio of 94%.  However, the more realistic number of recommendations - say 5 - across all the users we are getting hit ratio of about 23%, which isn't too bad.

To conclude, The hit ratio method ensures that if we recommend a restaurant to a user, there are high chances that they will like it. The application that I have built using recomender system will help new users who haven’t rated any restaurants yet, it will ask for their preferences for Smoking, Ambience, Budget, and Cuisine. The content-based recommender will return restaurants that are similar to their preferences.  