Reference - https://www.kaggle.com/fahd09/yelp-dataset-surpriseme-recommendation-system

We capture the idea of recommending restaurants of interesting and different categories than suggested category but also based on the similarities in terms of preferences. This allows the user to have different experience everytime with credible recommendations with good ratings and reviews.

We will explore the techniques later in details, but here I would like to highlight the core idea: First, we want to find a way to represent reviews using a bag-of-words representation. After doing so, we will also represent categories using a one-hot encoding representation. Then, we can manipulate those representations to find similarities and differences while balancing the weights of the two. Note that the core idea assume that you are more likely to love a restaurant if its reviews are similar to the reviews of the restaurants you already love.

Let's begin by importing libraries and making sure we only deal with valid data.

In [1]:
import os
import re
import string

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from nltk.tokenize import WordPunctTokenizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#loading the phoenix restaurant data
df_yelp_business_phx = pd.read_csv('data/final_project_phx/business_real_final_FINAL_Project.csv')
df_yelp_business_phx.head(5)

Unnamed: 0,checkin-info,city,review_count,name,neighborhoods,type,business_id,full_address,hours,reviews,state,rating_count,longitude,stars,latitude,is_open,categories,attributes
0,"{""23-6"": 1, ""19-2"": 1, ""6-1"": 3, ""5-2"": 2, ""2-...",Tempe,47,Domino's Pizza,[],business,-0Sgh0QlUKVsWosCWJzGqQ,"681 E Apache Blvd, Ste 104","{'Monday': {'close': '0:00', 'open': '10:00'},...","{'4': [""The service was great, we ordered on t...",AZ,"{'4': 6, '1': 29, '5': 4, '2': 6, '3': 4}",-111.928995,2.0,33.414346,OPEN,"Sandwiches, Restaurants, Chicken Wings, Pizza","{'Alcohol': ""u'none'"", 'Caters': 'False', 'Has..."
1,"{""7-4"": 3, ""4-5"": 32, ""3-6"": 24, ""2-3"": 5, ""5-...",Phoenix,130,Dubliner,[],business,-0tgMGl7D9B10YjSN2ujLA,"3841 E Thunderbird Rd, Ste 111","{'Monday': {'close': '2:00', 'open': '11:00'},...",{'4': ['always a good crowd and mix of ppl usu...,AZ,"{'4': 44, '3': 24, '5': 32, '2': 15, '1': 21}",-111.998513,3.5,33.611128,OPEN,"Irish Pub, Pubs, Nightlife, Irish, Bars, Music...","{'Alcohol': ""u'full_bar'"", 'Caters': 'False', ..."
2,"{""4-4"": 7, ""5-5"": 2, ""22-6"": 11, ""15-3"": 21, ""...",Phoenix,412,Matt's Big Breakfast,[],business,-1UMR00eXtwaeh59pEiDjA,"3800 E Sky Harbor Blvd, Terminal 4, Gate B5","{'Monday': {'close': '14:30', 'open': '6:00'},...",{'5': ['Good service every time.\n\nPaul is gr...,AZ,"{'5': 135, '2': 51, '1': 91, '4': 92, '3': 49}",-111.996636,3.5,33.436934,OPEN,"Nightlife, Sandwiches, Bars, Cocktail Bars, Re...","{'Alcohol': ""u'full_bar'"", 'HasTV': 'True', 'N..."
3,"{""21-2"": 1, ""23-6"": 4, ""16-2"": 1, ""21-1"": 1, ""...",Phoenix,24,Taco Bell,[],business,-2isRNVb6PDuBagELL5EBw,3507 W. Peoria Ave.,"{'Monday': {'close': '2:00', 'open': '7:00'}, ...",{'2': ['The food sucks but I buy it because I ...,AZ,"{'2': 4, '1': 7, '5': 8, '3': 4, '4': 1}",-112.135901,3.0,33.581713,OPEN,"Tex-Mex, Mexican, Fast Food, Restaurants","{'GoodForMeal': ""{'dessert': False, 'latenight..."
4,"{""20-4"": 1, ""1-2"": 1, ""0-3"": 1, ""0-1"": 1, ""17-...",Glendale,16,Subway,[],business,-34vSRcMz_RjN00dWIiQ3Q,"5026 W Cactus Rd, Ste 2","{'Monday': {'close': '21:00', 'open': '8:00'},...",{'1': ['I agree with Andrew T. When my wife a...,AZ,"{'1': 10, '5': 6, '2': 1}",-112.166708,2.5,33.596883,OPEN,"Fast Food, Restaurants, Sandwiches","{'BusinessParking': ""{'garage': False, 'street..."


In [None]:
df_yelp_business_phx.shape

In [None]:
# df_yelp_business = pd.read_json('../input/yelp-dataset/yelp_academic_dataset_business.json', lines=True)
# df_yelp_business.fillna('NA', inplace=True)
# # we want to make sure we only work with restaurants -- nothing else
# df_yelp_business = df_yelp_business[df_yelp_business['categories'].str.contains('Restaurants')]
# print('Final Shape: ',df_yelp_business.shape)

Now we bring the reviews and perform some preprocessing on those reviews..

In [3]:
#Load the reviews data in chunks 
df_yelp_review_iter = pd.read_json('data/final_project_phx/yelp_academic_dataset_review.json', chunksize=100000, lines=True)

We filter reviews of places that are not in our list of businesses filtered earlier. Note here we choose 5 chunks, but we could have chosen any number (larger numbers will give MemoryError later on).

In [4]:
df_yelp_review = pd.DataFrame()
i=0
for df in df_yelp_review_iter:
    df = df[df['business_id'].isin(df_yelp_business_phx['business_id'])]
    df_yelp_review = pd.concat([df_yelp_review, df])
    i=i+1
    print(i)
    if i==4: break

1
2
3
4


Also make sure we only get businesses that already show up in our review list and delete the rest.

In [5]:
#updating the dataframe to include the restaurants with reviews
df_yelp_business_phx = df_yelp_business_phx[df_yelp_business_phx['business_id'].isin(df_yelp_review['business_id'])]
df_yelp_business_phx.head(5)

Unnamed: 0,checkin-info,city,review_count,name,neighborhoods,type,business_id,full_address,hours,reviews,state,rating_count,longitude,stars,latitude,is_open,categories,attributes
15,"{""2-5"": 4, ""23-6"": 1, ""0-1"": 1, ""21-4"": 2, ""1-...",Phoenix,71,Yi's Chinese Restaurant,[],business,_287i8ZeEf0H1LiqPhyvBg,"1512 W Bell Rd, Ste 7","{'Monday': {'close': '0:00', 'open': '0:00'}, ...",{'2': ['Read the reviews before coming in. I s...,AZ,"{'2': 2, '5': 55, '3': 4, '4': 9, '1': 3}",-112.092288,4.5,33.640996,OPEN,"Chinese, Restaurants","{'GoodForMeal': ""{'dessert': False, 'latenight..."
21,"{""5-2"": 1, ""23-2"": 2, ""0-5"": 5, ""3-3"": 1, ""2-6...",Phoenix,20,Little Caesars Pizza,[],business,_8I19IRzDXmMSRES9cEGlw,"4920 W Baseline Rd, Ste 101","{'Monday': {'close': '22:00', 'open': '10:30'}...","{'3': [""On Friday night I take my kids for car...",AZ,"{'3': 4, '1': 15, '4': 1, '2': 2}",-112.166541,1.5,33.378923,OPEN,"Restaurants, Fast Food, Pizza","{'BusinessParking': ""{'garage': False, 'street..."
34,"{""3-2"": 4, ""16-3"": 1, ""2-4"": 10, ""15-4"": 1, ""2...",Phoenix,373,Chico Malo,[],business,_iEl9sCLsvXEFHUWPvgsAg,"50 W Jefferson St, Ste 100","{'Monday': {'close': '22:00', 'open': '11:00'}...",{'2': ['Food is great! Service is not! \n\nT...,AZ,"{'2': 26, '4': 84, '5': 207, '3': 28, '1': 44}",-112.073899,4.0,33.447891,OPEN,"Tapas/Small Plates, Tapas Bars, Restaurants, M...","{'RestaurantsTableService': 'True', 'GoodForMe..."
53,"{""20-4"": 1, ""3-6"": 1, ""4-2"": 1, ""2-2"": 1, ""14-...",Phoenix,11,Domino's Pizza,[],business,_RYkQNjV_D6xOzRp3RwVOQ,"5030 W McDowell Rd, Ste 51","{'Monday': {'close': '0:00', 'open': '10:00'},...","{'4': [""My fiance and I are visiting the area ...",AZ,"{'4': 1, '1': 8, '5': 2}",-112.167899,2.0,33.466386,OPEN,"Pizza, Restaurants, Sandwiches, Chicken Wings","{'BusinessAcceptsCreditCards': 'True', 'Restau..."
59,"{""2-0"": 17, ""2-2"": 21, ""23-2"": 10, ""1-4"": 36, ...",Phoenix,594,Blue Hound,[],business,_WvEXsx2eZ53lTWHlIx9kg,2 E Jefferson St,"{'Monday': {'close': '0:00', 'open': '11:00'},...","{'4': ['Classy place. Great section of liquor,...",AZ,"{'4': 178, '5': 251, '3': 86, '2': 62, '1': 34}",-112.073644,4.0,33.447519,OPEN,"Bars, Restaurants, American (New), Gastropubs,...","{'Alcohol': ""u'full_bar'"", 'HasTV': 'True', 'N..."


In [None]:
df_yelp_review.head(5)

In [None]:
df_yelp_business.head(5)

In [None]:
print('Final businesses shape: ', df_yelp_business_phx.shape)
print('Final review shape: ', df_yelp_review.shape)

Now we want to processes reviews in a reasonable way. The following function is adopted from [here](https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb) which really does a good deal to preprocess the text.

In [6]:
def clean_text(text):
    ## Remove puncuation
    text = text.translate(string.punctuation)
    
    ## Convert words to lower case and split them
    text = text.lower().split()
    
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]
    
    text = " ".join(text)
    
    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)    
    return text

The next step will apply those transformations. Note that it will take a couple of minutes to finish.

In [7]:
%%time
df_yelp_review['text'] = df_yelp_review['text'].apply(clean_text)

CPU times: user 7.59 s, sys: 385 ms, total: 7.97 s
Wall time: 7.98 s


Now we want to vectorize both reviews and categories. Note that min_df and max_df arguments in both.

In [8]:
#Create a feature vector of the preprocessed reviews using bag of words model
vectorizer_reviews = CountVectorizer(min_df = .01,max_df = .99, tokenizer = WordPunctTokenizer().tokenize)
vectorized_reviews = vectorizer_reviews.fit_transform(df_yelp_review['text'])

In [None]:
print(vectorized_reviews.shape)

Show top 100 vocabularies:

In [None]:
' | '.join(vectorizer_reviews.get_feature_names()[:100]) # only the first 100

In [9]:
#Vectorizing the categories
#TODO-experiment with ngram_range-> evaluate the results

#Create a feature vector of the categories using bag of words model
vectorizer_categories = CountVectorizer(min_df = 1, max_df = 1., tokenizer = lambda x: x.split(', '))
vectorized_categories = vectorizer_categories.fit_transform(df_yelp_business_phx['categories'])

In [None]:
print(vectorized_categories.shape)
print(vectorized_categories.toarray())

We also show 100 categories..

In [None]:
' | '.join(vectorizer_categories.get_feature_names()[:100]) # only the first 100

We will use sparse representations to make dot products easier to speed up dot products (and also save memory).

In [10]:
%%time
from scipy import sparse
#pd.get_dummies-> Converting categorical variables into dummy/indicator variables

#print(df_yelp_review['business_id'].head(5))
print(pd.get_dummies(df_yelp_review['business_id']).values)

#create a empty matrix of bussinessxreview using review label encoded data
businessxreview = sparse.csr_matrix(pd.get_dummies(df_yelp_review['business_id']).values)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
CPU times: user 65.5 ms, sys: 1.23 ms, total: 66.7 ms
Wall time: 68 ms


Let's print out the shapes of the matrices we have prepared and make sure they make sense (by matching their dimensions):

In [None]:
print('restuarants x categories: \t', vectorized_categories.shape) 
print('restuarants x reviews: \t\t' , businessxreview.shape) 
print('reviews x words: \t\t', vectorized_reviews.shape)

Now we are ready to choose a seed restaurant and find  other restaurants that might be as good as the seed restaurant. We make sure to choose a restaurant with good number of reviews and ratings.

In [None]:
# to choose a restaurant, just copy the business id and paste it in the next cell
df_yelp_business_phx.sample(10)

In [94]:
#df_business_phx_cat = pd.read_csv("../input/phx-data/Business_dishes.csv")

#Reading the categories in the dataset
df_business_phx_cat = pd.read_csv("data/dishes_count.csv")
df_business_phx_cat.sample(10)

# print(df_business_phx_cat.shape)
# df_business_phx_cat = df_business_phx_cat[df_business_phx_cat["business_id"].isin(df_yelp_business_phx["business_id"])]
# print(df_business_phx_cat.shape)

all_categories = df_business_phx_cat["dish"].unique()

# all_categories = df_business_phx_cat["cat"].unique()
#Fixing bad data, replacing N with & to maintain consistency in keywords
for x in range(len(all_categories)):
    if "N" in all_categories[x]:
        all_categories[x] = all_categories[x].replace("N","&")
print(all_categories,len(all_categories))

(2903, 4)
(269, 4)
['Chinese' 'Italian' 'Mexican' 'Sandwiches' 'American' 'Vegan' 'Thai'
 'Fast Food' 'Cafes' 'Mediterranean' 'Breakfast & Brunch' 'Burgers'
 'Coffee & Tea' 'Sports Bar' 'Seafood' 'Japanese' 'Vietnamese' 'Korean'
 'Indian' 'Barbeque' ' Frozen Yogurt' 'Asian Fusion' 'Cambodian'
 'Filipino' 'Diner' 'Tea Room' 'Middle Eastern' 'Juice Bars' 'French']


In [84]:
df_temp = df_yelp_business_phx
df_temp = df_temp.sort_values(by=['stars'],ascending=False)
df_temp.head(50)

df_category_filtered = df_temp[df_temp['categories'].str.contains('Breakfast & Brunch')]
print(df_category_filtered.shape)

(25, 18)


In [37]:
from scipy.spatial.distance import cdist

In [39]:
def get_distances_cat_review(new_reviews,new_categories):
    # find most similar reviews
    #TODO - Try out other similarity measure that captures the surprise
    review_dist = cdist(vectorizer_reviews.transform(new_reviews).todense().mean(axis=0), 
                  vectorized_reviews.T.dot(businessxreview).T.todense(), 
                   metric='correlation')
    # find most similar categories
    category_dist = cdist(vectorizer_categories.transform(new_categories).todense().mean(axis=0), 
                  vectorized_categories.todense(), 
                   metric='correlation')
    return review_dist,category_dist

In [95]:
#Compile all code statements for evaluation of all categories
import json
"""
{
    "suggested_category":"",
    "recommended_categories":["cat1","cat2","cat3"],
    "recommended_restaurants":{"cat1":[],"cat2":[],"cat3":[]} #top3 ranked
}
"""

def build_output(category,recommended_categories,recommended_restaurants):
    output = {}
    output["suggested_category"] = category
    output["recommended_categories"] = recommended_categories
    output["recommended_restaurants"] = recommended_restaurants
    return output

final_output = []
print(all_categories.tolist())
for category in all_categories.tolist():
    # get the top rated restaurant of this category
    category = category.strip()
    print("original category",category)
    df_category_filtered = df_temp[df_temp['categories'].str.contains(category)]
    df_category_filtered = df_category_filtered.sort_values(by=['stars'],ascending=False)
    #Filltering out empty list
    if not df_category_filtered.empty:
        top_record = df_category_filtered.iloc[0]
        business_id = df_category_filtered.iloc[0]['business_id'] #selecting the top business of this category based on ratings
        print("business_id",business_id)
        new_reviews = df_yelp_review.loc[df_yelp_review['business_id'] == business_id, 'text']
        new_categories = df_yelp_business_phx.loc[df_yelp_business_phx['business_id'] == business_id, 'categories']
        print("New categories",new_categories)
        review_dist,category_dist = get_distances_cat_review(new_reviews,new_categories)
        dists_together = np.vstack([review_dist.ravel(), category_dist.ravel()]).T
        dists = dists_together.mean(axis=1)
        closest = dists.argsort().ravel()[:10]
        df_yelp_business_phx_recd = df_yelp_business_phx.loc[df_yelp_business_phx['business_id'].isin(df_yelp_business_phx['business_id'].iloc[closest]), ['business_id', 'categories', 'name', 'stars']]
        unique_categories = set()
        for i in df_yelp_business_phx_recd["categories"].tolist():
            for w in i.split(","):
                w = w.strip()
                if w not in "Restaurants":
                    unique_categories.add(w)
        print("Unique categories",unique_categories,len(unique_categories))
        recommended_restaurants = {}
        top_k_categories = list(unique_categories)[:5]
        print("top categories",top_k_categories)
        for i,unique_cat in enumerate(top_k_categories): #selecting top 5 categories
            if 'American' in unique_cat:
                top_k_categories[i] = 'American'
                unique_cat = 'American'
            print("rec cat",unique_cat)
            df_yelp_rest_recd = df_yelp_business_phx[df_yelp_business_phx['categories'].str.contains(unique_cat)]
            recommended_rest_ids = df_yelp_rest_recd["business_id"].tolist()
            print("REST ids",recommended_rest_ids,len(recommended_rest_ids))
            recommended_restaurants[unique_cat] = recommended_rest_ids[:5] #selecting top 5
        final_output.append(build_output(category,top_k_categories,recommended_restaurants))
    else:
        print("Missing category",category)

with open('recommendations.json', 'w') as f:
    json.dump(final_output, f)

['Chinese', 'Italian', 'Mexican', 'Sandwiches', 'American', 'Vegan', 'Thai', 'Fast Food', 'Cafes', 'Mediterranean', 'Breakfast & Brunch', 'Burgers', 'Coffee & Tea', 'Sports Bar', 'Seafood', 'Japanese', 'Vietnamese', 'Korean', 'Indian', 'Barbeque', ' Frozen Yogurt', 'Asian Fusion', 'Cambodian', 'Filipino', 'Diner', 'Tea Room', 'Middle Eastern', 'Juice Bars', 'French']
original category Chinese
['_287i8ZeEf0H1LiqPhyvBg', 'ZCzey5aPhd7jYIoHsUfjmQ', 'jCg6MSfu3fgXxO2QrpDV7w', 'h7As2jB8bhfFxCMCvdssWA', 'IlYNJUylnAinXsuytty70Q', 'TdjydrOFUSMUsTKdlXW6aQ', 'RdK6dhy4lOb2taNp-WrHjQ', 'lEtpTFWetCf6xnzeImLiHg', 'cJzv3fbd7jL2ZHyFosFkOw', 'atSfDP-SLY4GrvBlBPB31Q', '079CV1EE5WLdQqVEVYFeHQ', 'szhJLmdLDVFTevm8fu0T4A', 'ZIdR-IopAtU5PH_mGtUAbw', 'xVP6vpI-LGJ4Y61gIb4LQA', 'tDYcVluqZwieulc1iqxGXg', 'QdvROupQvDDQIaHrTGNgKA', 'dYmm5468BdWxWgksXpy2TQ', 'Cdywb13_07M1_g3U85VKTA', 'I_WWH2vYccjz-QxW3u1zJA', '9ULcHyUTN1O16Vr8KUMQew', '0ebavvJVXAzKKQ8C9cOt6g', 'O3PiNOj1vv5dCCywdw3_og', 'wZ3MABp8WcSHfZrlEQa5dw', 'p2OO



New categories 852    Restaurants, Bars, Nightlife, Pubs, Burgers, S...
Name: categories, dtype: object
New reviews 2467      back cactus jack went super dirty sketchy diff...
14075     definitely love place many reasons one issue i...
15998     could give fine establishment stars would exte...
37651     great food awesome bartenders fantastic drink ...
44667     great little bar awesome bar food ! give lots ...
47499     service great food better staff friendly + loc...
59990     great bar phoenix bartenders awesome keeps com...
64341     cactus tavern legit kris jen susie awesome fri...
69728     favorite bar now clean great food cold beer je...
77187     great food cocktails atmosphere shrimp cocktai...
85165     love it ! fun great place hang out play pool d...
86031     seriously nicest people ever i am happy found ...
125817    stopped early dinner saturday around 5pm nice ...
139882    excellent neighborhood bar ! clients friendly ...
146949    hiking preserve decided stop beer 

As we can see from the results, For Italian, we can see predictions as Burgers, wine & spirits, Coffee & Tea,  essentially it represents that people who are looking for italian, may also want to check out places with light serving snacks or breakfast or even fine dine with music and wine and fast foodies can check out burgers.