# IEOR 4571 - Personalization - Final Project

#### Team members: 
Name, UNI/email, Github ID
* Megala Kannan, msk2245@columbia.edu, thisismeg
* Hojin Lee, hl3328@columbia.edu, hjlee9295
* Jung Ah Shin, js5569@columbia.edu, juliajungahshin
* Tiffany Zhu, tz2196@columbia.edu, tlzhu19


# TOC:
* [1. Introduction](#1)
* [2. Data Exploration](#2)
* [3. Modeling](#3)
    * [3.1 Baseline Model](#3-1)
    * [3.2 Something](#3-2)
* [4. Evaluation](#4)
    * [4.1 Accuracy](#4-1)
    * [4.2 Coverage](#4-2)
* [5. Conclusion](#5)


# 1. Introduction <a class="anchor" id="1"></a>

# 2. Data Exploration <a class="anchor" id="2"></a>

In [385]:
import pandas as pd
import json
from tqdm import tqdm

import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')

import itertools
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

In [2]:
def convert_json_to_df(path, file_name, column_names):
    line_count = len(open(path + file_name).readlines())    
    columns_dict = {name: [] for name in column_names}

    with open(path + file_name) as f:
        for line in tqdm(f, total=line_count):
            blob = json.loads(line)
            
            for key in column_names:
                columns_dict[key].append(blob[key])
    
    return pd.DataFrame(columns_dict)

In [3]:
path = "/Users/megalakannan/Documents/Sem 3/Personalization/Homework/yelp_dataset/"
file_name = "review.json"

In [6]:
# review.json
ratings = convert_json_to_df(path, file_name, ['user_id', 'business_id', 'stars', 'date'])

user_counts = ratings["user_id"].value_counts()
active_users = user_counts.loc[user_counts >= 5].index.tolist()

100%|██████████| 6685900/6685900 [01:22<00:00, 81193.39it/s] 


In [7]:
len(active_users)

286130

In [8]:
ratings.rename(columns={'stars': 'rating'}, inplace=True)
ratings.head()

Unnamed: 0,user_id,business_id,rating,date
0,hG7b0MtEbXx5QzbzE6C_VA,ujmEBvifdJM6h6RLv4wQIg,1.0,2013-05-07 04:34:36
1,yXQM5uF2jS6es16SJzNHfg,NZnhc2sEQy3RmzKTZnqtwQ,5.0,2017-01-14 21:30:33
2,n6-Gk65cPZL6Uz8qRm3NYw,WTqjgwHlXbSFevF32_DJVw,5.0,2016-11-09 20:09:03
3,dacAIZ6fTM6mqwW5uxkskg,ikCg8xy5JIg_NGPx-MSIDA,5.0,2018-01-09 20:56:38
4,ssoyf2_x0EQMed6fgHeMyQ,b1b1eb3uo-w561D0ZfCEiQ,1.0,2018-01-30 23:07:38


In [9]:
import random

# take random subset of active users
n = len(active_users)
subset_active_users = random.sample(active_users, round(n * 0.2))

# filter the ratings df by the subset of active users
active_user_ratings = ratings.loc[ratings['user_id'].isin(subset_active_users)]

In [10]:
active_user_ratings2 = active_user_ratings.sort_values('date')
actual_X = active_user_ratings2.groupby(['user_id'], as_index=False).apply(lambda x: x.iloc[:-1])
actual_y = active_user_ratings2.groupby(['user_id'], as_index=False).apply(lambda x: x.iloc[-1])

In [11]:
actual_X.head()

Unnamed: 0,Unnamed: 1,user_id,business_id,rating,date
0,5077203,--0kuuLmuYBe3Rmu0Iycww,sxPwFSLoW7xx1tWgNZ-p6g,5.0,2013-08-26 23:07:49
0,6081103,--0kuuLmuYBe3Rmu0Iycww,6TBfgiKpP-VWtKuM-IwR0Q,1.0,2013-09-06 06:20:54
0,4996945,--0kuuLmuYBe3Rmu0Iycww,loEwm40TwkQeEu3zYvU7RQ,4.0,2013-09-12 00:27:27
0,3338704,--0kuuLmuYBe3Rmu0Iycww,VaiYxIUfHIfYfwYgOupjMA,4.0,2013-10-03 17:19:39
0,3298263,--0kuuLmuYBe3Rmu0Iycww,ev1SC6q8AolQWix0n577sg,2.0,2013-11-11 20:35:14


In [12]:
actual_y.head()

Unnamed: 0,user_id,business_id,rating,date
0,--0kuuLmuYBe3Rmu0Iycww,PYe_FDw6QTbTf66WcGE_tw,2.0,2014-04-21 16:58:28
1,--2HUmLkcNHZp0xw6AMBPg,KW9RNyBPmc77f9FsO92qYw,5.0,2018-10-04 02:02:28
2,--2vR0DIsmQ6WfcSzKWigw,BLIJ-p5wYuAhw6Pp6mh6mw,3.0,2018-01-11 04:24:17
3,--4rAAfZnEIAKJE80aIiYg,HTaA1mo9cB1dXMwfJC6yKg,1.0,2018-11-12 20:37:07
4,--7gjElmOrthETJ8XqzMBw,UxWH8zRYIBgs6Q2oykvRdw,4.0,2018-05-24 21:19:54


In [13]:
# can do the same for business.json, user.json, tip.json 
# for metadata info see https://www.yelp.com/dataset/documentation/main

In [14]:
# business.json
# todo: how to add 'attributes'?
'''
"attributes": {
        "RestaurantsTakeOut": true,
        "BusinessParking": {
            "garage": false,
            "street": true,
            "validated": false,
            "lot": false,
            "valet": false
        },
'''

businesses = convert_json_to_df(path, 'business.json', ['business_id', 'city', 'state', 'stars', 
                                                        'review_count', 'is_open', 'attributes', 
                                                        'categories', 'hours', 'latitude', 'longitude'])

100%|██████████| 192609/192609 [00:03<00:00, 48844.26it/s]


In [15]:
businesses.head()

Unnamed: 0,business_id,city,state,stars,review_count,is_open,attributes,categories,hours,latitude,longitude
0,1SWheh84yJXfytovILXOAQ,Phoenix,AZ,3.0,5,0,{'GoodForKids': 'False'},"Golf, Active Life",,33.522143,-112.018481
1,QXAEGFB4oINsVuTFxEYKFQ,Mississauga,ON,2.5,128,1,"{'RestaurantsReservations': 'True', 'GoodForMe...","Specialty Food, Restaurants, Dim Sum, Imported...","{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...",43.605499,-79.652289
2,gnKjwL_1w79qoiV3IC_xQQ,Charlotte,NC,4.0,170,1,"{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...","Sushi Bars, Restaurants, Japanese","{'Monday': '17:30-21:30', 'Wednesday': '17:30-...",35.092564,-80.859132
3,xvX2CttrVhyG2z1dFg_0xw,Goodyear,AZ,5.0,3,1,,"Insurance, Financial Services","{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...",33.455613,-112.395596
4,HhyxOkGAM07SRYtlQ4wMFQ,Charlotte,NC,4.0,4,1,"{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...","Plumbing, Shopping, Local Services, Home Servi...","{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...",35.190012,-80.887223


In [16]:
# user.json
users = convert_json_to_df(path, 'user.json', ['user_id', 'review_count', 'friends', 'useful', 
                                               'funny', 'cool', 'fans', 'elite', 'average_stars', 
                                               'compliment_hot', 'compliment_more', 'compliment_profile',
                                               'compliment_cute', 'compliment_list', 'compliment_note',
                                               'compliment_plain', 'compliment_cool', 'compliment_funny',
                                               'compliment_writer', 'compliment_photos'
                                              ])

100%|██████████| 1637138/1637138 [00:35<00:00, 46631.75it/s]


In [17]:
users.head()

Unnamed: 0,user_id,review_count,friends,useful,funny,cool,fans,elite,average_stars,compliment_hot,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
0,l6BmjZMeQD3rDxWUbiAiow,95,"c78V-rj8NQcQjOI8KP3UEA, alRMgPcngYSCJ5naFRBz5g...",84,17,25,5,201520162017.0,4.03,2,0,0,0,0,1,1,1,1,2,0
1,4XChL029mKr5hydo79Ljxg,33,"kEBTgDvFX754S68FllfCaA, aB2DynOxNOJK9st2ZeGTPg...",48,22,16,4,,3.63,1,0,0,0,0,0,0,1,1,0,0
2,bc8C_eETBWL0olvFSJJd0w,16,"4N-HU_T32hLENLntsNKNBg, pSY2vwWLgWfGVAAiKQzMng...",28,8,10,0,,3.71,0,0,0,0,0,1,0,0,0,0,0
3,dD0gZpBctWGdWo9WlGuhlA,17,"RZ6wS38wnlXyj-OOdTzBxA, l5jxZh1KsgI8rMunm-GN6A...",30,4,14,5,,4.85,1,0,0,0,0,0,2,0,0,1,0
4,MM4RJAeH6yuaN8oZDSt0RA,361,"mbwrZ-RS76V1HoJ0bF_Geg, g64lOV39xSLRZO0aQQ6DeQ...",1114,279,665,39,2015201620172018.0,4.08,28,1,0,0,1,16,57,80,80,25,5


In [18]:
# tip.json
tips =  convert_json_to_df(path, 'tip.json', ['text', 'date', 'compliment_count', 'business_id', 'user_id'])

100%|██████████| 1223094/1223094 [00:07<00:00, 160644.03it/s]


In [26]:
photos = convert_json_to_df(path, 'photo.json', ['photo_id', 'business_id'])

100%|██████████| 200000/200000 [00:01<00:00, 179038.25it/s]


In [682]:
active_user_only_ratings_df = ratings[ratings['user_id'].isin(active_users)]

sample_size = [50000]

for s in sample_size:
    sampleUID = active_user_only_ratings_df['user_id'].unique()[np.random.randint(active_user_only_ratings_df['user_id'].unique().shape[0], size=s)]
    active_user_only_ratings_df_sample = active_user_only_ratings_df[active_user_only_ratings_df['user_id'].isin(sampleUID)]

In [683]:
import datetime as dt
active_user_only_ratings_df_sample['date'] = pd.to_datetime(active_user_only_ratings_df_sample['date'])

In [684]:
len(active_user_only_ratings_df_sample)

736852

In [685]:
# take 80% of reviews for each user sampled
training_data = active_user_only_ratings_df_sample.sort_values(by=['user_id', 'date']).groupby('user_id').apply(lambda x: x[:round(len(x)*.8)]).reset_index(drop=True)

In [686]:
# take 20% of reviews for each user sampled
testing_data = active_user_only_ratings_df_sample.sort_values(by=['user_id', 'date']).groupby('user_id').apply(lambda x: x[round(len(x)*.8):]).reset_index(drop=True)

In [687]:
def create_features(active_user_only_ratings_df_sample):
    base_df = active_user_only_ratings_df_sample

    #sparse alert - hopefully with bigger dataset, we will see some weird ratings..
    real_average = base_df[['business_id','rating']].groupby('business_id').mean().reset_index()
    real_average.rename(columns={"rating": "average_business_rating"}, inplace=True)

    #Average ratings for business added
    base_df = base_df.merge(real_average, how='left', on='business_id')

    #one-hot encoding for top5 categories
    catList = []
    businesses['categories'].fillna(value='',inplace=True)
    businesses['cat'] = businesses['categories'].apply(lambda x: x.split(','))
    catList.extend(businesses['cat'])
    merged = [x.strip() for x in list(itertools.chain(*catList))]

    #Adding state, review_count, is_open
    base_df = base_df.merge(businesses[['business_id','state','city', 'latitude', 'longitude','review_count','is_open', 'hours', 'cat']], on='business_id')
    base_df.rename(columns={"review_count": "business_review_count"}, inplace=True)

    #getting top 5 common categories items
    top5List = [x for x in list(itertools.chain(*Counter(merged).most_common(5))) if type(x) != int]

    #one-hot encoding if the business in top 5 common category
    for item in top5List:
        base_df[item] = base_df['cat'].apply(lambda categories: 'Y' if bool(set([y.strip() for y in categories]).intersection([item])) else 'N')
        #base_df[item] = base_df['cat'].apply(lambda categories: 'Y' if bool(set([y.strip() for y in categories]).intersection(set(item))) else 'N')

    # is_open (categorical) change from 1 and 0 to Y and N
    base_df['is_open'] = base_df['is_open'].apply(lambda x: 'Y' if x else 'N')

    # hours: how many days per week it's open
    base_df['hours'] = base_df['hours'].apply(lambda x: len(x.keys()) if x else 0)
    base_df.rename(columns={"hours": "days_per_week_open"}, inplace=True)

    # user information
    base_df = base_df.merge(users[['user_id', 'average_stars', 'review_count', 'friends']],  on='user_id')

    # number_of_friends
    base_df['friends'] = base_df['friends'].apply(lambda x: len(x.split(',')))
    base_df.rename(columns={"friends": "number_of_friends", "review_count": "user_review_count", "average_stars": "average_user_rating"}, inplace=True)

    #number of tips for popularity measure of business
    business_numberOfTips = tips[['business_id','user_id']].groupby('business_id').count().reset_index()
    business_numberOfTips.rename(columns={"user_id": "business_numberOfTips"}, inplace=True)
    base_df = base_df.merge(business_numberOfTips, on='business_id')

    #number of photo for popularity measure of business
    business_numberOfPhotos = photos[['business_id','photo_id']].groupby('business_id').count().reset_index()
    business_numberOfPhotos.rename(columns={"photo_id": "business_numberOfPhotos"}, inplace=True)
    base_df = base_df.merge(business_numberOfPhotos, on='business_id')
    
    return base_df

In [688]:
training_features = create_features(training_data)
training_features.head()

Unnamed: 0,user_id,business_id,rating,date,average_business_rating,state,city,latitude,longitude,business_review_count,...,Restaurants,Shopping,Food,Home Services,Beauty & Spas,average_user_rating,user_review_count,number_of_friends,business_numberOfTips,business_numberOfPhotos
0,--CH8yRGXhO2MmbF-4BWXg,mnI_n7A8sxgOSmtgI3wzQQ,5.0,2015-12-11 16:20:05,4.933333,PA,Pittsburgh,40.428679,-79.983114,95,...,Y,N,Y,N,N,2.33,12,1,15,3
1,zPwZQEVmFg9cbmsEwLpA6g,mnI_n7A8sxgOSmtgI3wzQQ,5.0,2015-11-09 19:51:05,4.933333,PA,Pittsburgh,40.428679,-79.983114,95,...,Y,N,Y,N,N,4.34,225,126,15,3
2,4wp4XI9AxKNqJima-xahlg,mnI_n7A8sxgOSmtgI3wzQQ,5.0,2015-02-17 02:17:53,4.933333,PA,Pittsburgh,40.428679,-79.983114,95,...,Y,N,Y,N,N,3.88,1180,3164,15,3
3,6KUA3-IfHoAhQ3FL2djQoQ,mnI_n7A8sxgOSmtgI3wzQQ,5.0,2016-11-06 22:27:38,4.933333,PA,Pittsburgh,40.428679,-79.983114,95,...,Y,N,Y,N,N,3.45,229,800,15,3
4,HYocPAFq7nSpD93B-5z6kQ,mnI_n7A8sxgOSmtgI3wzQQ,4.0,2015-09-01 03:17:07,4.933333,PA,Pittsburgh,40.428679,-79.983114,95,...,Y,N,Y,N,N,3.78,9,86,15,3


In [689]:
testing_features = create_features(testing_data)
testing_features.head()

Unnamed: 0,user_id,business_id,rating,date,average_business_rating,state,city,latitude,longitude,business_review_count,...,Restaurants,Shopping,Food,Home Services,Beauty & Spas,average_user_rating,user_review_count,number_of_friends,business_numberOfTips,business_numberOfPhotos
0,--CH8yRGXhO2MmbF-4BWXg,TZpTyyGvQkKPnt59PVUGhg,5.0,2015-12-11 16:31:56,4.0,PA,Pittsburgh,40.449976,-79.950737,182,...,Y,N,N,N,N,2.33,12,1,23,3
1,764ZNGXISDujCsJF4XyAWw,TZpTyyGvQkKPnt59PVUGhg,1.0,2017-10-03 03:14:34,4.0,PA,Pittsburgh,40.449976,-79.950737,182,...,Y,N,N,N,N,3.38,8,1,23,3
2,EFfQZFfWWlxZ4ckTJdBLtQ,TZpTyyGvQkKPnt59PVUGhg,4.0,2017-08-03 18:22:05,4.0,PA,Pittsburgh,40.449976,-79.950737,182,...,Y,N,N,N,N,3.59,107,66,23,3
3,GkWP5QDuoF3FkiEXmM3c2Q,TZpTyyGvQkKPnt59PVUGhg,4.0,2018-09-05 00:07:27,4.0,PA,Pittsburgh,40.449976,-79.950737,182,...,Y,N,N,N,N,3.85,51,21,23,3
4,Nwjm4o1s82JTTAM1Xtvs0w,TZpTyyGvQkKPnt59PVUGhg,5.0,2014-10-22 14:12:54,4.0,PA,Pittsburgh,40.449976,-79.950737,182,...,Y,N,N,N,N,3.59,33,8,23,3


# Wide and Deep Model

In this model a wide linear model and deep neural network are jointly trained to combine the benefits of memorization and generalization for recommender systems. Wide linear models can effectively memorize sparse feature interactions using cross-product feature transformations, while deep neural networks can generalize previously unseen feature interactions through low-dimensional embeddings.

## Feature Selection

A Set of relevant features are selected from the original dataset to be used for training the model. We notice that the dataset is highly imbalanced hence we perform oversampling ( Up-sampling) to create a balanced dataset for the traning data only

In [871]:
training_features.columns

Index(['user_id', 'business_id', 'rating', 'date', 'average_business_rating',
       'state', 'city', 'latitude', 'longitude', 'business_review_count',
       'is_open', 'days_per_week_open', 'cat', 'Restaurants', 'Shopping',
       'Food', 'Home Services', 'Beauty & Spas', 'average_user_rating',
       'user_review_count', 'number_of_friends', 'business_numberOfTips',
       'business_numberOfPhotos'],
      dtype='object')

In [872]:
train_nn = training_features[['user_id','business_id','average_business_rating','state','city','latitude','longitude',
                           'business_review_count','days_per_week_open','Restaurants','Shopping','Food','Home Services',
                            'Beauty & Spas','average_user_rating','user_review_count','business_numberOfPhotos', 'rating']]

In [873]:
train_nn= train_nn.rename(columns= {'Home Services': 'Home_Services', 'Beauty & Spas': 'Beauty_and_Spas'})

In [874]:
labels = train_nn.pop('rating')

In [875]:
class1 = np.where(labels==1)[0]
class2= np.where(labels==2)[0]
class3 = np.where(labels==3)[0]
class4= np.where(labels==4)[0]
class5 = np.where(labels==5)[0]

In [876]:
class1_upsampled = np.random.choice(class1, size = 100000, replace= True)
class2_upsampled = np.random.choice(class2, size = 100000, replace= True)
class3_upsampled = np.random.choice(class3, size = 100000, replace= True)
class4_upsampled = np.random.choice(class4, size = 100000, replace= True)

In [877]:
train_nn_1 = train_nn.iloc[class1_upsampled]
train_nn_2 = train_nn.iloc[class2_upsampled]
train_nn_3 = train_nn.iloc[class3_upsampled]
train_nn_4 = train_nn.iloc[class4_upsampled]
train_nn_5 = train_nn.iloc[class5]

In [878]:
labels_1 = labels.iloc[class1_upsampled]
labels_2 = labels.iloc[class2_upsampled]
labels_3 = labels.iloc[class3_upsampled]
labels_4 = labels.iloc[class4_upsampled]
labels_5 = labels.iloc[class5]

In [879]:
train_nn_balanced = train_nn_1
train_nn_balanced = train_nn_balanced.append(train_nn_2, ignore_index = True)
train_nn_balanced= train_nn_balanced.append(train_nn_3, ignore_index = True)
train_nn_balanced= train_nn_balanced.append(train_nn_4, ignore_index = True)
train_nn_balanced= train_nn_balanced.append(train_nn_5, ignore_index = True)

In [880]:
labels_balanced = labels_1
labels_balanced = labels_balanced.append(labels_2, ignore_index = True)
labels_balanced = labels_balanced.append(labels_3, ignore_index = True)
labels_balanced = labels_balanced.append(labels_4, ignore_index = True)
labels_balanced = labels_balanced.append(labels_5, ignore_index = True)

In [881]:
assert len(train_nn_balanced) == len(labels_balanced)

In [882]:
train_nn = train_nn_balanced
labels = labels_balanced

In [883]:
train_nn.head()

Unnamed: 0,user_id,business_id,average_business_rating,state,city,latitude,longitude,business_review_count,days_per_week_open,Restaurants,Shopping,Food,Home_Services,Beauty_and_Spas,average_user_rating,user_review_count,business_numberOfPhotos
0,HDajox2iB5oMdDpwII34tg,Y6ERro4P_wmneOWj07n94A,3.105263,ON,Richmond Hill,43.847866,-79.374942,124,7,Y,N,N,N,N,2.88,17,23
1,tFLFrEjD9omSqs0gqSYxvA,0imWty0eKpz8YpuZAKF83Q,2.0,NV,Henderson,36.006706,-115.112924,87,7,Y,N,Y,N,N,2.89,64,15
2,p9SE1mTHK8Hqow4E1Wunrw,dNRbhJt4wd-ZU4lFVK3iiw,2.666667,NV,Las Vegas,36.059689,-115.171527,80,0,Y,N,N,N,N,4.22,18,2
3,9Cts7waInWCn6f_RvcgW3Q,DwP10iEz5LGf3fhcVQZm0Q,3.925926,AZ,Scottsdale,33.501969,-111.924064,545,7,Y,N,Y,N,N,3.59,79,45
4,q1VxsV8Hk6EyDHn3SprU2g,UidScb9HFDSD3NBw8MLuGw,2.913043,WI,Madison,43.073546,-89.451283,149,7,Y,N,N,N,N,3.28,73,2


In [884]:
#train_nn['rating']= train_nn['rating'].apply(lambda x: int(x)-1)
#labels = train_nn.pop('rating')
labels = labels.apply(lambda x: int(x)-1)

In [885]:
test_nn = testing_features[['user_id','business_id','average_business_rating','state','city','latitude','longitude',
                           'business_review_count','days_per_week_open','Restaurants','Shopping','Food','Home Services',
                            'Beauty & Spas','average_user_rating','user_review_count','business_numberOfPhotos', 'rating']]

In [886]:
test_nn= test_nn.rename(columns= {'Home Services': 'Home_Services', 'Beauty & Spas': 'Beauty_and_Spas'})
test_nn['rating']= test_nn['rating'].apply(lambda x: int(x)-1)
test_labels = test_nn.pop('rating')

## Feature Engineering

For the wide component the feature set includes raw input features and transformed features. The sparse features are one hot encoded and cross product transformation was implemented on certain features. This captures the interaction between binary features and adds non linearity to the generalized linear model.

For the deep component each of the sparse high dimensional categorical features are first converted into low dimensional and dense real valued vector with the help of embeddings.


In [887]:
import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers

In [888]:
numeric_features = ['average_business_rating','business_review_count','average_user_rating',
                    'user_review_count', 'days_per_week_open','business_numberOfPhotos','latitude','longitude']
cat_features = ['user_id','business_id','state','city','Restaurants','Shopping','Food',
                'Home_Services', 'Beauty_and_Spas']

In [889]:
#continuous features
cont = {
        colname : feature_column.numeric_column(colname) for colname in numeric_features
}

In [890]:
#categorical features
cat = {
        'user_id' : feature_column.categorical_column_with_hash_bucket('user_id', hash_bucket_size=10000),
        'business_id' : feature_column.categorical_column_with_hash_bucket('business_id', hash_bucket_size=10000),
        'state' : feature_column.categorical_column_with_hash_bucket('state', hash_bucket_size=1000),
        'city' : feature_column.categorical_column_with_hash_bucket('city', hash_bucket_size=1000),
        'Restaurants' : feature_column.categorical_column_with_vocabulary_list('Restaurants', ['Y', 'N'], default_value = 0),
        'Shopping' : feature_column.categorical_column_with_vocabulary_list('Shopping', ['Y', 'N'], default_value = 0),
        'Food' : feature_column.categorical_column_with_vocabulary_list('Food', ['Y', 'N'], default_value = 0),
        'Home_Services' : feature_column.categorical_column_with_vocabulary_list('Home_Services', ['Y', 'N'], default_value = 0),
        'Beauty_and_Spas' : feature_column.categorical_column_with_vocabulary_list('Beauty_and_Spas', ['Y', 'N'], default_value = 0)
}

In [891]:
#embed sparse columns for dense layer
embed = {
       'embed_{}'.format(colname) : tf.feature_column.embedding_column(col, 10)
          for colname, col in cat.items()
}

In [892]:
#columns for dense network
dense = cont.copy()
dense.update(embed)
#print(dense.keys())

In [893]:
#one hot encode sparse columns for wide network
wide = {
    'ohe_{}'.format(colname)  : tf.feature_column.indicator_column(col)
          for colname, col in cat.items()
}

In [894]:
#crossed columns
lat_boundary = list(train_nn.latitude.sort_values().unique())
long_boundary = list(train_nn.longitude.sort_values().unique())

latitude_bucket_fc = feature_column.bucketized_column(cont.get('latitude'), lat_boundary)
longitude_bucket_fc = feature_column.bucketized_column(cont.get('longitude'), long_boundary)

#crossed_lat_lon = feature_column.crossed_column([latitude_bucket_fc, longitude_bucket_fc], 5000)
wide['crossed_lat_lon'] = feature_column.crossed_column([latitude_bucket_fc, longitude_bucket_fc], 5000)
#print(wide.keys())

In [895]:
def df_to_dataset_train(dataframe, labels,  shuffle = True, batch_size = 32):
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
      ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [896]:
def df_to_dataset_eval(dataframe, labels, batch_size= 32):
    if labels is None:
        ds = tf.data.Dataset.from_tensor_slices(dict(dataframe))
    else:
        ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.batch(batch_size)
    return ds

In [897]:
list(wide.keys())

['ohe_user_id',
 'ohe_business_id',
 'ohe_state',
 'ohe_city',
 'ohe_Restaurants',
 'ohe_Shopping',
 'ohe_Food',
 'ohe_Home_Services',
 'ohe_Beauty_and_Spas',
 'crossed_lat_lon']

In [898]:
list(dense.keys())

['average_business_rating',
 'business_review_count',
 'average_user_rating',
 'user_review_count',
 'days_per_week_open',
 'business_numberOfPhotos',
 'latitude',
 'longitude',
 'embed_user_id',
 'embed_business_id',
 'embed_state',
 'embed_city',
 'embed_Restaurants',
 'embed_Shopping',
 'embed_Food',
 'embed_Home_Services',
 'embed_Beauty_and_Spas']

# Modeling

The wide component consists of a generalized linear model and the deep component is a feed forward neural network. The wide and deep component are combined using a weighted sum of their output log odds at the prediction which is then fed into a common logistic loss function during training.

Joint training is done by back propogating the gradients from the output to both wide and deep part of the model using stockastic mini batch descent. The regularizers and optimizer parameters were set to the values used in the paper 'Wide and Deep Learning for Recommender Systems'. The number of hidden units were tuned.

In [899]:
estimator = tf.estimator.DNNLinearCombinedClassifier(
    # wide settings
    linear_feature_columns=list(wide.values()),
    linear_optimizer='Ftrl',
    # deep settings
    dnn_feature_columns=list(dense.values()),
    dnn_hidden_units=[1000, 500, 100],
    dnn_optimizer=tf.train.ProximalAdagradOptimizer(learning_rate = 0.1),
    n_classes = 5)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/_6/s5sd_lxx4fj1sh0y1bz07n140000gn/T/tmpzgf8fne2', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x137dadb38>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [900]:
# To apply L1 and L2 regularization, you can set dnn_optimizer to:
tf.train.ProximalAdagradOptimizer(
    learning_rate=0.1,
    l1_regularization_strength=0.001,
    l2_regularization_strength=0.001)
# To apply learning rate decay, you can set dnn_optimizer to a callable:
lambda: tf.AdamOptimizer(
    learning_rate=tf.exponential_decay(
        learning_rate=0.1,
        global_step=tf.get_global_step(),
        decay_steps=10000,
        decay_rate=0.96))

<function __main__.<lambda>()>

In [901]:
def input_fn_train():
    return df_to_dataset_train(train_nn, labels)

In [902]:
def input_fn_eval():
    return df_to_dataset_eval(test_nn, test_labels)

In [903]:
def input_fn_pred():
    return df_to_dataset_eval(test_nn)

In [904]:
estimator.train(input_fn=input_fn_train, steps=100)

INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/_6/s5sd_lxx4fj1sh0y1bz07n140000gn/T/tmpzgf8fne2/model.ckpt.
INFO:tensorflow:loss = 283.72607, step = 1
INFO:tensorflow:Saving checkpoints for 100 into /var/folders/_6/s5sd_lxx4fj1sh0y1bz07n140000gn/T/tmpzgf8fne2/model.ckpt.
INFO:tensorflow:Loss for final step: 50.98482.


<tensorflow_estimator.python.estimator.canned.dnn_linear_combined.DNNLinearCombinedClassifier at 0x137033160>

## Evaluation 

~This needs to be edited further~

We notice that our model does not outperform the baseline model. This could be due to several reasons. The deep component of the model can be improved in several ways. We can include droput, change the optimizers to find the best parameters that fit our model. The highly imbalanced dataset also affects the accuracy of the test set. Other metrics can be used to better represent the results. 

In [905]:
results = estimator.evaluate(input_fn=input_fn_eval, steps = 100)

INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-12-14T16:12:51Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/_6/s5sd_lxx4fj1sh0y1bz07n140000gn/T/tmpzgf8fne2/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2019-12-14-16:13:01
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.4609375, average_loss = 1.5673081, global_step = 100, loss = 50.15386
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 100: /var/folders/_6/s5sd_lxx4fj1sh0y1bz07n140

In [906]:
for key in sorted(results):
    print( "%s: %s" % (key, results[key]))

accuracy: 0.4609375
average_loss: 1.5673081
global_step: 100
loss: 50.15386


In [907]:
predictor = estimator.predict( input_fn = input_fn_eval)

In [None]:
pred = list(predictor)
for val in pred:
    print(val['class_ids'][0]+1)