# Restaurant Recommendation System
## Understand the business scenario and problem

Akeed is an app-based food delivery service in Omen that allows customers to order food from their favourite restaurants and have it delivered to their address. Akeed's vision is to be the delivery and discovery platform for everything people need instantly. Akeed approaches the dimensions of food delivery by taking the order, routing it to a restaurant, picking up the order and deliverying it to the customer.

Akeed wants to build a recommendation engine to predict what restaurants customers are most likely to order from given **the customer location, restaurant information and the customer order history**. This recommendation system will allow Akeed to customise restaurant recommendations for each of their customers and ensure more positive overall user experience. 

## Plan

### Akeed Dataset

There are 3 datasets for this project. 
* First dataset is `train_customers.csv` which contains customers' informations such as `akeed_customer_id`, `gender`, `date of birth (dob)` and `language`.
  
* Second dataset is `vendors.csv` which contains vendors' informations such as vendor's `id`, `vendor_category_en`, `serving_distance`, `preparation_time`, `rank`, `vendor_rating`, and `vendor_tag_name`.
  
* Third dataset is `orders.csv` which contains individual orders informations made by customers from respective vendors such as `akeed_order_id`, `customer_id`, `item_count`, `grand_total`, `vendor_id` and `LOCATION_NUMBER` of each customer.

### Imports 
* Import packages
* Load datasets

In [1]:
# import packages

# for data manipulation
import pandas as pd
import numpy as np


# for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for displaying all of the columns and decimal places in dataframes
pd.set_option('display.max_columns',None)
pd.set_option('display.float_format','{:.4f}'.format)

# for data preprocessing and deep learning
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate

In [2]:
# load datasets into a dataframe
customer_train = pd.read_csv(r'/Porfolio Projects/Recommendation System/Restraurant Recommendation System/train_customers.csv')
vendors = pd.read_csv(r'/Porfolio Projects/Recommendation System/Restraurant Recommendation System/vendors.csv')
orders = pd.read_csv(r'/Porfolio Projects/Recommendation System/Restraurant Recommendation System/orders.csv',
                    usecols=['akeed_order_id','customer_id','item_count','grand_total','vendor_discount_amount','vendor_rating','deliverydistance','vendor_id','LOCATION_NUMBER','delivery_date','created_at'])

  orders = pd.read_csv(r'/Users/salai/Library/CloudStorage/OneDrive-Personal/Desktop/Google Advance Data Analytic/Porfolio Projects/Recommendation System/Restraurant Recommendation System/orders.csv',


In [3]:
# quick look at customer_train df
print(f'Shape of df: {customer_train.shape}')
customer_train.head(5)

Shape of df: (34674, 8)


Unnamed: 0,akeed_customer_id,gender,dob,status,verified,language,created_at,updated_at
0,TCHWPBT,Male,,1,1,EN,2018-02-07 19:16:23,2018-02-07 19:16:23
1,ZGFSYCZ,Male,,1,1,EN,2018-02-09 12:04:42,2018-02-09 12:04:41
2,S2ALZFL,Male,,0,1,EN,2018-03-14 18:31:43,2018-03-14 18:31:42
3,952DBJQ,Male,,1,1,EN,2018-03-15 19:47:07,2018-03-15 19:47:07
4,1IX6FXS,Male,,1,1,EN,2018-03-15 19:57:01,2018-03-15 19:57:01


In [4]:
# quick look at vendors df
print(f'Shape of df: {vendors.shape}')
vendors.head(5)

Shape of df: (100, 59)


Unnamed: 0,id,authentication_id,latitude,longitude,vendor_category_en,vendor_category_id,delivery_charge,serving_distance,is_open,OpeningTime,OpeningTime2,prepration_time,commission,is_akeed_delivering,discount_percentage,status,verified,rank,language,vendor_rating,sunday_from_time1,sunday_to_time1,sunday_from_time2,sunday_to_time2,monday_from_time1,monday_to_time1,monday_from_time2,monday_to_time2,tuesday_from_time1,tuesday_to_time1,tuesday_from_time2,tuesday_to_time2,wednesday_from_time1,wednesday_to_time1,wednesday_from_time2,wednesday_to_time2,thursday_from_time1,thursday_to_time1,thursday_from_time2,thursday_to_time2,friday_from_time1,friday_to_time1,friday_from_time2,friday_to_time2,saturday_from_time1,saturday_to_time1,saturday_from_time2,saturday_to_time2,primary_tags,open_close_flags,vendor_tag,vendor_tag_name,one_click_vendor,country_id,city_id,created_at,updated_at,device_type,display_orders
0,4,118597.0,-0.5886,0.7544,Restaurants,2.0,0.0,6.0,1.0,11:00AM-11:30PM,-,15,0.0,Yes,0.0,1.0,1,11,EN,4.4,00:00:00,00:30:00,08:00:00,23:59:00,00:00:00,00:30:00,08:00:00,23:59:00,00:00:00,00:30:00,08:00:00,23:59:00,00:00:00,00:30:00,08:00:00,23:59:00,00:00:00,00:30:00,08:00:00,23:59:00,00:00:00,00:30:00,10:00:00,23:59:00,00:00:00,00:30:00,10:00:00,23:59:00,"{""primary_tags"":""4""}",1.0,2458912212241623,"Arabic,Breakfast,Burgers,Desserts,Free Deliver...",Y,1.0,1.0,2018-01-30 14:42:04,2020-04-07 15:12:43,3,1
1,13,118608.0,-0.4717,0.7445,Restaurants,2.0,0.7,5.0,1.0,08:30AM-10:30PM,-,14,0.0,Yes,0.0,1.0,1,11,EN,4.7,00:00:00,01:30:00,08:00:00,23:59:00,00:00:00,01:30:00,08:00:00,23:59:00,00:00:00,01:30:00,08:00:00,23:59:00,00:00:00,01:30:00,08:00:00,19:30:00,00:00:00,01:30:00,08:00:00,19:30:00,00:00:00,01:30:00,08:00:00,23:59:00,00:00:00,01:30:00,08:00:00,23:59:00,"{""primary_tags"":""7""}",1.0,44151342715241628,"Breakfast,Cakes,Crepes,Italian,Pasta,Pizzas,Sa...",Y,1.0,1.0,2018-05-03 12:32:06,2020-04-05 20:46:03,3,1
2,20,118616.0,-0.4075,0.6437,Restaurants,2.0,0.0,8.0,1.0,08:00AM-10:45PM,-,19,0.0,Yes,0.0,1.0,1,1,EN,4.5,08:00:00,22:45:00,,,08:00:00,22:45:00,,,08:00:00,22:45:00,,,08:00:00,22:45:00,,,08:00:00,22:45:00,,,08:00:00,22:45:00,,,08:00:00,22:45:00,,,"{""primary_tags"":""71""}",1.0,489110,"Breakfast,Desserts,Free Delivery,Indian",Y,1.0,1.0,2018-05-04 22:28:22,2020-04-07 16:35:55,3,1
3,23,118619.0,-0.5854,0.7538,Restaurants,2.0,0.0,5.0,1.0,10:59AM-10:30PM,-,16,0.0,Yes,0.0,1.0,1,11,EN,4.5,09:00:00,23:30:00,,,09:00:00,23:30:00,,,09:00:00,23:30:00,,,09:00:00,23:30:00,,,09:00:00,23:45:00,,,09:00:00,23:45:00,,,09:00:00,23:45:00,,,"{""primary_tags"":""46""}",1.0,583024,"Burgers,Desserts,Fries,Salads",Y,1.0,1.0,2018-05-06 19:20:48,2020-04-02 00:56:17,3,1
4,28,118624.0,0.4806,0.5529,Restaurants,2.0,0.7,15.0,1.0,11:00AM-11:45PM,-,10,0.0,Yes,0.0,1.0,1,11,EN,4.4,00:01:00,00:30:00,11:00:00,23:59:00,00:01:00,00:30:00,11:00:00,23:59:00,00:01:00,00:30:00,11:00:00,23:59:00,00:01:00,00:30:00,11:00:00,23:59:00,00:01:00,00:30:00,11:00:00,23:59:00,00:01:00,01:30:00,17:45:00,23:59:00,00:01:00,01:30:00,17:45:00,23:59:00,"{""primary_tags"":""32""}",1.0,5,Burgers,Y,1.0,1.0,2018-05-17 22:12:38,2020-04-05 15:57:41,3,1


In [5]:
# quick look at orders df
print(f'Shape of df: {orders.shape}')
orders.head(5)

Shape of df: (135303, 11)


Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,vendor_rating,deliverydistance,delivery_date,vendor_id,created_at,LOCATION_NUMBER
0,163238.0,92PEE24,1.0,7.6,0.0,,0.0,2019-07-31 05:30:00,105,2019-08-01 05:30:16,0
1,163240.0,QS68UD8,1.0,8.7,0.0,,0.0,2019-07-31 05:30:00,294,2019-08-01 05:31:10,0
2,163241.0,MB7VY5F,2.0,14.4,0.0,,0.0,2019-07-31 05:30:00,83,2019-08-01 05:31:33,0
3,163244.0,KDJ951Y,1.0,7.1,0.0,,0.0,2019-07-31 05:30:00,90,2019-08-01 05:34:54,0
4,163245.0,BAL0RVT,4.0,27.2,0.0,,0.0,2019-07-31 05:30:00,83,2019-08-01 05:35:51,0


There are some missing values in some of the columns in all three datasets. In the following section, we will select important features that could help to build content-based recommendation system and remove unimportant columns with missing values.

## Analyze

### Exploratory Data Analysis
* Understand variables
* Clean the dataset

#### Examine vendors_df

In [6]:
# select features for the vendors df
vendors_df = vendors[['id','vendor_category_en','serving_distance','prepration_time','rank','language','vendor_rating','vendor_tag_name']]

vendors_df.head(5)

Unnamed: 0,id,vendor_category_en,serving_distance,prepration_time,rank,language,vendor_rating,vendor_tag_name
0,4,Restaurants,6.0,15,11,EN,4.4,"Arabic,Breakfast,Burgers,Desserts,Free Deliver..."
1,13,Restaurants,5.0,14,11,EN,4.7,"Breakfast,Cakes,Crepes,Italian,Pasta,Pizzas,Sa..."
2,20,Restaurants,8.0,19,1,EN,4.5,"Breakfast,Desserts,Free Delivery,Indian"
3,23,Restaurants,5.0,16,11,EN,4.5,"Burgers,Desserts,Fries,Salads"
4,28,Restaurants,15.0,10,11,EN,4.4,Burgers


In [7]:
# select features for customer_train df

customer_df = customer_train[['akeed_customer_id','gender','language']]

customer_df.head(5)

Unnamed: 0,akeed_customer_id,gender,language
0,TCHWPBT,Male,EN
1,ZGFSYCZ,Male,EN
2,S2ALZFL,Male,EN
3,952DBJQ,Male,EN
4,1IX6FXS,Male,EN


In [8]:
# select features for orders df

orders_df = orders.drop(['vendor_rating'], axis=1)

orders_df.head(5)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,delivery_date,vendor_id,created_at,LOCATION_NUMBER
0,163238.0,92PEE24,1.0,7.6,0.0,0.0,2019-07-31 05:30:00,105,2019-08-01 05:30:16,0
1,163240.0,QS68UD8,1.0,8.7,0.0,0.0,2019-07-31 05:30:00,294,2019-08-01 05:31:10,0
2,163241.0,MB7VY5F,2.0,14.4,0.0,0.0,2019-07-31 05:30:00,83,2019-08-01 05:31:33,0
3,163244.0,KDJ951Y,1.0,7.1,0.0,0.0,2019-07-31 05:30:00,90,2019-08-01 05:34:54,0
4,163245.0,BAL0RVT,4.0,27.2,0.0,0.0,2019-07-31 05:30:00,83,2019-08-01 05:35:51,0


In [9]:
# check for duplicates in vendors_df

vendors_df.duplicated().sum()

0

In [10]:
# examine for the missing values in vendors_df 

vendors_df.isna().sum()

id                     0
vendor_category_en     0
serving_distance       0
prepration_time        0
rank                   0
language              15
vendor_rating          0
vendor_tag_name        3
dtype: int64

In [11]:
# examine the 'language' columns of vendors_df

vendors_df['language'].value_counts()

language
EN    85
Name: count, dtype: int64

Since 85 of the vendors use english on the app and no other languages, we will fill the missing `language` values in 15 vendors with `EN`.

In [12]:
# replace the missing values in 'language' column using 'EN'

vendors_df.loc[:,'language'] = vendors_df['language'].fillna('EN')

vendors_df['language'].value_counts()

language
EN    100
Name: count, dtype: int64

In [13]:
# replace the missing values in 'vendor_tag_name' in vendors_df with empty string

vendors_df.loc[:,'vendor_tag_name'] = vendors_df['vendor_tag_name'].fillna('')

vendors_df.isna().sum()

id                    0
vendor_category_en    0
serving_distance      0
prepration_time       0
rank                  0
language              0
vendor_rating         0
vendor_tag_name       0
dtype: int64

#### Tokenize Text Column

The feature `vendor_tag_name` is text-based. It is not a categorical variable, since it does not have a fixed number of possible values. Using `CountVectorizer`, I will extract the numerical features from it through a bag-of-words that is split by the `,`. This extracted feature will give what type of food or drinks that each vendors sell at their restaurants.

In [14]:
# to extract text
from sklearn.feature_extraction.text import CountVectorizer

# define a custom tokenizer
def custom_tokenizer(text):
    return text.split(',')
    
# set up a 'CountVectorizer' object, which converts a collection of text to a matrix of token counts
count_vec = CountVectorizer(tokenizer=custom_tokenizer,
                            max_features=30,
                            stop_words='english',
                           token_pattern=None)
count_vec

In [15]:
# extract numberical features from 'vendor_tag_name' in vendors_df

count_data = count_vec.fit_transform(vendors_df['vendor_tag_name']).toarray()

count_data.shape

(100, 30)

In [16]:
# place the numerical representation of 'vendor_tag_name' into data

count_df = pd.DataFrame(data=count_data, columns=count_vec.get_feature_names_out())

count_df.head(3)

Unnamed: 0,american,arabic,asian,biryani,breakfast,burgers,cafe,cakes,coffee,desserts,donuts,free delivery,fresh juices,fries,grills,healthy food,hot dogs,ice creams,indian,kids meal,milkshakes,mojitos,pasta,pizzas,rice,salads,sandwiches,shawarma,smoothies,soups
0,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0
1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,1
2,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [17]:
# concatenate 'vendors_df' and 'count_df' to form the final dataframe

vendors_final = pd.concat([vendors_df.drop(['vendor_tag_name'], axis=1).reset_index(drop=True), count_df], axis=1)

vendors_final.head(3)

Unnamed: 0,id,vendor_category_en,serving_distance,prepration_time,rank,language,vendor_rating,american,arabic,asian,biryani,breakfast,burgers,cafe,cakes,coffee,desserts,donuts,free delivery,fresh juices,fries,grills,healthy food,hot dogs,ice creams,indian,kids meal,milkshakes,mojitos,pasta,pizzas,rice,salads,sandwiches,shawarma,smoothies,soups
0,4,Restaurants,6.0,15,11,EN,4.4,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0
1,13,Restaurants,5.0,14,11,EN,4.7,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,1
2,20,Restaurants,8.0,19,1,EN,4.5,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


#### Encoding varables in the vendors dataset

In [18]:
# dummy encode categorical variables in vendors_df
vendors_final = pd.get_dummies(vendors_final, columns=['vendor_category_en','language'], dtype='int')

vendors_final.head(3)

Unnamed: 0,id,serving_distance,prepration_time,rank,vendor_rating,american,arabic,asian,biryani,breakfast,burgers,cafe,cakes,coffee,desserts,donuts,free delivery,fresh juices,fries,grills,healthy food,hot dogs,ice creams,indian,kids meal,milkshakes,mojitos,pasta,pizzas,rice,salads,sandwiches,shawarma,smoothies,soups,vendor_category_en_Restaurants,vendor_category_en_Sweets & Bakes,language_EN
0,4,6.0,15,11,4.4,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,1,0,1
1,13,5.0,14,11,4.7,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,0,1,1,0,1
2,20,8.0,19,1,4.5,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1


#### Examine the customer_df

In [19]:
# check for duplicates and missing values in customer_df
print('Duplicated data in customer_df:',customer_df.duplicated().sum())

print('Missing data in customer_df:\n',customer_df.isna().sum())

Duplicated data in customer_df: 143
Missing data in customer_df:
 akeed_customer_id        0
gender               12154
language             13575
dtype: int64


In [20]:
# examine 'gender' column in customer_df
customer_df['gender'].value_counts()

gender
Male          17815
male           2914
Female         1761
Female           13
Male              9
Female            2
Female            2
?????             2
Female            1
                  1
Name: count, dtype: int64

In [21]:
# normalize the gender values by converting all entries to lowercase
customer_df.loc[:,'gender'] = customer_df['gender'].str.lower()

# check the cleaned 'gender' column
print(customer_df['gender'].value_counts())

gender
male          20729
female         1761
female           13
male              9
female            2
female            2
?????             2
female            1
                  1
Name: count, dtype: int64


Dummies variables were created for `gender` column since it will be easier to add all the males into one column and females into one. When the customers has 0 for both male and female column, it will be known as the customer did not declare the gender on the app.

In [22]:
# create dummies variable for 'gender' in customer_df

customer_final = pd.get_dummies(customer_df, columns=['gender'], dtype='int')

customer_final.head(3)

Unnamed: 0,akeed_customer_id,language,gender_,gender_?????,gender_female,gender_female.1,gender_female.2,gender_female.3,gender_female.4,gender_male,gender_male.1
0,TCHWPBT,EN,0,0,0,0,0,0,0,1,0
1,ZGFSYCZ,EN,0,0,0,0,0,0,0,1,0
2,S2ALZFL,EN,0,0,0,0,0,0,0,1,0


In [23]:
# create a new column 'male' that include all the male customers
customer_final['male'] = customer_final.iloc[:,9] + customer_final.iloc[:,10]

customer_final['male'].sum()

20738

In [24]:
# create a new column 'female' that include all the female customers
customer_final['female'] = customer_final.iloc[:, 4:9].sum(axis=1)

customer_final['female'].sum()

1779

> The final customer_final will have columns for male and female. When both male and female columns contain both `0`, this suggests that customer's gender is unknown.

In [25]:
# clean the customer_final df

customer_final = customer_final[['akeed_customer_id','male','female']]

customer_final.head(5)

Unnamed: 0,akeed_customer_id,male,female
0,TCHWPBT,1,0
1,ZGFSYCZ,1,0
2,S2ALZFL,1,0
3,952DBJQ,1,0
4,1IX6FXS,1,0


In [26]:
# drop duplicated entries in customer_final df

customer_final = customer_final.drop_duplicates()

print('Missing Values in customer_final:\n', customer_final.isna().sum())
print('Duplicated Values in customer_final:', customer_final.duplicated().sum())

customer_final.head(5)

Missing Values in customer_final:
 akeed_customer_id    0
male                 0
female               0
dtype: int64
Duplicated Values in customer_final: 0


Unnamed: 0,akeed_customer_id,male,female
0,TCHWPBT,1,0
1,ZGFSYCZ,1,0
2,S2ALZFL,1,0
3,952DBJQ,1,0
4,1IX6FXS,1,0


#### Examine the orders_df dataset

In [27]:
# check for duplicates and missing values in orders_df

print('Missing Values in orders_df:\n', orders_df.isna().sum())
print('Duplicated Values in orders_df:', orders_df.duplicated().sum())

Missing Values in orders_df:
 akeed_order_id               70
customer_id                   0
item_count                 6925
grand_total                   0
vendor_discount_amount        0
deliverydistance              0
delivery_date             99759
vendor_id                     0
created_at                    0
LOCATION_NUMBER               0
dtype: int64
Duplicated Values in orders_df: 0


Since we will be building a neural network that predict an outcome, we will use `item_count` as our outcome to approximate whether a customer will purchase from a particular vendor. Since `orders_df` records all the purchase, we make an assumption that at least one item is purchase by the customers. Another possible outcome variable that we can use is the `grand_total`. However, upon closer examination, there are rows with `0.0` which may be hard to make assumption the exact amount the customer spent on the purchase from the vendor. Hence, the `item_count` is considered as our outcome variable in this project and missing `item_count` are filled with `1.0` count.

In [28]:
# since item_count column contains missing values, we will fill the missing values with 1.0 count

orders_df['item_count'] = orders_df['item_count'].fillna(1.0)

orders_df['item_count'].isna().sum()

0

In [29]:
# drop 'delivery_date' column since there are missing values and does not add information for the recommendation system
orders_df = orders_df.drop('delivery_date', axis=1)

orders_df.head(3)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER
0,163238.0,92PEE24,1.0,7.6,0.0,0.0,105,2019-08-01 05:30:16,0
1,163240.0,QS68UD8,1.0,8.7,0.0,0.0,294,2019-08-01 05:31:10,0
2,163241.0,MB7VY5F,2.0,14.4,0.0,0.0,83,2019-08-01 05:31:33,0


In [30]:
# examine missing values in the orders_df 
missingItem_df = orders_df[orders_df.isna().any(axis=1)]

missingItem_df.head(10)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER
94367,,H0KPCSI,3.0,0.0,0.0,7.54,401,2020-01-01 22:30:57,8
94369,,H0KPCSI,3.0,0.0,0.0,7.54,401,2020-01-01 22:31:09,8
94370,,H0KPCSI,3.0,0.0,0.0,7.54,401,2020-01-01 22:32:03,8
94373,,H0KPCSI,3.0,0.0,0.0,7.54,401,2020-01-01 22:32:26,8
94379,,H0KPCSI,3.0,0.0,0.0,7.54,401,2020-01-01 22:35:31,8
96129,,D54LNVL,1.0,0.0,0.0,3.8,386,2020-01-05 18:49:39,3
96130,,D54LNVL,1.0,0.0,0.0,3.8,386,2020-01-05 18:49:49,3
96132,,D54LNVL,1.0,0.0,0.0,3.8,386,2020-01-05 18:50:32,3
96990,,6Q5428S,1.0,0.0,0.0,3.96,845,2020-01-07 00:35:54,1
100559,,C1GBONQ,2.0,0.0,0.0,6.98,386,2020-01-15 21:59:25,0


In [31]:
# percentage of missing 'akeed_order_id'
perc_missing_order = ((orders_df['akeed_order_id'].isna().sum())/(len(orders_df))) *100

perc_missing_order

0.05173573387138497

In [32]:
# since missing values only account for 0.05% of the all data entries, we will remove missing values from order_final df
order_final = orders_df.dropna(axis=0, subset=['akeed_order_id'])

order_final.isna().sum()

akeed_order_id            0
customer_id               0
item_count                0
grand_total               0
vendor_discount_amount    0
deliverydistance          0
vendor_id                 0
created_at                0
LOCATION_NUMBER           0
dtype: int64

In [33]:
# checking again to ensure no missing or duplicated values in order_final df
print('Missing Values in order_final:\n', order_final.isna().sum())
print('Duplicated Values in order_final:', order_final.duplicated().sum())

Missing Values in order_final:
 akeed_order_id            0
customer_id               0
item_count                0
grand_total               0
vendor_discount_amount    0
deliverydistance          0
vendor_id                 0
created_at                0
LOCATION_NUMBER           0
dtype: int64
Duplicated Values in order_final: 0


#### Merging all three dataframes into one

In [34]:
# merge the order_final and customer_final into data_overall using 'customer_id'
data_overall = pd.merge(order_final, customer_final, left_on='customer_id', right_on='akeed_customer_id', how='left')

data_overall.head(5)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER,akeed_customer_id,male,female
0,163238.0,92PEE24,1.0,7.6,0.0,0.0,105,2019-08-01 05:30:16,0,92PEE24,1.0,0.0
1,163240.0,QS68UD8,1.0,8.7,0.0,0.0,294,2019-08-01 05:31:10,0,QS68UD8,0.0,0.0
2,163241.0,MB7VY5F,2.0,14.4,0.0,0.0,83,2019-08-01 05:31:33,0,MB7VY5F,0.0,0.0
3,163244.0,KDJ951Y,1.0,7.1,0.0,0.0,90,2019-08-01 05:34:54,0,KDJ951Y,1.0,0.0
4,163245.0,BAL0RVT,4.0,27.2,0.0,0.0,83,2019-08-01 05:35:51,0,BAL0RVT,1.0,0.0


In [35]:
data_overall.isna().sum()

akeed_order_id               0
customer_id                  0
item_count                   0
grand_total                  0
vendor_discount_amount       0
deliverydistance             0
vendor_id                    0
created_at                   0
LOCATION_NUMBER              0
akeed_customer_id         3276
male                      3276
female                    3276
dtype: int64

In [36]:
# check the missing rows in data_overall

missingData = data_overall[data_overall.isna().any(axis=1)]

missingData.head(10)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER,akeed_customer_id,male,female
43,163315.0,UINGJGR,3.0,8.5,0.0,0.0,180,2019-08-01 16:49:53,0,,,
134,163500.0,G5108U3,1.0,8.4,0.0,0.0,113,2019-08-01 20:00:38,0,,,
136,163502.0,G5108U3,2.0,16.0,0.0,0.0,113,2019-08-01 20:02:34,0,,,
205,163633.0,RYV1HLM,3.0,6.3,0.0,0.0,231,2019-08-01 23:37:37,0,,,
208,163638.0,NSE2CP5,2.0,14.1,0.0,0.0,85,2019-08-01 23:56:31,0,,,
217,163664.0,H1ROQQD,2.0,12.7,0.0,0.0,207,2019-08-02 01:12:24,0,,,
244,163717.0,PQB9ZOG,4.0,18.1,0.0,0.0,243,2019-08-02 02:35:58,0,,,
245,163719.0,6O8CYWX,4.0,16.0,0.0,0.0,78,2019-08-02 02:37:25,0,,,
423,164109.0,OK6N5SP,2.0,12.0,0.0,0.0,113,2019-08-02 22:16:43,0,,,
438,164142.0,ABAV063,4.0,25.6,0.0,0.0,203,2019-08-02 23:21:21,0,,,


In [37]:
# Check if values in 'missingData' exist in 'customer_final'
exists = missingData['customer_id'].isin(customer_final['akeed_customer_id'])

# Print results
for customer_id, exists_flag in zip(missingData['customer_id'], exists):
    if exists_flag:
        print(f"{customer_id} exists in customer_final df")
    else:
        print(f"{customer_id} does not exist in customer_final df")

UINGJGR does not exist in customer_final df
G5108U3 does not exist in customer_final df
G5108U3 does not exist in customer_final df
RYV1HLM does not exist in customer_final df
NSE2CP5 does not exist in customer_final df
H1ROQQD does not exist in customer_final df
PQB9ZOG does not exist in customer_final df
6O8CYWX does not exist in customer_final df
OK6N5SP does not exist in customer_final df
ABAV063 does not exist in customer_final df
0SIOQGR does not exist in customer_final df
QVZ1NPG does not exist in customer_final df
ZQ48UKH does not exist in customer_final df
3195FNI does not exist in customer_final df
88R02RZ does not exist in customer_final df
RYV1HLM does not exist in customer_final df
91OOARV does not exist in customer_final df
4G19ENH does not exist in customer_final df
G023BLI does not exist in customer_final df
8PKVPJV does not exist in customer_final df
1LQL8IF does not exist in customer_final df
LVSDBMQ does not exist in customer_final df
9M4S9MG does not exist in custom

In [38]:
# since 3276 users in order_df do not exist in customer_final, we will remove these users from the dataframe

data_overall = data_overall.dropna(axis=0, subset=['akeed_customer_id'])

print("Missing values in data_overall:\n",data_overall.isna().sum())
print("Duplicated values in data_overall:", data_overall.duplicated().sum())

Missing values in data_overall:
 akeed_order_id            0
customer_id               0
item_count                0
grand_total               0
vendor_discount_amount    0
deliverydistance          0
vendor_id                 0
created_at                0
LOCATION_NUMBER           0
akeed_customer_id         0
male                      0
female                    0
dtype: int64
Duplicated values in data_overall: 0


In [39]:
# we will merge the data_overall with vendors_final 

data_overall = pd.merge(data_overall, vendors_final, left_on='vendor_id', right_on='id', how='left')

data_overall.head(3)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER,akeed_customer_id,male,female,id,serving_distance,prepration_time,rank,vendor_rating,american,arabic,asian,biryani,breakfast,burgers,cafe,cakes,coffee,desserts,donuts,free delivery,fresh juices,fries,grills,healthy food,hot dogs,ice creams,indian,kids meal,milkshakes,mojitos,pasta,pizzas,rice,salads,sandwiches,shawarma,smoothies,soups,vendor_category_en_Restaurants,vendor_category_en_Sweets & Bakes,language_EN
0,163238.0,92PEE24,1.0,7.6,0.0,0.0,105,2019-08-01 05:30:16,0,92PEE24,1.0,0.0,105,15.0,12,11,4.5,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1
1,163240.0,QS68UD8,1.0,8.7,0.0,0.0,294,2019-08-01 05:31:10,0,QS68UD8,0.0,0.0,294,10.0,15,11,4.4,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
2,163241.0,MB7VY5F,2.0,14.4,0.0,0.0,83,2019-08-01 05:31:33,0,MB7VY5F,0.0,0.0,83,15.0,15,11,4.2,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1


In [40]:
#check for missing and duplicated values again for data_overall df
print("Missing values in data_overall:\n",data_overall.isna().sum())
print("Duplicated values in data_overall:", data_overall.duplicated().sum())

Missing values in data_overall:
 akeed_order_id                       0
customer_id                          0
item_count                           0
grand_total                          0
vendor_discount_amount               0
deliverydistance                     0
vendor_id                            0
created_at                           0
LOCATION_NUMBER                      0
akeed_customer_id                    0
male                                 0
female                               0
id                                   0
serving_distance                     0
prepration_time                      0
rank                                 0
vendor_rating                        0
american                             0
arabic                               0
asian                                0
biryani                              0
breakfast                            0
burgers                              0
cafe                                 0
cakes                          

In [41]:
# drop 'akeed_customer_id' and 'id' from data_overall

data_overall = data_overall.drop(['akeed_customer_id','id'], axis=1)

data_overall.head(3)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER,male,female,serving_distance,prepration_time,rank,vendor_rating,american,arabic,asian,biryani,breakfast,burgers,cafe,cakes,coffee,desserts,donuts,free delivery,fresh juices,fries,grills,healthy food,hot dogs,ice creams,indian,kids meal,milkshakes,mojitos,pasta,pizzas,rice,salads,sandwiches,shawarma,smoothies,soups,vendor_category_en_Restaurants,vendor_category_en_Sweets & Bakes,language_EN
0,163238.0,92PEE24,1.0,7.6,0.0,0.0,105,2019-08-01 05:30:16,0,1.0,0.0,15.0,12,11,4.5,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1
1,163240.0,QS68UD8,1.0,8.7,0.0,0.0,294,2019-08-01 05:31:10,0,0.0,0.0,10.0,15,11,4.4,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
2,163241.0,MB7VY5F,2.0,14.4,0.0,0.0,83,2019-08-01 05:31:33,0,0.0,0.0,15.0,15,11,4.2,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1


#### Feature Engineering

Since vendors have categorical variables of type of food or drinks that they sell, we can use this information to understand what types of food and drinks are purchased most by each customer. This will give an food preference profile of individual customer.

In [42]:
# find out how often the customers order from certain vendor_tag_name
columns_to_agg = data_overall.columns[15:45]

user_prefer = pd.DataFrame(data_overall.groupby('customer_id')[columns_to_agg].agg('sum').reset_index(names='customer_id'))

user_prefer.head(5)

Unnamed: 0,customer_id,american,arabic,asian,biryani,breakfast,burgers,cafe,cakes,coffee,desserts,donuts,free delivery,fresh juices,fries,grills,healthy food,hot dogs,ice creams,indian,kids meal,milkshakes,mojitos,pasta,pizzas,rice,salads,sandwiches,shawarma,smoothies,soups
0,000THBA,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
1,002510Y,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
2,005ECL6,2,0,0,0,0,2,0,0,0,2,2,0,0,2,0,0,0,0,0,0,0,0,2,0,0,2,2,0,0,0
3,0075AM7,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
4,009UFS1,1,2,0,0,2,3,0,0,0,2,0,0,0,2,1,0,0,0,0,1,0,0,0,1,0,1,2,2,0,0


In [43]:
# merge the user preference df to the data_overall
data_final = pd.merge(data_overall, user_prefer, on='customer_id', how='left',suffixes=('_l','_user'))

data_final.head(3)

Unnamed: 0,akeed_order_id,customer_id,item_count,grand_total,vendor_discount_amount,deliverydistance,vendor_id,created_at,LOCATION_NUMBER,male,female,serving_distance,prepration_time,rank,vendor_rating,american_l,arabic_l,asian_l,biryani_l,breakfast_l,burgers_l,cafe_l,cakes_l,coffee_l,desserts_l,donuts_l,free delivery_l,fresh juices_l,fries_l,grills_l,healthy food_l,hot dogs_l,ice creams_l,indian_l,kids meal_l,milkshakes_l,mojitos _l,pasta_l,pizzas_l,rice_l,salads_l,sandwiches_l,shawarma_l,smoothies_l,soups_l,vendor_category_en_Restaurants,vendor_category_en_Sweets & Bakes,language_EN,american_user,arabic_user,asian_user,biryani_user,breakfast_user,burgers_user,cafe_user,cakes_user,coffee_user,desserts_user,donuts_user,free delivery_user,fresh juices_user,fries_user,grills_user,healthy food_user,hot dogs_user,ice creams_user,indian_user,kids meal_user,milkshakes_user,mojitos _user,pasta_user,pizzas_user,rice_user,salads_user,sandwiches_user,shawarma_user,smoothies_user,soups_user
0,163238.0,92PEE24,1.0,7.6,0.0,0.0,105,2019-08-01 05:30:16,0,1.0,0.0,15.0,12,11,4.5,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,4,0,0,0,0,4,0,0,0,0,0,4,0,0,0,0,4,0,0,0,0,0,4,0,0,0,0,0,0,0
1,163240.0,QS68UD8,1.0,8.7,0.0,0.0,294,2019-08-01 05:31:10,0,0.0,0.0,10.0,15,11,4.4,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,2,0,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,163241.0,MB7VY5F,2.0,14.4,0.0,0.0,83,2019-08-01 05:31:33,0,0.0,0.0,15.0,15,11,4.2,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,4,0,0,4,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0


## Construct

The `data_final` dataframe is separated into `vendor_array`, `user_array` and `y_array` so that `vendor_array` will be fitted into a neural network that predict `y_array` using vendor's information and likewise for the `user_array`. The dataframes are converted to numpy array for efficient computation.

In [44]:
# extract columns from data_final to create numpy array for vendors' characteristic for each order
vendor_features = ['akeed_order_id','vendor_id','serving_distance','prepration_time','rank','vendor_rating']+data_final.columns[15:47].tolist()

vendor_array = data_final[vendor_features].to_numpy()

print(vendor_array.shape)

vendor_array[:3]

(131970, 38)


array([[1.63238e+05, 1.05000e+02, 1.50000e+01, 1.20000e+01, 1.10000e+01,
        4.50000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 0.00000e+00],
       [1.63240e+05, 2.94000e+02, 1.00000e+01, 1.50000e+01, 1.10000e+01,
        4.40000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 1.00000e+00, 0.00000e+00, 0.00000e+00,
        1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
        0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
   

In [45]:
# extract columns from data_final to create numpy array for customers' characteristic for each order
user_features = ['akeed_order_id','customer_id','LOCATION_NUMBER','male','female']+data_final.columns[48:].tolist()

user_array = data_final[user_features].to_numpy()

print(user_array.shape)

user_array[:3]

(131970, 35)


array([[163238.0, '92PEE24', 0, 1.0, 0.0, 4, 0, 0, 0, 0, 4, 0, 0, 0, 0,
        0, 4, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0],
       [163240.0, 'QS68UD8', 0, 0.0, 0.0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 1,
        0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [163241.0, 'MB7VY5F', 0, 0.0, 0.0, 0, 4, 0, 0, 4, 0, 0, 0, 0, 4,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0]],
      dtype=object)

In [46]:
# extract column 'item_count' from data_final to create y_array for the model prediction
y_array = data_final['item_count'].to_numpy()

print(y_array.shape)

y_array[:3]

(131970,)


array([1., 1., 2.])

In [47]:
# create a vendor dictionary that contain vendor ID as keys and vendor's category and tag_name as values

vendor_dict = vendors_df[['id','vendor_category_en','vendor_tag_name']].set_index('id')

vendor_dict = vendor_dict.to_dict(orient='index')

vendor_dict

{4: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'Arabic,Breakfast,Burgers,Desserts,Free Delivery,Grills,Lebanese,Salads,Sandwiches,Shawarma'},
 13: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'Breakfast,Cakes,Crepes,Italian,Pasta,Pizzas,Salads,Sandwiches,Soups'},
 20: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'Breakfast,Desserts,Free Delivery,Indian'},
 23: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'Burgers,Desserts,Fries,Salads'},
 28: {'vendor_category_en': 'Restaurants', 'vendor_tag_name': 'Burgers'},
 33: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'Desserts,Mexican'},
 43: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'American,Burgers,Fries,Sandwiches'},
 44: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'American,Burgers,Fries,Sandwiches'},
 55: {'vendor_category_en': 'Restaurants',
  'vendor_tag_name': 'Breakfast,Desserts,Grills,Milkshakes,Salads,Sandwiches,Soups'},
 66: {'v

In [48]:
# set configuration variables
num_user_features = user_array.shape[1]-2 #remove order_id and customer_id during training
num_vendor_features = vendor_array.shape[1]-2 #remove order_id and vendor_id during training
v_s = 2 #start of columns to use in training, vendor
c_s = 2 #start of columns to use in training, customer 

#### Preparing the training data

In [49]:
#scale training data
vendor_train_unscaled = vendor_array
user_train_unscaled = user_array
y_train_unscaled = y_array

scalerVendor = StandardScaler()
scalerVendor.fit(vendor_array[:,v_s:])
vendor_train = scalerVendor.transform(vendor_array[:,v_s:])

scalerUser = StandardScaler()
scalerUser.fit(user_array[:, c_s:])
user_train = scalerUser.transform(user_array[:,c_s:])

scalerTarget = MinMaxScaler((-1,1))
scalerTarget.fit(y_array.reshape(-1,1))
y_train = scalerTarget.transform(y_array.reshape(-1,1))

print(np.allclose(vendor_train_unscaled[:,v_s:].astype(np.float64), scalerVendor.inverse_transform(vendor_train)))
print(np.allclose(user_train_unscaled[:,c_s:].astype(np.float64), scalerUser.inverse_transform(user_train)))

True
True


In [50]:
# split the training datasets into training and test sets
vendor_tr, vendor_test = train_test_split(vendor_train, test_size=0.2, shuffle=True, random_state=1)
user_tr, user_test = train_test_split(user_train, test_size=0.2, shuffle=True, random_state=1)
y_tr, y_test = train_test_split(y_train, test_size=0.2, shuffle=True, random_state=1)

print(f"vendor training data shape: {vendor_tr.shape}")
print(f"vendor test data shape: {vendor_test.shape}")

vendor training data shape: (105576, 36)
vendor test data shape: (26394, 36)


### Building Neural Network for content-based filtering

In [51]:
# building NN model

# create a custom Keras layer
from tensorflow.keras.layers import Layer

@tf.keras.utils.register_keras_serializable()
class L2NormalizeLayer(tf.keras.layers.Layer):
    def __init__(self, axis=1, **kwargs):
        super(L2NormalizeLayer, self).__init__(**kwargs)
        self.axis = axis

    def call(self, inputs):
        return tf.linalg.l2_normalize(inputs, axis=self.axis)
    
    def get_config(self):
        config = super(L2NormalizeLayer, self).get_config()
        config.update({"axis": self.axis})
        return config

num_outputs = 32
tf.random.set_seed(1)
vendor_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'), 
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(num_outputs),
])

user_NN = tf.keras.models.Sequential([
    tf.keras.layers.Dense(256, activation='relu'), 
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(num_outputs),
])

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features,))
vu = user_NN(input_user)
vu = L2NormalizeLayer(axis=1)(vu)

# create the vendor input and point to the base network
input_vendor = tf.keras.layers.Input(shape=(num_vendor_features,))
vv = vendor_NN(input_vendor)
vv = L2NormalizeLayer(axis=1)(vv)

# compute dot product of two vectors vu and vv
output = tf.keras.layers.Dot(axes=1)([vu,vv])

# specify the inputs and output of the model
model = tf.keras.Model([input_user, input_vendor], output)

model.summary()

In [52]:
# compile the model with optimizer and loss function
tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt,
             loss=cost_fn)

In [53]:
# train the model
tf.random.set_seed(1)
#model.fit([user_tr, vendor_tr],y_tr, epochs=30)

In [54]:
# folder path where the fitted model will be saved at
path = '/Porfolio Projects/Recommendation System/Restraurant Recommendation System/tf_model.keras'

#model.save(path)

In [55]:
# load the keras format model
model = tf.keras.models.load_model(path,  custom_objects={'L2NormalizeLayer': L2NormalizeLayer})

In [56]:
model.evaluate([user_test, vendor_test],y_test)

[1m825/825[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 309us/step - loss: 0.0021


0.002099767094478011

> It is comparable to the training loss indicating the model has not substantially overfit the training data.

### Predictions

#### Predictions for an existing user

In [57]:
# define a function to get customer's info with matrix that match the size of vendor_dict
def get_user_vecs(user_id, user_array, vendors_final, data_final):
    """
    given a user_id, return:
    user train/predict matrix to match the size of vendor_final
    y vector with item counts for all vendors that were purchased by the user and 0 for others of the size vendor_final
    """
    if not user_id in data_final['customer_id'].values:
        print("error: unknown user id")
        return None
    else:
        user_vec_found=False
        for i in range(len(user_array)):
            if user_array[i,1] == user_id:
                user_vec = user_array[i]
                user_vec_found=True
                break
        if not user_vec_found:
            print("error in get_user_vecs, did not find user id in user_array")
        num_items = len(vendors_final)
        user_vecs = np.tile(user_vec, (num_items,1))
        
        y = np.zeros(num_items)
        user_data = data_final[data_final['customer_id'] == user_id]
        for i in range(num_items):
            vendor_id = vendors_final.iloc[i,0]
            if vendor_id in user_data['vendor_id'].values:
                item_count = user_data.loc[user_data['vendor_id']==vendor_id,'item_count'].max()
            else:
                item_count = 0
            y[i] = item_count
    return (user_vecs, y)

In [58]:
def print_existing_user(y_p, y, user, vendors, vendor_dict, maxcount=10):
    """
    print results of prediction for a user who was in the database.
    Inputs are expected to be in sorted order, unscaled.
    """
    count = 0
    disp = [['y_p','y','customer_id','vendor_id','vendor_category_en','vendor_tag_name']]
    count = 0
    for i in range(0, y.shape[0]):
        if y[i, 0] != 0:
            if count == maxcount:
                break
            count += 1
            vendor_id = vendors[i, 0].astype(int)

            disp.append([
                y_p[i,0], y[i,0],
                user[i,1], np.round(vendor_id.astype(int)),
                vendor_dict[vendor_id]['vendor_category_en'],
                vendor_dict[vendor_id]['vendor_tag_name']
            ])
    table = tabulate.tabulate(disp, tablefmt='html', headers='firstrow', floatfmt=[".1f", ".1f", ".0f", ".2f", ".1f"])
    return table

Let's look at the predictions for a customer with id `EE6DB8A`. We will compare the predicted item_count with dataset original item_count.

In [59]:
customer_id = 'EE6DB8A'

# form a set of user vectors.
user_vecs, y_vecs = get_user_vecs(customer_id, user_train_unscaled, vendors_final, data_final)

# scale our user and item vectors
vendors_vecs = vendors_final.drop('language_EN',axis=1)
vendors_vecs = vendors_vecs.to_numpy()
suser_vecs = scalerUser.transform(user_vecs[:, c_s:])
svendor_vecs = scalerVendor.transform(vendors_vecs[:,1:])

# make a prediction
y_p = model.predict([suser_vecs,svendor_vecs])

# unscale y prediction
y_pu = np.round(scalerTarget.inverse_transform(y_p))

# sort the results, highest prediction first
sorted_index = np.argsort(-y_pu, axis=0).reshape(-1).tolist()
sorted_ypu = y_pu[sorted_index]
sorted_vendors = vendors_vecs[sorted_index]
sorted_user = user_vecs[sorted_index]
sorted_y = y_vecs[sorted_index]

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step


In [60]:
# print sorted predictions for items purchased from vendors by consumer
print_existing_user(sorted_ypu, sorted_y.reshape(-1,1), sorted_user, sorted_vendors, vendor_dict, maxcount=50)

y_p,y,customer_id,vendor_id,vendor_category_en,vendor_tag_name
3.0,2.0,EE6DB8A,191.0,Restaurants,"Fresh Juices,Milkshakes,Mojitos ,Sandwiches,Shawarma"
3.0,1.0,EE6DB8A,106.0,Restaurants,"American,Burgers,Free Delivery,Hot Dogs,Pasta"


The model prediction is within 1 to 2 of the actual item count for this customer. Future tuning of the model is required to get a better predictive item_count.

## Evaluate

#### Finding similar vendors to Vendor ID #79 for customer #EE6DB8A

The neural network above produces two feature vectors, a user feature vector and a vendor feature vector. These are 32 entry vectors whose values are difficult to interpret. However, similar vendors will have similar vectors. This information can be used to make recommendation. For example, if a customer/user has bought from `vendor #120`, one could recommend similar vendors by selecting vendors with similar vendor feature vectors.

In [61]:
# write a function to compute the square distance

def sq_dist(a,b):
    """
    Returns the squared distance between two vectors
    Args:
        a (ndarray (n,)): vector with n features
        b (ndarray (n,)): vector with n features
    Returns:
        d (float) : distance
    """
    d = np.sum(np.square(a-b))
    return d

We will use the trained `vendor_NN` and build a small model to allow us to run the vendor vectors through it. 

In [62]:
#input layer
input_vendor_v = tf.keras.layers.Input(shape=(num_vendor_features,))

#use the trained vendor_NN
vv_v = vendor_NN(input_vendor_v)

# incorporate normalization as was done in the original model
vv_v = L2NormalizeLayer(axis=1)(vv_v)
model_v = tf.keras.Model(input_vendor_v, vv_v)
model_v.summary()

In [63]:
# vectors for all the vendors in the data set
scaled_vendors_vecs = scalerVendor.transform(vendors_vecs[:,1:])

vv_all = model_v.predict(scaled_vendors_vecs)

print(f"size of all predicted vendors' features: {vv_all.shape}")

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
size of all predicted vendors' features: (100, 32)


In [64]:
vendors_vecs

array([[  4.,   6.,  15., ...,   0.,   1.,   0.],
       [ 13.,   5.,  14., ...,   1.,   1.,   0.],
       [ 20.,   8.,  19., ...,   0.,   1.,   0.],
       ...,
       [856.,   7.,  10., ...,   0.,   1.,   0.],
       [858.,   3.,  10., ...,   0.,   1.,   0.],
       [907.,  12.,  20., ...,   0.,   1.,   0.]])

In [65]:
# vectors for vendor_id #191 that was ordered by customer_id #EE6DB8A
vendor_id = 191

# Find the row where the first column matches the vendor_id
vendor191_row = vendors_vecs[vendors_vecs[:, 0] == vendor_id]

scaled_vendor191 = scalerVendor.transform(vendor191_row[:,1:])

vv_191 = model_v.predict(scaled_vendor191)

print(f"size of vendor 191 features: {vv_191.shape}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
size of vendor 191 features: (1, 32)


In [66]:
import numpy.ma as ma

dim = len(vv_all)
dist = np.zeros((dim,1))

for i in range(dim):
    dist[i] = sq_dist(vv_all[i,:], vv_191)

sorted_index = np.argsort(-dist, axis=0).reshape(-1).tolist()
sorted_vendors = vendors_vecs[sorted_index]
sorted_vendors_id = sorted_vendors[:,0]

disp_recom = [['Vendor_id','Vendor_Category','Vendor_Tag_Name']]

for vendor_id in sorted_vendors_id[:50]:
    disp_recom.append([
        vendor_id,
        vendor_dict[vendor_id]['vendor_category_en'],
        vendor_dict[vendor_id]['vendor_tag_name']
    ])

recommend_table = tabulate.tabulate(disp_recom, tablefmt='html', headers='firstrow', floatfmt=[".1f", ".1f", ".0f", ".2f", ".1f"])

recommend_table

Vendor_id,Vendor_Category,Vendor_Tag_Name
265.0,Sweets & Bakes,"Desserts,Free Delivery,Fresh Juices,Healthy Food,Ice creams,Milkshakes,Mojitos"
856.0,Restaurants,"American,Breakfast,Burgers,Cafe,Desserts,Free Delivery,Fries,Ice creams,Kids meal,Salads"
115.0,Sweets & Bakes,"Desserts,Free Delivery,Healthy Food,Sweets"
846.0,Restaurants,"American,Breakfast,Burgers,Cafe,Desserts,Free Delivery,Fries,Ice creams,Kids meal,Salads"
858.0,Restaurants,"American,Breakfast,Burgers,Cafe,Desserts,Free Delivery,Fries,Ice creams,Kids meal,Salads"
113.0,Restaurants,"Arabic,Desserts,Free Delivery,Indian"
44.0,Restaurants,"American,Burgers,Fries,Sandwiches"
237.0,Restaurants,"American,Burgers,Desserts,Donuts,Fries,Pasta,Salads,Sandwiches"
67.0,Restaurants,"Breakfast,Desserts,Grills,Milkshakes,Salads,Sandwiches,Soups"
849.0,Restaurants,"American,Breakfast,Burgers,Cafe,Desserts,Free Delivery,Fries,Ice creams,Kids meal,Salads"


The above table shows that the model will generally suggest a vendor that sell similar type of food or drinks that the customer purchased from. In this case, since customer `#EE6DB8A` bought from a vendor that sells Fresh Juices, Milkshakes, Mojitos, Sandwiches, Shawarma, the model suggested vendors that also sell fresh juices, sandwishes, desserts and smoothies.

### Conclusion

1. Although this current recommendation system may require further tuning, it can help to recommend restaurants that may most likely be purchased by the existing customers. This is because the system suggests vendors that sell similar food or drinks that the customers had previously bought from another restaurant.
2. With feature engineering, I was able to create a profile of food preferences for the customers. Together with this profile, customer's gender and customer's location, the user neural network is trained as part of the recommendation system. Hence, we can use the recommendation system for the new user of the app to suggest possible restaurants that they may want to purchase their food.
  > To recommend vendors to the new users, the new user's food preferences, gender and location must be collected to provide initial suggestions. Therefore, when the new users register for the app, we may want to ask the users to fill in the survey form so that system can use the informations for the initial suggestion.  