## Product recommendation

In this project,we use the purchase data and items data to make recommendation given guest_id. The project will start from data exploration and cleaning, building models using both item based recomendation and model based collaborative filting and then make evaluations

## Content

* Data exploration
* Feature engineering
* Building model
* Model evaluation
* Further work

### Data exploration

In this part, import and checke the data at first and make some data cleaning if necessary

In [465]:
#import relevant packages
import pandas as pd
import numpy as np
import os
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors as knn

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

In [455]:
#Directly loading dataset if had alrealy on local disk. otherwise download the dataset via internet and save it to disk
if os.path.exists('purchases.csv'):
    purchase=pd.read_csv('purchases.csv')
else:
    url='https://s3-us-west-1.amazonaws.com/shapedata/purchases.csv'
    purchase=pd.read_csv(url)
    purchase.to_csv(url.split('/')[-1])
    
if os.path.exists('items.csv'):
    items=pd.read_csv('items.csv')
else:
    url='https://s3-us-west-1.amazonaws.com/shapedata/purchases.csv'
    purchase=pd.read_csv(url)
    purchase.to_csv(url.split('/')[-1])

In [3]:
#check the data at first
items.describe()

Unnamed: 0,anon_sku,attr
count,16670,16670
unique,16670,16656
top,79490,61201 1341 25404 58238 1341 31145 3292 9874 84...
freq,1,2


* As the table shows, there are 16670 unique items in items tables

In [456]:
#check the purchase data
purchase.describe()

Unnamed: 0,qty
count,37796.0
mean,1.328357
std,1.135054
min,-1.0
25%,1.0
50%,1.0
75%,1.0
max,63.0


* It is interesting to see that min value of qty is negative value(-1) rathen than zero, one possible reason may be the guest return the item they purchased before,or the data is invalid,it is common that customer returned the item once they found unstatisfied, so in this case we do nothing on the negative values 

In [460]:
purchase.columns

Index([u'qty', u'item_i', u'gst_i', u'purchase_d'], dtype='object')

In [461]:
#check data of item id
purchase['item_i'].describe()

count      37698
unique     17429
top       109559
freq          98
Name: item_i, dtype: object

* There are 17429 unique items in purchase table, the number of unique items in purchase data is larger than number in items table

In [462]:
#checek the gst_id 
purchase['gst_i'].describe()

count       37796
unique       1002
top       1904015
freq          281
Name: gst_i, dtype: object

* There are 1002 unique guests appearing in purchase table, it seems there are many loyalty consumers and purchased much for comparing number of purchases(37796) to number of guest(1002)

In [5]:
#replace the negative value with
purchase['qty']=purchase['qty'].replace(-1,0)
purchase.describe()

Unnamed: 0,qty
count,37796.0
mean,1.333542
std,1.128956
min,0.14
25%,1.0
50%,1.0
75%,1.0
max,63.0


In [6]:
#check if there is any missing values
purchase.isnull().sum()/purchase.shape[0]

qty           0.000000
item_i        0.002593
gst_i         0.000000
purchase_d    0.000000
dtype: float64

There are some missing values in item_i,decide to remove them for it only takes up only 0.25% of total number

In [7]:
#remove the rows with missing values in item_i
if purchase['item_i'].isnull().sum()>0:
    purchase=purchase.dropna(axis=0,subset=['item_i'])
purchase.isnull().sum()/purchase.shape[0]

qty           0.0
item_i        0.0
gst_i         0.0
purchase_d    0.0
dtype: float64

### Feature engineering

In this part, we create new feature of total_quantities that measures the total sells for each item and get better sense of 
popular items

In [9]:
#Create the new feature that represents total sells for each item across all guests
purchase_number=purchase.groupby(['item_i'])['qty'].sum().reset_index().rename(columns={'qty':'total_quantities'})
purchase_number.sort(['total_quantities'],ascending=False).head(5)purchase_merge=purchase.merge(purchase_number,left_on='item_i',right_on='item_i',how='left')
purchase_merge.head()

  from ipykernel import kernelapp as app


Unnamed: 0,item_i,total_quantities
1777,109559,371.87
11462,46061,121.0
11718,47330,111.0
11712,47323,101.0
1938,110407,96.06


In [467]:
purchase_merge.describe()

Unnamed: 0,qty,total_quantities
count,37698.0,37698.0
mean,1.333829,8.2906
std,1.129929,22.314439
min,0.14,0.3
25%,1.0,2.0
50%,1.0,3.48
75%,1.0,8.0
max,63.0,371.87


* The total_quantitties appears to be left skewed distributed with mean 8.29, and top four popular items,their total quantities exceed 100

In [10]:
#merge the origial purchase data with total quantities with left join
purchase_merge=purchase.merge(purchase_number,left_on='item_i',right_on='item_i',how='left')
purchase_merge.head()

Unnamed: 0,qty,item_i,gst_i,purchase_d,total_quantities
0,4.0,122464,2639949,23/08/2016_00:00:00,11.0
1,1.0,99091,4935278,16/08/2015_00:00:00,14.0
2,1.0,119976,3479638,18/12/2015_00:00:00,2.0
3,1.0,22501,257693,06/08/2016_00:00:00,4.0
4,1.0,23785,1912070,06/01/2016_00:00:00,6.0


In [17]:
#Convert the data into m*n matrix, m is the number of items, and n is the number of guest_id
train_pivot=train_data.pivot(index='item_i',columns='gst_i',values='qty').fillna(0)
#convert to sparse matrix
train_matrix=csr_matrix(train_pivot.values)
train_matrix

#Convert the data into m*n matrix, m is the number of items, and n is the number of guest_id
test_pivot=train_data.pivot(index='item_i',columns='gst_i',values='qty').fillna(0)
#convert to sparse matrix
test_matrix=csr_matrix(train_pivot.values)
test_matrix



<14690x1001 sparse matrix of type '<type 'numpy.float64'>'
	with 28273 stored elements in Compressed Sparse Row format>

### Building Model

In this part,we will build two models to approach product recommendations,the first one is item based recommendation based on similaries between items
the other one is from model based recommendation, which uses matrix decomposition to learn the latent factors of guest and then make recommendations

#### Item based recommendation

* First, get all item id product the guest purchased from purchase_merge table
* Second, select the item the guest purchased most with largest qty(quantity purchased) value as popular item
* Third, select the top k similar item ids to popular item that the guest have not purchased,using K Nearest Neightbors
* Finaly, use the item ids get from third step to select items from item table and ouput recommendation results.

In [438]:
#define the function to make recommendation given guest_id, the input parameter includes guest_id, number of item recommendations,purchase data, and item data
#The output is items recommendation in DataFrame
def item_based_recommend(guest_id,n_recommend,purchase_merge,item_data):

    #get most popular item the guest has purchased most
    popular_item=purchase_merge[purchase_merge['gst_i']==guest_id].sort_values(by='qty',ascending=False,axis=0).reset_index()['item_i'][0]
    
    #convert the purchase data to pivoit table, with row representing item id and columns representing guest id
    purchase_pivot=pd.pivot_table(purchase_merge,values='qty',index='item_i',columns='gst_i').fillna(0)

    #find the items that the guest has not purchased and most popular item into not brought table.
    #This table will be used to calculate smiliarity, recommendations items will be from not brought table
    not_bought=purchase_pivot[purchase_pivot[guest_id]==0]
    not_bought=not_bought.append(purchase_pivot[purchase_pivot.index==popular_item])
    
    #get the index of most popular item
    item_index=np.where(not_bought.index==popular_item)[0][0]
    
    
    #convert to sparse matrix
    not_bought_sparse=csr_matrix(not_bought.values)
    
    #train knn model using cosine to calculate the similarities between item vectors 
    model_knn=knn(metric='cosine',algorithm='brute')
    model_knn.fit(not_bought_sparse)
    
    
    #get the number of n_recommend similar item vectors, store the distance(similarity) and item ids into distances and indices respectively
    distances,indices=model_knn.kneighbors(not_bought.iloc[item_index,:].reshape(1,-1),n_recommend)
    
    #select the items from item table based on item id get from previous steps
    #start from the second element in indice for first element is  popular item itself
    recommendations=item_data[item_data['anon_sku'].isin(indices[1:].flatten())]

    return recommendations

In [447]:
#make recommendations for 2639949 guest for 10 items.
recommendations=item_based_recommend('2639949',10,purchase_merge,items)



In [448]:
#output the recommendation results
recommendations

Unnamed: 0,anon_sku,attr
1778,198270,51358 17140 53622 61530 56575 63269 29701 2815...
3361,53428,45564 57263 15517 7251 56227 19676 12730 6539 ...
5094,211585,27237 21839 5404 58908 59863 44925 48941 48525...
6722,5512,41702 26250 58807 30594 59348 60110 44902 1967...
8223,15776,40918 9361 45716 4163 56161 6809 45328 29701 4...
9985,95946,19273 51192 22591 28361 43218 28361 61461 1940...


In [443]:
len(np.unique(purchase_merge['item_i']))

17429

In [444]:
len(np.unique(items['anon_sku']))

16670

* It is interesting to see that number of recommendations is less than 10, that is to say some items that are from purchase table actually do not exist in items table, the number of unique items in purchase table is 17429, while in items table, it is 16670
* One reasone may be there is no stock for some items or stop selling right now, but there are some purchases records previously

#### Model based using matrix factorization

* we use the matrix factorization to find the latent preferences for guest as well as latent attributes for items,for user based and item based collaborative filtering
may not reflect well for guest preference or similarities between guests.Also, purchase marix can be built in low rank structure, which could be scaled well in large dataset
* Recommendation output would be k most high predicted qty(quantity purchased)items ids that guest has not purchased before

In [155]:
#convert the purchase_data to pivot table with rows representing guest and columns representing items
purchase_svd=pd.pivot_table(purchase_merge,values='qty',index='gst_i',columns='item_i').fillna(0)
purchase_svd.head()

item_i,100002,100003,100005,100007,100018,100021,100025,100026,100030,100033,...,99974,99983,99984,99986,99993,99996,99998,99999,?,���?���
gst_i,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1009636,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1014290,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1017018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1020846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [156]:
#demean the values in matrix 
purchase_svd_matrix=purchase_svd.as_matrix()
gst_purchase_mean=np.mean(purchase_svd_matrix,axis=1)
purchase_svd_demean=purchase_svd_matrix-gst_purchase_mean.reshape(-1,1)

In [157]:
#After normalization, SVD will be used to decompose matrix
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(purchase_svd_demean, k = 40)

sigma = np.diag(sigma)

In this case matrix u learns guest latent features or preferences, matrix v records the latent features of items, the value of k  should be selected using cross validation,for simplicity, we set k=40 to select top 40 most important features.

In [520]:
#Make prediction of quantitie for each item the guest has not purchased before and store in DataFrame
all_gst_purchase_pred= np.dot(np.dot(U, sigma), Vt) + gst_purchase_mean.reshape(-1,1)
preds_df=pd.DataFrame(all_gst_purchase_pred,columns=purchase_svd.columns)
preds_df.head()

item_i,100002,100003,100005,100007,100018,100021,100025,100026,100030,100033,...,99974,99983,99984,99986,99993,99996,99998,99999,?,���?���
0,0.000171,0.000478,0.000306,0.000154,-0.000764,-0.014531,-0.000894,0.000474,0.000176,0.000196,...,0.000343,-0.002289,0.013369,0.000541,0.002599,-0.002128,0.000717,0.010015,0.000529,6.4e-05
1,0.002586,0.002328,0.002412,0.002594,0.002775,0.000746,0.002013,0.002446,0.001761,0.002493,...,0.002438,0.002156,0.000656,0.002358,0.003156,0.00283,0.002434,-2.3e-05,0.002525,0.002744
2,0.004187,0.004708,0.005195,0.004134,0.004281,0.001104,0.003759,0.004069,-0.002865,0.003997,...,0.003867,0.000578,-0.002348,0.004188,0.006706,0.003694,0.003923,-0.008631,0.00453,0.003911
3,-3.8e-05,-0.001738,-0.002111,0.003377,0.000208,-0.054009,-0.001552,-0.002487,-0.005452,-0.000737,...,-0.001253,0.000988,0.015231,-0.001183,-0.003076,0.001543,-0.0016,0.022486,-0.000475,-0.000168
4,0.002757,0.002088,0.001893,0.005354,0.001261,0.022431,0.003617,0.002637,-0.004056,0.0025,...,0.002422,0.008873,0.012186,0.002702,0.00237,0.005068,0.002285,0.020864,0.002323,0.003067


In [502]:
#The input parameters includes predictions matrix,guest_id, items data and purchase data
#Output is item ids recommendations in DataFrame
def recommend_product(preds_df, gst_i, items_data,purchase_data, num_recommendations=5):
    
    #get the items predictions for the guest and sort
    gst_row_index = guest[guest['guest_id']==gst_i].index 
    sorted_guest_predictions = preds_df.iloc[gst_row_index].sort_values(by=gst_row_index[0],ascending=False,axis=1)
    
    #get information for given guest id
    guest_data = purchase_data[purchase_data['gst_i'] == gst_i]
    #merge data from items table using left join
    guest_full = guest_data.merge(items_data,how='left',left_on='item_i',right_on='anon_sku').sort_values(['qty'],ascending=False)
 
    #making item recommendation with top k highest predictions that guest has not purchased before 
    recommendations = (items_data[~items_data['anon_sku'].isin(guest_full['item_i'])].
             merge(pd.DataFrame(sorted_guest_predictions.unstack()).reset_index(), how = 'left',left_on = 'anon_sku',
                   right_on = 'item_i').
             rename(columns = {'level_1': 'Predictions'}).
             sort_values('Predictions', ascending = False).iloc[:num_recommendations, :-1])
    recommendations['Predictions']=gst_i

    return recommendations

In [506]:
#make 10 items recomendation for guest id 2639949
recommendations=recommend_product(preds_df,'2639949',items,purchase_merge,10)
recommendations

Unnamed: 0,anon_sku,attr,item_i,Predictions
2,102113,46957 29471 4761 11865 29471 46957 29471 52104...,102113,2639949
11021,60079,19979 58085 1708 25359 24237 32327 13708 8913 ...,60079,2639949
11024,66899,13468 22817 5488 11871 56389 60454 9361 35705 ...,66899,2639949
11028,74994,9177 59974 3808 62960 3296 41126 27065 33565 5...,74994,2639949
11029,75189,13162 25227 39838 8248 44925 4603 19223 13162 ...,75189,2639949
11032,80467,57648 44025 34566 60603 34101 24760 4603 16111...,80467,2639949
11036,85515,60033 16304 27335 19629 17140 14381 40667 3142...,85515,2639949
11037,85528,56584 18710 48374 22023 31717 25702 38134 4258...,85528,2639949
11038,87162,5592 42899 17994 2140 29906 8914 4888 6539 498...,87162,2639949
11040,90399,42905 22591 43795 18320 49207 5068 42056 56105...,90399,2639949


* Compared with the result from item based recommendation, it is interesting to see the results are different,since there is no explicit features
that help to get sense of recommendation results, we will use cross validation to evaluate the model performance in next steps.
* I will also investigate the attr feature in the future, which may help to justify the recommendation result

### Model evaluation


In this part, I will split the data into traing and test dataset and make predictions on quantity purchased, then evaluate the models using RMSE(Root Mean Square Error)

In [468]:
#split the data into train and test data
from sklearn import cross_validation as cv
train_data, test_data = cv.train_test_split(purchase_merge, test_size=0.25)

In [469]:
#Convert the data into matrix with row representing item id, and columns represent guest_id
train_pivot=train_data.pivot(index='item_i',columns='gst_i',values='qty').fillna(0)
#convert to sparse matrix
train_matrix=csr_matrix(train_pivot.values)
train_matrix

#Convert the data into matrix with row representing item id, and columns represent guest_id
test_pivot=train_data.pivot(index='item_i',columns='gst_i',values='qty').fillna(0)
#convert to sparse matrix
test_matrix=csr_matrix(train_pivot.values)
test_matrix

<14666x1000 sparse matrix of type '<type 'numpy.float64'>'
	with 28273 stored elements in Compressed Sparse Row format>

In [470]:
from sklearn.metrics.pairwise import pairwise_distances
#calculate the similarity using cosine between item vectors
item_similarity = pairwise_distances(train_matrix, metric='cosine')

In [510]:
#define the function to predict the qty on test dataset
#input parameters include dataset that is used to make prediction and similarity
#output is the prediction on test dataset
def predict(purchase, similarity):
    mean_item_qty= purchase.mean(axis=1)
    purchase_diff = (purchase - mean_item_qty)
    pred = mean_item_qty + similarity.dot(purchase_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    return pred

In [471]:
def predict(purchase, similarity, type='guest'):
    if type == 'item':
        mean_item_rating = purchase.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        purchase_diff = (purchase - mean_item_rating)
        pred = mean_item_rating + similarity.dot(purchase_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'guest':
        pred = purchase.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

In [473]:
#define function to calcualte the rmse between predicted values and true values
#input parameters include predidctions and true values
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    return sqrt(mean_squared_error(prediction, ground_truth.toarray()))

In [516]:
item_prediction=predict(test_matrix,item_similarity)
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_matrix))

Item-based CF RMSE: 0.0777105683105


In [518]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get the rmse of model
u, s, vt = svds(train_matrix, k = 50)
s_diag_matrix=np.diag(s)
prediction = np.dot(np.dot(u, s_diag_matrix), vt)
print 'Model based CF RMSE: ' + str(rmse(prediction, test_matrix))

Model based CF RMSE: 0.0596617421314


* Juding from the result, it seems like Model based method performs better than item based methods with relatively RMSE value, however,it takes more time on matrix computation
* Another metric I would like to try is to calcualte the similarities between recommendation items and items guest have purchased, then make comparision between models,for time concern,I will try in futher work

### Further work

* I think more feature engineering will be performed in purchase time and attributes of items, we can create such as day_of_week, holiday by extracting
information from datetime, which can be used to explore guest purchase pattern or preference.Also I will extract more information from attributes in items,though most values are a list of numbers and don not convey significant meaings.Hopefully can build contented
based recommendation model based on attributes features 
* I will also try to combine the predictions(linear or sequential) of content based and model based collaborative filtering method, set different weights to see the performances, I would like to explore and improve the both methods if I use linear combination to determine the weights 