# Session-based E-commerce Product Recommender
> We will build one of the simplest and powerful session-based recommender engine on a real-world data. The data contains [Trendyol's](https://www.trendyol.com/) session-level activities and product metadata information. 

- toc: false
- badges: true
- comments: true
- categories: [Session, Sequence, Retail, ECommerce]
- author: "<a href='https://github.com/CeyhanTurnali/ProductRecommendation'>CeyhanTurnalı</a>"
- image:

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype
from sklearn.metrics.pairwise import cosine_similarity

In [182]:
meta = pd.read_parquet('https://github.com/recohut/reco-data/raw/trendyol/trendyol/v1/meta.parquet.gzip')
meta.head(5)

Unnamed: 0,productid,brand,category,subcategory,name
0,HBV00000AX6LR,Palette,Kişisel Bakım,Saç Bakımı,Palette Kalıcı Doğal Renkler 10-4 PAPATYA
1,HBV00000BSAQG,Best,Pet Shop,Kedi,Best Pet Jöle İçinde Parça Etli Somonlu Konser...
2,HBV00000JUHBA,Tarım Kredi,Temel Gıda,"Bakliyat, Pirinç, Makarna",Türkiye Tarım Kredi Koop.Yeşil Mercimek 1 kg
3,HBV00000NE0QI,Namet,"Et, Balık, Şarküteri",Şarküteri,Namet Fıstıklı Macar Salam 100 gr
4,HBV00000NE0UQ,Muratbey,Kahvaltılık ve Süt,Peynir,Muratbey Burgu Peyniri 250 gr


In [183]:
events = pd.read_parquet('https://github.com/recohut/reco-data/raw/trendyol/trendyol/v1/events.parquet.gzip')
events.head(5)

Unnamed: 0,event,sessionid,eventtime,price,productid
0,cart,a0655eee-1267-4820-af21-ad8ac068ff7a,2020-06-01T08:59:16.406Z,14.48,HBV00000NVZE8
1,cart,d2ea7bd3-9235-4a9f-a9ea-d7f296e71318,2020-06-01T08:59:46.580Z,49.9,HBV00000U2B18
2,cart,5e594788-78a0-44dd-8e66-37022d48f691,2020-06-01T08:59:33.308Z,1.99,OFIS3101-080
3,cart,fdfeb652-22fa-4153-b9b5-4dfa0dcaffdf,2020-06-01T08:59:31.911Z,2.25,HBV00000NVZBW
4,cart,9e9d4f7e-898c-40fb-aae9-256c40779933,2020-06-01T08:59:33.888Z,9.95,HBV00000NE0T4


There are two dataset which are contains prouducts and session details. I used productid as a primary key and merge two csv files.

In [184]:
data = meta.merge(events, on="productid")
data.head()

Unnamed: 0,productid,brand,category,subcategory,name,event,sessionid,eventtime,price
0,HBV00000AX6LR,Palette,Kişisel Bakım,Saç Bakımı,Palette Kalıcı Doğal Renkler 10-4 PAPATYA,cart,cd34b98c-1e65-4dbb-945c-ca4955a9ad3c,2020-06-02T07:41:35.600Z,14.9
1,HBV00000AX6LR,Palette,Kişisel Bakım,Saç Bakımı,Palette Kalıcı Doğal Renkler 10-4 PAPATYA,cart,cd34b98c-1e65-4dbb-945c-ca4955a9ad3c,2020-06-02T07:41:36.982Z,14.9
2,HBV00000BSAQG,Best,Pet Shop,Kedi,Best Pet Jöle İçinde Parça Etli Somonlu Konser...,cart,9b1bc61a-abd1-48b6-950d-e4b7e5fbdc44,2020-06-09T14:07:16.068Z,11.99
3,HBV00000BSAQG,Best,Pet Shop,Kedi,Best Pet Jöle İçinde Parça Etli Somonlu Konser...,cart,89236793-7661-4043-a33e-cbdd80b728ae,2020-06-14T11:12:10.737Z,11.99
4,HBV00000JUHBA,Tarım Kredi,Temel Gıda,"Bakliyat, Pirinç, Makarna",Türkiye Tarım Kredi Koop.Yeşil Mercimek 1 kg,cart,7648b59c-fae4-4486-afbd-25146fa154ff,2020-06-01T08:50:44.463Z,10.5


In [185]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 387656 entries, 0 to 387655
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   productid    387650 non-null  object 
 1   brand        255805 non-null  object 
 2   category     387650 non-null  object 
 3   subcategory  387650 non-null  object 
 4   name         387650 non-null  object 
 5   event        387656 non-null  object 
 6   sessionid    387656 non-null  object 
 7   eventtime    387656 non-null  object 
 8   price        387650 non-null  float64
dtypes: float64(1), object(8)
memory usage: 29.6+ MB


Identify and drop null in ids columns

In [186]:
data.isna().sum()

productid           6
brand          131851
category            6
subcategory         6
name                6
event               0
sessionid           0
eventtime           0
price               6
dtype: int64

In [187]:
data = data.dropna(subset=['sessionid','productid'])

In [188]:
data.isna().sum()

productid           0
brand          131845
category            0
subcategory         0
name                0
event               0
sessionid           0
eventtime           0
price               0
dtype: int64

In [189]:
data.describe(include=['O']).T

Unnamed: 0,count,unique,top,freq
productid,387650,10235,HBV00000NVZGU,17082
brand,255805,789,Carrefour,36683
category,387650,20,Meyve ve Sebze,76021
subcategory,387650,132,Sebze,47590
name,387650,10123,Dana Biftek 250 gr,17082
event,387650,1,cart,387650
sessionid,387650,54442,08a906d4-4999-403c-a334-d296106d49cf,308
eventtime,387650,387190,2020-06-01T10:57:05.474Z,3


Cart is a category but we can use it as a quantity. Every cart process is one buying and we can use it as a quantity to answer how many products did the customers buy.

In [190]:
data['event'] = data['event'].replace(['cart'],'1')
data['event'] = data['event'].astype(float)

In [191]:
data_full = data.copy()

In [192]:
data = data[['sessionid','productid','event']]
data.head()

Unnamed: 0,sessionid,productid,event
0,cd34b98c-1e65-4dbb-945c-ca4955a9ad3c,HBV00000AX6LR,1.0
1,cd34b98c-1e65-4dbb-945c-ca4955a9ad3c,HBV00000AX6LR,1.0
2,9b1bc61a-abd1-48b6-950d-e4b7e5fbdc44,HBV00000BSAQG,1.0
3,89236793-7661-4043-a33e-cbdd80b728ae,HBV00000BSAQG,1.0
4,7648b59c-fae4-4486-afbd-25146fa154ff,HBV00000JUHBA,1.0


Next, we will create a session-item matrix. In this matrix, each row represents a session, each column represents each product or item and the value in each cell indicates whether the customer has purchased the given product in that particular session.

In [23]:
session_c = CategoricalDtype(sorted(data.sessionid.unique()), ordered=True)
product_c = CategoricalDtype(sorted(data.productid.unique()), ordered=True)

row = data.sessionid.astype(session_c).cat.codes
col = data.productid.astype(product_c).cat.codes

session_item_matrix = csr_matrix((data["event"], (row, col)), shape=(session_c.categories.size, product_c.categories.size))
session_item_matrix

<54442x10235 sparse matrix of type '<class 'numpy.float64'>'
	with 272476 stored elements in Compressed Sparse Row format>

In [32]:
session_item_matrix[:10,:10].todense()

matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [50]:
session_c.categories[10]

'000cd8b2-7c1a-454a-b038-45e3e6da202c'

## User-User Similarity

We compute the cosine similarity from the session item matrix to determine similarity between user's purchase behaviour.

In [42]:
user_user_sim_matrix = cosine_similarity(session_item_matrix, dense_output=False)
user_user_sim_matrix

<54442x54442 sparse matrix of type '<class 'numpy.float64'>'
	with 112020838 stored elements in Compressed Sparse Row format>

In [158]:
def getname(id=0, ntype='session', mode='lookup'):
  if mode=='random':
    if ntype=='session':
      id = np.random.randint(0,len(session_c.categories))
      return session_c.categories[id], id
    else:
      id = np.random.randint(0,len(product_c.categories))
      return product_c.categories[id], id
  else:
    if ntype=='session':
      return session_c.categories[id]
    else:
      return product_c.categories[id]
   

def print_topk(matrix, id, k=10, ntype='session'):
  frame = pd.DataFrame(matrix[id].todense()).T.sort_values(by=0, ascending=False).head(k)
  frame = frame.reset_index()
  frame.columns = ['id','similarity']
  frame[f'{ntype}_id'] = frame['id'].apply(lambda x: getname(x, ntype))
  return frame

In [167]:
random_session, id = getname(ntype='session', mode='random')
print("Let's try it for a random session {}".format(random_session))

Let's try it for a random session 628ba8b7-5b24-4d26-a529-7bd62f1fc47b


What are the similar sessions?

In [168]:
similar_sessions = print_topk(user_user_sim_matrix, id=id, k=10, ntype='session')
similar_sessions

Unnamed: 0,id,similarity,session_id
0,21076,1.0,628ba8b7-5b24-4d26-a529-7bd62f1fc47b
1,7655,0.606933,23963196-591a-4d3d-937d-9383b9e39218
2,30129,0.58728,8d698d37-d652-4cf0-9ce3-fc89724312f1
3,3322,0.571886,0f557a0b-e1ed-4409-bc59-a1ea8b2acb27
4,22107,0.571429,67512123-03fd-457c-927b-1c1dd61162d7
5,48984,0.571429,e6457f49-56c7-47c6-b236-73d343189fae
6,15102,0.571429,4601757d-fa21-43e4-afea-ca88d5401d30
7,15128,0.571429,46239b44-44a4-4751-ade7-e8e7d5c14846
8,43092,0.571429,ca46d863-0f83-42e6-b805-3c440ee59890
9,9323,0.571429,2b609c6b-8bdc-4e36-bdb2-1428b385215f


In [169]:
print("Random Session ID: {}\nTop-similar Session ID: {}".\
      format(random_session, similar_sessions.iloc[1].session_id))

Random Session ID: 628ba8b7-5b24-4d26-a529-7bd62f1fc47b
Top-similar Session ID: 23963196-591a-4d3d-937d-9383b9e39218


For reference, we take a random session id as A and top-most similar session id as B. Therefore, by identifying the items purchased by Customer A and Customer B and the Remaining Items of Customer A relative to Customer B, we can safely assume that there is high similarity between customers, as there is high similarity between customers. The rest of the products purchased by customer A are also likely to be purchased by customer B. Therefore, we recommend the remaining products to Customer

In [194]:
items_bought_by_customerA = [getname(x, ntype='product') for x in np.argwhere(session_item_matrix[id]>0)[:,1]]
print("Items Bought by Customer A:")
items_bought_by_customerA

Items Bought by Customer A:


['HBV00000EEHCZ',
 'HBV00000GNAW1',
 'HBV00000JTA3F',
 'HBV00000NE0RD',
 'HBV00000NE0UO',
 'HBV00000NE0V8',
 'HBV00000NE1MX',
 'HBV00000NFMLC',
 'HBV00000NG8K2',
 'HBV00000NG8KU',
 'HBV00000OE7BC',
 'HBV00000OE7QQ',
 'HBV00000P7VNE',
 'HBV00000PNG99',
 'HBV00000PQIQE',
 'HBV00000PQIQX',
 'HBV00000QSG5K',
 'HBV00000QU499',
 'HBV00000QUBMQ',
 'ZYHPFRITOCPS058',
 'ZYPINAR153100004',
 'ZYPINAR153100006',
 'ZYTAD7141038']

In [195]:
items_bought_by_customerB = [getname(x, ntype='product') for x in np.argwhere(session_item_matrix[similar_sessions.iloc[1].id]>0)[:,1]]
print("Items bought by other customer:")
items_bought_by_customerB

Items bought by other customer:


['HBV00000O2S62',
 'HBV00000O2SIK',
 'HBV00000OE7BC',
 'HBV00000OE7QQ',
 'HBV00000QU499']

In [197]:
items_to_recommend_to_customerB= set(items_bought_by_customerA) - set(items_bought_by_customerB)
print("Items to Recommend to customer B:")
data_full.loc[data_full['productid'].isin(items_to_recommend_to_customerB),['productid', 'name']].drop_duplicates().set_index('productid')

Items to Recommend to customer B:


Unnamed: 0_level_0,name
productid,Unnamed: 1_level_1
ZYHPFRITOCPS058,Doritos Nacho Süper Boy Mısır Cipsi 113 gr
HBV00000PQIQE,Bağdat Mahlep 30 g
HBV00000NG8K2,Billur Tuz İyotlu Tuz 750 gr
HBV00000NFMLC,Greenlife Limon Tuzu 60 g
ZYTAD7141038,Tadım Kabak Çekirdeği 180 gr
HBV00000QUBMQ,Knorr Pane Harcı 100 gr
HBV00000NE0UO,Tahsildaroğlu Ezine Koyun Peyniri 600 gr
HBV00000EEHCZ,Doğalsan Yulaf Kepeği Bran 400 gr
ZYPINAR153100006,Pınar Tam Yağlı Süt 6x200 ml
HBV00000PQIQX,Bağdat Sumak Poşet 80 g


> Tip: For Item-item similarity, take the transpose of session-item matrix and repeat the same steps. 