School of Computer Sciences, USM<br>Semester 2, 2020/2021

# CDS513: Predictive Business Analytics - Group Project

## Data Preprocessing for Recommender System (RS)

##### Project Title
\> Improving Sales Performance of the Ecommerce Website for an Electronics Store using Predictive Business Analytics Techniques 

##### Group No
\> Group 3 \[Lee Yong Meng (P-COM0012/20) | Lee Kar Choon (P-COM0130/19) | Lim Hang Thing (P-COM0143/20)\]

##### Dataset
\> User behaviour data: [Ecommerce Behavior Data from Multi Category Store](https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store)

<img alt="User Event" src="https://images.unsplash.com/photo-1605902711622-cfb43c4437b5?ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&ixlib=rb-1.2.1&auto=format&fit=crop&w=1949&q=80">

Photo by <a href="https://unsplash.com/@cardmapr?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">CardMapr.nl</a> on <a href="https://unsplash.com/s/photos/ecommerce?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>

## Overview

**Part 1: [Load Data](#load)**
- 1.1. [User Behaviour Data](#load-user)
- 1.2. [Explore Data](#load-explore)

**Part 2: [Transform Data](#transform)**
- 2.1. [Create New Columns](#transform-new-col)
- 2.2. [Filter User Behaviour Data ](#transform-filter)
- 2.3. [Train-Test Split](#transform-split)
- 2.4. [Save Data](#transform-save)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import datetime as dt
from datetime import datetime

print_start_time = lambda: print(f"Start time: {datetime.now()}")
print_end_time = lambda: print(f"End time: {datetime.now()}")

***

# 1. Load Data <a name="load"></a>

## 1.1. User Behaviour Data <a name="load-user"></a>

We only use the records from "2019-Oct.csv" due to limitation in computing resources. The user preferences towards products sold in the electronics store is **assumed to be time invariant** - do not change over an extended period.

# *This cell takes up to 5 minutes to complete running.*

In [2]:
%%time

print_start_time()

# Load data
df_oct = pd.read_csv('src/2019-Oct.csv.zip')
display(df_oct.head())
display(df_oct.shape)

# Select important columns
df_oct = df_oct[['event_type', 'category_code', 'brand', 'user_id']]

print_end_time()
    
display(df_oct.head())

Start time: 2021-06-27 18:27:57.868493


Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2019-10-01 00:00:00 UTC,view,44600062,2103807459595387724,,shiseido,35.79,541312140,72d76fde-8bb3-4e00-8c23-a032dfed738c
1,2019-10-01 00:00:00 UTC,view,3900821,2053013552326770905,appliances.environment.water_heater,aqua,33.2,554748717,9333dfbd-b87a-4708-9857-6336556b0fcc
2,2019-10-01 00:00:01 UTC,view,17200506,2053013559792632471,furniture.living_room.sofa,,543.1,519107250,566511c2-e2e3-422b-b695-cf8e6e792ca8
3,2019-10-01 00:00:01 UTC,view,1307067,2053013558920217191,computers.notebook,lenovo,251.74,550050854,7c90fc70-0e80-4590-96f3-13c02c18c713
4,2019-10-01 00:00:04 UTC,view,1004237,2053013555631882655,electronics.smartphone,apple,1081.98,535871217,c6bd7419-2748-4c56-95b4-8cec9ff8b80d


(42448764, 9)

End time: 2021-06-27 18:29:52.102502


Unnamed: 0,event_type,category_code,brand,user_id
0,view,,shiseido,541312140
1,view,appliances.environment.water_heater,aqua,554748717
2,view,furniture.living_room.sofa,,519107250
3,view,computers.notebook,lenovo,550050854
4,view,electronics.smartphone,apple,535871217


Wall time: 1min 54s


In [3]:
df_oct.shape

(42448764, 4)

In [4]:
df_oct.describe(include='all')

Unnamed: 0,event_type,category_code,brand,user_id
count,42448764,28933155,36331684,42448760.0
unique,3,126,3444,
top,view,electronics.smartphone,samsung,
freq,40779399,11507231,5282775,
mean,,,,533537100.0
std,,,,18523740.0
min,,,,33869380.0
25%,,,,515904300.0
50%,,,,529696500.0
75%,,,,551578800.0


In [5]:
df_oct['user_id'].nunique()

3022290

In [6]:
np.sum(df_oct.isna(), axis=0)

event_type              0
category_code    13515609
brand             6117080
user_id                 0
dtype: int64

***

# 2. Transform Data <a name="transform"></a>

## 2.1. Create New Columns <a name="transform-new-col"></a>

### 2.1.1. `product_name`

Impute missing data - assign values to missing `category_code` and `brand`:

- `category_code = 'unknown_category'`
- `brand = 'unknown_brand'`

Then, combine `category_code` and `brand` to form new value for each record - `product_name`.

In [9]:
%%time

print_start_time()

# Fill missing values with "unknown_{category/brand}"
df_oct['category_code'] = df_oct['category_code'].fillna('unknown_category')
df_oct['brand'] = df_oct['brand'].fillna('unknown_brand')

# Create `product_name`
df_oct['product_name'] = df_oct['category_code'] + '-' + df_oct['brand']
display(df_oct.head())

print_end_time()

Start time: 2021-06-27 18:30:25.770334


Unnamed: 0,event_type,category_code,brand,user_id,product_name
0,view,unknown_category,shiseido,541312140,unknown_category-shiseido
1,view,appliances.environment.water_heater,aqua,554748717,appliances.environment.water_heater-aqua
2,view,furniture.living_room.sofa,unknown_brand,519107250,furniture.living_room.sofa-unknown_brand
3,view,computers.notebook,lenovo,550050854,computers.notebook-lenovo
4,view,electronics.smartphone,apple,535871217,electronics.smartphone-apple


End time: 2021-06-27 18:31:06.329314
Wall time: 40.6 s


### 2.1.2. `preference`

Preference is the implicit feedback used for implementing recommender systems based on collaborative filtering approach.

In [10]:
# Assign weight to each event
weight_dict = {'view': 1, 
               'cart': 3, 
               'purchase': 10}

df_oct['preference'] = df_oct['event_type'].apply(lambda x: weight_dict[x])

df_oct.head()

Unnamed: 0,event_type,category_code,brand,user_id,product_name,preference
0,view,unknown_category,shiseido,541312140,unknown_category-shiseido,1
1,view,appliances.environment.water_heater,aqua,554748717,appliances.environment.water_heater-aqua,1
2,view,furniture.living_room.sofa,unknown_brand,519107250,furniture.living_room.sofa-unknown_brand,1
3,view,computers.notebook,lenovo,550050854,computers.notebook-lenovo,1
4,view,electronics.smartphone,apple,535871217,electronics.smartphone-apple,1


In [11]:
df_oct[['user_id', 'product_name']].nunique()

user_id         3022290
product_name       5986
dtype: int64

In [12]:
list_prod = df_oct['product_name'].unique()
list_prod.sort()
list_prod

array(['accessories.bag-a-elita', 'accessories.bag-acer',
       'accessories.bag-acron', ..., 'unknown_category-zuru',
       'unknown_category-zvezda', 'unknown_category-zyxel'], dtype=object)

In [13]:
# Generate product ID
prod_dict = {prod: i for i, prod in enumerate(list_prod)}
display(len(prod_dict))

5986

### 2.1.3. `product_id`

In RapidMiner, `product_id` will be set role as "item identification".

In [14]:
df_prod = pd.DataFrame({'product_id': list(prod_dict.values()),
                        'product_name': list(list_prod)})

display(df_prod.head())

df_prod.to_csv('output/product_list.csv', index=False)

Unnamed: 0,product_id,product_name
0,0,accessories.bag-a-elita
1,1,accessories.bag-acer
2,2,accessories.bag-acron
3,3,accessories.bag-apple
4,4,accessories.bag-asus


In [15]:
df_oct['product_id'] = df_oct['product_name'].apply(lambda x: prod_dict[x])

In [16]:
%%time

print_start_time()

# Aggregate the data frame by `user_id` and `product_id` - 
# ... summing up the preferences
df_group_user_prod = (df_oct[['user_id', 'product_id', 'preference']]
                      .groupby(['user_id', 'product_id'])).sum()

display(df_group_user_prod)

print_end_time()

Start time: 2021-06-27 18:33:16.871153


Unnamed: 0_level_0,Unnamed: 1_level_0,preference
user_id,product_id,Unnamed: 2_level_1
33869381,3119,1
64078358,5793,1
183503497,5793,1
184265397,3047,4
184265397,5202,2
...,...,...
566280663,2641,2
566280676,2857,1
566280697,2554,1
566280780,5648,1


End time: 2021-06-27 18:33:44.498982
Wall time: 27.6 s


In [17]:
num_user = len(list(set([user_id 
                         for user_id, _ 
                         in df_group_user_prod.index.values])))

num_product = len(list(set([product 
                            for _, product 
                            in df_group_user_prod.index.values])))

print(f"# unique users: {num_user}\n# unique products: {num_product}")

# unique users: 3022290
# unique products: 5986


In [18]:
# Sort the grouped data frame by total preference.
df_group_user_prod.sort_values(by='preference', ascending=False).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,preference
user_id,product_id,Unnamed: 2_level_1
523974502,2623,3726
517728689,2672,3597
519267944,2641,2679
543312954,2672,2672
541510103,2672,2604
513117637,2641,2433
512365995,2672,1938
513320236,2672,1908
515384420,2641,1824
563459593,2641,1756


## 2.2. Filter User Behaviour Data <a name="transform-filter"></a>

In [19]:
df_group_user = (df_group_user_prod
                 .groupby(level=0)
                 .count()
                 .sort_values(by='preference', ascending=False))

df_group_prod = (df_group_user_prod
                 .groupby(level=1)
                 .count()
                 .sort_values(by='preference', ascending=False))

print("Preferences group by user")
display(df_group_user)

print("\nPreferences group by product")
display(df_group_prod)

Preferences group by user


Unnamed: 0_level_0,preference
user_id,Unnamed: 1_level_1
512786243,319
513022543,264
549922696,239
512401084,226
521109179,224
...,...
547472849,1
547472845,1
547472803,1
547472756,1



Preferences group by product


Unnamed: 0_level_0,preference
product_id,Unnamed: 1_level_1
5793,689804
2672,675556
2641,591544
2681,287430
2655,203729
...,...
777,1
4126,1
3571,1
4065,1


In [20]:
# ------------------------------------------------------------
# Generate set of users (preference >= MIN_USER_PREF)
# - approximately 2000 users
# ------------------------------------------------------------

total_user = df_group_user.shape[0]
top_n_user = 2000
MIN_USER_PREF = df_group_user.quantile(1-(top_n_user/total_user)).values[0]


# ------------------------------------------------------------
# Generate set of products (preference >= MIN_PROD_PREF)
# - approximately 1000 products
# ------------------------------------------------------------

total_prod = df_group_prod.shape[0]
top_n_prod = 1000
MIN_PROD_PREF = df_group_prod.quantile(1-(top_n_prod/total_prod)).values[0]

print(f"Min user preference: {MIN_USER_PREF}\nMin product preference: {MIN_PROD_PREF}")

Min user preference: 71.0
Min product preference: 1634.167056465085


In [21]:
# Generate set of users and products to keep in the RS data
# ... use `df` to shorten the syntax.
df = df_group_user
df_group_user = df[df['preference'] >= MIN_USER_PREF]

df = df_group_prod
df_group_prod = df[df['preference'] >= MIN_PROD_PREF]

set_user_rs = set(df_group_user.index.values)
set_prod_rs = set(df_group_prod.index.values)

del df

print(f"# users (after filter): {len(list(set_user_rs))}")
print(f"# products (after filter): {len(list(set_prod_rs))}")

# users (after filter): 2088
# products (after filter): 1000


In [22]:
# ------------------------------------------------------------
# Define conditions for filtered records
# ------------------------------------------------------------

df = df_group_user_prod

# Convert index back into columns.
df = df.reset_index()

# Create two new columns to help filter process
df['keep_user'] = df['user_id'].apply(lambda x: x in set_user_rs)
# df['keep_product'] = df['product_name'].apply(lambda x: x in set_prod_rs)
df['keep_product'] = df['product_id'].apply(lambda x: x in set_prod_rs)

CONDITION = (df['keep_user'] & df['keep_product'])

# ------------------------------------------------------------
# Filter user preference records
# ------------------------------------------------------------

df = df[CONDITION]
df = df.reset_index(drop=True)

df_group_user_prod = df[['user_id', 'product_id', 'preference']]

del df

df_group_user_prod

Unnamed: 0,user_id,product_id,preference
0,463020196,1576,39
1,463020196,1579,10
2,463020196,1605,1
3,463020196,1608,13
4,463020196,1610,10
...,...,...,...
133660,566165785,5751,1
133661,566165785,5793,93
133662,566165785,5844,1
133663,566165785,5891,7


## 2.3. Train-test Split <a name="transform-split"></a>

In [23]:
# ============================================================
# Split user preference data into training and test sets
# - Train:test = 80:20
# ============================================================

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_group_user_prod, 
                                     stratify=df_group_user_prod['user_id'], 
                                     test_size=0.20, 
                                     random_state=42)

print(f"# records in training data: {df_train.shape[0]}")
print(f"# records in test data: {df_test.shape[0]}")

# records in training data: 106932
# records in test data: 26733


## 2.4. Save Data <a name="transform-save"></a>

In [24]:
# ============================================================
# Save data
# ============================================================

df_train.to_csv('output/train_oct.csv', index=False)
df_test.to_csv('output/test_oct.csv', index=False)