### Business goal
1. Investigate the performance of ecommerce website with predefined metrics
2. Cluster users in terms of user activity and browsing behavior

### Performance Metrics
1. daily sale amount
2. daily user count 
3. distribution of user event and conversion rate
    * Monthly conversion rate = $\frac{\text{number of the users that purchase per month}}{\text{total users per month}}$
4. retention rate
    * Weekly retention rate = $\frac{\text{number of the users online in a given week are still online in the following week}}{\text{total users in the given week}}$

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib.dates as dates
from datetime import datetime
import seaborn as sns

%matplotlib inline
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# oct data as example
df = pd.read_csv("/kaggle/input/ecommerce-events-history-in-cosmetics-shop/2019-Oct.csv")
df.head()

In [None]:
df.info()

The presence of each user in this month: 

In [None]:
# event counts of each consumer
df['user_id'].value_counts()

In [None]:
df.isnull().mean()

### Preprocessing
1. Get the date, time point, and week number of the given month

In [None]:
# Format date, week and time column
df['date'] = df['event_time'].apply(lambda r: datetime.strptime(str(r)[:10], '%Y-%m-%d'))
df['timepoint'] = df['event_time'].apply(lambda r: datetime.strptime(str(r)[11:19], '%H:%M:%S').time())
df['week'] = df['date'].apply(lambda d: (d.day-1) // 7 + 1)

2. Drop rows with price < 0

In [None]:
df.drop(df[df.price < 0].index, inplace=True)

3. Drop duplicate records

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
df.head()

#### Metric 1: Daily sale amount

In [None]:
user_cnt_ax = df[df.event_type == "purchase"].groupby(['date'])['price'].agg(Dialy_sale = ('sum')).sort_values(by=['date'], ascending=True).plot.line()
user_cnt_ax.set_ylabel = "sale amount"

#### Metric 2: Daily user count

In [None]:
user_cnt_ax = df.groupby(['date'])['user_id'].agg(User_daily_count =('nunique')).sort_values(by=['date'], ascending=True).plot.line()
user_cnt_ax.set_ylabel = "user count"

#### Metric 3: Distribution of view, cart, purchase event and conversion rate (daily and monthly)

In [None]:
labels = ['view', 'cart', 'remove_from_cart','purchase']
colors = ['red', 'blue','green', 'orange']
plt.pie(df['event_type'].value_counts(), labels = labels, colors = colors, autopct = '%.2f%%')

In [None]:
# daily conversion rate
event_type_ax = df.groupby(['date'])['event_type'].value_counts(['event_type']).unstack(level = -1).plot.line()
event_type_ax.legend(loc='best')

In [None]:
# monthly covnersion rate
len(df[df['event_type'] == "purchase"]['user_id'].unique())/len(df.user_id.unique())

#### Weekly User Retention in this month
* What's proportion of the users online in a given week are still online in the following week. 

In [None]:
week_user_list = df.groupby(['week'])['user_id'].unique().to_list()
retention_list = [len(list(set(i)&set(j)))/len(list(i)) for i, j in zip(week_user_list[:5], week_user_list[1:])]
plt.bar(np.arange(4), retention_list, width = 0.5)
plt.ylabel("Weekly retention rate")
plt.xticks(np.arange(4), ["week1-2", "week2-3", "week3-4", "week4-5"])
plt.title("October weekly retention rate ")

### User segmentation 
This is the user clustering based on two dimensions of metrics, how active a user is at daily, month and purchase level and to what extend a user contribute to the monthly sale. In the first dimension, 391055 records were involved, no 0 values were removed because the majority of users have at least 1 day active or one session active per day. However, in the second and third dimension, most of users have 0 on all of the behavior related features and NAN in rate related features. Therefore these users records were removed and classified as a "inactive users"

1. user active dimension
    * Monthly active = how many days online per month
    * Daily active = number of sessions per day
    * Purcase price = total amount of transaction of each user



2. revenue dimension at event frequency level
    * Monthly purchase
    * Monthly price
    * Monthly view

3. revenue dimension at rate level
    * Monthly purchase/view rate 
    * Monthly purchase/cart rate
    * Monthly price/purchase rate

#### User active dimension

In [None]:
# number of days each participant show up in the given month
id_daycount = df.groupby(['user_id'])['date'].agg(montly_online_daycnt =('nunique')).sort_values(by=['montly_online_daycnt'])
id_daycount['montly_online_daycnt'].value_counts().plot(kind='bar')

In [None]:
# event number of each participant
id_sessioncount = df.groupby(['user_id'])['user_session'].agg(monthly_sessioncnt = ('nunique')).sort_values(by=['monthly_sessioncnt'])
id_sessioncount['monthly_sessioncnt'].value_counts(bins=20)

In [None]:
id_totalprice = df[df['event_type'].str.contains('purchase')].groupby(['user_id'])['price'].agg(monthly_totalprice = ('sum')).sort_values(by=['monthly_totalprice'])
id_totalprice['monthly_totalprice'].value_counts(bins=20).plot(kind='bar')

In [None]:
# concatenate three dataframe
activeuser_df = pd.concat([id_sessioncount, id_daycount, id_totalprice], axis=1)
activeuser_df['monthly_totalprice'].fillna(0, inplace = True)

#### Revenue dimension (freqeuncy and rate level)

In [None]:
behavior_df = df.groupby(['user_id', 'event_type']).size().unstack(fill_value=0)
price_df = df[df['event_type'].str.contains('purchase')].groupby(['user_id'])['price'].agg('sum')
behavior_df = behavior_df.merge(price_df, how = 'left', on = 'user_id')
behavior_df['price'].fillna(0, inplace = True)

In [None]:
# calculate the three rates
behavior_df['month_purchaseview_rate'] = behavior_df.purchase.div(behavior_df.view.replace(0, np.nan))
behavior_df['month_pricepurchase_rate'] = behavior_df.price.div(behavior_df.purchase.replace(0, np.nan))
behavior_df['month_purchasecart_rate'] = behavior_df.purchase.div(behavior_df.cart.replace(0, np.nan))

In [None]:
# drop nan rows based on the predefined three rates
behavior_df = behavior_df.dropna(subset=['month_purchaseview_rate', 'month_pricepurchase_rate', 'month_purchasecart_rate'])

In [None]:
# get the behavior feature at freq and rate level 
behavior_rate_df = behavior_df[['month_purchaseview_rate', 'month_pricepurchase_rate', 'month_purchasecart_rate']]
behavior_freq_df = behavior_df[['purchase', 'view', 'price']]

In [None]:
freq_boxplot = behavior_df.boxplot(rot=45, fontsize=10, column = ['cart', 'purchase', 'remove_from_cart', 'view'])
freq_boxplot.set_title('Behavior frequency boxplot')

In [None]:
rate_boxplot = behavior_df.boxplot(rot=45, fontsize=10, column = ['month_purchaseview_rate', 'month_pricepurchase_rate', 'month_purchasecart_rate'] )
rate_boxplot.set_title('Behavior rate boxplot')

#### Clustering using k-means

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [None]:
# standardize the data before finding the k optimal value
X_activeuser = StandardScaler().fit_transform(activeuser_df)
X_freq_behavior = StandardScaler().fit_transform(behavior_freq_df)
X_rate_behavior = StandardScaler().fit_transform(behavior_rate_df)

##### Find the best k value for three dataset

In [None]:
for j, XX in enumerate([X_freq_behavior, X_rate_behavior]): #X_freq_behavior, X_rate_behavior
    SSE=[]
    for i in range(1,9,1):
        kmeans=KMeans(n_clusters=i)
        kmeans.fit(XX)
        SSE.append(kmeans.inertia_)
    sns.set()
    plt.plot(range(1,9,1),SSE,marker='o', label = j)
plt.xlabel('k')
plt.ylabel('SE')
plt.legend()

In [None]:
import random
from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objs as go
def k_mean_cluster(df, n_cluster, figsize = (6, 6)):
    '''
    Assign and visualize the cluster to given X
    :param: 
        df is dataframe
        n_cluster 
        


    ''' 
    # Check X validity
    assert len(df.columns) == 3, "Should be three features"
    X = df.copy()
    
    random.seed(7)
    km = KMeans(n_clusters=n_cluster, 
                init = "k-means++", 
                max_iter = 300, 
                n_init = 10, 
                random_state=0)
    X_standard = StandardScaler().fit_transform(X)
    km.fit(X_standard)
    y_pred = km.predict(X_standard)
    X['y_pred'] = list(y_pred)
    
    feat_columns = df.columns
    COLORS = sns.color_palette("tab10")[:n_cluster]
    
    
    # draw cluster mean statistics and scatter plot
    
    
    sns.countplot(x = 'y_pred', palette = COLORS, data = X)
#     sns.scatterplot(x=feat_columns[0],
#                     y=feat_columns[1],
#                     hue = 'y_pred',
#                     size=feat_columns[2],
#                     palette=mycolors,sizes=(200,1000), 
#                     legend=False,
#                     data=df_RFM3)

    fig = plt.figure(figsize = figsize)
    ax=Axes3D(fig)
    for k in range(n_cluster):
        # index of cluster k, correspinding feature value, 0, 1, 2
        ax.scatter(X.loc[X.y_pred == k, feat_columns[0]], 
                   X.loc[X.y_pred == k, feat_columns[1]],
                   X.loc[X.y_pred == k, feat_columns[2]], 
                   marker='o', alpha=0.5, color = COLORS[k], label = f"cluster {k}")
    
    ax.set_xlabel(feat_columns[0])
    ax.set_ylabel(feat_columns[1])
    ax.set_zlabel(feat_columns[2])
    ax.legend(loc='best')
    plt.show()

In [None]:
k_mean_cluster(behavior_freq_df, 4)

In [None]:
k_mean_cluster(behavior_rate_df, 4)

The following is user's active data features

In [None]:
SSE=[]
for i in range(1,9,1):
    kmeans=KMeans(n_clusters=i)
    kmeans.fit(X_activeuser)
    SSE.append(kmeans.inertia_)
sns.set()
plt.plot(range(1,9,1),SSE,marker='o')
plt.xlabel('k')
plt.ylabel('SE')

In [None]:
k_mean_cluster(activeuser_df, 4)