# DSI Case

During the COVID-19 pandemic in 2020, the total sales increase in e-commerce increased by 37% in **Indonesia**. Due to increasingly fierce competition between competitors, you and your Product Manager are in discussion regarding how to stay afloat and compete in the e-commerce industry. After that, you decide to make an innovation or offer so that users will still choose you as their online shopping media.

For that, you are assigned to perform transaction-related analysis of user data. However, the problem is that the company is doing efficiency in terms of managing promotional funds in 2021. As a data analyst, what insights and recommendations can you give to the company?

**Objectives:** 
* user acqusition & user retention through new program or offer
    1. how new user use and get to know ecom
    2. how to keep new user and old user stay

**Approach:**
* Data deep dive to know our customer more and then we go from there

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
pd.set_option('display.max_rows', 10000)
pd.set_option('display.max_columns', 100)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
path = "/kaggle/input/ecommerce-data/data.csv"

In [None]:
df = pd.read_csv(path,header= 0,encoding="ISO-8859-1", dtype = {'CustomerID': str,'InvoiceID': str})

In [None]:
df_clean = df.copy()
df.head()

# **Cleaning**

In [None]:
df.tail()

In [None]:
df_clean['InvoiceDate'] = pd.to_datetime(df_clean['InvoiceDate'])

In [None]:
df_clean.InvoiceDate = df_clean.InvoiceDate.astype(str)
df_clean.head()

For this case purpose, change the year and month so the data we have ranging from January 2020 to December 2020

In [None]:
df_clean['InvoiceDate'] = df_clean['InvoiceDate'].apply(lambda x: x.replace('2011','2020'))
df_clean = df_clean[~df_clean.InvoiceDate.str.contains('2010')]
df_clean['InvoiceDate'] = pd.to_datetime(df_clean['InvoiceDate'])

Remove country as we assume all is in Indonesia

In [None]:
df_clean = df_clean.drop(columns = 'Country')
df_clean = df_clean.reset_index(drop=True)

In [None]:
df_clean.head()

check the % of missing value to get a glimpse

In [None]:
missing_percentage = df_clean.isnull().sum() / df_clean.shape[0] * 100
missing_percentage

check the number of null value in description

In [None]:
df_clean[df_clean['Description'].isnull()].head()

from the table, most of the description also have 0 unit price, we absolutely want to remove this kind of data. let's check if all null description have 0 unit price.

In [None]:
df_clean[df_clean.Description.isnull()].UnitPrice.value_counts()

can be concluded that all null description have 0 unit, so we have to remove them all

In [None]:
df_clean = df_clean[df_clean['Description'].notnull()]
df_clean.Description.isna().value_counts()

In [None]:
df_clean[df_clean['UnitPrice']==0.0].head()

remove all the rest from UnitPrice that has 0 value, because it's not normal

In [None]:
df_clean = df_clean[df_clean['UnitPrice']!=0.0]
(df_clean['UnitPrice']==0.0).value_counts()

In [None]:
df_clean.describe()

remove negative value

In [None]:
df_clean = df_clean[(df_clean['UnitPrice']>0) & (df_clean['Quantity'] > 0)]

In [None]:
df_clean.describe()

In [None]:
df_clean.head()

next, remove duplicate value

In [None]:
print('number of duplicates: {}'.format(df_clean.duplicated().sum()))

In [None]:
data = df_clean.drop_duplicates()

In [None]:
print('number of duplicates: {}'.format(data.duplicated().sum()))
data.shape

Just based on curiosity, let's check the data that have a large unitprice

In [None]:
data[data.UnitPrice > 200].head() #to check it fully, remove the head()

from the large unitprice data, we can see that most of them were DOT, M, and there is this "AMAZONFEE" that have super large unitprice

In [None]:
data[(data.StockCode == 'DOT') | (data.StockCode == 'M') | (data.StockCode == 'AMAZONFEE')].shape

before we remove them, let's make a box plot to make sure are they an extreme outliers

In [None]:
sns.boxplot(y=data.UnitPrice)

As you can see, the box which is majority of the data can't be seen, so removing the extreme outliers can be our option. In real life case, I think the best choice is to verify to the data collection, are they really customer purchase or not. But since we can't do that now, let's assume that these isn't customer purchase (since the stockcode itself is suspicious). Let's remove them.

In [None]:
data = data[(data.StockCode != 'DOT') & (data.StockCode != 'M') & (data.StockCode != 'AMAZONFEE')].copy()
sns.boxplot(y=data.UnitPrice)

this boxplot indicates that there's still something suspicious going on, so we will check again

In [None]:
data[data.UnitPrice>200].head()

ok the item with stockcode POST and B are suspicious to. For the same reason as before, we will remove them too.

In [None]:
data = data[(data.StockCode != 'B') & (data.StockCode != 'POST')].copy()
sns.boxplot(y=data.UnitPrice)

In [None]:
data[data.UnitPrice>200].head()

ok this looks fine

In [None]:
data.describe()

ok there's stil something quite off, the max quantity, let's check it.

In [None]:
data[data.Quantity>500].head()

ok after checking, it looks fine.

next thing i want to do, is looking from some potential odd description by using the descrption length

In [None]:
data['des_len'] = data.Description.apply(lambda x: len(x))
data.head()

In [None]:
data.des_len.describe()

In [None]:
data[data.des_len < 10].head()

everything looks normal

next, let's see from the invoice number

In [None]:
data['noinvo_len'] = data.InvoiceNo.apply(lambda x: len(x))
data.head()

In [None]:
data.noinvo_len.describe()

oke everything looks fine

In [None]:
data = data.drop(columns = ['des_len', 'noinvo_len'])
data.head()

Before moving forward, i want to replace null value in customer id to guest, just in case.

In [None]:
value = {'CustomerID':'Guest'}
data = data.fillna(value = value)
data[data.CustomerID == 'Guest'].head()

# **Data Mining**

Enriching Data:

Adding 'TotalPrice' column

In [None]:
data['TotalPrice'] = data.Quantity * data.UnitPrice
data.shape

# Customer Segmentation

I want to know who is our customer really is, based on their purchase behavior. Let's group them so how much our customer for each time purchase.

In [None]:
data2 = data.groupby(['InvoiceNo','InvoiceDate','CustomerID']).sum()
data2 = data2.drop(columns = 'UnitPrice')
data2.head()

As stated in the study case, we have a limited budget. So before rolling out promo, let's narrow our scope to focus more only to our majority of customer.

In [None]:
data2.describe()

Based on the data, for now we know that our customer are mainly a reseller.

As stated in the study case, we have a limited budget. Thus, before rolling out promo, let's narrow our scope to focus more only to our majority of customer. So let's see the outliers and remove them.

In [None]:
from scipy.stats import skew

In [None]:
skew(data2.TotalPrice)

Data is highly skewed, let's remove the outlier with this formula: mean-stddev <= data <= mean+stddev

In [None]:
data2 = data2.query('TotalPrice >= 0 and TotalPrice <= 518.593623 + 1799.695926')
#we use 0 because the mean-stddev is minus, so instead we just use zero

In [None]:
data2.describe()

In [None]:
sns.boxplot(data2.TotalPrice)

In [None]:
sns.displot(data2.TotalPrice)

In [None]:
skew(data2.Quantity)

let's also remove the outlier in quantity

In [None]:
data2 = data2.query('Quantity >= 0  and Quantity <= 220.835074 + 248.776217')
data2.describe()

In [None]:
print(skew(data2.TotalPrice))
print(skew(data2.Quantity))

ok since it looks pretty much all right, now we have a smaller scope and we will focus on this kind of customer 

In [None]:
sns.boxplot(data2.TotalPrice)

Ok now let's filter our main table, only with the data that we already sort before.

In [None]:
data2 = data2.reset_index()
invoice = data2['InvoiceNo'].tolist()

In [None]:
data = data[data.InvoiceNo.isin(invoice)]

In [None]:
data.head()

In [None]:
data.describe()

# Timeseries Trend

plot the total user each month

In [None]:
#first we make a new column named month
data['Month'] = data.InvoiceDate.dt.to_period('M')
data.head()

In [None]:
#f, ax = plt.subplots(figsize=(20, 6))

since we want to see the customer/user, let's drop 'Guest' User

In [None]:
data = data[data.CustomerID != 'Guest']

In [None]:
user_month = data.groupby('Month').CustomerID.nunique().reset_index()
user_month.columns = ['month','total_user']
user_month.head()

In [None]:
f, ax = plt.subplots(figsize=(15, 6))

sns.lineplot(data = user_month)
plt.xlabel('Month')
plt.ylabel('Unique User')
plt.title('Unique User by Month')

from the graph, we know that overall, we have a good unique user each month. Keep in mind that in december, we only collect data upto December 9th.

# Insights so far

* We know that majority of our customer were a reseller
* We have a good amount of unique user each month, with the trend of upward through the end of the year

# Action

From the information that we have, in order to reach our goals, we want to roll out promotion. But before that happen, we have to know to whom will we target the promotion. To answer it, first we do clustering to know our customer even more.

# Customer Classification

In [None]:
data_cust = data[['CustomerID','InvoiceDate','Quantity','UnitPrice','TotalPrice','StockCode']]
data_cust.head()

For clustering, let's group them based on their purchase behavior

In [None]:
#total unique item bought per cust
total_bought = data_cust.groupby('CustomerID').StockCode.nunique().reset_index()
total_bought.columns = ['cust_id','total_product']
total_bought.head()

In [None]:
#total transaction value
total_trx = data_cust.groupby('CustomerID').TotalPrice.sum().reset_index()
total_trx.columns = ['cust_id','total_trx']
total_trx.head()

In [None]:
data.InvoiceDate.max()

In [None]:
#Day since last transactions happen
data['LastTrx'] = (pd.to_datetime('2020-12-09 12:50:00') - data.InvoiceDate).dt.days
data.tail()

In [None]:
cus_recent_trx = data.groupby('CustomerID').LastTrx.min().reset_index()
cus_recent_trx.columns = ['cust_id','recent_trx']
cus_recent_trx.head()

In [None]:
#buying frequency in a year
cus_frequency = data_cust.groupby('CustomerID').InvoiceDate.nunique().reset_index()
cus_frequency.columns = ['cust_id','freq']
cus_frequency.head()

In [None]:
#merge the 4 table
cust = pd.DataFrame()
cust['cust_id'] = cus_recent_trx.cust_id
cust = cust.merge(total_bought, on='cust_id')
cust = cust.merge(total_trx, on='cust_id')
cust = cust.merge(cus_recent_trx, on='cust_id')
cust = cust.merge(cus_frequency, on='cust_id')
cust.head()

# K-means Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
# Calculate sum of squared distances
ssd = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(cust)
    ssd.append(km.inertia_)

In [None]:
# Plot sum of squared distances / elbow method
plt.figure(figsize=(10,6))
plt.plot(K, ssd, 'bx-')
plt.xlabel('k')
plt.ylabel('ssd')
plt.title('Elbow Method For Optimal k')
plt.show()

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion.

In this case, we select K = 4

In [None]:
kmeans = KMeans(n_clusters=4)
model = kmeans.fit(cust)

In [None]:
pred = model.labels_
cust['Cluster'] = pred
cust.head()

In [None]:
plt.figure(figsize=(10,6))

sns.scatterplot(data=cust, x="total_trx", y="recent_trx", hue="Cluster")
plt.title('Cluster by Total Transaction and Recencys')
plt.show()

To sum up our cluster, let's make another table to see what majority of the cluster looks like.

In [None]:
customers = cust.groupby('Cluster').mean().reset_index()
customers.sort_values('total_trx')

In [None]:
contribution = cust.groupby('Cluster').total_trx.sum().reset_index()
contribution['Contribution (%)'] = (contribution.total_trx/contribution.total_trx.sum())*100
contribution

From "Customers" & "Contribution" Table, now we can classify our customer based on which cluster they belong. Let's determine what kind of customers is in each cluster.

Cluster 0: Low unique product, low spending, not recent trx, low freq, high contribution --> **Seasonal Customer**

Cluster 1: Low unique product, low spending, not recent trx, low freq, high contribution --> **Seasonal Customer**

Cluster 2: medium unique product, medium spending, recent trx, medium freq, high contribution --> **Loyal Customer**

Cluster 3: high unique product, very high spending, recent trx, high freq, low contribution --> **Dropshipper**


Let's see what our customer distribution looks like

In [None]:
sns.displot(data=cust,x='Cluster')

From the graph we know that most our customer is from cluster 0 and 1

# Conclusion

From our the insight that we got, we know that:
1. Most of our customer are reseller
2. From further classification, we know that our best customers are seasonal customer, and our loyal customer are medium spender reseller
3. Based on this fact, we want to focus our budget to strengthen our business by targeting those kind of customers.


# Proposed Action

**Proposed Idea 1:**

**Idea:** Make a VIP based membership

**Goals:** To reward the customer in cluster 2 with more benefit (discount, free shipping, etc.) so we can keep them, and also become selling point to the customer that outside that cluster

**Proposed Idea 2:**
 
**Idea:** Rollout seasonal promotion like seasonal discount, bundle offers, etc.

**Goals:** To attract more new customer and to keep the loyal one to keep using our service.

**Proposed Idea 3:**

**Idea:** Make a seasonal personalization like seasonal/holiday item category, push notification on trending items, etc.

**Goals:** To help our customer navigate through our website, so the chance of converting is much higher.