# Introduction

**Objective**<br>
* create prediction models to determine customer segmentation based on RFM features</br>

**About The Dataset**
* This dataset is obtained from [Retail Transaction Data | Kaggle](https://www.kaggle.com/regivm/retailtransactiondata)

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import seaborn as sns
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load and Describe Data

## Load Dataset 

In [None]:
df = pd.read_csv('/kaggle/input/retailtransactiondata/Retail_Data_Transactions.csv')
df.head()

## Data Description

In [None]:
df.info()

## Check Null and Missing Values

In [None]:
data_missing_value = df.isnull().sum().reset_index()
data_missing_value.columns = ['feature','missing_value']
data_missing_value

**There are no duplicate data in this dataset**

## Numerical Data

In [None]:
numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
display(df.select_dtypes(include=numerics).columns)
print(df.select_dtypes(include=numerics).shape)
data_num = df.select_dtypes(include=numerics)
data_num.head(3)

## Non Numerical Data

In [None]:
display(df.select_dtypes(include=['object']).columns)
print(df.select_dtypes(include=object).shape)
data_cat = df.select_dtypes(include=['object'])
data_cat.head(3)

**Features which are numeric and non-numeric data are separated into two different objects.</br>
the `tran_amount` feature is numeric data</br>Change the date data type in the `tran_date` feature to be an integer based date data type
the `customer_id` and` tran_date` features are non-numeric data**

# Data Pre-Processing

## Convert `trans_date` column to date data type

**Change the date data type in the `tran_date` feature to be an integer based date data type**

In [None]:
#convert string to date type
df['trans_date'] = pd.to_datetime(df['trans_date'])

In [None]:
df['trans_date'] = df['trans_date'].dt.strftime('%Y-%m-%d')

In [None]:
df['trans_date'] = df['trans_date'].astype('datetime64[ns]')

In [None]:
#first and last date available in our dataset
print(df['trans_date'].min(), df['trans_date'].max())

**The maximum data of the last transaction is taken as a reference for calculating the recency value.</br>
The maximum data is then added by one day so that there are no blank values.**

In [None]:
#use latest date in our data as current date
import datetime as dt
now = dt.datetime(2015,3,17)
df['hist']=now - df['trans_date']
df['hist'].astype('timedelta64[D]')
df['hist']=df['hist'] / np.timedelta64(1, 'D')
df.head()

In [None]:
df.head()

## Make recency, frequency, and monetary column from `customer_id` groupby

In [None]:
#groupby `customer_id` and aggregate the three features to become new features.
rfm_table = df.groupby('customer_id').agg({'hist': lambda x:x.min(),# Recency
                                        'customer_id': lambda x: len(x),# Frequency
                                        'tran_amount': lambda x: x.sum()})# Monetary Value

In [None]:
#change the column name to `recency`,` frequency`, and `monetary`
rfm_table.rename(columns={'hist': 'recency', 
                         'customer_id': 'frequency', 
                         'tran_amount': 'monetary'}, inplace=True)

In [None]:
rfm_table.head()

In [None]:
#check the value from one of the customer data
df[df['customer_id']=='CS1112']

# Features Standardization

In [None]:
rfm_table.describe()

In [None]:
#create a new data frame from the dataframe `rfm_table`
rfm_segmentation = rfm_table.copy()

In [None]:
feats = ['recency','frequency','monetary']
X = rfm_segmentation[feats].values

from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
new_df = pd.DataFrame(data = X_std, columns = feats)
new_df.describe()

**standardization is done so that the distance of each feature is the same and makes machine learning fair and not in favor of one feature.**

# Modeling

## K-Means Clustering

In [None]:
#K-Means Internal Evaluation: Elbow Method
from sklearn.cluster import KMeans
inertia = []

for i in range(1, 11):
  kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
  kmeans.fit(new_df.values)
  inertia.append(kmeans.inertia_)

plt.figure(figsize=(12, 6))
plt.plot(inertia)

**Based on the results of the elbow method, to obtain the optimal number of clusters, a point is chosen after the inertia begins to decline linearly.</br>
So that the optimal number of clusters for the model to be carried out is three clusters.**

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(new_df.values)

In [None]:
#make new column `fit` with cluster values
new_df['cluster'] = kmeans.labels_

**The cluster results from K-Means Clustering are put into the `cluster` column**

In [None]:
#check our hypothesis
new_df[new_df.cluster == 0].head()

In [None]:
#see the distribution of recency feature based on cluster
sns.boxplot(new_df.cluster,new_df.recency)

In [None]:
#see the distribution of frequency feature based on cluster
sns.boxplot(new_df.cluster,new_df.frequency)

In [None]:
#see the distribution of monetary feature based on cluster
sns.boxplot(new_df.cluster,new_df.monetary)

**Loyal customers have low `recency` values and high `frequency` and `monetary` values, while regular customers have high `recency` values and lower `frequency` and `monetary` values.</br>
It can be seen in the `recency`,` frequency`, and `monetary` feature boxplot that the order of customers from most loyal to regular customers is cluster 2, cluster 1, and cluster 0**

In [None]:
#see clustering distribution based on three features
sns.scatterplot(data=new_df, x='monetary', y='recency', size='frequency', 
                hue='cluster')

## Agglomerative Clustering

**Besides using k-means clustering, I also use aggomerative clustering to compare the customer segmentation of the two clustering models.**

In [None]:
from sklearn.cluster import AgglomerativeClustering
ac = AgglomerativeClustering(n_clusters=3)
ac.fit(new_df.values)

In [None]:
#see clustering distribution based on three features
sns.scatterplot(data=new_df, x='monetary', y='recency', size='frequency',
                hue='cluster')

**There is no significant difference from the results of the two clustering models, so that customer segmentation is carried out in 3 clusters based on their RFM features.**