# Business Problem with Customer Segmentation 

### An e-commerce company wants to segment its customers and determine marketing strategies according to these segments.  For this purpose, we will define the behavior of customers and we will form groups according to clustering.  In other words, we will take those who exhibit common behaviors into the same groups and we will try to develop sales and marketing techniques specific to these groups.

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime as dt

import warnings
warnings.simplefilter(action = 'ignore')

## Data Set Story:
### https://www.kaggle.com/mathchi/online-retail-ii-data-set-from-ml-repository

### This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

### The company mainly sells unique all-occasion gift-ware.

### Many customers of the company are wholesalers.

## Features Information:

### InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
### StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
### Description: Product (item) name. Nominal.
### Quantity: The quantities of each product (item) per transaction. Numeric.
### InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
### UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
### CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
### Country: Country name. Nominal. The name of the country where a customer resides.

## Data Understanding

In [None]:
pd.set_option('display.max_columns', None); pd.set_option('display.max_rows', None);  # to display all columns and rows
pd.set_option('display.float_format', lambda x: '%.2f' % x) # The number of numbers that will be shown after the comma.

onret = pd.read_excel('../input/online-retail-ii-data-set-from-ml-repository/online_retail_II.xlsx', 
                      sheet_name = 'Year 2010-2011', sep=';')
onret.head()

In [None]:
onret.info()

In [None]:
df = onret.dropna() # Deleting missing values
df.info()

In [None]:
df.head(2)

In [None]:
df["Customer ID"] = df["Customer ID"].astype(int) # Values of float are converted to integer.
df["TotalPrice"] = df["Quantity"] * df["Price"]  # it is necessary to create a new variable by multiplying two variables
df.head()

In [None]:
df[df['Invoice'].str.startswith("C", na = False)]  # Values starting with 'c' in the invoice variable indicate the returned products.

## Customer Segmentation with RFM Scores

### Consists of initials of **Recency**, **Frequency**, **Monetary** expressions.

### It is a technique that helps determine marketing and sales strategies based on customers' buying habits.

### - *Recency (innovation)*: Time since customer last purchased

###      -- In other words, it is the “time since the last contact of the customer”.

###      -- Today's date - Last purchase

###      -- To give an example, if we are doing this analysis today, today's date is the last product purchase date.

###      -- This can be for example 20 or 100. We know that 20 customers are hotter. He has been in contact with us recently.

### - *Frequency*: Total number of purchases.

### - *Monetary* (Monetary Value): Total spending by the customer.


## Recency




What is today? Now if we take today's date, then there will be a very serious difference.

For this reason, let us determine ourselves a "today" according to the structure of this data set.

We can set this day as the maximum day of the data set.

We can segmentation according to the day of the last recording.

In [None]:
df["InvoiceDate"].max()

In [None]:
today = df["InvoiceDate"].max()
today

In [None]:
temp_df = (today - df.groupby("Customer ID").agg({"InvoiceDate":"max"})) # Show the last shopping dates of each customer.
temp_df.head()

### For each customer, we need to deduce the customers' last purchase date from today's date.

### Then we have singularized customer deadlines.

In [None]:
temp_df.rename(columns={"InvoiceDate":"Recency"},inplace=True)
recency_df = temp_df["Recency"].apply(lambda x: x.days)
recency_df.head()

In [None]:
recency_df = pd.DataFrame(recency_df)
recency_df.head()

## Frequency

In [None]:
temp_df = df.groupby(["Customer ID", "Invoice"]).agg({"Invoice": "count"})
freq_df = temp_df.groupby("Customer ID").agg({"Invoice":"count"})
freq_df.rename(columns={"Invoice":"Frequency"},inplace = True)
freq_df.head()

## Monetary

In [None]:
monetary_df = df.groupby("Customer ID").agg({"TotalPrice": "sum"})
monetary_df.rename(columns={"TotalPrice":"Monetary"},inplace=True)
monetary_df.head()

### Now, we need to score according to the most recent (Recency), the cyclic (Frequency) and the monetary expenditure (Monetary).


In [None]:
rfm_df = pd.concat([recency_df, freq_df, monetary_df], axis = 1)
rfm_df.head()

In [None]:
# Outlier Analysis of Amount Frequency and Recency

attributes = ['Monetary','Frequency','Recency']
plt.rcParams['figure.figsize'] = [10,8]
sns.boxplot(data = rfm_df[attributes], orient="v", palette="Set2" ,whis=1.5,saturation=1, width=0.7)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Range", fontweight = 'bold')
plt.xlabel("Attributes", fontweight = 'bold')

In [None]:
# Removing (statistical) outliers for Monetary
Q1 = rfm_df.Monetary.quantile(0.05)
Q3 = rfm_df.Monetary.quantile(0.95)
IQR = Q3 - Q1
rfm_df = rfm_df[(rfm_df.Monetary >= Q1 - 1.5*IQR) & (rfm_df.Monetary <= Q3 + 1.5*IQR)]

# Removing (statistical) outliers for Recency
Q1 = rfm_df.Recency.quantile(0.05)
Q3 = rfm_df.Recency.quantile(0.95)
IQR = Q3 - Q1
rfm_df = rfm_df[(rfm_df.Recency >= Q1 - 1.5*IQR) & (rfm_df.Recency <= Q3 + 1.5*IQR)]

# Removing (statistical) outliers for Frequency
Q1 = rfm_df.Frequency.quantile(0.05)
Q3 = rfm_df.Frequency.quantile(0.95)
IQR = Q3 - Q1
rfm_df = rfm_df[(rfm_df.Frequency >= Q1 - 1.5*IQR) & (rfm_df.Frequency <= Q3 + 1.5*IQR)]

In [None]:
rfm_1 = rfm_df.copy()

In [None]:
rfm_df.head()


## Scoring for RFM

- Let's start with the last 5 here. Let's use the 'qcut' method to score.

In [None]:
rfm_df["RecencyScore"] = pd.qcut(rfm_df['Recency'], 5, labels = [5, 4, 3, 2, 1])
rfm_df["FrequencyScore"] = pd.qcut(rfm_df['Frequency'].rank(method="first"), 5, labels = [1, 2, 3, 4, 5])
rfm_df["MonetaryScore"] = pd.qcut(rfm_df['Monetary'], 5, labels = [1, 2, 3, 4, 5])
rfm_df["RFM_SCORE"] = (rfm_df['RecencyScore'].astype(str) 
                                + rfm_df['FrequencyScore'].astype(str) 
                                + rfm_df['MonetaryScore'].astype(str))
rfm_df.head()

### Let's do regex segmentation. With the help of regex, we will set rfm aside and consider r and f.

### Example: If you see 1-2 in r and 1-2 in f, write 'Hibernating'

In [None]:
seg_map = {r'[1-2][1-2]': 'Hibernating',
            r'[1-2][3-4]': 'At Risk',
            r'[1-2]5': 'Can\'t Loose',
            r'3[1-2]': 'About to Sleep',
            r'33': 'Need Attention',
            r'[3-4][4-5]': 'Loyal Customers',
            r'41': 'Promising',
            r'51': 'New Customers',
            r'[4-5][2-3]': 'Potential Loyalists',
            r'5[4-5]': 'Champions'}
    
rfm_df['Segment'] = rfm_df['RecencyScore'].astype(str) + rfm_df['FrequencyScore'].astype(str)
rfm_df['Segment'] = rfm_df['Segment'].replace(seg_map, regex=True)
rfm_df.head()

# KMeans:

In [None]:
rfm = rfm_1
rfm.head()

## Rescaling the Attributes

### It is extremely important to rescale the variables so that they have a comparable scale.| There are two common ways of rescaling:

### Min-Max scaling
### Standardisation (mean-0, sigma-1)
### Here, we will use Min-Max scaling.

In [None]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler((0,1))
cols = rfm.columns
index = rfm.index
scaled_rfm = mms.fit_transform(rfm)
scaled_rfm = pd.DataFrame(scaled_rfm, columns=cols, index = index)
scaled_rfm.head()

## Building the Model


### K-Means Clustering¶

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

### The algorithm works as follows:

First we initialize k points, called means, randomly.
We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
We repeat the process for a given number of iterations and at the end, we have our clusters.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 4)
kmeans = kmeans.fit(scaled_rfm)
kmeans

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans.labels_

## OPTIMAL NUMBER OF CLUSTER


Elbow Curve to get the right number of Clusters
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

In [None]:
ssd = []

K = range(1,30)

for k in K:
    kmeans = KMeans(n_clusters = k).fit(rfm)
    ssd.append(kmeans.inertia_)
    
plt.plot(K, ssd, "bx-")
plt.xlabel("Distance Residual Sums Versus Different k Values")
plt.title("Elbow method for Optimum number of clusters")

In [None]:
from yellowbrick.cluster import KElbowVisualizer

kmeans = KMeans()
visu = KElbowVisualizer(kmeans, k = (2,20))
visu.fit(rfm)
visu.poof();

### MODEL OPTIMIZATION

In [None]:
# Final model with k=6
kmeans = KMeans(n_clusters = 5, max_iter=50).fit(rfm)
kmeans.labels_

In [None]:
# assign the label
rfm['Cluster_Id'] = kmeans.labels_

In [None]:
# Box plot to visualize Cluster Id vs Monetary
sns.boxplot(x = 'Cluster_Id', y = 'Monetary', data = rfm);

In [None]:
# Box plot to visualize Cluster Id vs Frequency
sns.boxplot(x='Cluster_Id', y='Frequency', data = rfm);

In [None]:
# Box plot to visualize Cluster Id vs Recency
sns.boxplot(x='Cluster_Id', y='Recency', data = rfm);

In [None]:
rfm.groupby("Cluster_Id").agg({"Cluster_Id":"count"})

In [None]:
rfm.groupby("Cluster_Id").agg(np.mean)

## HIERARCHICAL CLUSTERING

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering,

Divisive
Agglomerative.


Complete Linkage

In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.

https://www.saedsayad.com/images/Clustering_complete.png


Average Linkage:

In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.

https://www.saedsayad.com/images/Clustering_average.png



In [None]:
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
hc_complete = linkage(rfm, "complete") #Complete Linkage
hc_average = linkage(rfm, "average") # Average Linkage


In [None]:
plt.figure(figsize = (15,10))
plt.title("Hierarchical Cluster Dendrogram")
plt.xlabel("Observation Unit")
plt.ylabel("Distance")
dendrogram(hc_complete,
           truncate_mode = "lastp",
           p = 10,
           show_contracted = True,
          leaf_font_size = 10);

In [None]:
# Cutting the Dendrogram based on K
cluster_labels = cut_tree(hc_complete, n_clusters = 5).reshape(-1, )
cluster_labels

In [None]:
# Assign cluster labels
rfm['Cluster_Labels'] = cluster_labels
rfm['Cluster_Labels'] = rfm['Cluster_Labels'] + 1
rfm.head()

In [None]:
rfm.groupby("Cluster_Labels").agg(np.mean)

In [None]:
# Plot Cluster Id vs Monetary
sns.boxplot(x = 'Cluster_Labels', y = 'Monetary', data = rfm);

In [None]:
# Plot Cluster Id vs Frequency
sns.boxplot(x = 'Cluster_Labels', y = 'Frequency', data = rfm);

In [None]:
# Plot Cluster Id vs Recency
sns.boxplot(x='Cluster_Labels', y='Recency', data=rfm);

## Final Analysis
According to K-Means Clustering with optimized 5 Cluster Ids;

Customers belong to Cluster Id 4 are the ones having the highest amount of transactions as compared to other cluster Ids.\
Customers in Cluster Id 1 are the ones bring us the least amount of money and not recent buyers.

For the Hierarchical Clustering Model, customers in cluster_label 4 has the highest amount of transactions.\
Customers in cluster_label 1 are the ones having the least transaction frequency.


**