<a href="https://colab.research.google.com/github/sanjananasa/online-retail-customer-segmetation/blob/main/Unsupervised_machine_learning_capstone_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Online Retail Customer Segmentation



##### **Project Type**    - UNSUPERVISED MACHINE LEARNING
##### **Contribution**    - Individual

####**Name**- Sanjana Nasa


# **Project Summary -**

####The aim of this machine learning project is to perform customer segmentation for an online retail business. Customer segmentation involves dividing customer base into distinct groups based on shared characteristics , behaviour or preferences. By effectively segmenting customers , business can gain valuable insights and tailor their marketing strategies to specific customer group , leading to improved  customer satisfaction and increased profitability.

####The project utilises a dataset containing relevant information about the online reatil customers. The dataset includes features such as customer demographics , purchase history ,frequency of purchases , monetary value of purchases, and other relevant variables that can help in segmenting customers effectively.

####The primary objective of this project is to apply machine learning techniques to segment the online retail customers into distinct groups based on theit purchasing behaviour and characteristics. This segmentation will help the business to better understand the customer base , identify patterns and trends and develop personalised marketing campaigns  to target each segment.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


####In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

#**Variables Description**

##Attribute Information:
**InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

**StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

**Description:** Product (item) name.

**Quantity:** The quantities of each product (item) per transaction. Numeric.

**InvoiceDate**: Invice Date and time. Numeric, the day and time when each transaction was generated.

**UnitPrice:** Unit price. Numeric, Product price per unit.

**CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

**Country:** Country name. Nominal, the name of the country where each customer resides.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from numpy import math

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
data=pd.read_csv("/content/drive/MyDrive/Online Retail- Online Retail.csv")

### Dataset First View

In [None]:
# Dataset First Look

#first 5 rows of data
data.head()

In [None]:
#last 5 rows of data
data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

In [None]:
print("number of rows=",data.shape[0])
print("number of columns=",data.shape[1])

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().value_counts()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
miss_val=data.isnull().sum().sort_values(ascending=False)
print(miss_val)

In [None]:
# Visualizing the missing values
import missingno as msno
msno.matrix(data)
plt.show()

### What did you know about your dataset?

1. The dataset has **541909** (five lakh forty one thousand nine hundred nine) rows and **8** columns.
2.There are 5268 duplicate values.
3.Two columns namely, **Description** and **customerID** has **1454**(one thousand four hundren fifty four) and **135080**(one lakh thirty five thousand and eighty) missing/null values respectively.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe([0.75,0.95,0.99])

###**When looking at the summary statistics generated by the describe function, it is apparent that some negative values exist in the data. (as we can see the min value for price and quantity is negative) , so need to explore these columns.**

###**Another noteworthy observation is that the 99th percentile for both the UnitPrice and Quantity columns is low, while the maximum value is much higher. This suggests that there are outliers in the data, which could be due to the occasional purchase of valuable items. Additionally, the UnitPrice and Quantity columns are inversely related to each other.**

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data.nunique().sort_values()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#converting data type of "InvoiceDate" from object to datetime
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

In [None]:
#looking for negative values in "Quantity"  column
data[data['Quantity']<0]

**Given that, if the "InvoiceNo" starts with 'C', that indicates cancellation. And we can see above that negative values of "Quantity" corresponds to invoice no beginning with 'C'. So removing all those rows.**

In [None]:
#removing cancelled orders
data=data[~data['InvoiceNo'].str.contains('C')]

**As we have seen above that we have null values present in our dataset like in CustomerID and Description.we can drop thode null values in customerID columns as we are making customer segmentation and keeping those null values make no sense**

In [None]:
#dropping null values
data.dropna(inplace=True)

In [None]:
#shape of data after dropping entries
data.shape

**Now let's check if there are still any null/missing values**

In [None]:
#checking for null values
data.isnull().sum()

**There are no null values in the data now.**

In [None]:
#lets check data summary again
data.describe()

**Now that we have removed all the cancelled orders , we can see that there is no negative value in the data. But the minimum UnitPrice is seen to be 0 (zero) which is not possible.**

In [None]:
# Checking how many values are present for unitprice==0

len(data[data['UnitPrice']==0])

**almost 40 values are present where UnitPrice=0 . so will drop this values**

In [None]:
# taking unitprice values greater than 0.
data=data[data['UnitPrice']>0]
data.head()

In [None]:
#checking shape of data after dropping certain values
data.shape

### What all manipulations have you done and insights you found?

**1. Converted data type of "InvoiceDate" from object to datetime**

**2. Here I dropped some InvoiceNo which starts with 'c' because 'c' indicates a cancellation.**

**3. I dropped null values in customerID columns as we are making customer segmentation and keeping those null values make no sense.**

**4. dropped rows where unit price was zero.**

**5. After removing above entries, now we have 397884 rows and 8 columns.**

#**Feature Engineering**

In [None]:
# Converting InvoiceDate to datetime. InvoiceDate is in format of 01-12-2010 08:26.
data["InvoiceDate"] = pd.to_datetime(data["InvoiceDate"], format="%d-%m-%Y %H:%M")

In [None]:
data["year"] = data["InvoiceDate"].apply(lambda x: x.year)
data["month_num"] = data["InvoiceDate"].apply(lambda x: x.month)
data["day_num"] = data["InvoiceDate"].apply(lambda x: x.day)
data["hour"] = data["InvoiceDate"].apply(lambda x: x.hour)
data["minute"] = data["InvoiceDate"].apply(lambda x: x.minute)

In [None]:
# extracting month from the Invoice date
data['Month']=data['InvoiceDate'].dt.month_name()

In [None]:
# extracting day from the Invoice date
data['Day']=data['InvoiceDate'].dt.day_name()

In [None]:
data['TotalAmount']=data['Quantity']*data['UnitPrice']

In [None]:
#lets have a look on dataframe after adding new columns
data.head()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

##**Chart-1** (**Popularity of product type**)

In [None]:
#top five most popular products among customers

Description_df=data['Description'].value_counts().reset_index()
Description_df.rename(columns={'index': 'Description_Name'}, inplace=True)
Description_df.rename(columns={'Description': 'Count'}, inplace=True)
#top 5 Description Name
Description_df.head()

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12,8))
plt.title('Top 5 Product Name')
sns.barplot(x='Count',y='Description_Name',data=Description_df[:5], palette='spring_r');


In [None]:
#bottom 5 products (least popular among customers)
Description_df.tail()

In [None]:
#visualisation code
plt.figure(figsize=(12,8))
plt.title('Bottom 5 Product Name')
sns.barplot(x='Count',y='Description_Name',data=Description_df[-5:], palette='spring_r');


##### 1. Why did you pick the specific chart?

**I chose bar chart to compare the counts of different products.**

##### 2. What is/are the insight(s) found from the chart?

###**Top 5 most selling products are:**
1. WHITE HANGING HEART T-LIGHT HOLDER

2.	REGENCY CAKESTAND 3 TIER

3.	JUMBO BAG RED RETROSPOT

4.	ASSORTED COLOUR BIRD ORNAMENT

5.	PARTY BUNTING

###**5 least selling products are:**
1. RUBY GLASS CLUSTER EARRING

2. PINK CHRYSANTHEMUMS ART FLOWER

3. 72 CAKE CASES VINTAGE CHRISTMAS

4. WALL ART , THE MAGIC FOREST

5. PAPER CRAFT , LITTLE BIRDIE

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**When you know your best selling and least selling items, you can spend more time in promoting your best sellers and can look for ways to improve poor performing products.**

## **Chart - 2** **(Best sellers StockCode wise)**

In [None]:
#top 5 products stock code wise
StockCode_df=data['StockCode'].value_counts().reset_index()
StockCode_df.rename(columns={'index': 'StockCode_Name'}, inplace=True)
StockCode_df.rename(columns={'StockCode': 'Count'}, inplace=True)
#top 5 stockcode name
StockCode_df.head()

In [None]:
# Chart - 2 visualization code
#plot top 5 stockcode name
plt.figure(figsize=(12,8))
plt.title('Top 5 Stock Name')
sns.barplot(x='Count',y='StockCode_Name',data=StockCode_df[:5], palette='spring_r')



##### 1. Why did you pick the specific chart?

**Used bar chart to compare the selling counts of product as per their stock code.**

##### 2. What is/are the insight(s) found from the chart?

**Product with stock code 85123A is the product with highest sales.**

##**Chart - 3** **(number of customers from different countries)**

In [None]:
#country wise number of customers
cust_count=data['Country'].value_counts()
print(cust_count)

In [None]:
# Chart - 3 visualization code
cust_count.head().plot(kind='bar')
plt.title("top five countries the customers belong to")
plt.xlabel("Countries")
plt.ylabel("counts")

##### 1. Why did you pick the specific chart?

**Here I have chosen bar chart to compare the count of customers from different countries.**

##### 2. What is/are the insight(s) found from the chart?

**We can clearly see that maximum number of customers are from United Kingdom, which makes sense as the company itself is UK-based. After that we have Germany , France , EIRE (Ireland) which have almost equal number of customers and last is  Spain.**

#**Chart - 4**

In [None]:
# Chart - 4 visualization code
cust_count.tail().plot(kind='bar')
plt.title("Countries with least number of customers")
plt.xlabel("Countries")
plt.ylabel("count")

####**Lithuania , Brazil , Czech Republic , Bahrain and Saudi Arabia has the  lowest customer base.**

## **Chart - 6** **(Distribution of unit price)**

In [None]:
# Chart - 6 visualization code

#distribution of unit price
plt.figure(figsize=(12,8))
plt.title('UnitPrice distribution')
sns.distplot(data['UnitPrice'])

##### 1. Why did you pick the specific chart?

**Used distplot to show the variation in the distribution of data points of "UnitPrice"**

##### 2. What is/are the insight(s) found from the chart?

**From the distribution of unit price, we can say that most items have a lower price range.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Lower prices mean lower profit margins, so they'll have to sell a higher volume in order to survive in the market.**

## **Chart - 5** **(Distribution of quantity)**

In [None]:
# Chart - 5 visualization code

#distribution of Quantity
plt.figure(figsize=(12,8))
plt.title('distribution of Quantity')
sns.distplot(data['Quantity'],color="r")

##### 1. Why did you pick the specific chart?

**Used distplot to show the variation in the distribution of data points of "Quantity"**

##### 2. What is/are the insight(s) found from the chart?

**Here we can see that its a Positively skewed (or right-skewed) distribution. It is a type of distribution in which most values are clustered around the left tail of the distribution.This means that the most extreme values are on the right side.**

## **Chart - 7 (Top 10 customers as per customerID)**

In [None]:
#top 10 customers in terms of purchasing counts
top_10_customers=data['CustomerID'].value_counts().reset_index().rename(columns={'index':'CustomerID','CustomerID':'Product_purchasing_count'}).head(10)
print(top_10_customers)

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12,8))
sns.barplot(x=top_10_customers['CustomerID'],y=top_10_customers['Product_purchasing_count'].head(10),palette='spring_r')
plt.title('Top 10 frequent Customers.')

##### 1. Why did you pick the specific chart?

**Used bar chart to compare the number of purchases done by different customers.**

##### 2. What is/are the insight(s) found from the chart?

**CustomerID- 17841 had purchased highest number of products**

**CustomerID-14911 is the 2nd higest customer who purchased the most the products**

## **Chart - 8 (Monthwise sales count)**

In [None]:
#sales count in different months
sales_in_month=data['Month'].value_counts().reset_index().rename(columns={'index':'Month','Month':'Sales_count'})
sales_in_month

In [None]:
# Chart - 8 visualization code

# Sales count in different months.
plt.figure(figsize=(20,6))
sns.barplot(x=sales_in_month['Month'],y=sales_in_month['Sales_count'],palette='spring_r')
plt.title('Sales count in different Months ')

##### 1. Why did you pick the specific chart?

**Used bar chart to compare the monthly sales count.**

##### 2. What is/are the insight(s) found from the chart?

**Most of the sale happened in Novmenber month.**

**February Month had least sales.**

##**Chart-9 (sales count day wise)**

In [None]:
#sales count on different days of th week
sales_on_day_basis=data['Day'].value_counts().reset_index().rename(columns={'index':'Day',"Day":'Sale_count'})
sales_on_day_basis

In [None]:
# Chart - 9 visualization code

# Sales count on different days.
plt.figure(figsize=(20,6))
sns.barplot(x=sales_on_day_basis['Day'],y=sales_on_day_basis['Sale_count'],palette='spring_r')
plt.title('Sales count on different Days ')

##### 1. Why did you pick the specific chart?

**Used bar chart to compare sale count on different days of week.**

##### 2. What is/are the insight(s) found from the chart?

**We can clearly see the sales on Thursdays being the highest and least on Fridays.**

#**Chart - 10 (sales time in different day times)**

In [None]:
#let's check for unique values in hour column
data['hour'].unique()

In [None]:
#creating a function to categorise the purchase time as morning , afternoon or evening
def time(time):
  if (time==6 or time==7 or time==8 or time==9 or time==10 or time==11) :
    return'Morning'
  elif (time==12 or time==13 or time==14 or time==15 or time==16 or time==17):
    return 'Afternoon'
  else:
    return 'Evening'

In [None]:
data['Day_time_type']=data['hour'].apply(time)

In [None]:
#number of sales happened during different time of the day(morning, evening or afternoon)
sales_timing=data['Day_time_type'].value_counts().reset_index().rename(columns={'index':'Day_time_type','Day_time_type':'Sales_count'})
sales_timing

In [None]:
# Chart - 10 visualization code

# Sales count on different days.
plt.figure(figsize=(12,6))
sns.barplot(x=sales_timing['Day_time_type'],y=sales_timing['Sales_count'],palette='spring_r')
plt.title('Sales count in different day timings')

##### 1. Why did you pick the specific chart?

**Used bar chart to compare the sale count with respect to different times of day.**

##### 2. What is/are the insight(s) found from the chart?

**Most of the sales happened in the afternoon and least in the evening.**

#**Chart - 11 (Average amount spent by each customer)**

In [None]:
#average amount spent by customers
average_amt=data.groupby('CustomerID')['TotalAmount'].mean().reset_index().rename(columns={'TotalAmount':'Average_amount_spent'}).sort_values('Average_amount_spent',ascending=False)
average_amt

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(12,6))
sns.barplot(x=average_amt['CustomerID'].head(5),y=average_amt['Average_amount_spent'].head(5),palette='spring_r')
plt.title('Average amount spent by top 5 Customers')


##### 1. Why did you pick the specific chart?

**Used bar chart to compare the average amount spent by customers.**

##### 2. What is/are the insight(s) found from the chart?

**77183 (Dollars)is the highest average amount spent by the CustomerID-12346**

**56157 (Dollars) is the 2nd highest average amount spent by the CustomerID-16446**

#**MODEL BUILDING**

##**RFM Model Analysis**

###**What is RFM?**

RFM**(Recency, Frequency, Monetary)** analysis is a widely used customer segmentation technique in marketing and analytics. It helps businesses understand and categorize their customers based on three key factors:


1.How recently they made a purchase **(Recency)**,

2.How frequently they make purchases **(Frequency),**

3.How much they spend **(Monetary value)**.


RFM analysis enables businesses to identify and target different customer segments with customized marketing approaches.


###**Why it is Needed?**

RFM Analysis is a marketing framework that is used to understand and analyze customer behaviour based on the above three factors RECENCY, Frequency, and Monetary.


The RFM Analysis will help the businesses to segment their customer base into different homogenous groups so that they can engage with each group with different targeted marketing strategies.

In [None]:
#creating copy of original dataframe
rfm_df=data.copy()

In [None]:
rfm_df.head()

In [None]:
'''Recency = Latest Date - Last Inovice Data,
 Frequency = count of invoice no. of transaction(s),
  Monetary = Sum of Total '''

import datetime as dt

#Set Latest date 2011-12-10 as last invoice date was 2011-12-09. This is to calculate the number of days from recent purchase
Latest_Date = dt.datetime(2011,12,10)

#Create RFM Modelling scores for each customer
rfm_df = rfm_df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})

#Convert Invoice Date into type int
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#Rename column names to Recency, Frequency and Monetary
rfm_df.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalAmount': 'Monetary'}, inplace=True)

rfm_df.reset_index().head()

In [None]:
# Descriptive Stats= Recency
rfm_df.Recency.describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm_df['Recency'])
plt.title('Distribution of Recency')

###**We  can clearly observe from above plot that distribution of  'Recency'  is positively skewed.**

In [None]:
# Descriptive Stats= Frequency
rfm_df.Frequency.describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm_df['Frequency'])
plt.title('Distribution of Frequency')

###**Distribution of Frequency is highly right skewed.**

In [None]:
# Descriptive Stats= Monetary
rfm_df['Monetary'].describe()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(x=rfm_df['Monetary'])
plt.title('Distribution of Monetary')

###**Distribution of Monetary is highly right skewed.**

In [None]:
# Split the data into four segment using Quantile
quantile = rfm_df.quantile(q = [0.25,0.50,0.75])

#Converting quantiles to a dictionary, that would be easier to use.
quantile = quantile.to_dict()

In [None]:
quantile

In [None]:
#Function to create R, F and M segments
# arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
# lower the recency, good for the company

def RScoring(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]:
        return 3
    else:
        return 4

        # arguments (x = value, p = recency, monetary_value, frequency, d = quartiles dict)
        # higher value of frequency and monetary lead to a good consumer. Here higher value = 1 in reverse way.

def FnMScoring(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]:
        return 2
    else:
        return 1

In [None]:
# Calculating and adding R,F and M segments values columns in the existing dataset to show R,F,M segment values
rfm_df["R"] = rfm_df['Recency'].apply(RScoring,args=('Recency',quantile,))
rfm_df["F"] = rfm_df['Frequency'].apply(FnMScoring,args=('Frequency',quantile,))
rfm_df["M"] = rfm_df['Monetary'].apply(FnMScoring,args=('Monetary',quantile,))
rfm_df.head()

In [None]:
# Add a new column to combine RFM score
rfm_df['RFM_Group'] = rfm_df.R.map(str)+rfm_df.F.map(str)+rfm_df.M.map(str)

#Calculate and Add RFMScore value column showing total sum of RFMGroup values
rfm_df['RFM_Score'] = rfm_df[['R', 'F', 'M']].sum(axis = 1)
rfm_df.head()

In [None]:
rfm_df.info()

In [None]:
rfm_df['RFM_Score'].unique()

In [None]:
# Assign Loyalty Level to each customer
Loyalty_Level = ['Platinaum','Gold','Silver','Bronz']

Score_cut = pd.qcut(rfm_df['RFM_Score'],q = 4,labels=Loyalty_Level)
rfm_df['RFM_Loyalty_Level'] = Score_cut.values
rfm_df.reset_index().head()

In [None]:
# Validate the data For RFM group = 111
rfm_df[rfm_df['RFM_Group'] == '111'].sort_values("Monetary",ascending = False).reset_index().head(10)

In [None]:
# Plot the loyalty level
plt.figure(figsize=(12,6))
sns.countplot(rfm_df['RFM_Loyalty_Level'],palette='spring_r')
plt.title('Loyalty Level of Customers')
plt.show()

In [None]:
segmentation_based_on_RFM = rfm_df[['Recency','Frequency','Monetary','RFM_Loyalty_Level']]

segmentation_based_on_RFM.groupby('RFM_Loyalty_Level').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

In [None]:
#Handle negative and zero values so as to handle infinite numbers during log transformation
def handle_neg_n_zero(num):
    if num <= 0:
        return 1
    else:
        return num
#Apply handle_neg_n_zero function to Recency and Monetary columns
rfm_df['Recency'] = [handle_neg_n_zero(x) for x in rfm_df.Recency]
rfm_df['Monetary'] = [handle_neg_n_zero(x) for x in rfm_df.Monetary]

In [None]:
#Perform Log transformation to bring data into normal or near normal distribution
Log_rfm_df = rfm_df[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis = 1).round(3)

##**Now let's Visualize the Distribution of Recency,Frequency and Monetary.**

In [None]:
#distribution of Recency
plt.figure(figsize=(12,6))
sns.distplot(x=Log_rfm_df['Recency'])
plt.title('Distribution of Recency')

In [None]:
#distribution of Frequency
plt.figure(figsize=(12,6))
sns.distplot(x=Log_rfm_df['Frequency'])
plt.title('Distribution of Frequency')

In [None]:
#distribution of Monetary
plt.figure(figsize=(12,6))
sns.distplot(x=Log_rfm_df['Monetary'])
plt.title('Distribution of Monetary')

In [None]:
rfm_df['Recency_log'] = rfm_df['Recency'].apply(math.log)
rfm_df['Frequency_log'] = rfm_df['Frequency'].apply(math.log)
rfm_df['Monetary_log'] = rfm_df['Monetary'].apply(math.log)

In [None]:
rfm_df

## ***7. ML Model Implementation***

### **ML Model - 1**

#**K-Means Clustering**

In [None]:
#Importing Libraries

from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler


In [None]:
!pip install yellowbrick

##**Before implementing the Kmeans Clustering algorithm we need to decide the number of clusters in algorithm as input. So we will be finding the minimum number of clusters required by using Elbow method.**

#**1) Applying Elbow Method on Recency and Monetary.**

The elbow method is a heuristic used to determine the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

In [None]:
# taking Recency and Monetory_log in list.
Recency_and_Monetary_feat=['Recency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_df[Recency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#applying Elbow Method
wcss = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    wcss[k] = km.inertia_


#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

###**Here we can see that Optimal value for cluster came out to be 2.**

#**Silhouette Score *(Validating Above optimal cluster value(i.e optimal_cluster=2)***

Silhouette Score is a metric to evaluate the performance of clustering algorithm. It uses compactness of individual clusters(intra cluster distance) and separation amongst clusters (inter cluster distance) to measure an overall representative score of how well our clustering algorithm has performed.

In [None]:
# taking Recency and Monetory_log in list.
Recency_and_Monetary_feat=['Recency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_df[Recency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#Silhouette Score
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters,random_state=1)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

####**The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate.**

###**Here we can see that for n_cluster=2 silhouette score is good as compared to others.(if values is close to 1 means data points are clustered very well to respective clusters and distance of that datapoint is very far from the other cluster).**

In [None]:
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

range_n_clusters = [2,3,4,5,6,7,8,9,10]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

##**We got good Silhouette plot for Cluster-2 but still few datapoints are on the negative side of the Silhouette Coefficient value  but its better than others.**

##**So giving n_clusters=2 to K-means Model**

In [None]:
# applying Kmeans_clustering algorithm
kmeans_rec_mon = KMeans(n_clusters=2)
kmeans_rec_mon.fit(X)
y_kmeans= kmeans_rec_mon.predict(X)

In [None]:
#Find the clusters for the observation given in the dataset
rfm_df['Cluster_based_rec_mon'] = kmeans_rec_mon.labels_
rfm_df.head(10)

In [None]:
# Centers of the clusters(coordinates)
centers = kmeans_rec_mon.cluster_centers_
centers

In [None]:
# ploting visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Recency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_rec_mon.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

#**2. Applying Elbow Method on Frequency and Monetary.**

In [None]:
# taking frequency and Monetory_log in list.
Frequency_and_Monetary_feat=['Frequency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_df[Frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#applying Elbow Method
wcss = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    wcss[k] = km.inertia_


#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

##**Here we can see that Optimal value for cluster came out to be 2.**

#**Silhouette Score (Validating Above optimal cluster value(i.e optimal_cluster=2)**

In [None]:
# taking frequency and Monetory_log in list.
Frequency_and_Monetary_feat=['Frequency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_df[Frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#Silhouette Score
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters,random_state=1)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

##**Here we can see the for n_cluster=2 silhouette score is good as compared to others.(if values is close to 1 means data points are clustered very well to respective clusters and distance of that datapoint is very far from the other cluster.)**

In [None]:
range_n_clusters = [2,3,4,5,6,7,8,9,10]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

###**Silhouette Plot for Frequency and Monetary with cluster=2 is very good as compared to Recency and Monetary's Silhouette plot.**

###**No datapoints are on the negative side of the Silhouette Coefficent values**

##**So giving n_clusters=2 on Kmeans Model.**

In [None]:
# applying Kmeans_clustering algorithm
kmeans_freq_mon = KMeans(n_clusters=2)
kmeans_freq_mon.fit(X)
y_kmeans= kmeans_freq_mon.predict(X)

In [None]:
#Find the clusters for the observation given in the dataset
rfm_df['Cluster_based_on_freq_mon'] = kmeans_freq_mon.labels_
rfm_df.head(10)

In [None]:
# Centers of the clusters(coordinates)
centers = kmeans_freq_mon.cluster_centers_
centers

In [None]:
# ploting visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_freq_mon.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

#**3. Applying Elbow Method on Recency, Frequency and Monetary.**

In [None]:
# taking Recency_log, Frequency_log and Monetory_log in list.
Recency_Frequency_and_Monetary_feat=['Recency_log','Frequency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_df[Recency_Frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#applying Elbow Method
wcss = {}
for k in range(1,15):
    km = KMeans(n_clusters= k, init= 'k-means++', max_iter= 1000)
    km = km.fit(X)
    wcss[k] = km.inertia_


#Plot the graph for the sum of square distance values and Number of Clusters
plt.figure(figsize=(12,6))
sns.pointplot(x = list(wcss.keys()), y = list(wcss.values()))
plt.xlabel('Number of Clusters(k)')
plt.ylabel('Sum of Square Distances')
plt.title('Elbow Method For Optimal k')
plt.show()

##**Silhouette Score (Validating Above optimal cluster value(i.e optimal_cluster=2)**

In [None]:
# taking Recency_log,Frequency_log and Monetory_log in list.
Recency_Frequency_and_Monetary_feat=['Recency_log','Frequency_log','Monetary_log']

# taking only values of recency and monetory in X.
X=rfm_df[Recency_Frequency_and_Monetary_feat].values

# standardising the data
scaler=StandardScaler()
X=scaler.fit_transform(X)

#Silhouette Score
range_n_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15]
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters,random_state=1)
    preds = clusterer.fit_predict(X)
    centers = clusterer.cluster_centers_

    score = silhouette_score(X, preds)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

###**Here we can see the for n_cluster=2 silhouette score is good as compared to others.(if values is close to 1 means data points are clustered very well to respective clusters and distance of that datapoint is very far from the other cluster.)**

In [None]:
for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=1)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) /n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

###**Silhouette Plot for Frequency and Monetary with cluster=2 is good.**

###**Still Few datapoints are on the negative side of the Silhouette Coefficent values(see below image). Still we can consider the clusters**

##**So giving n_clusters=2 on Kmeans Model.**

In [None]:
# applying Kmeans_clustering algorithm
kmeans_freq_mon_rec = KMeans(n_clusters=2)
kmeans_freq_mon_rec.fit(X)
y_kmeans= kmeans_freq_mon_rec.predict(X)

In [None]:
#Find the clusters for the observation given in the dataset
rfm_df['Cluster_based_on_freq_mon_rec'] = kmeans_freq_mon_rec.labels_
rfm_df.head(10)

In [None]:
# Centers of the clusters(coordinates)
centers = kmeans_freq_mon_rec.cluster_centers_
centers

In [None]:
# ploting visualizing the clusters
plt.figure(figsize=(15,10))
plt.title('customer segmentation based on Recency,Frequency and Monetary')
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='winter')

centers = kmeans_freq_mon_rec.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=300, alpha=0.8)

#**ML Model - 2**

#**HIERARCHICAL CLUSTERING**

###Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters.

###Strategies for hierarchical clustering generally fall into two categories:

###**Agglomerative:** This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

###**Divisive:** This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

###In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering[1] are usually presented in a **dendrogram** (A dendrogram is a tree-like diagram that records the sequences of merges or splits.More the distance of the vertical lines in the dendrogram, more the distance between those clusters.

###We can set a threshold distance and draw a horizontal line (Generally, we try to set the threshold in such a way that it cuts the tallest vertical line. Find largest vertical distance we can make without crossing any other horizontal line).

In [None]:
import scipy.cluster.hierarchy as sch

In [None]:
plt.figure(figsize=(13,8))
dendrogram = sch.dendrogram(sch.linkage(X, method = 'ward'))

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.axhline(y=80, color='r', linestyle='--')
plt.show() # find largest vertical distance we can make without crossing any other horizontal line

##**The number of clusters will be the number of vertical lines which are being intersected by the line drawn using the threshold.**

##**No. of Cluster = 2**

In [None]:
# Fitting hierarchical clustering on  dataset
from sklearn.cluster import AgglomerativeClustering
h_clustering = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')
y_hc = h_clustering.fit_predict(X)

In [None]:
# Visualizing the clusters (two dimensions only)
plt.figure(figsize=(13,8))
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 100, c = 'red', label = 'Customer 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Customer 2')
#plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Target')

plt.title('Clusters of Customer')
plt.xlabel('RFM')

plt.legend()
plt.show()

In [None]:
rfm_df.head()

# **Conclusion**

###**1. Firstly I did clustering based on RFM analysis. We had 4 clusters/Segmentation of customers based on RFM score which are as follows:**


In [None]:
segmentation_based_on_RFM = rfm_df[['Recency','Frequency','Monetary','RFM_Loyalty_Level']]

segmentation_based_on_RFM.groupby('RFM_Loyalty_Level').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

###**2. Later implemented the machine learning algorithms to cluster the customers.**

In [None]:
data_process_normalized=rfm_df[['Recency','Frequency','Monetary','Recency_log','Frequency_log','Monetary_log','RFM_Loyalty_Level','Cluster_based_on_freq_mon_rec']]
data_process_normalized.groupby('Cluster_based_on_freq_mon_rec').agg({
    'Recency': ['mean', 'min', 'max'],
    'Frequency': ['mean', 'min', 'max'],
    'Monetary': ['mean', 'min', 'max','count']
})

###**Above clustering is done with recency,frequency and monetary data(Kmeans Clustering) as all 3 together will provide more information.**

###**Cluster 0 has high recency rate but very low frequency and monetary. Cluster 0 conatins 2414 customers.**

###**Cluster 1 has low recency rate but they are frequent buyers and spends very high money than other customers as mean monetary value is very high.Thus generates more revnue to the retail business.**


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***