<a href="https://colab.research.google.com/github/sanjayrawat2468/onlineretailcustomersegmentation/blob/main/Online_Retail_Customer_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Online Retail Customer Segmentation**



##### **Project Type**    - EDA/Unsupervised
##### **Contribution**    - Sanjay Rawat(Individual)



# **Project Summary -**

### **Customer segmentation is the process of separating customers into groups on the basis of their shared behavior or other attributes. The groups should be homogeneous within themselves and should also be heterogeneous to each other. The overall aim of this process is to identify high-value customer base i.e. customers that have the highest growth potential or are the most profitable.**

### **Insights from customer segmentation are used to develop tailor-made marketing campaigns and for designing overall marketing strategy and planning.**

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


### **In this project, your task is to identify major customer segments on a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.**


# **Data Description**

### **InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
### **StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
### **Description:** Product (item) name. Nominal.
### **Quantity:** The quantities of each product (item) per transaction. Numeric.
### **InvoiceDate:** Invice Date and time. Numeric, the day and time when each transaction was generated.
### **UnitPrice:** Unit price. Numeric, Product price per unit in sterling.
### **CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
### **Country:** Country name. Nominal, the name of the country where each customer resides.

# **Import Libararies**

In [None]:
# Importing Required Liberaries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Mounting The Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/Online Retail.xlsx - Online Retail.csv')

# **Data Wrangling**

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
# Checking For Null Values
df.isna().sum()

**Here we have null values present in our dataset like in CustomerID and Description.we can drop those null values**

# **Data Cleaning**

In [None]:
# Dropping Null Values
df.dropna(inplace=True)

In [None]:
df.shape

**We will look for invoices with the letter c in the InvoiceNo column to see which means order is cancelled**

In [None]:
# Changing Datatype
df['InvoiceNo'] = df['InvoiceNo'].astype('str')


In [None]:
# Dropping The cancelled Orders
df=df[~df['InvoiceNo'].str.contains('C')]

# **Exploratory Data Analysis**

In [None]:
Description_df=df['Description'].value_counts().reset_index()
Description_df.rename(columns={'index': 'Description_Name'}, inplace=True)
Description_df.rename(columns={'Description': 'Count'}, inplace=True)


**Top Products Based On Selling**

In [None]:
# Top 5 Description Name
Description_df.head()

In [None]:
# Plot Top 5 Product Name
plt.figure(figsize=(12,8))
plt.title('Top 5 Product Name')
sns.barplot(x='Count',y='Description_Name',data=Description_df[:5]);

**Bottom 5 Product Based On Selling**

In [None]:
# Bottom 5 Description Name
Description_df.tail()

**Country Name**



In [None]:
country_df=df['Country'].value_counts().reset_index()
country_df.rename(columns={'index': 'Country_Name'}, inplace=True)
country_df.rename(columns={'Country': 'Count'}, inplace=True)

**Top 5 Country Name Based On Customer Count**

In [None]:
# Top 5 Country Name Based On Customer
country_df.head()


In [None]:
plt.figure(figsize=(12,8))
plt.title('Top 5 Country based on the Most Numbers Of Customers')
sns.barplot(x='Count',y='Country_Name',data=country_df[:5]);

**From this graph we can see that most of the customers are from United Kingdom that make sense aas company is from UK bases after that we have Germany ,France ,EIRE and Spain**

**Bottom 5 Country Name Based On Customers**

In [None]:
# Bottom 5 Country Name
country_df.tail()

In [None]:
plt.figure(figsize=(12,8))
plt.title('Bottom 5 Countries based on Customers')
sns.barplot(x='Count',y='Country_Name',data=country_df[-5:]);

**From this graph we can see that least number of customers from Lithuania,Brazil, Czech Republic ,Bahrain and Saudi Arabia**

**Distribution of Quantity**



In [None]:
# Distribution Of Quantity
plt.figure(figsize=(15,10))
plt.title('distribution of Quantity')
sns.distplot(df['Quantity']);

**From the above graph we can clearly see that its a Positively skewed (or right-skewed) distribution.**

In [None]:
# Transforming Skewed Distribution Using log Transformation
plt.figure(figsize=(15,10))
plt.title('log distribution of Quantity')
sns.distplot(np.log(df['Quantity']));

# **Feature Engineering**

In [None]:
# Converting InvoiceDate Columns Into date time format
from datetime import datetime

df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])

In [None]:
# Creating a new features from Invoicedate
df['Month']=df['InvoiceDate'].dt.month_name()
df['Day']=df['InvoiceDate'].dt.day_name()
df['Hour']=df['InvoiceDate'].dt.hour

**Transaction Made Per Month**

In [None]:
# Creating a New Column
month_df=df['Month'].value_counts().reset_index()
month_df.rename(columns={'index': 'Month_Name'}, inplace=True)
month_df.rename(columns={'Month': 'Count'}, inplace=True)
month_df

In [None]:
# plotting for monthwise transactions
plt.figure(figsize=(13,8))
ax = sns.barplot(x='Month_Name', y='Count', data=month_df)

# Add the percentage values in the middle of each bar
for i, v in enumerate(month_df['Count']):
    ax.text(i, v, str(round(v / month_df['Count'].sum() * 100, 2)) + '%', color='black', ha='center', fontweight='bold')
    
plt.title('Month-Wise Transaction')
sns.barplot(x='Month_Name',y='Count',data=month_df);

**Most numbers of customers made purchases in the month of November, October and December this could be due to festive season in end of the year as well new year to celebrate so we have highest numbers of transaction in november, october, december.**

**Least numbers of purchasing are in the month of April and February.**

In [None]:
# Creating another dataframe 
day_df=df['Day'].value_counts().reset_index()
day_df.rename(columns={'index': 'Day_Name'}, inplace=True)
day_df.rename(columns={'Day': 'Count'}, inplace=True)
day_df

In [None]:
# Plotting the graph daywise transactions
plt.figure(figsize=(12,8))
plt.title('Day-wise transaction')
sns.barplot(x='Day_Name',y='Count',data=day_df);

**From above graph we can see the maximum number of transaction are for thursday but we can also see there are no transaction on saturday may be lack of data or missing data.**

**Most of the transaction took place on Thursday ,Wednesday and Tuesday.**

In [None]:
# Creating another dataframe for hour
hour_df=df['Hour'].value_counts().reset_index()
hour_df.rename(columns={'index': 'Hours'}, inplace=True)
hour_df.rename(columns={'Hour': 'Count'}, inplace=True)
hour_df

In [None]:
# Plotting the graph for hourly transactions
plt.figure(figsize=(13,8))
plt.title('Hourly transactions')
sns.barplot(x='Hours',y='Count',data=hour_df);

**By seeing the above graph we can say that most numbers of transactions done between 12pm clock to 3pm.**

In [None]:
# Dividing hours into different time periods like morning, afternoon and evening
def time_type(time):
  if(time>=6 and time<=11):
    return 'Morning'
  elif(time>=12 and time<=17):
    return 'Afternoon'
  else:
    return 'Evening'

In [None]:
# Applying function on hour column
df['Time_type']=df['Hour'].apply(time_type)

In [None]:
plt.figure(figsize=(12,8))
plt.title('Time_type wise transaction')
sns.countplot(x='Time_type',data=df);

**From the above graph we can see that afternoon is the bussiest time slot where most of the transactions tooks place.**

# **Top Customers**

In [None]:
# Creating a dataframe of top customers by number of transactions
top_customers = pd.DataFrame(df['CustomerID'].value_counts().sort_values(ascending = False).reset_index())

In [None]:
top_customers.rename(columns = {'index':'CustomerID','CustomerID':'count'},inplace = True)

# Displaying the top 5 customers
top_customers.head(5)

In [None]:
plt.figure(figsize = (10,7))
sns.barplot(x = 'CustomerID',y = 'count',data = top_customers[:5])
plt.xlabel('Customer ID', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.title("Top 5 Customer's ID", fontsize = 16)
plt.show()

**The chart provides insight into the customers who make the most purchases from the business and helps identify potential loyal customers or areas for improvement in customer retention.**

# **Top Selling products**

In [None]:
# Group the data by product name and calculate the sum of the quantity sold for each product
product_group = df.groupby('Description').sum()['Quantity']

In [None]:
# Sort the data in descending order
product_group = product_group.sort_values(ascending=False)

In [None]:
# Select the top 10 items
top_10_selling_products = product_group.index[:10]

In [None]:
# Create a new dataframe to store the top 10 selling products
top_10_products_df = pd.DataFrame({'Product': top_10_selling_products, 'Quantity Sold': product_group.values[:10]})
top_10_products_df

In [None]:
plt.figure(figsize=(15, 6))
plt.bar(top_10_products_df['Product'], top_10_products_df['Quantity Sold'])

# Set the title and axis labels
plt.title('Top 10 Selling Products', size=15, fontweight='bold')
plt.xlabel('Product', size=15)
plt.ylabel('Quantity Sold', size=15)

# Rotate the x-axis labels
plt.xticks(rotation=90)
plt.show()

**A bar chart is a good choice for showing the quantity of each product sold as it allows for easy comparison between the different products. It is also effective in highlighting the top 10 selling products.**


**This chart shows the quantity of each of the top 10 selling products, providing insight into the most popular items. It also allows for comparison between the different products and their respective quantities sold.**

# **RFM Analysis**

**RFM analysis is a marketing technique that segments customers based on their recency (time since last purchase), frequency (number of purchases), and monetary value (amount spent) of their transactions. This helps businesses understand their customers better and make data-driven decisions about marketing and customer engagement.**

**Three dimensions of RFM are -**

**Recency -** In order to find the recency value of each customer, we need to determine the last invoice date as the current date and subtract the last purchasing date of each customer from this date.

**Frequency -**In order to find the frequency value of each customer, we need to determine how many times the customers make purchases.

**Monetary -** In order to find the monetary value of each customer, we need to determine how much do the customers spend on purchases.

In [None]:
import datetime as dt
# Recency = Latest Date - Last Inovice Data, Frequency = count of invoice no. of transaction(s), Monetary = Sum of Total 
# Amount for each customer

# Get the latest value of the 'InvoiceDate' column
last_date = df['InvoiceDate'].max()
last_date

In [None]:
# Set Latest date 2011-12-10 as last invoice date was 2011-12-09
Latest_Date = dt.datetime(2011,12,10)

In [None]:
# Creating a new feature TotalAmount from product of Quantity and Unitprice
df['TotalAmount']=df['Quantity']*df['UnitPrice']

# Create RFM Modelling scores for each customer
rfm_df = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (Latest_Date - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalAmount': lambda x: x.sum()})


**The RFM dataframe combines recency, frequency, and monetary value information for each customer to provide a comprehensive overview of their behavior and spending habits.**

In [None]:
#Convert Invoice Date into type int
rfm_df['InvoiceDate'] = rfm_df['InvoiceDate'].astype(int)

#Rename column names to Recency, Frequency and Monetary
rfm_df.rename(columns={'InvoiceDate': 'Recency', 
                         'InvoiceNo': 'Frequency', 
                         'TotalAmount': 'Monetary'}, inplace=True)

rfm_df.reset_index().head()

**Recency Ditribution Over Plot**

In [None]:
# Recency distribution plot
import seaborn as sns
x = rfm_df['Recency']
plt.figure(figsize=(10,8))
sns.distplot(x);

**Frequency Ditribution Over Plot**

In [None]:
x = rfm_df['Frequency']
plt.figure(figsize=(10,8))
sns.distplot(x,);

**Monateray Distribution Over Plot**

In [None]:
# Monateray distribution plot
x = rfm_df['Monetary']
plt.figure(figsize=(10,8))
sns.distplot(x)

**From all the above graphs of Recency,Frequency and Monetary we can say that all are positively skewed distribution.for that we can use quantile method -**



**Quantile Method**

In [None]:
# Split into quantiles
quantiles = rfm_df.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
quantiles

In [None]:
# Functions to create R, F and M segments according to quantiles for recency low score is important and for frequency and monetory maximum is important.
def RScoring(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
    
def FnMScoring(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1

In [None]:
rfm_df['R'] = rfm_df['Recency'].apply(RScoring, args=('Recency',quantiles,))
rfm_df['F'] = rfm_df['Frequency'].apply(FnMScoring, args=('Frequency',quantiles,))
rfm_df['M'] = rfm_df['Monetary'].apply(FnMScoring, args=('Monetary',quantiles,))
rfm_df.head()

In [None]:
#Handling negative and zero values so as to handle infinite numbers during log transformation
def handle_neg_n_zero(num):
    if num <= 0:
        return 1
    else:
        return num
#Applying handle_neg_n_zero function to Recency and Monetary columns 
rfm_df['Recency'] = [handle_neg_n_zero(x) for x in rfm_df.Recency]
rfm_df['Monetary'] = [handle_neg_n_zero(x) for x in rfm_df.Monetary]

#Performing Log transformation to bring data into normal or near normal distribution
Log_Tfd_Data = rfm_df[['Recency', 'Frequency', 'Monetary']].apply(np.log, axis = 1).round(3)


In [None]:
# Data distribution after data normalization for Recency
Recency_Plot = Log_Tfd_Data['Recency']
plt.figure(figsize=(10,8))
sns.distplot(Recency_Plot);

In [None]:
# Data distribution after data normalization for Frequency
Frequency_Plot = Log_Tfd_Data.query('Frequency < 1000')['Frequency']
plt.figure(figsize=(10,8))
sns.distplot(Frequency_Plot);

In [None]:
# Data distribution after data normalization for Monetary
Monetary_Plot = Log_Tfd_Data.query('Monetary < 10000')['Monetary']
plt.figure(figsize=(10,8))
sns.distplot(Monetary_Plot);

**As we can see from the above plots, skewness has been removed from the data.**



In [None]:
# Calculating the concatenated score of RFM
rfm_df['RFMGroup'] = rfm_df.R.map(str) + rfm_df.F.map(str) + rfm_df.M.map(str)

In [None]:
# Calculate and Add RFMScore value column showing total sum of RFMGroup values
rfm_df['RFMScore'] = rfm_df[['R', 'F', 'M']].sum(axis = 1)
rfm_df.head()

In [None]:
# Sort the dataframe by MonetaryValue in descending order and reset the index
rfm_df2 = rfm_df[rfm_df['RFMGroup'] == '444'].sort_values('Monetary', ascending=False)
rfm_df2.head(10)

In [None]:
# Categorising customer or making customer segmentation based on RFM Score
print("Best Customers: ",len(rfm_df[rfm_df['RFMGroup']=='444']))
print('Loyal Customers: ',len(rfm_df[rfm_df['F']==4]))
print("Big Spenders: ",len(rfm_df[rfm_df['M']==4]))
print('Almost Lost: ', len(rfm_df[rfm_df['RFMGroup']=='244']))
print('Lost Customers: ',len(rfm_df[rfm_df['RFMGroup']=='144']))
print('Lost Cheap Customers: ',len(rfm_df[rfm_df['RFMGroup']=='111']))

**With the segmentation of our customers based on their RFM scores, we can now tailor our marketing strategies to each segment effectively.**

**For example, our "Best Customers" or "Champions" can be rewarded for their loyalty. These customers can also serve as early adopters for new products, so we can suggest them to participate in a "Refer a Friend" program.**

**For customers who are "At Risk", we can send them personalized emails to encourage them to make a purchase. This can help to retain them as customers and keep them engaged with our brand.**

In [None]:
# Dropping the RFMScore and its components columns from the dataframe
rfm_data = rfm_df.drop(['R','F','M','RFMScore','RFMGroup'], axis=1)

In [None]:
# Calculate the correlation between the variables
correlation = rfm_data.corr()

# Display the correlation matrix
correlation

In [None]:
# Plotting the heatmap of the feature correlations in the dataframe
sns.heatmap(rfm_data.corr(), annot=True, cmap='Reds');

**The insight is that there is a negative correlation between recency and both frequency and monetary, indicating that customers who have recently made a purchase are less likely to make another purchase. There is also a positive but weak correlation between frequency and monetary.**

**The insights can help create a positive business impact by helping businesses better understand customer behaviour and tailor their sales and promotions accordingly.**

# **Plot the distribution of Recency, Frequency, and Monetary**

**A scatter matrix is a visual representation of the relationships between multiple variables or features in a dataset. It can help identify patterns, trends, and correlations between the variables. It is a useful tool for exploratory data analysis and can help provide insight into the data.**



In [None]:
# Visualizing the distribution of features in the dataset using Seaborn.
sns.pairplot(rfm_data, diag_kind='kde');

**The pairplot with kde diagonal plots was chosen as it is an effective way to visualize the distribution and pairwise relationships between multiple features in a dataset. It allows us to quickly identify any correlations or patterns between variables, making it an excellent choice for visualizing the distribution of features in the dataset.**

**We can observe that the distributions of the three variables are skewed. This suggests that normalization is necessary to make the data features normally distributed, as most clustering algorithms require them to be normally distributed.**

In [None]:
#The skew() method is used to measure the asymmetry of the data around the mean. 
rfm_data.skew()

**Plot the distribution of Recency, Frequency, and Monetary after Data Normalization**

In [None]:
sns.pairplot(data = Log_Tfd_Data, diag_kind='kde');

**The distribution of the Frequency and Monetary features have improved and appear to be more normal, but the distribution of the Recency feature has only improved to some extent and is still not as well-normalized as the other two features.**



In [None]:
Log_Tfd_Data.head()

# **Correlation Heatmap**

In [None]:
# Features correlation after log transformation or data normalization
sns.heatmap(Log_Tfd_Data.corr(),annot=True, cmap='Reds');

**The correlation between Monetary and Frequency is now stronger.**

# **Data Scaling**

In [None]:
# Assign the normalized data to a variable "X"
X = Log_Tfd_Data

In [None]:
# Define the features to use for K-means
features = ['Recency', 'Frequency', 'Monetary']

# Standardize the feature values
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(Log_Tfd_Data[features].values)

**I used Standardization to transform my features in order to ensure that they had a similar scale and distribution. This was important because some machine learning algorithms are sensitive to the scale and distribution of features, and Standardization helps to ensure unbiased results.**

# **ML Model Implementation**

# **K-means Implementation**

**K-means is a clustering algorithm that groups data points into K clusters. Choosing the right number of clusters can be challenging. The Silhouette Coefficient can be used to evaluate the quality of the clusters by measuring the similarity of each data point to its assigned cluster. A high Silhouette Score indicates a good quality cluster. To ensure a high-quality solution, k-means++ should be used for initialization.**

# **K-Means with silhouette_score**

In [None]:
# Importing Libraries
from sklearn.metrics import silhouette_score
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

In [None]:
silhouette_scores = []

# Loop over different values of K
for n_clusters in range(2, 16):
    # Initialize the KMeans model with the number of clusters
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++')
    
    # Fit the KMeans model to the data
    kmeans.fit(X)
    
    # Predict the cluster labels for each data point
    labels = kmeans.labels_
    
    # Calculate the silhouette score for this solution
    silhouette = silhouette_score(X, labels)
    
    # Append the silhouette score to the array
    silhouette_scores.append(silhouette)
    
    # Print the silhouette score for this solution
    print(f"Silhouette score for {n_clusters} clusters: {silhouette:.3f}")
    
# Plot the silhouette scores
plt.plot(range(2, 16), silhouette_scores, '-o', color='red', markersize=10, linewidth=2)
plt.xlabel('Number of clusters (K)', fontsize=14)
plt.ylabel('Silhouette score', fontsize=14)
plt.title('Silhouette score for different values of K', fontsize=16)
plt.xticks(range(2, 16), fontsize = 12)
plt.yticks(fontsize=12)
plt.grid(True)
plt.show()

**The silhouette score plot is commonly used to evaluate the quality of clustering.**

**The plot suggests that 2 clusters are optimal for the dataset.**





In [None]:
# Instantiate a KMeans object with 2 clusters
kmeans = KMeans(n_clusters=2)

# Fit the input data X to the KMeans model
kmeans.fit(X)

# Predict the cluster labels for the input data X using the trained KMeans model
y_kmeans = kmeans.predict(X)

In [None]:
# Visualization of customer segmentation based On RFM features. 
# Set the figure size and title for the scatter plot
plt.figure(figsize=(12,8))
plt.title('Customer Segmentation Based on RFM Features')

# Plot the scatter plot using the first two features of the input data X and the predicted cluster labels y_kmeans
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='RdYlBu')

# Get the cluster centers from the trained KMeans model and plot them as yellow circles with transparency
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5, edgecolor='black')

# Set the x-axis and y-axis labels
plt.xlabel('Recency')
plt.ylabel('Frequency')

# Add a color bar to the plot to show the correspondence between the colors and the cluster labels
color_bar = plt.colorbar()
color_bar.set_ticks(np.unique(y_kmeans))
color_bar.set_ticklabels(['Cluster {}'.format(i) for i in np.unique(y_kmeans)])

# Show the plot
plt.show()
     


**The scatter plot is commonly used to visualize the distribution of data points in a 2D space. In this case, the scatter plot is used to visualize customer segmentation based on RFM (Recency, Frequency, Monetary) features.**

**The scatter plot reveals distinct clusters of customers based on their RFM features. This allows businesses to identify groups of customers with similar behavior and tailor their marketing strategies accordingly. The cluster centers (yellow circles) also provide a visual representation of the typical RFM profile of each customer segment**

**By enabling businesses to identify and target specific customer segments with personalized marketing strategies and product recommendations. This can lead to improved customer experiences, increased customer loyalty, and ultimately, positive business impact.**

# **K-Means with Elbow method**

**The elbow method is used to find the optimal number of clusters for KMeans clustering. It involves plotting the within-cluster sum of squares (WCSS) against the number of clusters. The elbow point on the plot corresponds to the optimal number of clusters that balances the trade-off between model complexity and data structure.**

In [None]:
# Initialize an empty list to store the WCSS values for different number of clusters
wcss = []  

for i in range(1, 11):
    # Create a KMeans instance for each number of clusters
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    # Fit the KMeans model to the input data X 
    kmeans.fit(X)
    # Append the WCSS value to the list for the current number of clusters 
    wcss.append(kmeans.inertia_)  

# Plot the WCSS values against the number of clusters
plt.figure(figsize=(10,6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.xticks(np.arange(1, 11, 1))
plt.grid(True)
plt.show()

**The Elbow Method plot is commonly used to identify the optimal number of clusters in a K-means clustering algorithm.**


In [None]:
# Create an instance of the KMeans model with 2 clusters and initialize the centroids using the 'k-means++' method
KMean_clust = KMeans(n_clusters= 2, init= 'k-means++', max_iter= 1000)

# Fit the KMeans model to the data in the X variable
KMean_clust.fit(X)

# Add a new column to the rfm_df dataframe to store the cluster labels for each observation
rfm_df['Cluster'] = KMean_clust.labels_

# Display the first 10 rows of the rfm_df dataframe with the new 'Cluster' column
rfm_df.head(10)

# **Agglomerative Hierarchial Clustering**

**Agglomerative Hierarchical Clustering is a clustering algorithm that starts with each data point in its own cluster, and then merges the two closest clusters until only one remains, producing a tree-like structure. Different distance metrics and linkage criteria can be used to determine proximity between clusters. It is a popular and effective method for exploratory data analysis.**



In [None]:
# Import necessary libraries
from sklearn.cluster import AgglomerativeClustering

# Create an instance of AgglomerativeClustering with 2 clusters, euclidean affinity, and ward linkage
model = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')

# Fit the input data X to the model
model.fit(X)

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

# Set the figure size and title for the dendrogram plot
plt.figure(figsize=(15, 12))
plt.title('Agglomerative Hierarchical Clustering Dendogram')

# Set the x and y-axis labels for the dendrogram plot
plt.xlabel('Sample index')
plt.ylabel('Distance')

# Create a linkage matrix using the input data X and the ward linkage method
Z = linkage(X, 'ward')

# Plot the dendrogram with specified parameters
dendrogram(Z, leaf_rotation=90.0, p=25, color_threshold=80, leaf_font_size=10, truncate_mode='level')

# Ensure tight layout of the plot
plt.tight_layout()

**I picked a dendrogram plot because it is a common way to visualize the results of hierarchical clustering, which is the clustering method used in this case.**

**The dendrogram plot shows how the data points are clustered based on their distance to each other. It helps identify the optimal number of clusters and the hierarchical structure of the clusters.**

**The insights gained from the dendrogram plot can help identify the optimal number of clusters and determine which observations or clusters are most similar to each other. This information can be used to create more targeted marketing or sales strategies and improve overall business performance.**


# **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**

**DBSCAN is a clustering algorithm that groups together data points that are close to each other and are part of a dense region of the dataset. It is useful for handling non-linearly separable data and can handle noise and outliers.**

In [None]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

# Create and fit the DBSCAN model
dbscan = DBSCAN(eps=0.5, min_samples=15)
dbscan.fit(X)

# Plot the results
plt.scatter(X[:,0], X[:,1], c=dbscan.labels_, cmap='rainbow')
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

**The chart used is a scatter plot, which is a suitable choice for visualizing the clustering results of DBSCAN. The x and y axes represent the two features of the dataset, and the points are colored based on their assigned cluster labels.**

**The insights gained from the chart include identifying the clusters formed by the DBSCAN algorithm and their density. The points that are closer to each other are assigned to the same cluster, and the outliers or noise points are labeled as -1. By observing the distribution of the points and the density of the clusters, we can understand the structure and characteristics of the data, and potentially find any patterns or anomalies.**


**The gained insights can help in creating a positive business impact by identifying groups of similar data points, which can aid in targeting specific segments of customers or optimizing operational processes.**


# **Summary Table**

In [None]:
from prettytable import PrettyTable

# Initialize the table with specified column names
myTable = PrettyTable(['SL No.', "Model_Name", 'Data', "Optimal_Number_of_cluster"])

# Add rows to the table
myTable.add_row(['1', "K-Means with silhouette_score", "RFM", "2"])
myTable.add_row(['2', "K-Means with Elbow method", "RFM", "2"])
myTable.add_row(['3', "Hierarchical clustering", "RFM", "2"])
myTable.add_row(['4',"DBSCAN ", "RFM", "3"])

# Print the table
print(myTable)

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***