# What is DBSCAN?
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise and is one of clustering algorithms.
As the name of paper suggests the core idea of DBSCAN is around concept of dense regions. The assumption is that natural clusters are composed of densely located points. This requires definition of “dense region”. To do these two parameters are required for DBSCAN algorithm.

* Eps, ε - distance
* MinPts – Minimum number of points within distance Eps

A “dense region” is therefore created by a minimum number of points within distance between all of them, Eps. Points which are within this distance but not close to minimum number of other points are treated as “border points”. Remaining ones are noise or outliers. This is shown in the picture below (for MinPts=3). Red points (D) are in a “dense region” – each one has minimum of 3 neighbours within distance Eps. Green points (B) are border ones – they have a neighbour within distance Eps but less than 3. Blue point (O) is an outlier – no neighbours within distance Eps.

### Advantages of this approach:

* it finds number of clusters itself, based on eps and MinPts parameters
* It it able to differentiate elongated clusters or clusters surrounded by other clusters in contrary to e.g. K-Means where clusters are always convex.
* It is also able to find points not fitting into any cluster – it detects outliers.

### The biggest drawback of DBSCAN:

* High computational expense of average O(n log(n)) coming from a need to execute a neighbourhood query for each point.
* Poorly identifies clusters with various densities

## If this Kernel helped you in any way, UPVOTES would be very much appreciated

# Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# import required libraries for clustering
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Load dataset

In [None]:
df = pd.read_csv('/kaggle/input/online-retail-customer-clustering/OnlineRetail.csv', sep=",", encoding="ISO-8859-1", header=0)

In [None]:
# first five row
df.head()

In [None]:
# size of datset
df.shape

In [None]:
# statistical summary of numerical variables
df.describe()

In [None]:
# summary about dataset
df.info()

# Exploratory data analysis

In [None]:
# check for missing values
df.isna().sum() / df.shape[0] * 100

In [None]:
# Droping rows having missing values

df = df.dropna()
df.shape

### Data Preparation

We are going to analysis the Customers based on below 3 factors:
* R (Recency): Number of days since last purchase
* F (Frequency): Number of tracsactions
* M (Monetary): Total amount of transactions (revenue contributed)

In [None]:
# New Attribute : Monetary

df['Amount'] = df['Quantity']*df['UnitPrice']

rfm_m = df.groupby('CustomerID')['Amount'].sum()
rfm_m = rfm_m.reset_index()
rfm_m.head()

In [None]:
# New Attribute : Frequency

rfm_f = df.groupby('CustomerID')['InvoiceNo'].count()

rfm_f = rfm_f.reset_index()
rfm_f.columns = ['CustomerID', 'Frequency']
rfm_f.head()

In [None]:
# Merging the two dfs

rfm = pd.merge(rfm_m, rfm_f, on='CustomerID', how='inner')
rfm.head()

In [None]:
# New Attribute : Recency

# Convert to datetime to proper datatype

df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'],format='%d-%m-%Y %H:%M')

In [None]:
# Compute the maximum date to know the last transaction date

max_date = max(df['InvoiceDate'])
max_date

In [None]:
# Compute the difference between max date and transaction date

df['Diff'] = max_date - df['InvoiceDate']
df.head()

In [None]:
# Compute last transaction date to get the recency of customers

rfm_p = df.groupby('CustomerID')['Diff'].min()

rfm_p = rfm_p.reset_index()
rfm_p.head()

In [None]:
# Extract number of days only

rfm_p['Diff'] = rfm_p['Diff'].dt.days
rfm_p.head()

In [None]:
# Merge tha dataframes to get the final RFM dataframe

rfm = pd.merge(rfm, rfm_p, on='CustomerID', how='inner')

rfm.columns = ['CustomerID', 'Amount', 'Frequency', 'Recency']
rfm.head()

There are 2 types of outliers and we will treat outliers as it can skew our dataset¶
* Statistical
* Domain specific

In [None]:
# Outlier Analysis of Amount Frequency and Recency

attributes = ['Amount','Frequency','Recency']
plt.rcParams['figure.figsize'] = [10,8]
sns.boxplot(data = rfm[attributes], orient="v", palette="Set2" ,whis=1.5,saturation=1, width=0.7)
plt.title("Outliers Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Range", fontweight = 'bold')
plt.xlabel("Attributes", fontweight = 'bold')

In [None]:
# Removing (statistical) outliers for Amount
Q1 = rfm.Amount.quantile(0.05)
Q3 = rfm.Amount.quantile(0.95)
IQR = Q3 - Q1
rfm = rfm[(rfm.Amount >= Q1 - 1.5*IQR) & (rfm.Amount <= Q3 + 1.5*IQR)]

# Removing (statistical) outliers for Recency
Q1 = rfm.Recency.quantile(0.05)
Q3 = rfm.Recency.quantile(0.95)
IQR = Q3 - Q1
rfm = rfm[(rfm.Recency >= Q1 - 1.5*IQR) & (rfm.Recency <= Q3 + 1.5*IQR)]

# Removing (statistical) outliers for Frequency
Q1 = rfm.Frequency.quantile(0.05)
Q3 = rfm.Frequency.quantile(0.95)
IQR = Q3 - Q1
rfm = rfm[(rfm.Frequency >= Q1 - 1.5*IQR) & (rfm.Frequency <= Q3 + 1.5*IQR)]

### Rescaling the Attributes
It is extremely important to rescale the variables so that they have a comparable scale.| There are two common ways of rescaling:

* Min-Max scaling
* Standardisation

Here, we will use Standardisation Scaling.

In [None]:
# Rescaling the attributes

rfm_df = rfm[['Amount', 'Frequency', 'Recency']]

# Instantiate
scaler = StandardScaler()

# fit_transform
rfm_df_scaled = scaler.fit_transform(rfm_df)
rfm_df_scaled.shape

In [None]:
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['Amount', 'Frequency', 'Recency']

rfm_df_scaled.head()

# Building the Model

In [None]:
# create an object
db = DBSCAN(eps=0.8, min_samples=7, metric='euclidean')

# fit the model
db.fit(rfm_df_scaled)

In [None]:
# Cluster labled
db.labels_

# Evaluation

### Silhouette

In [None]:
from sklearn.metrics import silhouette_score

cluster_labels = db.labels_   

# silhouette score
silhouette_avg = silhouette_score(rfm_df_scaled, cluster_labels)
print("The silhouette score is", format(silhouette_avg))


In [None]:
rfm_df_scaled['label']=db.labels_

rfm_df_scaled.head()

In [None]:
for c in rfm_df_scaled.columns[:-1]:
    plt.figure(figsize=(6,4))
    sns.boxplot(data=rfm_df_scaled, y=c, x='label')
    plt.show()