# Density Based Spatial Clustering Applications with Noise (DBSCAN)

> The unsupervised learning method **DBSCAN** is preferred over kmeans when data is real, non-spherical and has noise, as clusters predicted with kmeans are severely affected in the presence of noise or outlier.

> Also, the elbow criterion to select the optimal number of clusters k does not work. Read this paper for details: https://arxiv.org/pdf/2212.12189



>> DBSCAN demo from sklearn: https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html


> The hyperparameters are epsilon distance (eps) and min. number of samples which is the minimum number of neighbors within the eps radial dist..

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import math
import missingno as msn

from sklearn import metrics
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
#from sklearn.decomposition import PCA

In [None]:
df = pd.read_csv('train_data.csv')
df.info()

#pd.plotting.scatter_matrix(df, alpha = 0.45, figsize = (16, 12), diagonal = 'kde')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               10000 non-null  int64  
 1   profession       9507 non-null   object 
 2   age              5822 non-null   float64
 3   gender           7421 non-null   object 
 4   highestDegree    10000 non-null  object 
 5   maritalStatus    10000 non-null  object 
 6   noOfKids         5467 non-null   float64
 7   creditRisk       10000 non-null  float64
 8   otherMembership  8553 non-null   object 
 9   pastStays1y      9050 non-null   float64
 10  blogger          10000 non-null  object 
 11  articles         10000 non-null  int64  
 12  amexCard         10000 non-null  object 
 13  purposeTravel    10000 non-null  object 
 14  staySpend        10000 non-null  float64
 15  loyaltyClass     10000 non-null  object 
dtypes: float64(5), int64(2), object(9)
memory usage: 1.2+ MB


# Data exploration

In [None]:
df.head(5)

Unnamed: 0,ID,profession,age,gender,highestDegree,maritalStatus,noOfKids,creditRisk,otherMembership,pastStays1y,blogger,articles,amexCard,purposeTravel,staySpend,loyaltyClass
0,0,Public Sector,,Male,High School Equivalent,Single,0.0,3.18,IHG,23.0,No,7,Yes,Medical Tourism,1433.147527,Bronze
1,1,Private Sector,,,Associate Degree,Married,1.0,2.07,IHG,48.0,No,4,Yes,Leisure Travel,38.379023,Silver
2,2,Business,44.0,Female,Masters Degree,Married,,2.81,Discovery,32.0,No,1,No,Leisure Travel,0.0,Bronze
3,3,Public Sector,,Male,Bachelors Degree,Single,0.0,3.46,IHG,6.0,No,16,Yes,Medical Tourism,0.0,Bronze
4,4,Business,,Female,Bachelors Degree,Widowed,,3.21,Marriott,,Yes,0,No,Business,764.418767,Silver


In [None]:
fig = px.violin(df, x = 'gender', y = "creditRisk", color = "loyaltyClass", box = True, points = 'all')
fig.show()

In [None]:
fig = px.violin(df, x = 'loyaltyClass', y = "creditRisk", color = "gender", box = True, violinmode = 'overlay')
fig.show()

# Clustering of customers

In [None]:
df1 = df[['creditRisk', 'staySpend']]

df1 = df1.dropna(how = 'any', axis = 0)

In [None]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   creditRisk  10000 non-null  float64
 1   staySpend   10000 non-null  float64
dtypes: float64(2)
memory usage: 156.4 KB


In [None]:
model = DBSCAN(eps = 1.0, min_samples = 300)                           #default parameters:eps = 0.5, min_samples = 5, metric = 'euclidean'
pred1 = model.fit(df1)

In [None]:
labels1 = pred1.labels_

# Number of clusters in labels, ignoring noise if present
n_clusters_ = len(set(labels1)) - (1 if -1 in labels1 else 0)
n_noise_ = list(labels1).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

Estimated number of clusters: 1
Estimated number of noise points: 4692


In [None]:
labels_true = df['loyaltyClass']

print(f"Homogeneity: {metrics.homogeneity_score(labels_true, labels1):.4f}")
print(f"Completeness: {metrics.completeness_score(labels_true, labels1):.4f}")
print(f"V-measure: {metrics.v_measure_score(labels_true, labels1):.4f}")
print(f"Adjusted Rand Index: {metrics.adjusted_rand_score(labels_true, labels1):.4f}")
print(
    "Adjusted Mutual Information:"
    f" {metrics.adjusted_mutual_info_score(labels_true, labels1):.4f}"
)
print(f"Silhouette Coefficient/Compactness: {metrics.silhouette_score(df1, labels1):.4f}")

Homogeneity: 0.0017
Completeness: 0.0020
V-measure: 0.0019
Adjusted Rand Index: 0.0016
Adjusted Mutual Information: 0.0017
Silhouette Coefficient/Compactness: 0.2733
