# **Density Based Clustering / DBSCAN clustering**

* DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.

* It groups ‘densely grouped’ data points into a single cluster.

* It can identify clusters in large spatial datasets by looking at the local density of the data points.

* The most exciting feature of DBSCAN clustering is that it is robust to outliers.

 * It also does not require the number of clusters to be told beforehand, unlike K-Means, where we have to specify the number of centroids.

* DBSCAN requires only two parameters:
     1. epsilon
     2. minPoints.


*  **Epsilon** is the radius of the circle to be created around each data point to check the density
*  **minPoints** is the minimum number of data points required inside that circle for that data point to be classified as a Core point.

In higher dimensions the circle becomes hypersphere, epsilon becomes the radius of that hypersphere, and minPoints is the minimum number of data points required inside that hypersphere.




In [None]:
import numpy as np
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
 dataset=pd.read_csv('/content/drive/MyDrive/Mall_Customers.csv')

In [None]:
dataset.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [None]:
dataset.shape

(200, 5)

In [None]:
X = dataset.iloc[:, [3,4]].values
X = X.reshape(-1,1)

Standardizing data is recommended because otherwise the range of values in each feature will act as a weight when determining how to cluster data, which is typically undesired.

**Parameters :**

 * ***eps*** :  (float) default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.

* ***min_samples*** : int, default=5
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

* ***metric*** : str, or callable, default=’euclidean’
   ,The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances for its metric parameter.

* ***metric_params*** : dict, default=None
Additional keyword arguments for the metric function.



* ***algorithm*** :{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors.

* ***leaf_size***: int, default=30
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

* ***p*** : float, default=None
The power of the Minkowski metric to be used to calculate distance between points.
Purpose: Compute the Minkowski distance between two variables. The case where p = 1 is equivalent to the Manhattan distance and the case where p = 2 is equivalent to the Euclidean distance.If None, then p=2 (equivalent to the Euclidean distance).

* ***n_jobs*** : int, default=None
The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

   **Elbow method is used to determine the number of Clusters**

In [None]:
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import DBSCAN
dbscan=DBSCAN(eps=3,min_samples=4, metric='euclidean', metric_params=None, algorithm='kd_tree', leaf_size=50, p=None, n_jobs=None)


In [None]:
# Fitting the model
model=dbscan.fit(X)

In [None]:
#to find how many group of clusters
labels=model.labels_

In [None]:
#identifying the points which makes up our core points
sample_cores=np.zeros_like(labels,dtype=bool)
sample_cores[dbscan.core_sample_indices_]=True

In [None]:
#Calculating the number of clusters
n_clusters=len(set(labels))- (1 if -1 in labels else 0)
n_clusters

1

In [None]:
n_noise_ = list(labels).count(-1)
n_noise_

8

In [None]:
from sklearn import metrics
print(metrics.silhouette_score(X,labels))

0.529657695058028


The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.