### Density-based spatial clustering of applications with noise (DBSCAN)

Differs from K-means in a few important ways: 
* DBSCAN does not require the analyst to select the number of clusters a priori — the algorithm determines this based on the parameters it's given. 
* It excels at clustering non-spherical data. 
* It does not force every data point into a cluster — if a data point lies outside of the specified starting parameters, it will be classified as **"noise" and not included in any resulting cluster**.
---------
DBSCAN looks at each data point individually. DBSCAN relies on two parameters:
* **Epsilon, or "eps,":** A measure of radial distance extending from a data point. A larger epsilon means a larger distance from a data point is considered when determining if another data point should be considered in its "neighborhood" and vice versa.
* **Minimum points:** The number of other data points within a data point's "neighborhood" for it to be considered a "core" data point.
Data point that has neither the minimum number of data points in its neighborhood nor does it fall within a core data point's neighborhood and is thus labeled as "noise."
---
An evolved version of DBSCAN, called **"HDBSCAN"** (the H for "hierarchical")
* Attempts to allow for clusters of differing variances and densities. 
* HDBSCAN really only requires us to provide one parameter: **minimum cluster size**.
* Every data point starts as part of its own cluster and iteratively clusters with the next nearest data points until all data points are clustered together. The minimum cluster size parameter allows us to toss out clusters below this threshold.

### It's more intuitive from a business standpoint to determine what size a cluster/segment needs to be in order for it to be considered "actionable" 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from collections import Counter
from itertools import combinations 

from sklearn import metrics
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

In [2]:
df1           = pd.read_pickle("./ds1_trans.pkl")
df2           = pd.read_pickle("./ds2_trans.pkl")
df1_churn     = pd.read_pickle("./df1_churn.pkl")
df2_churn     = pd.read_pickle("./df2_churn.pkl")
df1_not_churn = pd.read_pickle("./df1_not_churn.pkl")
df2_not_churn = pd.read_pickle("./df2_not_churn.pkl")

In [3]:
db = DBSCAN(eps=0.3, min_samples=10).fit(df1)


In [1]:
import hdbscan

clust_count = np.linspace(1, 20, num=20, dtype='int')

clust_number = 2
plot_number = 1
plt.figure (figsize=(17,12))
while clust_number < 21:
    hdb = hdbscan.HDBSCAN(min_cluster_size=clust_number)
    hdb_pred = hdb.fit(unbal)
    plt.subplot(5, 4, plot_number, title = 'Min. Cluster Size = {}'.format(clust_number))
    plt.scatter(unbal[:,0], unbal[:,1], c=hdb_pred.labels_, cmap=cmap)
    plot_number += 1
    clust_number += 1

plt.tight_layout()

ModuleNotFoundError: No module named 'hdbscan'