### HDBSCAN Clustering
In this exercise, we try with a different clustering method, i.e. the Hierarchical Density-Based Spatial Clustering of Applications with Noise, HDBSCAN.  
HDBSCAN performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

#### Import Libraries

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn import metrics
from sklearn.cluster import HDBSCAN

#### Load the Dataset
The Iris dataset is one of datasets Scikit-learn comes with that do not require the downloading of any file from some external website. The code below loads the Iris dataset.

In [None]:
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

#### Arrange Data into Features Matrix
K-Means is considered an <b>unsupervised</b> learning algorthm. This means you only need a features matrix. In the Iris dataset, there are four features. In this notebook, the features matrix will only be two features as it is easier to visualize clusters in two dimensions.

In [None]:
features = ['petal length (cm)','petal width (cm)']

# Create features matrix
x = df.loc[:, features].values

In [None]:
# The variable y below is for demonstrational purposes in this notebook and not needed if you want to do K-Means.
y = data.target

#### Standardize the Data
KMeans is affected by scale so you need to scale the features in the data before using KMeans. You can transform the data onto unit scale (mean = 0 and variance = 1) for better performance. Scikit-learn's `StandardScaler` helps standardize the dataset’s features.

In [None]:
# Apply Standardization to features matrix X
x = df.loc[:, features].values

In [None]:
x = StandardScaler().fit_transform(x)

#### Plot Data to Estimate the Number of Clusters
If your data is 2 or 3 dimensional, it is a good idea to plot your data before clustering. Hopefully you can see if there are any natural looking clusters. 

In [None]:
# Plot 
pd.DataFrame(x, columns = features).plot.scatter('petal length (cm)','petal width (cm)' )

# Add labels
plt.xlabel('petal length (cm)');
plt.ylabel('petal width (cm)');

### DBSCAN

In [None]:
db = HDBSCAN().fit(x)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

#### Silhouette Coefficient Metric

In [None]:
print(f"Silhouette Coefficient: {metrics.silhouette_score(x, labels):.3f}")

#### Visually Evaluate the Clusters

In [None]:
x = pd.DataFrame(x, columns = features)

In [None]:
colormap = np.array(['r', 'g', 'b'])
plt.scatter(x['petal length (cm)'], x['petal width (cm)'], c=colormap[labels])

plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)');

#### Visually Evaluate the Clusters and Compare Species

In [None]:
plt.figure(figsize=(8,4))

plt.subplot(1, 2, 1)
plt.scatter(x['petal length (cm)'], x['petal width (cm)'], c=colormap[labels])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)');
plt.title('HDBSCAN')
 
plt.subplot(1, 2, 2)
plt.scatter(x['petal length (cm)'], x['petal width (cm)'], c=colormap[y], s=40)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)');
plt.title('Flower Species')

plt.tight_layout()

Does HDBSCAN perform better than DBSCAN?