DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an algorithm used for clustering.
This is a useful algorithm to group high density clusters together, while ignoring data points that may not fit in a cluster (called "noise). We choose two hyperparameters within this algorithm:
 Epsilon or eps (ε): You can think of this as a radius that the algorithm searches around a data point to consider adding the data point to its cluster. It is the maximum distance between two samples for
one to be considered as in the neighborhood of the other. Two points are considered to be in the same cluster if the distance between the two points is below the threshold epsilon.  min_samples: The minimum number of points needed to create a cluster.
The algorithm works by:
1. Picking a random data point to be in a cluster.
2. Looking at the points within eps distance from the point in step 1. If there are at least min_samples number of data points, these get assigned to that cluster.
3. It then looks at each point in the cluster and the points within eps distance of each point get added to the cluster.
4. Repeats step 3 until there are no more neighbors within eps distance of each point in the cluster.
5. Repeat steps 1-4 for each additional cluster.
6. Points that do not meet the distance or min_samples criteria are marked as "noise" and are not included as part of a cluster.
One benefit of DBSCAN is that you do not need to choose the number of clusters ahead of time, it will find the optimal number of clusters on its own. DBSCAN also produces better results than other
clustering algorithms like KMeans for data points where density is more explanatory for clusters than distance. The below image illustrates this:

In [None]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load Dataset
wine = pd.read_csv('/content/drive/path_to_data/modified_wine.csv')
df = wine[['malic_acid', 'flavanoids']]
df.head()

In [1]:
# Scale data

# Instantiate Standard Scaler
scaler = StandardScaler()

# Fit & transform data.
scaled_df = scaler.fit_transform(df)

NameError: name 'StandardScaler' is not defined

In [None]:
# Instantiate & fit clustering - this is done in one step for DBSCAN
dbs = DBSCAN(eps = 0.5, min_samples = 5).fit(scaled_df)

# Save the cluster lables to the dataframe
df['cluster'] = dbs.labels_


# Visualize the clusters
plt.scatter(df['malic_acid'], df['flavanoids'], c = df['cluster'])
plt.xlabel('Malic Acid')
plt.ylabel('Flavanoids')
plt.title('Clusters of Wine Varieties');