# DB SCAN - Reverse pedagogy
Giuliano RICCARDI - Victor ROBIC

## 1. Intro
**DBSCAN** stands for "**Density-Based Spatial Clustering of Applications with Noise**". It is a clustering algorithm that identifies clusters by taking into account the density of the data points in space. This allows for a more flexible clustering that perfoms better than other algorithms when dealing with nested data. Furthermore, DBSCAN also handles exceptionally well outliers and noise in data since it only affects a point to a cluster if it is within a dense region. 

Just like K-Means algorithm, DBSCAN uses a distance metric in its execution. Because of this, it is **essential** for us to work with **scaled data** so all the features studied are treated as equally important. Also, this algorithm eliminates the need of specifying before hand the amount of clusters we want, DBSCAN **automatically detects** the optimum amount of clusters for its given parameters.

DBSCAN can still perform well identifying clusters and outliers when working with multiple dimensions, making it an even more appealing solution for multidimensional problems.

<u>A little bit of history</u>: DBSCAN algorithm was created in 1996 by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiawei Xu. 

**Clustering**: a technique in unsupervised machine learning that involves grouping a set of observations in such a way that oservations in the same group (cluster) are more similar to each other than to those in other groups. The goal being to discover natural groupings within the studied data.

**Nested data**: said of data when clusters are wrap around one another.

Example:

![Nested Data](./img/3.png)



## 2. Theory

To better understand how the DBSCAN algorithm works, let us start with the definitions of the important notions linked to it.

**Epsilon**: the radius of the neighboorhood around a point. It defines the maximum distance (usually Euclidean distance) at which a point will be considered the neighboor to another one.

**MinPts**: the minimum amount of points needed to form a dense region.

**Dense Region**: a region containing at least MinPts within it.

**Sparse Region**: said of a region that contains points but not enough to be considered a Dense Region.

**Core Point**: said of a data point if there is at least a minimum number of points (MinPts) within a radius Epsilon around it.   

**Border Point**: said of a data point if the point is not a Core Point but lies within a radius Epsilon of a Core Point. 

**Noise Point**: said of a data point if it is neither a Core nor a Border Point. They are considered the outliers and do not belong to any cluster.

![Different Kind of Points](./img/4.png)

## 3. Algorithm Steps

1. Select a point randomly
2. Retrieve all points within an **Epsilon** distance of this point
3. If the point is a **Core Point**, a cluster is formed. If it isn't a Core Point it is labeled as **Noise**
4. Iterate through the cluster adding any point within Epsilon distance of a Core Point to the cluster
5. **Repeat** until all points have been processed

Video Explanation by : **StatQuest with Josh Starmer**

![Gif Illustration for DBSCAN](./img/5.gif)

Full video at: https://www.youtube.com/watch?v=RDZUdRSDOok

Notes: 

We can say that Core Points **extend** the clusters since we will assign every point within Epsilon distance of it to the cluster it belongs to, whereas the Border Points can only be **assigned** to clusters.

Cluster creation is done in **sequence**, this means that if a border point is within Epsilon distance of a Core Point from cluster 1 and a Core Point from cluster 2, it will be assigned to the cluster that the algorithm **first** iterated through.

## Lab

In [1]:
from PIL import Image
import numpy as np
image = Image.open('img/butterfly.jpeg')

In [2]:
height = int(image.height/2)
width = int(image.width/2)

In [3]:
image.resize((width,height)).save('img/butterfly_resized.jpeg')

In [4]:
from matplotlib import image
import cv2
I = image.imread('img/butterfly.jpeg',)/255
np.isnan(I).any()

False

In [5]:
#my_image=cv2.imread('img/bateau_resized.jpeg', cv2.IMREAD_GRAYSCALE)
#my_image.shape

In [6]:

[M,N]=I.shape[:2]
# NGRD - normalized green-red difference (NDVI inspired)
f = (I[:,:,1]-I[:,:,0])/(I[:,:,1]+I[:,:,0])
f = 0.5*(f+1)

  f = (I[:,:,1]-I[:,:,0])/(I[:,:,1]+I[:,:,0])


In [7]:
f=np.nan_to_num(f)
np.isnan(f).any()

False

In [None]:
from sklearn.cluster import DBSCAN

k=0
X = np.zeros([M*N,2])
for i in range(M):    
    for j in range(N):        
        X[k,:]=[I[i,j,2],f[i,j]]
        k= k+1

dbscan = DBSCAN(eps=0.0001, min_samples=5).fit(X)
labels = dbscan.labels_
L = np.reshape(labels+1,[M,N])/2

In [None]:
len(np.unique(labels))

In [None]:
import matplotlib.pyplot as plt

plt.subplot(2,3,4)
plt.imshow(L,cmap='rainbow_r') 
plt.xticks(ticks=[])
plt.yticks(ticks=[])
plt.title('DBSCAN');

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5).fit(X)
labels = kmeans.labels_
L = np.reshape(labels+1,[M,N])/2

In [None]:
import matplotlib.pyplot as plt

plt.subplot(2,3,4)
plt.imshow(L,cmap='rainbow_r') 
plt.xticks(ticks=[])
plt.yticks(ticks=[])
plt.title('DBSCAN');

## Sources
[DBSCAN, a density-based algorithm for discovering clusters](https://cdn.aaai.org/KDD/1996/KDD96-037.pdf?source=post_page---------------------------)

[Video : Clustering with DBSCAN, clearly explained](https://youtu.be/RDZUdRSDOok?si=CPiHvumMMTC0OkmW)