# Lesson 4: Mastering DBSCAN: From Basics to Implementation

# Mastering DBSCAN: From Basics to Implementation

## Introduction and Overview of DBSCAN

Greetings to aspiring data scientists! Today, we'll unlock the curious world of the **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)** algorithm. Standing out in the clustering landscape, DBSCAN is famous for its resilience to outliers and for eliminating the need for pre-set cluster numbers. This lesson will demystify DBSCAN through a Python-based implementation from scratch.

At its core, DBSCAN operates on concepts of density and noise. It identifies clusters as regions of high density separated by lower-density regions. Concurrently, it classifies low-density entities as noise, enhancing its robustness towards outliers. The secret recipe behind DBSCAN? A pair of parameters: **Epsilon (Eps)** and **Minimum Points (MinPts)**, which guide the classification of points into categories of 'core', 'border', or 'outlier'.

With a foundational understanding, let's roll up our sleeves and implement DBSCAN from scratch.

## Creating a Toy Dataset

We'll create a simple toy dataset using numpy arrays for the first hands-on task. This dataset represents a collection of points on a map that we'll be clustering.

```python
data_points = np.array([
    [1.2, 1.9], [2.1, 2], [2, 3.5], [3.3, 3.9], [3.2, 5.1],
    [8.5, 7.9], [8.1, 7.8], [9.5, 6.5], [9.5, 7.2], [7.7, 8.6],
    [6.0, 6.0]
])
```

## Distance Function

Next, we'll devise a function to calculate the **Euclidean distance** between the data points. The function uses numpy's `linalg.norm` to evaluate this distance, which reflects the shortest possible distance between two points.

```python
def euclidean_distance(a, b):
    return np.linalg.norm(a - b, axis=-1)
```

This function evaluates the Euclidean distance \( d(a, b) \) between points \( a \) and \( b \) using the formula:

\[
d(a, b) = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + \dots + (a_n - b_n)^2}
\]

Where \( a = (a_1, a_2, \dots, a_n) \) and \( b = (b_1, b_2, \dots, b_n) \).

## Setting Initial Point Labels

Armed with a dataset and the Euclidean distance function, we are prepared to implement DBSCAN.

We will use the following labels:
- **0** is noise or outlier data points, not belonging to any cluster.
- **1** is the data points in the first identified cluster.
- **2** is the data points in the second identified cluster.

Our function initially labels each point as an outlier. It then verifies if each point has at least **MinPts** within an **Eps** radius. If this condition is satisfied, the point qualifies as a core point. The code block below demonstrates these steps.

```python
def dbscan(data, Eps, MinPt):
    point_label = [0] * len(data)
    # Initialize list to maintain count of surrounding points within radius Eps for each point. 
    point_count = []
    core = []
    noncore = []

    # Check for each point if it falls within the Eps radius of point at index i
    for i in range(len(data)):
        point_count.append([])
        for j in range(len(data)):
            if euclidean_distance(data[i], data[j]) <= Eps and i != j:
                point_count[i].append(j)
        
        # If a point has at least MinPt points within its Eps radius (excluding itself), classify it as a core point, and vice versa
        if len(point_count[i]) >= MinPt:
            core.append(i)
        else:
            noncore.append(i)
```

## Mapping Points to Clusters

With the mechanism to classify points into 'core', 'non-core', and 'outliers' in place, we can now map each unvisited core point and its reachable neighbors to unique clusters identified by an ID.

```python
    ...
    ID = 1
    for point in core:
        # If the point has not been assigned to a cluster yet
        if point_label[point] == 0:
            point_label[point] = ID
            # Create an empty list to hold 'neighbour points'  
            queue = []
            for x in point_count[point]:
                if point_label[x] == 0:
                    point_label[x] = ID
                    # If neighbor point is also a core point, add it to the queue 
                    if x in core:
                        queue.append(x)
            
            # Check points from the queue
            while queue:
                neighbours = point_count[queue.pop(0)]
                for y in neighbours:
                    if point_label[y] == 0:
                        point_label[y] = ID
                        if y in core:
                            queue.append(y)
            ID += 1  

    return point_label
```

This code iterates over core points and assesses its neighbors for each. If a neighbor has not been assigned to a cluster, it's assigned to the current core point's cluster. Core points among these neighbors are put in a queue to repeat the same process. Once all points in a cluster are labeled, they move to the next cluster. The final output lists all points with their respective cluster IDs.

## Visualization and Results Interpretation

With the DBSCAN function ready, let's test it with our toy dataset and visualize the result using `matplotlib`.

```python
labels = dbscan(data_points, 2, 2)

for i in range(len(labels)):
    if labels[i] == 1:
        plt.scatter(data_points[i][0], data_points[i][1], s=100, c='r')
    elif labels[i] == 2:
        plt.scatter(data_points[i][0], data_points[i][1], s=100, c='g')
    else:
        plt.scatter(data_points[i][0], data_points[i][1], s=100, c='b')

plt.show()
```

Here is the resulting plot. Red and green dots represent two separate clusters, while blue dots are outliers.

Colors in the plot represent different clusters, showcasing how changing **Eps** and **MinPts** influences the clustering. The strength of DBSCAN lies in its ability to group densities, as vividly demonstrated in the plot.

## Lesson Summary and Practice

Well done! You've braved the challenging world of DBSCAN, mastering its theory and Python-based implementation. What's next? A set of practice exercises is designed to reinforce your newly acquired knowledge. Remember, **practice makes perfect**! So, don your coding hat and get ready for a fascinating ride into Python practice!


## DBSCAN Clustering Visualization

## Adjusting DBSCAN Epsilon Value

## Navigating Through the Stars: Adding DBSCAN Logic