<center><h1> DBSCAN , K-Means Clustering and KNN Classifier </h1></center>

<br><br>
This notebook implements KMeans, DBScan clustering and KNN classifier from scratch. Let's learn these algorithms and logic behind them.

## Importing all the libraries

In [None]:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons, make_circles
from sklearn.model_selection import train_test_split 
import random
from scipy.spatial import distance
from sklearn.metrics import accuracy_score

## Creating dataset

In [None]:
X1, y1 = make_blobs(n_samples=300, centers=4,random_state=0, cluster_std=0.60)
plt.scatter(X1[:, 0], X1[:, 1], s=50, alpha=.8)
plt.title("Dataset 1")
plt.show()

In [None]:
X2, y2 = noisy_circles = make_circles(n_samples=300, factor=.5, noise=.05)
print("First five rows and col values \nX2 : \n",X2[:5], " \n y2 :\n",y2[:5])
plt.scatter(X2[:, 0], X2[:, 1], s=50, cmap='winter', alpha=.5)
plt.title("Dataset 2")
plt.show()


<center><h2> DBSCAN </h2></center>

<br><br><br>

`Denisty Based Spatial Clustering Applications with Noise` (DBSCAN) deals with forming clusters based on the points that closely packed together. It is based on the assumption that `clusters are dense regions seperated by regions of lower density`. 

It is useful in finding `arbitary shaped clusters` and `sepearates outliers` too.

Parameters:

1. **ε** (Epsilon) - radius of neighbourhood regions
2. **minPts** - minimum number of points required to form that dense region

Each point is either a core point,a border point or an outlier.

1. **Core point** : If you draw a circle of radius ε, from a point, and the circle contains minPts number of points inside then the point is a core point.

2. **Border point** : If you draw a circle of radius ε, the circle contains less number of points then minPts but is reachable from core point.

3. **Outlier** : If you draw a circle of radius ε, and the circle and it does not contains minPts number of points, and is not within the reach of any core point, then it is an outlier. 

The representation is shown in the following figure. `All the core points and border points will be merged to form a cluster.` 

*For more information, check this [youtube tutorial](https://www.youtube.com/watch?v=6jl9KkmgDIw)* 

I am taking the algorithm mentioned in [Wikipedia DBScan](https://en.wikipedia.org/wiki/DBSCAN)

`
DBSCAN(DB, distFunc, eps, minPts) {
    C := 0                                                  /* Cluster counter */
    for each point P in database DB {
        if label(P) ≠ undefined then continue               /* Previously processed in inner loop */
        Neighbors N := RangeQuery(DB, distFunc, P, eps)     /* Find neighbors */
        if |N| < minPts then {                              /* Density check */
            label(P) := Noise                               /* Label as Noise */
            continue
        }
        C := C + 1                                          /* next cluster label */
        label(P) := C                                       /* Label initial point */
        SeedSet S := N \ {P}                                /* Neighbors to expand */
        for each point Q in S {                             /* Process every seed point Q */
            if label(Q) = Noise then label(Q) := C          /* Change Noise to border point */
            if label(Q) ≠ undefined then continue           /* Previously processed (e.g., border point) */
            label(Q) := C                                   /* Label neighbor */
            Neighbors N := RangeQuery(DB, distFunc, Q, eps) /* Find neighbors */
            if |N| ≥ minPts then {                          /* Density check (if Q is a core point) */
                S := S ∪ N                                  /* Add new neighbors to seed set */
            }
        }
    }
}
`

`
RangeQuery(DB, distFunc, Q, eps) {
    Neighbors N := empty list
    for each point P in database DB {                      /* Scan all points in the database */
        if distFunc(Q, P) ≤ eps then {                     /* Compute distance and check epsilon */
            N := N ∪ {P}                                   /* Add to result */
        }
    }
    return N
}
`

The algorithm is very simple and the general idea is:


    For each point in training set (X_train):
        1. If the point is not visited, find its neighbours
        2. If neighbours < minPTS , label it noise and continue
        3. Else , create a list and add all the neighbours to this list
        4. For each point in the list (while list is not empty),
            4.1. If the point is in noise, change its label as core 
            4.2. If it is not vistied, label it as core 
            4.2. Find its neighbours
            4.2. If their neighbours count > minPTS 
            4.3. add them to the list to explore further 



In [None]:
class DBScan:
    def __init__(self, eps = 1, minPts = 30):
        self.eps = eps
        self.minPts = minPts
        
    def train(self, X, y):
        self.X = X
        self.y = y

        # list for storing visited points, noise and core points
        visited = []
        noise = []
        all_core = []
        
        # all points in X
        for p in X:
            # storing points inside tuple like: (x,y) for easy understandability
            if ((p[0],p[1])) not in visited: # if it is not visited, it will not have any label
                visited.append((p[0],p[1]))
                neighbours = self.getNeighbours(X,p,visited) # find neighbours

                # if neighbour points count < minimum points, consider those points as noise
                if len(neighbours) < self.minPts: 
                    noise = noise + neighbours 
                    continue
                else:
                    # add all theneighbouring points and their neighbours to a new core
                    new_core = []
                    new_core.append((p[0],p[1]))
                    # since we already visted current point, p
                    neighbours.remove((p[0],p[1]))
                    
                    # seedset containing neighbours and their neighbours
                    seedset = neighbours                    
                    while len(seedset) > 0:
                        q = seedset.pop(0) # like a queue, pop each element from begining, let's call it q (p is already referencing X)

                        # if q in noise, assign it to new core 
                        if (q[0],q[1]) in noise:
                            noise.remove((q[0],q[1]))
                            new_core.append((q[0],q[1])) # adding q as border point
                        
                        # if q is not noise and not visited
                        if (q[0],q[1]) not in visited:
                            visited.append((q[0],q[1])) # add it to visited list
                            new_core.append((q[0],q[1])) # its label is new core
                            neighbours = self.getNeighbours(X,q,visited)
                            if len(neighbours) >= self.minPts: # find neighbours of q
                                seedset = seedset + neighbours # add it to the seedset  

                    # storing all cores
                    all_core.append(new_core)
        self.plot(all_core, noise)
        return all_core, noise
            
    def getNeighbours(self, X, p, visited):
        neighbours = []
        neighbours.append((p[0],p[1]))
        for q in X:
            if distance.euclidean(p, q) < self.eps:
                neighbours.append((q[0],q[1]))
        return neighbours
   
    def plot(self, all_core, noise):
        for i,core in enumerate(all_core):
            X_p = []
            Y_p = []
            pt = []
            for x_p,y_p in core:
                X_p.append(x_p)
                Y_p.append(y_p)
                pt.append([x_p,y_p])
            plt.scatter(X_p,Y_p, label=i)
        plt.title("DBScan")
        plt.legend()
        plt.show()            

In [None]:
dbscan = DBScan(eps = 0.2, minPts = 5)
all_core, noises = dbscan.train(X2,y2)
print(len(all_core))

In [None]:
dbscan = DBScan(eps = 0.5, minPts = 10)
all_core, noises = dbscan.train(X1,y1)
print(len(all_core))

<center><h2> KMeans </h2></center>
<br><br>

![KMeans Algo](https://miro.medium.com/max/700/1*7LuOOmBXcnbxm7AeD1xAQQ.png)
Source: https://heartbeat.fritz.ai/understanding-the-mathematics-behind-k-means-clustering-40e1d55e2f4c 


From the above algorithm, let's try to create our own K means algorithm.

Also, according to the algorithm, the loss is:

**Loss**: J(C<sub>1</sub>, C<sub>2</sub>,... C<sub>m</sub>, μ<sub>1</sub>, μ<sub>2</sub>,... μ<sub>k</sub>) = 1/m Σ( ||x<sup>i</sup> - μ<sub>C<sup>i</sup></sub>|| )


C<sub>1</sub>, C<sub>2</sub>, ..., C<sub>m</sub> is the cluster index of each data point x<sup>i</sup> <br>
μ<sub>1</sub>, μ<sub>2</sub>, ..., μ<sub>k</sub> is the clusters centers

In [None]:
class KMeans:
    def __init__(self, k=4, epoches=10):
        self.k = k
        self.epoches = epoches
        
    def train(self,X,y):
            self.X = X
            self.y = y
            n = X.shape[0] # training data instances
            m = X.shape[1] # features count
            
            # generating random centroids using mean and std deviation
            mean = np.mean(X, axis = 0) 
            std = np.std(X, axis = 0)
            new_centers = np.random.randn(self.k,m)*std + mean            
            prev_centers = np.zeros((self.k,m)) # (0,0) k times
            iter_count = 0
            cluster_info = {}
            
            # if the previous center is too close to the new center
            while(self.centers_stable(prev_centers,new_centers, 0.005) != True):
                
                prev_centers = new_centers
                
                # dictionary storing centers as key and their nearby points in list as value
                cluster_info = self.closest_cluster(X, new_centers)
               
                # re-calculating centers 
                new_centers = self.find_new_center(cluster_info)

                # either we find center ro finish epoches
                iter_count += 1
                if iter_count > self.epoches:
                    break
            
            return new_centers,cluster_info

    # if the previous center is too close to the new center which means, centroids are not changing
    def centers_stable(self,prev_centers, new_centers, min_dist):
        if prev_centers.shape == new_centers.shape:        
            dist = 0
            for i in range(new_centers.shape[0]):
                dist += distance.euclidean(prev_centers[i], new_centers[i])
            return True if dist < min_dist else False

    # dictionary storing centers as key and their nearby points in list as value
    def closest_cluster(self, X, new_centers):    
        cluster_info = {}
        # initizling the dictionary
        for i in range(self.k):
            cluster_info[i] = []
        
        for i in range(X.shape[0]):
            short_dist = 999 # max distance
            closest_cluster = 0
            for j in range(len(new_centers)):
                dist = distance.euclidean(X[i,:], new_centers[j])
                if dist < short_dist: # if dist < shortest dist found
                    short_dist = dist
                    closest_cluster = j # new closest cluster index
            ls = cluster_info.get(closest_cluster)
            ls.append(list(X[i,:]))
            cluster_info[closest_cluster] = ls
        return cluster_info
    
    # re-calculating new centers
    def find_new_center(self,cluster_info):
        new_centers = []
        keys = cluster_info.keys()
        # key => cluster index 
        for key in keys:
            ls = cluster_info.get(key)
            if len(ls) == 0:
                continue
            # using mean to find new centers
            mean = np.mean(ls, axis = 0)
            new_centers.append(mean)
        return np.array(new_centers)
    
    # plot is entirely based on the dictionary, and not X,y
    def plot(self,X,y,new_centers,cluster_info):
        for i,key in enumerate(cluster_info.keys()):
            data = cluster_info.get(key)
            if len(data)>0:
                data = np.array(data)                
                plt.scatter(data[:,0], data[:,1], alpha = 0.5, label = key)
                plt.scatter(new_centers[i,0], new_centers[i,1], marker='*', s=150, c="black")
                plt.title("KMeans")
        plt.legend()
        plt.show()

**Note**: 

The plot is based on clsuter info, which stores cluster index and the points inside the cluster and not based on original dataset. We can do that too by changing

plt.scatter(data[:,0], data[:,1], alpha = 0.5, label = key)

to

plt.scatter(X[:,0], X[:,1], alpha = 0.5, label = key)

In [None]:
kmeans = KMeans(k=4)
new_centers,cluster_info = kmeans.train(X1,y1)
kmeans.plot(X1,y1,new_centers, cluster_info)

In [None]:
kmeans = KMeans(k=2)
new_centers,cluster_info = kmeans.train(X2,y2)
kmeans.plot(X2,y2,new_centers, cluster_info)

## K Nearest Neighbours


KNN is a simple algorithm and its based on: 

    For every instance in X_test 
        1. Find the n neighbour points from the training set, and store their label
        2. The label which is repeated more number of times, return it

In [None]:
# creating testing and training set to train model, will be used for k means
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=1)
plt.scatter(X_train[:, 0], X_train[:, 1], s=50, alpha=.8)
plt.title("Train dataset")
plt.show()

In [None]:
from statistics import mode 
class KNN:
    def __init__(self, n_neighbour = 10):
        self.n_neighbour = n_neighbour
        
    def predict_class(self, X_train, y_train, X_test):
        prediction = []
        # for each instance in test set
        for x_test_item in X_test:
            pred = self.get_nieghbour(X_train, y_train, x_test_item)
            prediction.append(pred)
        return prediction

    # finding (n_neighbour) nearest neighbours
    def get_nieghbour(self, X_train, y_train, x_test_item):
        dist_list = []
        for i in range(len(X_train)):
            X_train_item = X_train[i]
            y_train_item = y_train[i]
            dist = distance.euclidean(X_train_item, x_test_item)
            dist_list.append((dist, y_train_item))

        # sorting the dictionary based on distance
        dist_list.sort(key=lambda dist_x: dist_x[0])
        
        # storing labels of n_neighbour after sorting
        label_list = [ dist[1] for dist in dist_list[:self.n_neighbour]] 
        
        # return most occured label
        return max(set(label_list), key=label_list.count)    

In [None]:
knn = KNN(n_neighbour = 10)
pred = knn.predict_class(X_train, y_train, X_test)
accuracy_score(pred, y_test)

## References:

1. [Develop k-Nearest Neighbors in Python From Scratch, Machine Learning Mastery](https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)

2. [Find most frequent element in a list, Geeks for geeks](https://www.geeksforgeeks.org/python-find-most-frequent-element-in-a-list/)

3. [DBSCAN, Wikipedia](https://en.wikipedia.org/wiki/DBSCAN)

## Thank you