## Unsupervised Learning Techniques

---

 The vast majority of the available data is unlabeled

We have the input features X, but we do not have the labels y

Unsupervised learning tasks → clustering, anomaly detection and Density estimation

---

### k-means

given all the instance labels, locate each cluster’s centroid by computing the mean of the instances in that cluster - both centroids and labels are not given

Process

1. place the centroids randomly (randomly select k instances from the training set).
2. Assign each data point to the nearest centroid creating k clusters (euclidean distance).
3. Recalculate the mean of all the data points for each cluster and assign the new point as the centroid. 
4. Repeat 2,3 and 4 till the centroids stop moving.

example of finding the mean of the clusters

lets say we have these data points 

`Points: (1, 2), (2, 3), (3, 4), (8, 8), (9, 10), (10, 11)`

and we want to make k = 2 clusters

assume intial centroids are

Centroid 1: (1, 2)
Centroid 2: (10, 11)

- Points (1, 2), (2, 3), (3, 4) are closer to Centroid 1 → cluster 1
- Points (8, 8), (9, 10), (10, 11) are closer to Centroid 2 → cluster 2

Recalculate Centroids:

New Centroid 1 = Mean((1, 2), (2, 3), (3, 4))
= ((1+2+3)/3, (2+3+4)/3)
= (2, 3)

New Centroid 2 = Mean((8, 8), (9, 10), (10, 11))
= ((8+9+10)/3, (8+10+11)/3)
= (9, 9.67)

Clustering types

Hard clustering 

directly assign the cluster for a node

Soft clustering

score per cluster

Centroid initialization methods

Setting the init hyperparameter (if you have an idea of where the centroid should be)

Run algorithm multiple times (with different random initialization for the centroid)

Performance metric 

model’s inertia → sum of the squared distances between the instances and their closest centroids.

---

### k-means++

Smarter initialization step that tends to select centroids that are distant from one another

1. Randomly select the first centroid, say $\mu_1$.
2.  Calculate the distance of all points from  $\mu_1$, then select the second centroid  $\mu_2$ with a probability proportional to sum of the squared distances.
    
    Let’s say the distance to the all the points from the current centroid is as follows
    
    - $D(x_1)$ = 1
    - $D(x_2)$ = 2
    - $D(x_3)$ = 3
    - $D(x_4)$ = 4
    
    Squaring these distances, we get:
    
    - $D(x_1)^2$ = 1
    - $D(x_2)^2$ = 4
    - $D(x_3)^2$ = 9
    - $D(x_4)^2$ =16
    
    The sum of the squared distances is 1+4+9+16=301+4+9+16=30. The probabilities for each point being selected as the next centroid are:
    
    - $P(x_1)$ = 1/30 ≈ 0.033
    - $P(x_2)$ = 4/30 ≈ 0.133
    - $P(x_3)$ = 9/30 = 0.3
    - $P(x_4)$ = 16/30 ≈ 0.533
    
    So basically we do not always use the farthest point but as the distance increases the probability increases too .
    

---

### Accelerated and mini-batch k-means

1. **Accelerated k-Means**

Accelerated k-Means algorithms aim to speed up the standard k-means clustering process, which can be computationally intensive due to the need to repeatedly compute distances between data points and cluster centroids.

1. Elkan's Algorithm
    
    Elkan's algorithm speeds up k-means by reducing the number of distance calculations needed during each iteration. 
    
    1. **Initialization**: Start with initial centroids, like in standard k-means.
    2. **Distance Bounds**: Maintain upper and lower bounds for the distances between each point and the centroids.
    3. **Update Centroids**: After assigning points to the nearest centroids, update the centroids.
    4. **Bounds Update**: Update the bounds based on the new centroids.
    5. **Pruning**: Use the bounds to skip distance calculations for points that are unlikely to change their cluster assignments.
2. k-d Tree and Ball Tree Methods
    1. k-d Tree → A binary tree where each node represents a splitting hyperplane dividing the space into two subspaces.
    2. ball tree → A hierarchical data structure where data points are encapsulated in hyperspheres (balls).

1. **Mini-batch k-means**
    
    Mini-batch k-means is a variant of the k-means algorithm that reduces the computational cost by using small, random samples (mini-batches) of the dataset in each iteration rather than the entire dataset.
    
    Process
    
    1. **Initialization**: Start with initial centroids, just like in standard k-means.
    2. **Mini-Batch Selection**: Randomly select a mini-batch of data points from the dataset.
    3. **Cluster Assignment**: Assign each point in the mini-batch to the nearest centroid.
    4. **Centroid Update**: Update the centroids based on the mini-batch. The update rule is typically a weighted average to account for the small size of the mini-batch.
    5. **Repeat**: Iterate the mini-batch selection, cluster assignment, and centroid update steps until convergence or a maximum number of iterations is reached.

Finding the optimal number of clusters

1. Inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k.
    
    $Inertia = \sum_{i=1}^n \sum_{j=1}^k 1(c_i=j) || x_i - \mu_j||^2$
    
    - n is the number of data points.
    - k is the number of clusters.
    - $x_i$ is the i-th data point.
    - $\mu_j$ is the centroid of the j-th cluster.
    - 1($c_i$ = j) is an indicator function that is 1 if the data point i belongs to cluster j, and 0 otherwise.
    - $||x_i -\mu_j ||^2$ is the squared Euclidean distance between the data point and the cluster centroid.
2. Silhouette score
    
    The Silhouette score is a metric used to evaluate the quality of a clustering algorithm. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
    
    For each data point i, the Silhouette score s(i) is calculated using the following steps
    
    1. Calculate the mean intra-cluster distance (a(i))
        
        This is the average distance between the data point i and all other points in the same cluster.
        
        $a(i) = \frac{1}{|C_i|-1}\sum_{j \in C, j\neq i} d(i,j)$
        
        where $c_i$ is the cluster containing i, and d(i,j) is the distance between points i and j.
        
    2. Calculate the mean nearest-cluster distance (b(i))
        
        This is the average distance between the data point i and all points in the nearest cluster that is not $C_i$
        
        $b(i) = min_{C_k \neq C_i} \frac{1}{|C_k|}\sum_{j \in C_k} d(i,j)$
        
        where $C_k$  is any cluster that is not $C_i$
        
    3. Calculate the Silhouette score (s(i))
        
        $s(i) = \frac{b(i) - a(i)}{max(a(i),b(i))}$
        
        The Silhouette score ranges from -1 to 1:
        
        1. s(i) close to 1 indicates that the data point is well-clustered and appropriately assigned.
        2. s(i) close to 0 indicates that the data point is on or very close to the decision boundary between two neighboring clusters.
        3. s(i) close to -1 indicates that the data point might have been assigned to the wrong cluster.

---

### DBSCAN

density-based spatial clustering of applications with noise (DBSCAN)

ε (epsilon), min_samples,  ε-neighborhood, core instance

1. For each instance, the algorithm counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance’s ε-neighborhood.
2. If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is considered a core instance.
3. All instances in the neighborhood of a core instance belong to the same cluster.
4. Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.

---

### Gaussian Mixtures

A probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.