### Unsupervised Learning Techniques

**Clustering** = groups similar instances together into clusters. Used for: 
 * data analysis: run clustering and analyze each cluster separately
 * customer segmentation: cluster customers based on their purchases and activity on a website.
 * recommender systems
 * search engines: apply clustering to all the images in a database. When a user provides a reference image, we need to use the trained model to find the image's cluster and return all the images from this cluster.  
 * image segmentation: clustering pixel according to their color, then replace each pixel's color with the mean color of its cluster to reduce the number of different colors in the image. 
 * semi-supervised learning: if we only have a few labels, we can perform clustering and propagate the labels to all the instances in the same cluster.   
 * dimensionality reduction: once a dataset has been clustered, it is usually possible to measure each instance's affinity with each cluster --> an instance's feature vector can then be replace by the vector of cluster affinities. If there are k clusters, the vector will be k-dimensional. 
 * anomaly detection: any instance that has low affinity to all the clusters is likely to be an anomaly. 

**Anomaly detection** = learn what *normal* data looks like and use that to detect abnormal instances: 
 * defective items on a production line 
 * new trend in a time series 

**Density estimation** = estimate the probability density function of the random process that generated the dataset. Used for: 
 * anomaly detection --> instances located in very low density regions are likely to be anomalies 
 * data analysis and visualization

### 1. Clustering algorithms 

#### 1a. K-Means 


In [1]:
from sklearn.datasets import load_iris 

iris = load_iris()
X = iris.data
y = iris.target


In [2]:
from sklearn.cluster import KMeans 

k = 4
kmeans = KMeans(n_clusters = k)
y_pred = kmeans.fit_predict(X)

In [3]:
y_pred # labels the instance was assigned to 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 3, 3, 3, 0, 3, 0, 3, 0, 3, 0, 0, 0, 0, 3, 0, 3,
       0, 0, 3, 0, 3, 0, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 3, 0, 3, 3, 3,
       0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 2, 3, 2, 3, 2, 2, 0, 2, 2, 2,
       3, 3, 2, 3, 3, 2, 2, 2, 2, 3, 2, 3, 2, 3, 2, 2, 3, 3, 2, 2, 2, 2,
       2, 3, 3, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 3, 2, 3], dtype=int32)

In [4]:
kmeans.cluster_centers_

array([[5.53214286, 2.63571429, 3.96071429, 1.22857143],
       [5.006     , 3.428     , 1.462     , 0.246     ],
       [6.95      , 3.10666667, 5.86666667, 2.15333333],
       [6.25714286, 2.86190476, 4.85      , 1.63333333]])

In [5]:
import numpy as np 

X_new = np.array([[0, 2, 3, 2]])
kmeans.predict(X_new) # Assign new instance to the cluster whose centroid is the closest 

array([0], dtype=int32)

**Hard clustering** = assign each instance to a single cluster 

**Soft clustering** = give each instance a score per cluster. It can either be a distance between the instance and the centroid or a similarity score like a Gaussian Radial Basis Function. 

In [6]:
kmeans.transform(X_new) # Gives the distances from each cluster centroid 

array([[5.70322814, 5.70448771, 7.60055919, 6.59178739]])

#### Algorithm

 1. Place centroids randomly = pick k instances at random and use their locations as centroids 
 2. Label the instances 
 3. Update the centroids 
 4. Iterated steps 2 and 3 until the centroids stop moving 

The algorithm is guaranteed to converge in a finite number of steps. --> but, it is not guaranteed to converge to the global optimum, and this depends on the centroid initialization!

**Computational complexity** = linear in the number of instances, number of clusters and number of dimensions. 

**Centroid initialization methods**

 * If we happen to know approximately where the centroids should be, we can set the *init* hyperparameter to a numpy array containing the list of centroids and set *n_init* to 1

In [7]:
good_init = np.array([[5, 3, 1, 0.2], [6, 2, 4 , 1], [6, 3, 5, 2], [5, 2, 3, 1]])
kmeans = KMeans(n_clusters = 4, init = good_init, n_init = 1)
y_pred = kmeans.fit_predict(X)

 * Run the algorithm multiple times with different random initializations and keep the best solution 

In [8]:
kmeans.inertia_

57.25600931571815

**Inertia** = performance metric for clustering, which is the mean squared distance between each instance and its closest centroid. --> the KMeans class runs the algorithm *n_init* times and keeps the model with the lowest intertia. 

 * KMeans++ initialization algorithm: 
  1. Take one centroid $c^{(1)}$, chosen uniformly at random from the dataset 
  2. Take a new centroid $c^{(i)}$, choosing an instance $x^{(i)}$ with probability $D(x^{(i)})^{2}$ / $\sum_{j = 1}^{m} D(x^{(j)})^{2}$, where $D(x^{(i)})$ is the distance between the instance $x^{(i)}$ and the closest centroid that was already chosen. This probability distribution ensures that instances farther away from already chosen centroids are much more likely to be selected as centroids 
  3. Repeat the previous steps until all k centroids have been chosen 
  

The KMeans class uses this initialization method by default.
  

#### Accelerated K-Means and mini-batch K-Means 

 * Exploit triangle inequality + keep track of the lower and upper bounds for distances between instances and centroids --> accelerates the algorithm by avoiding unnecessary distance calculations 
 * Use mini-batches + move the centroids just slightly at each iteration --> possible to cluster huge datasets that do not fit in memory 

In [9]:
from sklearn.cluster import MiniBatchKMeans 

minibatch_kmeans = MiniBatchKMeans(n_clusters = 5)
minibatch_kmeans.fit(X)

MiniBatchKMeans(n_clusters=5)

**Finding the optimal number of clusters**

Inertia is not a good performance metric for choosing k because it keeps getting lower as we increase k --> the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia!

 * Elbow method with inertia plot --> pick k where inertia starts decreasing more slowly 
 * Silhouette score = mean silhouette coefficient over all instances. --> varies between -1 and +1
 
$(b - a)$ / $max(a, b)$

where $a$ is the mean distance to the other instances of the same cluster (intra-cluster distance) and $b$ is the mean nearest-cluster distance. 

In [10]:
from sklearn.metrics import silhouette_score

silhouette_score(X, kmeans.labels_)

0.49745518901737446

**Limits of K-Means**

K-Means does not behave well when clusters have: 
 * varying sizes 
 * different densities 
 * nonspherical shapes 


**!** Important to **scale the input features** before running K-Means otherwise the clusters might be very stretched and therefore result in poor performance. 

#### K-Means for Image Segmentation

Image segmentation is the task of partitioning an image into multiple segments. 

 * Semantic segmentation = all pixels that are part of the same object type get assigned to the same segment. 
 * Instance segmentation = all pixels that are part of the same individual object are assigned to the same segment. 
 * Color segmentation = all pixels of the same color are assigned to the same segment 

In [None]:
from matplotlib.image import imread 
import os

image = imread(os.path.join("images", "unsupervised_learning", "ladybug.png"))
image.shape
X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters = 8).fit(X)
segmented_img = kmeans.cluster_centers_[kmenas.labels_]
segmented_img = segmented_img.reshape(image.shape)

#### K-Means for preprocessing 

Clustering can be an efficient preprocessing step before supervised learning algorithms.

In [14]:
from sklearn.datasets import load_digits

X_digits, y_digits = load_digits(return_X_y = True)


In [15]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)

In [17]:
from sklearn.linear_model import LogisticRegression 

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [18]:
logreg.score(X_test, y_test)

0.9733333333333334

In [19]:
from sklearn.pipeline import Pipeline 

pipeline = Pipeline([
    ('kmeans', KMeans(n_clusters = 50)),
    ('logreg', LogisticRegression())
])

pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Pipeline(steps=[('kmeans', KMeans(n_clusters=50)),
                ('logreg', LogisticRegression())])

In [20]:
pipeline.score(X_test, y_test)

0.9644444444444444

In [None]:
from sklearn.model_selection import GridSearchCV 

param_grid = dict(kmeans__n_clusters = range(2,100))
grid_clf = GridSearchCV(pipeline, param_grid, cv = 3)
grid_clf.fit(X_train, y_train)

In [None]:
grid_clf.best_params_

In [None]:
grid_clf.score(X_test, y_test)

#### K-Means for semi-supervised learning 

Semi-supervised learning = plenty of unlabeled instances and few labeled instances. 

 * Label representative images 
 * Propagate the labels to all instances in the same cluster (**label propagation**) 

In [22]:
n_labeled = 50
logreg = LogisticRegression()
logreg.fit(X_train[:n_labeled], y_train[:n_labeled])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [23]:
logreg.score(X_test, y_test)

0.84

In [25]:
k = 50 
kmeans = KMeans(n_clusters = k)
X_digits_dist = kmeans.fit_transform(X_train)
representative_digit_idx = np.argmin(X_digits_dist, axis = 0)
X_representative_digits = X_train[representative_digit_idx]

#### 1b. DBSCAN 

 * For each instance, count how many instances are located within a small distance $\epsilon$ from it --> instance's **$\epsilon$ - neighborhood** 
 * If an instance has at least min_samples instances in its **$\epsilon$ - neighborhood** it is considered a **core instance** --> core instances are those located in dense regions. 
 * all instances in the neighborhood of a core isntance belong to the same cluster --> long sequence of neighboring core instances forms a single cluster 
 * any instance that is not a core instance and does not have its own neighborhood is considered an **anomaly** 


**!** DBSCAN works wee if the clusters are dense enough and are well separated by low-density regions. 

In [26]:
from sklearn.cluster import DBSCAN 
from sklearn.datasets import make_moons

X, y = make_moons(n_samples = 1000, noise = 0.05)
dbscan = DBSCAN(eps = 0.05, min_samples = 5)
dbscan.fit(X)

DBSCAN(eps=0.05)

In [27]:
dbscan.labels_

array([ 0,  5,  1,  0, -1,  0,  0,  2, -1,  0,  0,  0,  1,  2,  3,  0,  2,
        1,  0,  3,  0,  3,  4,  3, -1,  1,  0,  1,  2,  5,  3,  3,  6, -1,
        1, -1,  4,  6,  0,  0,  6,  1,  0,  2,  2,  4,  1,  0,  1,  1,  0,
        1,  3,  1,  3,  1,  3,  4,  3,  1,  0,  0,  3,  2,  0, -1,  3,  5,
        2,  0,  1,  1,  3,  3,  4,  4,  0,  0, -1,  1,  2,  1,  0,  4,  0,
       -1,  1,  5,  1,  0,  1,  3,  1,  5,  1,  2,  3,  1,  5,  1,  3,  1,
       -1,  1,  3,  3,  4,  0,  0,  0,  1, -1,  3,  1,  1,  0,  4,  3,  0,
        1,  0,  3,  4,  1,  2,  6,  3,  7,  2, -1,  1, -1,  1,  0,  7,  0,
        1,  0,  4,  4,  1,  1,  1,  5,  3,  0,  5,  1,  1,  0,  5,  1,  0,
        0, -1,  0, -1,  2,  2,  1,  2,  1,  3,  5,  2,  1,  1,  0,  1, -1,
        1,  5,  1,  0,  0,  1,  2,  1,  3,  5,  1, -1,  2,  3,  6,  1,  1,
        0,  3,  3,  0, -1,  1,  0,  0,  1,  0,  1, -1,  1,  3,  2,  1,  0,
        3, -1,  3,  2,  1, -1,  3,  6,  1,  1, -1,  7,  1,  1,  0,  0,  2,
        1,  5,  4,  4,  2

Instances that have a cluster index of -1 are considered anomalies 

In [29]:
len(dbscan.core_sample_indices_)

782

 * DBSCAN does not have a predict() method, meaning that it cannot predict which cluster a new instance belongs to. 

In [30]:
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors = 50)
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])


KNeighborsClassifier(n_neighbors=50)

In [32]:
X_new = np.array([[-0.5, 0], [0, 0.5], [1, -0.1], [2, 1]])
knn.predict(X_new)

array([5, 3, 2, 1])

In [33]:
knn.predict_proba(X_new)

array([[0.  , 0.  , 0.  , 0.2 , 0.04, 0.76, 0.  , 0.  ],
       [0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 0.32, 0.68, 0.  , 0.  , 0.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ]])

 * Can identify number of clusters of any shape
 * Robust to outliers 
 * 2 hyperparameters: min_samples and eps 

#### 1.c Other clustering algorithms

 * **Agglomerative clustering** = a hierarchy of clusters is built from the bottom up. 
 * **BIRCH** = during training, it builds a tree structure containing just enough information to quickly assign each new instance to a cluster without having to store all the instances in the tree --> uses limited memory, good for huge datasets. 
 * **Mean-Shift** = places a circle centered on each instance, and for each instance computes the mean of all the instances located within it and shifts the circle so that it is centered on the mean. iterate this mean-shifting step until the circles stop moving. Similar to DBSCAN. Has only one hyperparameter: radius of the circles. Not suited for large datasets. 
 * **Affinity propagaton** = uses a voting system where instances vote for similar instances to be their representatives, and once the algorithm converges each representative and its voters form a cluster. Not suited for large datasets. 
 * **Spectral clustering** = takes a similarity matrix between instances the creates a low dimensional embedding from it. Then uses another clustering algorithm on this low-dimensional space. 