# Unsupervised Learning

**Unsupervised learning** is a type of machine learning where the algorithm learns patterns from input data without explicit supervision or labeled responses.

In **unsupervised learning**, the algorithm tries to find hidden structures or relationships within the data.

Unlike **supervised learning**, where the algorithm is provided with labeled data and learns to predict output based on input-output pairs, **unsupervised learning** deals with **unlabeled data** and aims to uncover underlying patterns or structures.

## What if you only have X?

You can use unsupervised learning to :

- **Understand your data** (exploration, visualisation, segmentation...)
- **Feature processing** (engineering, selection, compression...)
- **Because you have no targets yet** (too tough to annotate, too expensive, huge dataset...)

## Principal Component Analysis (PCA)

- Aims to find the best linear combination of features that best represents the underlying structure of the data.
- Squashes our high-dimensional dataset down into a lower dimension.


### Optimize linear combination of features

Remember linear regression variants?

**Polynomial**

$Y = \beta_0 + \beta_1X_1 + \beta_2X^2$

**Log transformation**

$Y = \beta_0 + \beta_1\log(X_1)$

**Linear combination of features**

$Y = \beta_0 + \beta_1X_1 + \beta_2(X_2 + X_3)$

With PCA you can find the best linear combination of features which:

$Y = \beta_0 + \beta_1X'_1 + \beta_2X'_2$

- Removes any multicolinearity.
- Ranks features by "importance". Meaning that the biggest part of the explainability will be in the first $X'_1$ than, $X'_2$ and so on...

How do we build those $X'$, which are also called "principal components"?

<div>
<img src="files/PCA.png" width="75%" align='center' source='https://www.biorender.com/template/principal-component-analysis-pca-transformation'/> </div>



- PCA returns a new projection of the data.
- On the image below, we start with three different features and we're looking for to create two new features PCA1 and PCA2 which are not colinear.

- On the second graph, values vary a lot in function to PC1, but vary just a little with PC2.
- The features are orthonomal to each other and maximize the variance explained.

[2D Interactive Visualization](https://setosa.io/ev/principal-component-analysis/)

[Stats Stack Exchange](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Example with Wine dataset

In [None]:
from sklearn.datasets import load_wine

wine = load_wine(as_frame=True)
X = wine.data
y = wine.target # we do have y, but let's forget about it for now
wine_features = X.columns

# Data must be centered around its mean before applying PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X), columns=wine_features)
X

### Heatmap and correlation

Information of some features are also inside other features such as "flavanoids" and "total_phenols". There is redundancy inside the features.

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(pd.DataFrame(X).corr(), cmap='coolwarm', vmin=-1, vmax=1, annot=True);

In this case it's worth doing some PCA.

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X)

In [None]:
# New 13 features as linear combination of initial vector basis (rows)
#pd.DataFrame(pca.components_)
# Access our 13 PCs 
W = pca.components_

# Print PCs as COLUMNS
W = pd.DataFrame(W.T,
                 index=wine_features,
                 columns=[f'PC{i}' for i in range(1, 14)])
W

So PC1 $= 0.144329 * X_1 \text{(alcohol)} - 0.483652 * X_2 \text{(malid_acid)} - 0.207383 * X_3 \text{(ash)} \dots$

### Transforming X

In [None]:
X_proj = pca.transform(X) # After fitting, we can transform our original X
X_proj = pd.DataFrame(X_proj, columns=[f'PC{i}' for i in range(1, 14)]) # X_proj is now my new dataset with my 178 samples.
X_proj

In [None]:
# Now Xp features are uncorrelated
plt.figure(figsize=(10, 6))
sns.heatmap(pd.DataFrame(X_proj).corr(), cmap='coolwarm', vmin=-1, vmax=1,);

In [None]:
# 2D-slice

plt.figure(figsize=(13,5))
plt.subplot(1,2,1)
plt.title('X1 vs. X0 before PCA (initial space)')
plt.xlabel('X0')
plt.ylabel('X1')
plt.scatter(X.iloc[:,0], X.iloc[:,1])

plt.subplot(1,2,2)
plt.title('PC1 vs PC2 (new space)')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.scatter(X_proj.iloc[:,0], X_proj.iloc[:,1]);

Before the PCA, nothing is really important. After the PCA a "space" has been created, it looks like the data is more "clusterised". But it's only 2 features out of the 13 there are.

### Share of variance explained by each PC

Also now PC1 will contain most of the "explainability".

$\frac{Var(Pc)}{Var(X)}$

How much the variation of only one PC explains the variation of my X?

In [None]:
pca.explained_variance_ratio_ # Share of varince explained by each of the selected components

In [None]:
# PC1 explains 36% of the variation, PC2 19%, PC3 11% etc.

plt.plot(pca.explained_variance_ratio_);

In [None]:
# We can compute these variances
X_proj.var() / (X_proj.var()).sum()

### How do you compute PCA?

1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes

### Why would you want less features?

- To compress data
- To reduce model complexity
- To reduce overfitting

### How many features to keep ($k$)?

If we take a look at the graph above, we see there's an inflection point starting from PCA3. Choosing $k$ is a trade-off between compression and performance.

In [None]:
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.ylim(ymin=0)
plt.title('cumulated share of explained variance')
plt.xlabel('# of principal component used');

### PCA with fewer components

In [None]:
# Fit a PCA with only 3 components
pca3 = PCA(n_components=3).fit(X)

# Project your data into 3 dimensions
X_proj3 = pd.DataFrame(pca3.fit_transform(X), columns=['PC1', 'PC2', 'PC3'])

# We have "compressed" our dataset in 3D
X_proj3

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

print("accuracy all 13 initial features")
print(cross_val_score(LogisticRegression(), X, y, cv=5).mean())

print("\n accuracy 3 PCs")
print(cross_val_score(LogisticRegression(), X_proj3, y, cv=5).mean())

### Decompress

Can you reconstruct exactly X from X_proj?

Not if you kept k < 13 dimensions; information has been lost
We can approximate X by reconstructing it with inverse_transform()

In [None]:
X_reconstructed = pca3.inverse_transform(X_proj3)
X_reconstructed.shape

In [None]:
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
sns.heatmap(X)
plt.title("original data")
plt.subplot(1,2,2)
plt.title("reconstructed data")
sns.heatmap(X_reconstructed);

### Limitations of PCA

- Cannot capture non linear data. But there is kernel PCA to do this.

## Clustering (with K-Means)

### Definition

The process of organizing data points into groups whose members are similar in some way. So we're going to find categories (classes, segments) of unlabelled data rather than just trying to reduce dimensionality.

<div>
<img src="files/clustering.webp" width="75%" align='center' source='https://bolisettigunasekhar.medium.com/what-is-clustering-7c8c9c34bd66'> </div>

[Animation](https://shabal.in/visuals/kmeans/2.html)

### K-means algorithm

The K-means algorithm is actually quite simple.

1. Select the number $K$ (hyperparameter) to decide the number of clusters.

1. Select random $K$ points or centroids. (It can be other from the input dataset).

1. Assign each data point to their closest centroid, which will form the predefined $K$ clusters.

1. Calculate the variance and place a new centroid of each cluster.

1. Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

1. If any reassignment occurs, then go to step-4 else go to FINISH.

1. The model is ready.

[source](https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)

### K-means and dimensions

If you plot two features, maybe you won't see there are different clusters. But if you add a feature, it might become obvious.

<div>
<img src="files/k-means_2D_to_3D.png" width="65%" align='center' source='https://bolisettigunasekhar.medium.com/what-is-clustering-7c8c9c34bd66'> </div>

So in order to the algorithm work, it's better if the data is already "clustered" geomatrically. You can use PCA first to better shape your data.

### In practice

- K-means is usually run a few times with different random initializations (sklearn will do it by itself).
- We can use a random mini-batch at each epoch instead of the full dataset.
- The algorithm is quite fast.

### Wine clustering

Let's go back to our wine dataset on which we applied a PCA.

In [None]:
from sklearn.cluster import KMeans

# Fit K-means
km = KMeans(n_clusters=3)
km.fit(X_proj)

In [None]:
km.cluster_centers_.shape # Position in 13 dimensions of the 3 centroids.

In [None]:
km.labels_ # All our samples (data points) are already labelled!

In [None]:
# Plotting the first two PC (which contain the more information as we've seen)
plt.scatter(X_proj.iloc[:,0], X_proj.iloc[:,1], c=km.labels_, cmap='viridis_r')
plt.title('KMeans clustering')
plt.xlabel('PC 1')
plt.ylabel('PC 2');

### Mapping ```y``` and labels

Watch out! The km.labels_ and the y may not match together!

In [None]:
# set(km.labels_)
np.unique(km.labels_)

In [None]:
# This seems right... But !
y.unique()

In [None]:
np.unique(km.labels_, return_counts=True)

In [None]:
#y.value_counts()
np.unique(y, return_counts=True)

In [None]:
# Here are a few options of how you can do the matching.

# With list comprehension
km_labels_mapped = np.array([0 if x == 2 else 1 if x == 0 else 2 for x in km.labels_])
# With numpy vectorize
km_labels_mapped = np.vectorize(lambda x: {0:1, 1:2, 2:0}.get(x, 0))(km.labels_)
# With Pandas map
km_labels_mapped = np.array(pd.Series(km.labels_).map({0:1, 1:2, 2:0}))
km_labels_mapped

In [None]:
# Visualization of y vs y_pred
plt.figure(figsize=(13,5))

plt.subplot(1,2,2)
plt.scatter(X_proj.iloc[:,0], X_proj.iloc[:,1], c=y, cmap='viridis')
plt.title('True wine labels'); plt.xlabel('PC 1'); plt.ylabel('PC 2');

plt.subplot(1,2,1)
plt.scatter(X_proj.iloc[:,0], X_proj.iloc[:,1], c=km_labels_mapped, cmap='viridis')
plt.title('KMeans clustering'); plt.xlabel('PC 1'); plt.ylabel('PC 2')

### Score

In [None]:
from sklearn.metrics import accuracy_score

y_pred = pd.Series(km_labels_mapped)
#y_pred = pd.Series(km.labels_).map({0:1, 1:2, 2:0}) # Alternative way, don't forget to change the mapping manually!
accuracy_score(y_pred, y)

### Prediction

We can then use the fitted model to predict new values.

In [None]:
# creating a new df with the right column names a random observation
new_obs = pd.DataFrame(data=np.random.random((1,13)), columns=X_proj.columns) # np.random.random((1,13)) -> 1 row and 13 columns
km.predict(new_obs)

### K-Means' Loss Function

```km.fit(X)``` finds parameters $\beta$ that minimize a loss.

Each $\beta_j$ parameter is the centroid $\mu_j$ of its respective cluster $C_j$.

The loss function is called inertia $L(\mu)$

= **sum of squared distance** between each observation and their closest centroid

= sum of **within-cluster sum of squares** (WCSS)

= variance

$inertia = L(\mu) = K\sum_{j=1}^{\text{K}} \sum_{x_i \in C_j}$

### Choosing $K$

In [None]:
inertias = []
ks = range(1,10)

for k in ks:
    km_test = KMeans(n_clusters=k).fit(X)
    inertias.append(km_test.inertia_)

plt.plot(ks, inertias)
plt.xlabel('k cluster number')

### There are many other types of clustering

<div>
<img src="files/sklearn_clustering.png" width="75%" align='center' source='https://bolisettigunasekhar.medium.com/what-is-clustering-7c8c9c34bd66'> </div>

[Sklearn website](https://scikit-learn.org/stable/modules/clustering.html)