# Unsupervised Learning for Clustering and Dimensionality Reduction

## Learning objectives:
1. Distinguish supervised from **unsupervised learning**
2. Understand the necessity of **reducing dimensionality** for big datasets
    * Know at least two approaches for dimensionality reduction
    * Understand the steps of PCA
3. 	Distinguish **clustering** from supervised classification
    * Know how to implement the Kmeans algorithm and select the number of clusters
    * Know how to implement Gaussian mixture models and how to detect anomalies

### Unsupervised Learning (UL) v.s Supervised Learning (SL)

* **UL** is a type of machine learning where the model is trained on a dataset **without** explicit supervision - no **labeled output data**. 
* Unlike **SL**, where the algorithm learns from labeled examples to make predictions or classifications; 
* UL aims to uncover **patterns**, **groupings**, **inherent structures**, or **representations** in the data, 
* Often through techniques like **clustering**, **dimensionality reduction**, and **density estimation**. 
* UL is particularly useful for exploring and gaining insights from data when the specific output categories or labels are unknown or not available.

### Dimension Reduction

#### The Curse of Dimensionality
* **High dimensional datasets** are likely very **sparse**, with training instances **far away from each other**, which increases the risk of **overfitting**.

#### Main Methods for Dimension Reduction
* **Projection Methods**:Map high-dimensional data to lower dimensions while preserving features, e.g. PCA
* **Manifold Learning**: Captures underlying data structure for nonlinear relationships, e.g. t-SNE, LLE

#### Principal Component Analysis (PCA)
* PCA identifies **orthogonal axes** (principal components) that **maximize variance** in the data. It **projects** the data onto these components, effectively reducing dimensionality.

### Clustering
#### Clustering v.s Classification
* **Clustering** groups data into clusters based on similarities, **without predefined labels**. **Classification** classifies data into predefined categories **using labeled examples**. Clustering discovers patterns and relationships, while classification predicts labels based on known outcomes.

#### K-means clustering
* K-means clustering **divides data into K clusters** by minimizing the sum of squared distances between points and cluster centroids. It **iteratively assigns points to the nearest centroid** and updates centroids. 
* **K-Means Clustering Steps**:

    1. **Initialization**: Randomly select K initial centroids.
    2. **Assignment**: Assign each point to the nearest centroid.
    3. **Update Centroids**: Recalculate centroids based on cluster points.
    4. **Reassignment**: Repeat steps 2 and 3 until convergence.
    5. **Convergence**: Stop when centroids stabilize or after a set number of iterations.
    6. **Clusters**: Resulting centroids define distinct clusters in the data.

#### Gaussian mixture models (GMM)
* GMM is a probabilistic model for **representing data** as a **mixture** of several **Gaussian distributions*8. 
* GMM helps anomaly detection by modeling **normal data distribution**. Anomalies are identified as data points with **low probability** under the GMM, serving as outliers.

## Exercise 1: Dimensionality Reduction

Does PCA always reduce model training time and increase model performance?

1. Load the MNIST dataset and split it into a training set and a test set;
2. Train a Random Forest classifier on the dataset,  
    Time how long it takes,   
    Evaluate the resulting model on the test set.
3. Train a Logistic Regression classifier on the dataset,  
    Time how long it takes,   
    Evaluate the resulting model on the test set.
3. Use PCA to reduce the dataset's dimensionality, with an explained variance ratio of 95%.
4. Train a new Random Forest classifier on the reduced dataset. Was training much faster? Was the performance better?
5. Train a new Logistic Regression classifier on the reduced dataset. Was training much faster? Was the performance better?

## Exercise 2:  Clustering

How to choose the number of clusters when using K-means?

1. Load the MNIST dataset;
2. Time one K-Means training;
3. Use PCA for dimension reduction;
4. Train K-Means with multiple ks;
5. Calculate the performance of different k - silhouette score;
6. Visualize silhouette score & inertia against k;
7. Visualize clusters. 

## Exercise 3: Application to Dynamical Regime Identification - Tracking the impact of global Heating on Ocean Regimes (THOR)

### Reading: Transparent Machine learning (ML) method that explains the governing mechanisms of the North Atlantic Meridional Overturning Circulation (AMOC) called Tracking global Heating with Ocean Regimes (THOR).
* Transparent ML
* Dynamics contributing to AMOC changes under a global heating model

### Exercise: Step1 of THOR - Identify 2D dynamical regimes
1. Data
    * Reduced to 5 dimensions: (1) curlA; (2) curlB, (3) curlTau, (4)curlCori, (5) BPT,
    * i.e., with shape (360, 720, 5) - 5 layers of 720x360 images, each pixel/cell has 5 features;
    * pixels/cells to be clustered into groups based on these features.
2. Use Xarray to format data.
3. Use K-Means to cluster the 5D training data;
4. Visualize identified clusters.