<img src=images/gdd-logo.png width=300px align=right>

# Unsupervised Learning

**Supervised learning** algorithms are machine learning algorithms that learn with the help of external feedback: the algorithm makes a prediction, compares its prediction with a provided ground truth, and "learns" by adjusting its internal parameters. 

In contrast, the techniques described in this lecture do not rely on some external notion of what is or is not correct; this class of learning techniques is referred to as **unsupervised learning**. 

In this notebook, we will take a look at one of the most popular dimensionality reduction techniques: principal component analysis. 

# PCA
- [Introduction to PCA](#intuition)
- [PCA for visualisation](#vis)
- [PCA for classification](#clas)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id='intuition'></a>
## Introduction to PCA
### PCA: Motivation
PCA is a technique used for dimensionality reduction, which can have a number of practical benefits:
- **Data Compression**: In high-dimensional spaces, storing and processing fewer dimensions saves computational resources.
- **Visualization**: When reducing to 2 or 3 dimensions, it’s easier to plot and analyze patterns in the data.
- **Noise Reduction**: PCA can help filter out noise by compressing components with low variance (which often represent irrelevant or random fluctuations).


### PCA: Intuition

The idea behind Principal Component Analysis (PCA) is rooted in the fact that, although our data may exist in a high-dimensional space (e.g., hundreds or thousands of features), often the important patterns or structures in the data can be captured in a lower-dimensional space without losing much information.

Imagine you’re looking at a cloud of points in 3D space (like a scatterplot of stars in the sky). While this cloud seems to span three dimensions, it might actually lie mostly along a flat plane within the 3D space. If that's true, we could describe the entire cloud using just two dimensions (the x and y coordinates of the plane) instead of all three. PCA helps us identify this lower-dimensional structure by:

- Finding the direction in which the data varies the most (called the first principal component).
- Finding additional directions (orthogonal to the first) that capture the remaining variability*.

For data in *n*-dimensional space, there will be at max *n* principal components.

<img src="images/pca.png" alt="PCA Illustration" height=600 width=600>

### Why does this work?
1. Redundancy in Features:
    - Real-world datasets often have features that are highly correlated. For example, "height in inches" and "height in centimeters" are essentially the same information. PCA merges such redundant information into a single component.
    Focus on Variation:

2. PCA identifies the directions where the data changes the most. These are the directions that carry the most information about the dataset. 
    - By ignoring directions with minimal variation (which often represent noise or irrelevant details), PCA simplifies the data without losing its core structure.
  
### 2D Example

Let’s take a 2D example to visualize the process:

In [None]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');

**Principal Components**:
- The first principal component (PC1) will be the line that passes through the middle of the scatter and points in the direction of maximum spread of the points.
- The second principal component (PC2) is orthogonal (perpendicular) to the first and represents the next largest direction of variation.

By projecting the data onto PC1 and PC2, you essentially rotate the axes to align with the directions of the most important variations, simplifying how the data is represented.

Let's see how it works using **scikit-learn**! 

`PCA` is just like any other scikit-learn transformer. However, as it is an **unsupervised** method, it only takes X and **no y**. 

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X)

As the estimator has been fitted to the data, some attributes are set. 

You can then extract the principal axes in feature space, representing the directions of maximum variance in the data.

In [None]:
pca.components_

Below, you can see the amount of variance explained by each of the selected components.

In [None]:
pca.explained_variance_

Lastly, the percentage of variance explained by each of the selected components can also be extracted. As our number of principal axes is equal to the original number of axes, this sums up to 1.0. 

In [None]:
pca.explained_variance_ratio_

Let's visualise these values. 

In [None]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=2,
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

In [None]:
# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)

for length, vector in zip(pca.explained_variance_, pca.components_):
    # Scale the vector based on the explained variance
    v = vector * 3 * np.sqrt(length)
    
    # Draw vector in both directions
    draw_vector(pca.mean_ - v, pca.mean_ + v)

plt.axis('equal');

<details>
    
  <summary><span style="color:blue">What is that 3 doing there?</span></summary>
  
The number 3 in this line:
    
`v = vector * 3 * np.sqrt(length)`
    
is a scaling factor used to make the principal component vectors visually longer so they stand out clearly when plotted over the data. It has no mathematical significance in terms of PCA itself; it's purely for visualization purposes.

</details>



So now that you have seen how to compute principal components, let's have a go at using it for dimensionaly reduction.

#### <mark>Exercise: Compress to one component</mark>

1. Perform PCA with only one component on the feature matrix *X*.

In [None]:
## add your code

In [None]:
# %load answers/pca-one-component.py

2. Check the shape of the data before and after, has it worked?

In [None]:
## add your code

3. Compare the output of the following cells to the PCA with 2 components: what is different?

In [None]:
pca.components_

In [None]:
pca.explained_variance_

In [None]:
pca.explained_variance_ratio_

Let's also visualise this as well.

In [None]:
X_new = pca.inverse_transform(X_pca)
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], color='blue', alpha=0.7)
plt.axis('equal');

In [None]:
retained_info = pca.explained_variance_ratio_.sum()*100
n_components = pca.components_.shape[0]
print(f'In total, you managed to preserve {retained_info:.2f}% of information with {n_components} component(s)')

<a id='vis'></a>
## PCA for visualisation

Below, we have a dataset with measurements of abnormalities (nodules) that may be breast cancer (malignant) or simply benign. For each nodule, there are 30 different types of measurements. Taking these measurements is a time-consuming, labour-intensive task that the radiologist is tasked with, which limits the number of patients they can review in a single day. Therefore, our aim is not only to create a model that can accurately predict whether a nodule is malignant or benign, but also to see if we can limit the number of measurements the radiologist has to take.

In [None]:
df = pd.read_csv("data/cancer.csv")
df.head()

Let's start with some early preprocessing: split the data in X and y and remove unnecessary columns like 'Unnamed' and 'id'.

In [None]:
df = df.drop(['Unnamed: 32'], axis='columns').dropna()
y = df['diagnosis']
X = df.drop(['id','diagnosis'], axis = 'columns')

X.shape, y.shape

### Scaling

When you apply PCA on any regular dataset, scaling is **hugely** important. 

<mark>**Question:** Why do you think this is so important?</mark>
<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
In PCA, we are interested in the components that maximize the variance. If one component (e.g. height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, this is clearly incorrect. This is why we scale the data with the `StandardScaler` before we apply PCA. 

</details>



Let's scale the data using a standard scaler.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

What would you now expect the mean and standard deviation to be?

In [None]:
# check mean
X_scaled.mean()

In [None]:
X_scaled.std()

As we're interested in visualising our data first, we choose either **two** or **three** components. 

In [None]:
df_scaled = pd.DataFrame(X_scaled, columns=X.columns)
df_scaled.head()

The data seems to be properly scaled, which means you can apply PCA! 

In [None]:
from sklearn.decomposition import PCA

pca_transformer = PCA(n_components=2)
pca_data = pca_transformer.fit_transform(X_scaled)

In [None]:
X_scaled.shape

In [None]:
pca_data.shape

In [None]:
df_pca = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])
df_pca.head()

In [None]:
pca_transformer.explained_variance_ratio_

In [None]:
retained_info = pca_transformer.explained_variance_ratio_.sum()*100
n_components = pca_transformer.components_.shape[0]
print(f'In total, you managed to preserve {retained_info:.2f}% of information with {n_components} component(s)')

Let's visualise our data on these components.  

In [None]:
def visualise(df_pca, targets, colors, labels):
    plt.figure()
    plt.title('Principal Component Analysis')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')

    for i in range(len(targets)): 
        indices = labels == targets[i]
        plt.scatter(
            df_pca.loc[indices, 'PC1'],
            df_pca.loc[indices, 'PC2'],
            c = colors[i], s = 20)

    plt.legend(targets);
    
targets = ['B', 'M']
colors = ['g', 'r']
visualise(df_pca, targets, colors, y)

This looks quite nice! The dataset seems pretty linearly separable based on these two components. 

How would it look if you hadn't scaled the data?

In [None]:
# Unscaled version
pca_transformer = PCA(n_components=2)
pca_data = pca_transformer.fit_transform(X)

df_pca_unscaled = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])
visualise(df_pca_unscaled, targets, colors, y)

Much less nice! 

### <mark>Exercise: Visualise the Palmer Penguins dataset with PCA.</mark>

Let's load in the data

In [None]:
penguins = pd.read_csv('data/penguins.csv')
penguins.head() 

PCA cannot be performed on rows with missing values and does not perform well on categorical data. Hence we shall process our dataset accordingly.

In [None]:
# Preprocessing
penguins_processed = penguins.dropna()
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

X_penguins = penguins_processed.loc[:, feature_columns]
y_penguins = penguins_processed.loc[:, 'species'].reset_index(drop=True)

print(f'The shape of feature matrix X is: {X_penguins.shape}')
print(f'The shape of target vector y is: {y_penguins.shape}')

Now,
1. Scale the dataset,
2. Perform PCA, 
3. Visualise the PCA transformation (using the *visualise* function from earlier) , and
4. Calculate how much information was retained.

<details>
    
  <summary><span style="color:blue">Show hint</span></summary>
  
When visualising the PCA transformation, make sure:
    
- That you use the right input data in every transformation step;
- That the `targets` and `colors` variables contain appropriate values for this penguin dataset (e.g. the targets are not "B" and "M" anymore).

</details>

In [None]:
# add your code

In [None]:
# %load answers/ex-PCA.py

### PCA and categorical values

When you preprocessed the dataset for PCA, you only took a few columns into consideration. _Species_ was dropped as this is the target variable, but _island_ and _sex_ were also dropped. 

<mark>**Question:** What's different about these and why might you not want to include them in the PCA?</mark>

<details>
    
  <summary><span style="color:blue">Show answer</span></summary>
  
PCA is designed for _continuous_ variables. It aims to minimize the variance (the squared devations). However, the concept of squared deviations breaks down when you have binary variables (the output of one-hot encoding your categorical variables). This means that PCA can be used (in the sense that you get an output) but that output will be less _meaningful_ than you'd want it to be. 

</details>


<a id='clas'></a>
## PCA for classification

There is a large variety of algorithms to choose from for classification purposes, each of which has their own advantages and disadvantages. Some of these algorithms, including k-nearest neighbors algorithm, are not very suitable for high-dimensional data. In the case of k-nearest neighbors, this is because the distance measure loses accuracy with high-dimensional data: there is little difference between the nearest and farthest neighbors. 

Feature selection focuses on **discarding irrelevant features**, which k-nearest neighbor also is not very robust against as it assigns the same importance to all features, and automatically leads to a reduction in dimensionality. But to relief the problem of high dimensionality, dimensionality reduction techniques can be applied to transform the data before we train our classifier. 

Let's see that in action on the breast cancer dataset! 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=111)

In [None]:
X_train.head()

In [None]:
y_train.head()

First you need to create a train-test split and a baseline pipeline.

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

base = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', KNeighborsClassifier())
])

base.fit(X_train, y_train)
y_pred = base.predict(X_test)
accuracy_score(y_test, y_pred)

Next, you want to extend this pipeline with PCA. Remember that scaling must be performed _before_ PCA is used to transform the data! However, how do you choose the right number of components?

First of all, there are 30 features to start with, so your number of components must be less than 30. What happens if you apply PCA with no number of components specified? 

In [None]:
# Scale the data. 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA. 
pca_transformer = PCA()
pca_data = pca_transformer.fit_transform(X_scaled)

In [None]:
len(pca_transformer.components_)

PCA without the number of components specified gives you back the same number of components as number of features you put in: 30. This also means that, with these 30 components, you should be able to perfectly model the data. You can prove this by looking at the explained variance ratio. This should sum to 1.0

In [None]:
sum(pca_transformer.explained_variance_ratio_)

So what determines the number of components you choose if we want to reduce the dimensionality? You want the lowest amount of components that most accurately models the data. Let's visualise this. 

In [None]:
fig = plt.figure()
plt.plot(np.cumsum(pca_transformer.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

In [None]:
pca_transformer.explained_variance_ratio_

### <mark>Exercise: Include PCA in the pipeline</mark>
Extend the pipeline to include PCA. Plug in various values based on the graph. What gives you the best results? Can you explain that?

In [None]:
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('model', KNeighborsClassifier())
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

# Conclusion

PCA is a powerful and fast _unsupervised_ approach for reducing the number of dimensions in your dataset by finding the principal components in the dataset. It is important to scale the data before applying PCA.