In [None]:
# Run this cell.
from lec_utils import *
plotly.io.renderers.default = 'notebook'

#### DAIR-3 Workshop, Day 2 • Building Robust ML Models

# Part 2: Dimensionality Reduction

**Instructor**: Suraj Rampure (rampure@umich.edu)

### Outline

- Seeds and stratification in train-test splits.
- PCA.
- Other dimensionality reduction techniques.
    - MDS.
    - t-SNE.

### Seeds in train-test splits

- Run the cell below to import the breast cancer dataset from the previous notebook.

In [None]:
from sklearn.datasets import load_breast_cancer

full = load_breast_cancer()
df = pd.DataFrame(full['data'], columns=full['feature_names'])
df['target'] = 1 - full['target']
df

<div class="alert alert-danger"><h3>Warning #1: Missing Random Seeds</h3>
    
When building a predictive model, **always set a random seed** when splitting the data into training and testing sets. The splitting is done randomly; a seed ensures the same results every time, so others can reproduce your work.
    
<br>
    
Pick a random seed arbitrarily and stick with it; **don't** intentionally select a seed that yields "good" results, as this will lead to an overly confident assessment of model performance.
    
</div>

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), 
                                                    df['target'],
                                                    random_state=23)

X_train

- All model development should be done on the training data only.

### Stratification

- Peek at the distribution of malignant (1) and benign (0) tumors in both the training and test sets.

In [None]:
# Shows that 38.49% of patients in the training set have a malignant tumor. Why?
y_train.mean()

In [None]:
y_test.mean()

- Malignant tumors are not particularly rare in this dataset.<br>If they were, and we wanted to ensure the same frequency of malignant tumors in both our training and test sets, use the `stratify` parameter.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], random_state=23, stratify=df['target'])

In [None]:
y_train.mean()

In [None]:
y_test.mean()

### Dimensionality reduction

- The breast cancer dataset has 30 **features** (input variables).

In [None]:
X_train

- Many of these features may be redundant or noisy.<br>It's also impossible to visualize the entire dataset as-is.

In [None]:
X_train.corr()

- By converting this $(426 \times 30)$ dataset to a $(426 \times p)$ dataset, for some relatively small $p$, we can attempt to:
    - Preserve the majority of the signal in the dataset.
    - Add the ability to visualize the data.
    
<center><img src="images/dim-red.svg" width=1000></center>

- Dimensionality reduction is a form of **unsupervised learning**.

## PCA

---

### Principal component analysis (PCA)

- Principal component analysis (PCA) is one dimensionality reduction technique.

- It creates $p$ **new features**, each of which is a **linear combination** of all 30 existing features.


    $$\text{new feature 1} = 0.05 \cdot \text{mean radius} + 0.93 \cdot \text{mean texture} + ... - \: 0.35 \cdot \text{worst fractal dimension}$$

    $$\text{new feature 2} = - 0.06 \cdot \text{mean radius} + 0.5 \cdot \text{mean texture} + ... + \: 0.04 \cdot \text{worst fractal dimension}$$

    $$...$$


    These new features are chosen to capture as much variability (information) in the original data as possible, while being **orthogonal** (uncorrelated, independent) to one another.

- It leverages the **singular value decomposition** from linear algebra:

$$X = U \Sigma V^T$$

### PCA in `sklearn`

- `sklearn` has an implementation of PCA.<br><small>Remember, PCA is an **unsupervised** technique, so the `'target'` column is irrelevant.</small>

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X_train)

- Once `fit`, `pca` can transform `X_train` into a **$p$-column matrix** in a way that retains the bulk of the information.<br><small>We'll use $p = 2$ to make it easy to visualize the resulting components.</small>

    $$\mathbb{R}^{426 \times 30} \rightarrow \mathbb{R}^{426 \times p}$$

In [None]:
X_train

In [None]:
X_train_transformed = pca.transform(X_train)
X_train_transformed

- The above 2 new features are called "principal components".<br>Their values aren't directly interpretable (where do they come from?), but the new features are all uncorrelated.

In [None]:
np.corrcoef(X_train_transformed.T) # Note that the correlation between the new features is 0.

### Visualizing principal components

- The resulting principal components aren't directly interpretable, but sometimes have intuitive interpretations.

In [None]:
fig_pca = px.scatter(x=X_train_transformed[:, 0], y=X_train_transformed[:, 1], color=y_train.replace({0: 'benign', 1: 'malignant'}))
fig_pca.update_layout(xaxis_title='PC 1', yaxis_title='PC 2', title='PCA Projection of Breast Cancer Data')

- Remember, the target variable **was not** used in computing the principal components.<br>However, clusters (in the target variable) _may_ naturally form.

### Scree plot

- We said that PCA aims to retain the **bulk** of the information in the original dataset. Let's be more precise.

- Each principal component (new feature) has associated with it a "score", which describes the **proportion of variance in the original dataset that is captured by the PC**.<br><small>The score for PC $i$ is $\frac{\sigma_i^2}{n - 1}$, where $\sigma_i$ is the $i$th entry in the diagonal matrix $\Sigma$ from $X = U \Sigma V^T$ and $n$ is the number of rows in the dataset.

- A **scree plot** shows the explained variance ratio vs. the number of principal components, and is useful in helping choose the number of principal components to retain.

In [None]:
pca_large_p = PCA(n_components=10)
pca_large_p.fit(X_train)

In [None]:
fig = px.scatter(x=np.arange(1, 11), y=pca_large_p.explained_variance_ratio_)
fig.update_layout(xaxis_title='Number of Principal Components', yaxis_title='Proportion of Variation Explained')

- PC 1 captures the most variability, then PC 2, then PC 3, and so on.

### Directions

- To peek into how each PC was defined, we can look at the **loadings** vectors.<br><small>These correspond to the rows of the matrix $V^T$ in $X = U \Sigma V^T$.</small>

In [None]:
# 2 loadings vectors, since we created 2 PCs.
# These are being rounded; they aren't sparse.
pca.components_

- Sometimes, the loadings vectors give intuitive definitions for the PCs.

In [None]:
px.bar(y=pca.components_[0], x=pca.feature_names_in_, title='Coefficients used to construct PC 1')

In [None]:
px.bar(y=pca.components_[1], x=pca.feature_names_in_, title='Coefficients used to construct PC 2')

<div class="alert alert-danger"><h3>Warning #2: Inconsistent Variance/Standard Deviation Formulas</h3>
    
Depending on the package, the default method for computing the variance of a dataset $x_1, x_2, ..., x_n$ may be the "population" formula:
    
$$\sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 \:\:\:\:\:\:\:\: (\text{ddof}=0)$$
    
**or** the "sample" formula:
    
$$\sigma_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\:\:\:\:\:\:\:\: (\text{ddof}=1)$$
    
</div>

In [None]:
X_train_transformed

In [None]:
# Variance of each new PC.
X_train_transformed.var(axis=0)

In [None]:
# Should also be the variance of each new PC!
pca.explained_variance_

In [None]:
X_train_transformed.var(axis=0, ddof=1)

### Read the documentation!

**Activity**: Look at the documentation for `sklearn.decomposition.PCA` by running the cell below (or going [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)). Identify **2 unexpected operations** that it performs, without explicitly saying so.

<div class="alert alert-danger"><h3>Warning #3: Centering with PCA</h3>
    
Before applying PCA with singular value decomposition, you must **center each column** by subtracting each column's mean from all values in that column.
    
<br>

Most PCA implementations handle this centering automatically, but always check your library's documentation to be sure. If you're using SVD directly rather than a dedicated PCA function, you'll need to center the data manually first.

## Other dimensionality reduction techniques

---

### MDS

- Multidimensional scaling (MDS) is a dimensionality reduction technique that aims to **preserve pairwise (Euclidean) distances between points**.<br><small>If two individuals are close in the original space, they will be close in the lower-dimensional projected space, and if they are far in the original space, they will be far in the lower-dimensional projected space.</small>

In [None]:
from sklearn.manifold import MDS
mds = MDS(n_components=2, random_state=23)
X_train_mds = mds.fit_transform(X_train) # Only uses X_train, not y_train!
fig_mds = px.scatter(x=X_train_mds[:, 0], y=X_train_mds[:, 1], color=y_train.replace({0: 'benign', 1: 'malignant'}))
fig_mds.update_layout(title='MDS Projection of Breast Cancer Data')

- Note that unlike PCA, the algorithm is not deterministic, so a random seed should be set.

- The features produced by MDS are not interpretable, so MDS is mostly a visualization tool, unless you have a good reason to believe that the distances between points in the original space are meaningful.

### MDS on Dissimilarity Matrices

- MDS can be used directly on a **dissimilarity matrix** (i.e., a matrix of pairwise "distances" between points, where distances are calculated in some meaningful way).

In [None]:
cities = ['Ann Arbor', 'Toronto', 'San Diego', 'London', 'Copenhagen', 'Singapore']

# Distances in miles (as the crow flies, approximate)
# Sources: great circle distance calculators
distances_miles = [
    [0,     153,   1950,  3760,  4120,  9360],
    [153,     0,   2130,  3550,  3910,  9460],
    [1950, 2130,      0,  5470,  5790,  8800],
    [3760, 3550,   5470,     0,   600,  6740],
    [4120, 3910,   5790,   600,     0,  6130],
    [9360, 9460,   8800,  6740,  6130,     0]
]

distances = pd.DataFrame(distances_miles, index=cities, columns=cities)
distances

In [None]:
# Also non-deterministic!
mds_cities = MDS(n_components=2, random_state=23)
cities_transformed = mds_cities.fit_transform(distances_miles)
px.scatter(
    x=cities_transformed[:, 0],
    y=cities_transformed[:, 1],
    text=cities
).update_traces(textposition='middle right').update_layout(xaxis_range=[-10 ** 4, 10 ** 4])

### $t$-SNE

- t-SNE (t-distributed stochastic neighbor embedding) is a dimensionality reduction technique that aims to **preserve local structure**, mostly for the purpose of forming clusters for visualization.

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_train_tsne = tsne.fit_transform(X_train)
fig_tsne = px.scatter(x=X_train_tsne[:, 0], y=X_train_tsne[:, 1], color=y_train.replace({0: 'benign', 1: 'malignant'}))
fig_tsne.update_layout(title='t-SNE Projection of Breast Cancer Data')

- Again, the resulting features are not interpretable in the way PCA features are.

### Comparing dimensionality reduction techniques

- A good discussion of the differences between the three techniques can be found [here](https://orangedatamining.com/blog/pca-vs-mds-vs-t-sne/).

- In short:
    - Use PCA to produce interpretable, new features.
    - Use MDS and t-SNE to visualize the data.

In [None]:
import plotly.graph_objects as go

from plotly.subplots import make_subplots

fig_compare = make_subplots(rows=1, cols=3, subplot_titles=["PCA", "MDS", "t-SNE"])

# PCA plot
for trace in fig_pca.data:
    fig_compare.add_trace(trace, row=1, col=1)
fig_compare.update_xaxes(title_text=fig_pca.layout.xaxis.title.text, row=1, col=1)
fig_compare.update_yaxes(title_text=fig_pca.layout.yaxis.title.text, row=1, col=1)

# MDS plot
for trace in fig_mds.data:
    fig_compare.add_trace(trace, row=1, col=2)
fig_compare.update_xaxes(title_text=fig_mds.layout.xaxis.title.text, row=1, col=2)
fig_compare.update_yaxes(title_text=fig_mds.layout.yaxis.title.text, row=1, col=2)

# t-SNE plot
for trace in fig_tsne.data:
    fig_compare.add_trace(trace, row=1, col=3)
fig_compare.update_xaxes(title_text=fig_tsne.layout.xaxis.title.text, row=1, col=3)
fig_compare.update_yaxes(title_text=fig_tsne.layout.yaxis.title.text, row=1, col=3)

fig_compare.update_layout(height=400, width=1200, showlegend=False)
fig_compare.show()
