## Dimensionality Reduction

*Curse of dimensionality* : large number of features makes the training slow and can make it hard to find a good solution. This is because high-dimensional datasets are at risk of being very **sparse** and therefore the **training instances are likely to be far away from each other**. When we have to make predictions on a new instance, this will likely be far away from each training instance, thus making the prediction based on large **extrapolation**. 

--> **Reducing dimensionality** does cause some information loss but significantly speeds up training and can filter out noise. 

### 1. Approaches for dimensionality reduction

#### 1a. Projection 

In most cases, training instances are **not spread out uniformly** across all dimensions due to: 

 * some features being almost constant 
 * some features being correlated 
 
--> as a result, all training instances **lie within a lower-dimensional subspace** of the high-dimensional space. 

**Project every training instance perpendicularly onto this subspace** 


#### 1b. Manifold Learning 

If the subspace twists and turns, such as the famous Swiss roll case, then simply projecting onto a plane would squash different layers together. Instead, we want to "unroll" it. 

**d-dimensional manifold is a part of an n-dimensional space (where d < n) that locally resembles a d-dimensional hyperplane**. Manifold learning relies on the manifold hypothesis = high-dimensional datasets lie close to a much lower-dimensional manifold. It attempts at modeling the manifold on which the training instances lie. 

### 2. Principal Component Analysis (Projection method)

 * Identifies the hyperplane that lies closest to the data: 
     1. Select the axis that preserves the maximum amount of variance (therefore likely to lose less information) or that minimizes the mean squared distance between the original dataset and its projection onto that axis. 
     2. Select a second axis, orthogonal to the first, that accounts for the largest amount of the remaining variance, 
     3. Iterate the procedure to find as many axes as the number of dimensions in the dataset. 

The i-th axis is called the i-th principal component of the data. 
For each principal component, PCA finds a 0-centered unit vector pointing in the direction of the PC. However, since two opposing unit vectors lie on the same axis, the direction of the unit vectors returned by PCA is not stable, thought the plane they define remains the same. 

**Singular Value Decomposition** : matrix factorization technique that can decompose the training set matrix $X$ into the matrix multiplication of 3 matrices: $U$ $\Sigma$ $V^{T}$ 

--> $V$ contains the unit vectors that define all the principal components we are looking for. 


In [7]:
from sklearn.datasets import load_iris 

iris = load_iris()
X = iris.data
y = iris.target


In [8]:
# Split the training dataset into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.3, # This can be changed, though it makes sense to use 25-30% of the data for test
        random_state=1996
    )

In [9]:
import numpy as np 

X_centered = X - X.mean(axis = 0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt.T[:, 0]
c2 = Vt.T[:, 1]

**!** PCA assumes the dataset is centered around the origin! Though sklearn's PCA classes take care of this themselves 

 * Projects the data onto the hyperplane defined by the first d principal components 
 
$W_{d}$ = matrix containing the first d columns of $V$ 

--> Projection: $X_{d-proj}$ = $X$ $W_{d}$

In [10]:
Vt.shape

(4, 4)

In [11]:
W2 = Vt.T[:, 2]
X2D = X_centered.dot(W2)

In [12]:
# with sklearn 
from sklearn.decomposition import PCA 

pca = PCA(n_components = 2)
X2D = pca.fit_transform(X)

 * Explained variance ratio of each principal components indicates the proportion of variance that lies along each principal component 

In [13]:
pca.explained_variance_ratio_

array([0.92461872, 0.05306648])

 * Choosing the right number of dimensions by specifying the total amount of variance that we want to retain

In [14]:
pca = PCA(n_components = 0.95)
X_reduced = pca.fit_transform(X)

 * PCA for compression --> reconstruction error = mean squared distance between the original data and the reconstructed (compressed and then decompressed) data
 
$X_{recovered}$ = $X_{d-proj}$ $W_{d}^T$

In [15]:
X_recovered = pca.inverse_transform(X_reduced)

#### 2a. Randomized PCA

Algorithm that quickly find an approximation of the first d principal components with faster and more efficient computation.

In [17]:
rnd_pca = PCA(n_components = 2, svd_solver = "randomized") # specify the solver 
X_reduced = rnd_pca.fit_transform(X)

#### 2b. Incremental PCA 

Allow to split the training dataset into mini-batches and feed an IPCA algorithm one mini-batch at a time. 
Useful for:

 * Large training sets
 * Online learning 

In [20]:
from sklearn.decomposition import IncrementalPCA 

n_batches = 10
inc_pca = IncrementalPCA(n_components = 2)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)
    
X_reduced = inc_pca.transform(X_train)

#### 2c. Kernel PCA

**kernel trick** = mathematical technique that implicitly maps instances into a very high-dimensional space (called the feature space), enabling nonlinear classification and regression with Support Vector Machines. --> a linear decision boundary in the high-dimensional feature space corresponds to a complex nonlinear decision boundary in the original space. 


In [21]:
from sklearn.decomposition import KernelPCA 

rbf_pca = KernelPCA(n_components = 2, kernel = 'rbf', gamma = 0.04)
X_reduced = rbf_pca.fit_transform(X)

### 3. Manifold learning methods

#### 3a. Locally Linear Embedding

 * First measures how each training instance linearly relates to its closest neighbors
 * Looks for a low-dimensional representation of the training set where these locacl relationships are best preserved 
 
For each training instance $x^{(i)}$:
 1. Identify the k nearest neighbors
 2. Reconstruct $x^{(i)}$ as a linear function of its neighbors: finds the weights $w_{i,i}$ such that the squared distance between  $x^{(i)}$ and $\sum_{j = 1}^{m} w_{i,i} x^{(i)}$ is as small as possible. 
 3. Constrain by normalizing the weights of each training instance $x^{(i)}$
 4. Map the training instances to a d-dimensional space while preserving the local relationships (weight matrix) as much as possible
 

In [22]:
from sklearn.manifold import LocallyLinearEmbedding

lle = LocallyLinearEmbedding(n_components = 2, n_neighbors = 10)
X_reduced = lle.fit_transform(X)

#### End of notebook 