### Dimensionality Reduction

first we should try to train system on original data before reduction

reducing noise and filtering unnecessary details will result in higher performance (speed up training)

Reducing the dimensionality could help in visuals such as detecting patterns (clusters)

**Projection:** 3D space -> 2D space

**Manifold:** Swiss Roll data example from 3D to unrolling it onto 2D space

### PCA Principal Component Analysis

the most popular dimensionality reduction algorithm

identifies the axis that accounts for the largest amount of variance in the trainging set

finds second axis orthogonal to the first one that accounts for the largest amount of the remaining variance (as many axes as the number of dimensions)

To find the principal components of a training set?

using standard matrix factorization technique called sinular value decomposition (SVD)
decompose X onto UEV^T

In [6]:
import numpy as np

# Create a small 3D dataset (5 points in 3D space)
X = np.array([
    [2, 3, 5],
    [3, 5, 7],
    [5, 8, 11],
    [7, 10, 13],
    [9, 12, 15]
])

# Center the data
X_centered = X - X.mean(axis=0)

print(X.mean(axis=0))
# Perform SVD
U, s, Vt = np.linalg.svd(X_centered)

# Principal components
c1 = Vt[0]  # First principal component
c2 = Vt[1]  # Second principal component

print("Principal Component 1:", c1)
print("Principal Component 2:", c2)

[ 5.2  7.6 10.2]
Principal Component 1: [0.45795989 0.58739374 0.66726407]
Principal Component 2: [-0.85099447  0.0726221   0.52012926]


Projects training set onto a plane:

In [7]:
W2 = Vt[:2].T
X2D = X_centered @ W2

reduce dimensionality of dataset down to two dimensions
(automatically takes care of centering the data)

In [8]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

Ratio of the proportion of the dataset's variance

In [9]:
pca.explained_variance_ratio_

array([0.99541491, 0.00428147])

0.9954149 variance lies on first PC, 0.00428 second PC

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a sufficiently large portion of the variance ~95%

In [10]:
from sklearn.datasets import load_digits

digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = X[:1500], X[1500:], y[:1500], y[1500:]

pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1

In [11]:
# variance to perserve:
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

In [12]:
pca.n_components_

28

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import make_pipeline

clf = make_pipeline(PCA(random_state=42), RandomForestClassifier(random_state=42))

param_distrib = {
    "pca__n_components": np.arange(10, 80),
    "randomforestclassifier__n_estimators": np.arange(50, 500)
}

rnd_search = RandomizedSearchCV(clf, param_distrib, n_iter=10, cv=3, random_state=42)
rnd_search.fit(X_train[:1000], y_train[:1000])

3 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/mathias/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/mathias/Library/Python/3.9/lib/python/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/mathias/Library/Python/3.9/lib/python/site-packages/sklearn/pipeline.py", line 471, in fit
    Xt = self._fit(X, y, routed_params)
  File "/Users/mathias/Library/Python/3.9/lib/python/site-packages/sklearn/pipeline.py", line 408, 

In [19]:
rnd_search.best_params_

{'randomforestclassifier__n_estimators': 304, 'pca__n_components': 62}

This ^ reduced number of dimensions to 62

reconstructed data (compressed then decompressed) is called reconstruction error

In [20]:
X_recovered = pca.inverse_transform(X_reduced)

$$
\mathbf{X}_{\text{reconstructed}} = \mathbf{Z} \mathbf{W}^T + \mathbf{\mu}
$$

If mean-centering was not applied before PCA, the equation simplifies to:

$$
\mathbf{X}_{\text{reconstructed}} = \mathbf{Z} \mathbf{W}^T
$$