# Curse of dimensionality

Many ML problems involve thousands or even millions of features for each training instance. This makes training extremely slow and harder to find a good solution. THis is called curse of dimensionality.

High dimensional datasets are at risk of being sparse. Most training instances are likely to be far way from each other. This also means that a new instance will likely be far away from any training instance making predictions much less reliable than in lower dimensions. More dimensions the training set has, the greater the risk of overfitting it. One solution could be to increase the training set size to reach a sufficient density of training instances. But the no of training instances required to reach a given density grows exponentially with the number of dimensions.

## Why dimensionality reduction ?

Reducing dimensionality cause information loss so training would be faster but it may degrade the performance.

In some cases, reducing dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance.

Useful for data visualization.Reducing dimensions down to 2 or 3 makes it possible to plot a condensed view of a high dimensional training set on a graph and gain some important insights by visually detecting patterns such as clusters.

2 main approaches for dimensionality reduction are projection and manifold learning.

3 most popular dimensionality reduction techniques are PCA,random projection, locally linear embedding (LLE)

## Projection

In real world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant while others are highly correlated. As a result , all training instances lie within a much lower dimensional subspace of the high dimensional space.

## Manifold Learning

Cases where subspace may twist and turn eg: Swiss roll toy dataset - its an example of 2D manifold. 2D manifold is a 2D shape that can be bent and twisted in a higher dimensional space. Many dimensionality reduction algorithms work by modeling the manifold on which the training instances lie - manifold learning. It relies on the manifold assumption (or hypothesis) which holds that most real-world high dimensional datasets lie close to a much lower dimensional manifold. Task at hand (classification/regression) will be simpler if expressed in the lower dimensional space of the manifold.


Reducing dimensionality of training set before training a model will usually speed up training but it may not lead to a better or simpler solution.

# Principal Component Analysis (PCA)

First it identifies the hyperplane that lies closest to the data and then it projects the data onto it. For example, to project a 2D dataset to a 1D hyperplane, select the axis that preserves the maximum amount of variance as it will most likely lose less information than the other projections. Or choose the axis that minimizes the mean squared distance between the original dataset and its projection onto that axis.

Then it identifies a second axis orthogonal to the first one that accounts for the largest amount of remaining variance. In 2D example ther's no choice. There will be only one orthogonal axis. For higher dimensional dataset, there will be multiple orthogonal axes. Number of axes=Number of dimensions in the dataset. ith axis is called the ith principal component of the data.

For each principal component PCA finds a zero-centered unit vector pointing in the direction of the PC. PCA assumes that the dataset is centered around the origin. Scikit-Learn's PCA classes take care of centering the data.

To find the PC of training set, ther's a matrix factorization technique called singular value decomposition (SVD) that can decompose the training set matrix X into the matrix multiplication of 3 matrices U,epsilon,V_transpose where V contains the unit vectors that define all the PCs.

The SVD factorization algorithm returns three matrices, U, Σ and V, such that X = UΣV⊺, where U is an m × m matrix, Σ is an m × n matrix, and V is an n × n matrix. But the svd() function returns U, s and V⊺ instead. s is the vector containing all the values on the main diagonal of the top n rows of Σ. Since Σ is full of zeros elsewhere, your can easily reconstruct it from s

Once all PCs are identified, we can reduce the dimensionality of the dataset down to d dimensions by projecting it onto the hyperplane defined by first d PCs.To project the training set onto the hyperplane and obtain a reduced dataset Xd-proj of dimensionality d, compute the matrix multiplication of the training set matrix X by the matrix Wd, defined as the matrix containing the first d columns of V.

Xd-proj = XWd

-> explained_variance_ratio - proportion of dataset's variance that lies along each principal component.

-> Right number of dimensions -

1. Choose the no: of dimensions that add up to a significantly large portion of the variance say 95%

2. If we are reducing dimensionality for data visualization, then we want t reduce it down to 2 or 3.

3. Plot the explained variance as a function of the number of dimensions. There will usually be an elbow in the curve where explained variance stops growing fast.

-> After dimensionality reduction, training set takes up much less space so faster training. Its also possible to decompress the reduced dataset back to original dimensions by applying the inverse transformation of the PCA projection. It wont give back the original dataset due to some information loss during projection but it will be close to original data. The mean squared distance between the original data and the reconstructed data is called reconstruction error.

Xrecovered=Xd-proj Wd_transpose

-> Complexity - O(m*n^2)+O(n^3)

-> Randomized PCA - svd_solver="randomized". Complexity - O(m*d^2)+O(d^3) Faster than full SVD when d is much smaller than n.

-> Incremental PCA - It allows you to split the training set into mini batches and feed these in one mini batch at a time. This is useful for large training sets and for applying PCA online.

-> If we are dealing with a dataset with tens of thousands of features or more , say images then training become much too slow. In that case consider random projection.

## Maths behind it

To find the best value of vector u, which is the direction of maximum variance or maximum information and along which we should rotate our existing coordinates, we follow the below-given steps

1. Find the covariance matrix of feature matrix X

2. Then calculate eigen vectors and eigen values of the covariance matrix. Eigen vector is the direction of best u and the eigen value is the importance of that vector.

The eigen vector associated with the largest eigen value indicates the direction in which the data has the most variance. So we can select our pcs in the direction of the eigen vectors having large eigen values and drop the pcs having ralatively small eigen values.

-> Eigen vector and eigen value

For any matrix A, there exists x such that when this vector is multiplied with matrix A, we get a new vector in the same direction having diff magnitude. Vector x is called eigen vector and length is called eigen value. There can be multiple eigen vectors which are always orthogonal to each other.

In [None]:
# Creating 3D dataset

import numpy as np
from scipy.spatial.transform import Rotation

m = 60
X = np.zeros((m, 3))  # initialize 3D dataset
np.random.seed(42)
angles = (np.random.rand(m) ** 3 + 0.5) * 2 * np.pi  # uneven distribution
X[:, 0], X[:, 1] = np.cos(angles), np.sin(angles) * 0.5  # oval
X += 0.28 * np.random.randn(m, 3)  # add more noise
X = Rotation.from_rotvec([np.pi / 29, -np.pi / 20, np.pi / 4]).apply(X)
X += [0.2, 0, 0.2]  # shift a bit

In [None]:
import numpy as np

# X = [...]  # the small 3D dataset was created earlier in this notebook
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt[0]
c2 = Vt[1]

In [None]:
W2 = Vt[:2].T
X2D = X_centered @ W2

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

In [None]:
pca.components_

array([[ 0.67857588,  0.70073508,  0.22023881],
       [ 0.72817329, -0.6811147 , -0.07646185]])

In [None]:
pca.explained_variance_ratio_

array([0.7578477 , 0.15186921])

In [None]:
1 - pca.explained_variance_ratio_.sum()

0.09028309326742034

In [None]:
from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False, parser="auto")
X_train, y_train = mnist.data[:60_000], mnist.target[:60_000]
X_test, y_test = mnist.data[60_000:], mnist.target[60_000:]

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

In [None]:
pca.n_components_

154

In [None]:
pca.explained_variance_ratio_.sum()

0.9501960192613035

In [None]:
pca = PCA(0.95)
X_reduced = pca.fit_transform(X_train, y_train)

In [None]:
X_recovered = pca.inverse_transform(X_reduced)

In [None]:
rnd_pca = PCA(n_components=154, svd_solver="randomized", random_state=42)
X_reduced = rnd_pca.fit_transform(X_train)

In [None]:

from sklearn.decomposition import IncrementalPCA

n_batches = 100
inc_pca = IncrementalPCA(n_components=154)
for X_batch in np.array_split(X_train, n_batches):
    inc_pca.partial_fit(X_batch)

X_reduced = inc_pca.transform(X_train)

# Random Projection

Projects the data to a lower dimensional space using a random linear projection. It preserves distances. So 2 similar instances will remain similar and 2 very different instances will remain very different. Generates a random matrix P of shape [d,n] where each item is sampled randomly from a Gaussian distribution with mean 0 and variance 1/d and use it to project a dataset from n dimensions down to d.

# Locally Linear Embedding (LLE)

-> Its a manifold learning technique. A nonlinear dimensionality reduction (NLDR) technique.

-> LLE works by first measuring how each training instance linearly relates to its nearest neighbors and then looking for a low dimensional representation of the training set where these local relationships are best preserved.