# Chapter 8: Dimensionality Reduction

Too many dimensions is a problem, especially as:
1. There are too many features, leading to slower training.
2. The models might not generalize well.
3. The behavior of objects in a higher-dimensional space is not intuitive, so some of our expectations break down.

How can the space between two points in a 1 million dimension hypercube be roughly 408.25? I can't derive it theoratically, so let's derive it empirically.

Let's generate a billion sets of points in a 1M space, and then calculate the space between them

In [1]:
import numpy as np


In [27]:
import numpy as np

# Run later
d={}
runs = 1000000

for dimensions in (1, 2, 3, 4, 10, 100, 1000, 1000000):
    sum_distance = 0.0

    for i in range(runs):
        a=np.random.rand(dimensions)
        b=np.random.rand(dimensions)
        sum_distance += np.linalg.norm(a-b)
    
    sum_distance /= (1.0*runs)
    print ("Mean distance in dimension %d is %f" % (dimensions, sum_distance))
    d[str(dimensions)] = sum_distancev


In [28]:
d

{'0': 1,
 '1000000': 408.2611230474335,
 '1': 0.34803266704316954,
 '2': 0.5207446344412412,
 '3': 0.6667205574300262,
 '4': 0.7807210204305841,
 '10': 1.2710891450432376,
 '100': 4.092600984543937,
 '1000': 12.914293148945156}

# Principal Components Analysis

Reducing dimensions by getting the principal components' eigenvectors from an SVD (Singular Value Decomposition) and then transforming to a space of the first few eigenvectors that explain most of the variance.

This helps in reducing the space down (fewer attributes, smaller data) and in computational cost of training and predicting classifiers.

The only hitch is that the SVD is time consuming, but you only need to do it once.


In [29]:
# All imports

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict


In [None]:
# Download the MNIST data if it isn't available. Cached in ~/scikit_learn_data
mnist = fetch_openml('mnist_784', version=1, cache=True)

In [30]:
# Assign attributes and labels using standard terminology
X, y = mnist["data"], mnist["target"]

print (X.shape)
print (y.shape)

(70000, 784)
(70000,)
