<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/02_dimensionality_reduction_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Dimensionality reduction based on projection

**Principal Component Analysis** (PCA) is by far the most popular dimensionality reduction
algorithm. First it identifies the hyperplane that lies closest to the data, and then
it projects the data onto it.

<img src='https://github.com/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/images/0.png?raw=1' width='600'/>

**Preserving the Variance**

Before you can project the training set onto a lower-dimensional hyperplane, you
first need to choose the right hyperplane.

For example, a simple 2D dataset is represented
on the left in figure, along with three different axes (i.e., 1D hyperplanes).
On the right is the result of the projection of the dataset onto each of these axes.

<img src='https://github.com/rahiakela/machine-learning-research-and-practice/blob/main/hands-on-machine-learning-with-scikit-learn-keras-and-tensorflow/8-dimensionality-reduction/images/1.png?raw=1' width='600'/>

As you can see, the projection onto the solid line preserves the maximum variance, while
the projection onto the dotted line preserves very little variance and the projection
onto the dashed line preserves an intermediate amount of variance.

It seems reasonable to select the axis that preserves the maximum amount of variance,
as it will most likely lose less information than the other projections. Another
way to justify this choice is that it is the axis that minimizes the mean squared distance
between the original dataset and its projection onto that axis. This is the rather
simple idea behind [**PCA**](https://www.tandfonline.com/doi/pdf/10.1080/14786440109462720).













##Setup

In [None]:
# Common imports
import numpy as np
import os
import time

from sklearn.decomposition import PCA

from sklearn.manifold import TSNE
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.manifold import MDS

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, accuracy_score

from sklearn.datasets import make_swiss_roll
from sklearn.datasets import fetch_openml

from sklearn.model_selection import train_test_split

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.patches import FancyArrowPatch
from matplotlib.offsetbox import AnnotationBbox, OffsetImage
from mpl_toolkits.mplot3d import proj3d
from mpl_toolkits.mplot3d import Axes3D

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [None]:
def plot_digits(instances, images_per_row=5, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

Let's load MNIST dataset.

In [None]:
mnist = fetch_openml('mnist_784', version=1, cache=True, as_frame=False)
X_train, y = mnist["data"], mnist["target"]
X_train.shape

(70000, 784)

In [None]:
y.shape

(70000,)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y)

##Exercise-9