# What is Dimensionality Reduction?

Many Machine Learning problems involve thousands of features, having such a large number of features bring along many problems, the most important ones are:
* Makes the training extremely slow
* Makes it difficult to find a good solution

This is known as the curse of dimensionality and the Dimensionality Reduction is the process of reducing the number of features to the most relevant ones in simple terms.
Reducing the dimensionality does lose some information, however as most compressing processes it comes with some drawbacks, even though we get the training faster, we make the system perform slightly worse, but this is ok! ‚Äúsometimes reducing the dimensionality can filter out some of the noise present and some of the unnecessary details‚Äù.

Most Dimensionality Reduction applications are used for:
* Data Compression
* Noise Reduction
* Data Classification
* Data Visualization

One of the most important aspects of Dimensionality reduction, it is Data Visualization. Having to drop the dimensionality down to two or three, make it possible to visualize the data on a 2d or 3d plot, meaning important insights can be gained by analysing these patterns in terms of clusters and much more.



# Main Approaches for Dimensionality Reduction
The two main approaches to reducing dimensionality: Projection and Manifold Learning.
* Projection: This technique deals with projecting every data point which is in high dimension, onto a subspace suitable lower-dimensional space in a way which approximately preserves the distances between the points.
* Manifold Learning: Many dimensionality reductions algorithm work by modelling the manifold on which the training instance lie; this is called Manifold learning. It relies on the manifold hypothesis or assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold, this assumption in most of the cases is based on observation or experience rather than theory or pure logic.[4]
Now let's briefly explain the three techniques before jumping into solving the use case.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import time

# For plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

#PCA
from sklearn.decomposition import PCA
#TSNE
from sklearn.manifold import TSNE
#UMAP
import umap

import plotly.io as plt_io
import plotly.graph_objects as go

In [None]:
train = pd.read_csv('/kaggle/input/sign-language-mnist/sign_mnist_train/sign_mnist_train.csv')
train.head()

In [None]:
train.shape
# Get indexes where name column doesn't have value john

In [None]:
train = train[train['label'] < 10]

In [None]:
train

In [None]:
## Setting the label and the feature columns
y = train.loc[:,'label'].values
x = train.loc[:,'pixel1':].values

In [None]:
np.unique(y)

In [None]:
from sklearn.preprocessing import StandardScaler
## Standardizing the data
standardized_data = StandardScaler().fit_transform(x)

In [None]:
y

# PCA (Principal Component Analysis)
One of the most known dimensionality reduction ‚Äúunsupervised‚Äù algorithm is PCA(Principal Component Analysis).
This works by identifying the hyperplane which lies closest to the data and then projects the data on that hyperplane while retaining most of the variation in the data set.
Principal Components.

The axis that explains the maximum amount of variance in the training set is called the Principal Components.
The axis orthogonal to this axis is called the second principal component. As we go for higher dimensions, PCA would find a third component orthogonal to the other two components and so on, for visualization purposes we always stick to 2 or maximum 3 principal components.
It is very important to choose the right hyperplane so that when the data is projected onto it, it the maximum amount of information about how the original data is distributed.

In [None]:
## Importing and Apply PCA
start = time.time()
pca = PCA(n_components=3) # project from 784 to 2 dimensions
principalComponents = pca.fit_transform(standardized_data)
principal_df = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2','principal component 3'])
principal_df.shape
print('Duration: {} seconds'.format(time.time() - start))

In [None]:
def plot_2d(component1, component2):
    
    fig = go.Figure(data=go.Scatter(
        x = component1,
        y = component2,
        mode='markers',
        marker=dict(
            size=20,
            color=y, #set color equal to a variable
            colorscale='Rainbow', # one of plotly colorscales
            showscale=True,
            line_width=1
        )
    ))
    fig.update_layout(margin=dict( l=100,r=100,b=100,t=100),width=2000,height=1200)                 
    fig.layout.template = 'plotly_dark'
    
    fig.show()



In [None]:
def plot_3d(component1,component2,component3):

    fig = go.Figure(data=[go.Scatter3d(
        x=component1,
        y=component2,
        z=component3,
        mode='markers',
        marker=dict(
            size=10,
            color=y,                # set color to an array/list of desired values
            colorscale='Rainbow',   # choose a colorscale
            opacity=1,
            line_width=1
        )
    )])

    # tight layout
    fig.update_layout(margin=dict(l=50,r=50,b=50,t=50),width=1800,height=1000)
    fig.layout.template = 'plotly_dark'
    
    fig.show()

In [None]:
plot_2d(principalComponents[:, 0],principalComponents[:, 1])

In [None]:
plot_3d(principalComponents[:, 0],principalComponents[:, 1],principalComponents[:, 2])

# t-SNE ( T-distributed stochastic neighbour embedding )
(t-SNE) or T-distributed stochastic neighbour embedding created in 2008 by (Laurens van der Maaten and Geoffrey Hinton) for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets.

(t-SNE) takes a high dimensional data set and reduces it to a low dimensional graph that retains a lot of the original information. It does so by giving each data point a location in a two or three-dimensional map. This technique finds clusters in data thereby making sure that an embedding preserves the meaning in the data. t-SNE reduces dimensionality while trying to keep similar instances close and dissimilar instances apart.[2]

For a quick a Visualization of this technique, refer to the animation below (it is taken from an amazing tutorial by Cyrille Rossant, I highly recommend to check out his amazing tutorial.
link: https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/

In [None]:
# t-SNE does consume a lot of memory so let's consider only a subset of the dataset. 

start = time.time()
pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(standardized_data)
tsne = TSNE(random_state = 42, n_components=3,verbose=0, perplexity=40, n_iter=300).fit_transform(pca_result_50)
print('Duration: {} seconds'.format(time.time() - start))

In [None]:
plot_2d(tsne[:, 0],tsne[:, 1])

In [None]:
plot_3d(tsne[:, 0],tsne[:, 1],tsne[:, 2])

# UMAP ( Uniform Manifold Approximation and Projection )

Uniform Manifold Approximation and Projection created in 2018 by (Leland McInnes, John Healy, James Melville) is a general-purpose manifold learning and dimension reduction algorithm.
UMAP is a nonlinear dimensionality reduction method, it is very effective for visualizing clusters or groups of data points and their relative proximities.

The significant difference with TSNE is scalability, it can be applied directly to sparse matrices thereby eliminating the need to applying any Dimensionality reduction such as PCA or Truncated SVD(Singular Value Decomposition) as a prior pre-processing step.[1]
Put simply, it is similar to t-SNE but with probably higher processing speed, therefore, faster and probably better visualization. (let‚Äôs find it out in the tutorial below)

In [None]:
start = time.time()
reducer = umap.UMAP(random_state=42,n_components=3)
embedding = reducer.fit_transform(standardized_data)
print('Duration: {} seconds'.format(time.time() - start))

In [None]:
plot_2d(reducer.embedding_[:, 0],reducer.embedding_[:, 1])

In [None]:
plot_3d(reducer.embedding_[:, 0],reducer.embedding_[:, 1],reducer.embedding_[:, 2])

# LDA ( Linear Discriminant Analysis )

Linear Discriminant Analysis (LDA) is most commonly used as a dimensionality reduction technique in the pre-processing step for pattern-classification.
The goal is to project a dataset onto a lower-dimensional space with good class-separability in order to avoid overfitting and also reduce computational costs.

The general approach is very similar to PCA, rather than finding the component axes that maximize the variance of our data, we are additionally interested in the axes that maximize the separation between multiple classes(LDA).LDA is ‚Äúsupervised‚Äù and computes the directions (‚Äúlinear discriminants‚Äù) that will represent the axes that maximize the separation between multiple classes.


In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

start = time.time()
X_LDA = LDA(n_components=3).fit_transform(standardized_data,y)
print('Duration: {} seconds'.format(time.time() - start))

In [None]:
plot_2d(X_LDA[:, 0],X_LDA[:, 1])

In [None]:
plot_3d(X_LDA[:, 0],X_LDA[:, 1],X_LDA[:, 2])

# Comparison between the Dimension Reduction Techniques: PCA vs t-SNE vs UMAP vs LDA

By comparing the visualisations produced by the four models, we can see that PCA was not able to do such a good job in differentiating the signs. The main drawback of PCA is that it is highly influenced by outliers present in the data. Moreover, PCA is a linear projection, which means it can‚Äôt capture non-linear dependencies, its goal is to find the directions (the so-called principal components) that maximize the variance in a dataset.

t-SNE does a better job as compared to PCA when it comes to visualising the different patterns of the clusters. Similar labels are clustered together, even though there are big agglomerates of data points on top of each other, certainly not good enough to expect a clustering algorithm to perform well.

UMAP outperformed t-SNE and PCA, if we look at the 2d and 3d plot, we can see mini-clusters that are being separated well. It is very effective for visualizing clusters or groups of data points and their relative proximities. However, for this use case certainly not good enough to expect a clustering algorithm to distinguish the patterns.

Finally LDA, outperformed all the previous techiniques in all aspects. Excellent computation time (second fastest) as well as proving the well separated clusters we were expecting.
UMAP is much faster than t-SNE, another problem faced by the latter is the need for another dimensionality reduction method prior, otherwise, it would take a longer time to compute.

# Summary

**We have explored four dimensionality reduction techniques for data visualization : (PCA, t-SNE, UMAP, LDA)and tried to use them to visualize a high-dimensional dataset in 2d and 3d plots.**

- **PCA** did not work quite well in categorizing the different signs (10). However, instead of arbitrarily choosing the number dimensions to 3, it is much better to choose the number of dimensions that add up to a sufficiently large proportion of variance, but since this is data visualization problem that was the most reasonable thing to do.

- **TSNE** managed to do better work on separating the clusters, the visualization in 2d and 3d was better than PCA definitely. However, it took a very long time to compute its embeddings.t-SNE doesn‚Äôt have major use outside visualisation.

- **UMAP** turned out to be the most effective manifold learning in terms of displaying the different clusters with clear separations, However not good enough clusters for multi-class pattern classification.

- **LDA** outperformed all the above techniques Excellent computation time (second fastest) as well as proving the well-separated clusters we were expecting.

# IF you like this Notebook ‚úåÔ∏è, Don't Forget to Upvote the Kernel Thank you!, See You on to the NExxt One üòâ