In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf #needed for models in this script
import pylab as pl
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

In [2]:
pd.set_option('html', True) #see the dataframe in a more user friendly manner
%matplotlib inline

##Dimensionality Reduction

As datasets get bigger, as the number of variables grows, we'll need "heavier" techniques to determine what variables and what points should be included in our analyses.

In this lesson, we're going to explore dimensionality reduction, which is the process of reducing the number of random variables under consideration. It's a way of paring down a dataset to what's important.

##Overview

We're going to cover two topics in this lesson: <b>data dimensionality reduction</b> and <b>data preprocessing</b>. Dimensionality reduction aims to produce a compact low-dimensional "encoding" of a high-dimensional data set, while data preprocessing aims to simplify, reduce and clean data for subsequent training.

To motivate our discussion, we will examine the aptly named <b>curse of dimensionality.</b> Suppose our data consists of 1000 datapoints uniformly distributed in a unit cube, and that we wish to apply 10-Nearest-Neighbors on this dataset. In 2 dimensions (2 independent variables), on average we must go a distance of sqrt(10/1000)=.1 to find 10 points (i.e. capturing 1% of the volume of this cube). As we increase the number of independent variables, the average distance between points in a neighborhood “blows up”; with just 10 dimensions we must go a distance of (10/1000)^(1/10) = .954 to capture 10 points. In other words, our neighborhood already contains 95% of the total possible volume with just 10 datapoints, and it is no longer informative.

![](files/dimr1.jpg)

In addition to the example above, here are some other situations in which we'd want to reduce the dimensionality of our data:

* If the data contain redundancy (multiple features encode the same information)
* To eliminate features that are redundant and extract those which are more central to the phenomenon we are trying to learn,
* If the data contain irrelevent or noisy features that could negatively impact a learning algorithm's performance
* To more effectively visualize relationships between features
* To address computational resource constraints

In this lesson we're going to study two techniques for reducing the dimensionality of data in different ways:

1. <b>feature transform:</b> determines the dependencies between features and finds a new set of feature vectors with lower dimensionality (because it captures multiple original feature vectors in one) while also preserving the characteristics of the original feature vectors.
2. <b>feature selection:</b> determines the relevant features for a given learning problem. Instead of creating a new feature set through transformations, feature selection simply subsets the feature set based on some criteria of relevance.

Principal Component Analysis (PCA) is the most basic and most common feature transform used in feature extraction and dimensionality reduction. It is easy to train, and has additional functionalities like noise reduction.

PCA is an unsupervised learning process. For supervised learning, not only do we want to preserve the variances across features in the data, we also want to preserve the relationships between features and labels. Linear Discriminant Analysis (LDA) takes the labeling into account (essentially considering it as an additional feature vector).

* Review the sklearn PCA example with Iris Dataset: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html
* Review the sklearn Comparison of LDA and PCA 2D Projection of Iris Dataset: http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html