# Chapter 9 - Dimensionality Reduction Using Feature Extraction

## 9.0 - Introduction

It is common to have access to thousands and even hundreds of thousands of features. For example, in Chapter 8 we transformed a 256 x 256-pixel color image into 196,608 features. Furthermore, because each of these pixels can take one of 256 possible values, there ends up being $ 256^{196608} $ different configurations our observation can take. This is problematic because we will practically never be able to collect enough observations to cover even a small fraction of those configurations and our learning algorithms do not have enough data to operate correctly.

Fortunately, not all features are created equal and the goal of feature extraction for dimensionality reduction is to transform our set of features, $ p_{original} $, such that we end up with a new set, $ p_{new} $, where $ p_{original} > p_{new} $, while still keeping much of the underlying information. Put another way, we reduce the number of features with only a small loss in our data's ability to generate high-quality predictions. In this chapter, we will cover a number of feature extraction techniques to do just this.

One downside of the feature extraction techniques we discuss is that the new features we generate will not be interpretable by humans. They will contain as much or nearly as much ability to train our models, but will appear to the human eye as a collection of random numbers. If we wanted to maintain our ability to interpret our models, dimensionality reduction through feature selection is a better option.

## 9.1 - Reducing Features Using Principal Components

### Problem
Given a set of features, you want to reduce the number of features while retaining the variance in the data.

### Solution
Use principle component analysis with scikit's 'PCA':

In [1]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets

In [2]:
# Load the data
digits = datasets.load_digits()
digits.data

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

In [3]:
# Standardize the feature matrix
features = StandardScaler().fit_transform(digits.data)
features

array([[ 0.        , -0.33501649, -0.04308102, ..., -1.14664746,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  0.54856067,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  1.56568555,
         1.6951369 , -0.19600752],
       ...,
       [ 0.        , -0.33501649, -0.88456568, ..., -0.12952258,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -0.67419451, ...,  0.8876023 ,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649,  1.00877481, ...,  0.8876023 ,
        -0.26113572, -0.19600752]])

In [4]:
# Create a PCA that will retain 99% of variance
pca = PCA(n_components = 0.99, whiten = True)
pca

PCA(copy=True, iterated_power='auto', n_components=0.99, random_state=None,
    svd_solver='auto', tol=0.0, whiten=True)

In [5]:
# Conduct PCA
features_pca = pca.fit_transform(features)
features_pca

array([[ 0.70631939, -0.39512814, -1.73816236, ...,  0.36526417,
        -0.31369006,  0.05355504],
       [ 0.21732591,  0.38276482,  1.72878893, ..., -0.17818068,
        -0.14031747,  1.18179755],
       [ 0.4804351 , -0.13130437,  1.33172761, ..., -0.01924571,
        -0.23580029,  0.92966158],
       ...,
       [ 0.37732433, -0.0612296 ,  1.0879821 , ..., -1.05526847,
         1.75559618, -0.87894699],
       [ 0.39705007, -0.15768102, -1.08160094, ...,  0.10442881,
         0.65907949,  1.1292155 ],
       [-0.46407544, -0.92213976,  0.12493334, ..., -1.10593026,
         0.54434185, -0.26573597]])

In [6]:
# Show results
print("Original number of features:", features.shape[1])
print("Reduced number of features:",features_pca.shape[1])

Original number of features: 64
Reduced number of features: 54


### Discussion
Principal component analysis (PCA) is a popular linear dimensionality reduction technique. PCA projects observations onto the (hopefully fewer) principal components of the feature matrix that retain the most variance. PCA is an unsupervised technique, meaning that it does not use information from the target vector and instead only considers the feature matrix.

For a mathematical description of how PCA works, see the external resources listed at the end of this recipe. However, we can understand the intuition behind PCA using a simple example. In the following figure, our data contains two featuresm $ x_1 $ and $ x_2 $. Looking at the viualization, it should be clear that observations are spread out like a cigar, with a lot of length and very little height, More specifically, we can say that the variance of the "length" is significantly greater than the "height". Instead of length and height, we refer to the "directions" with the most variance as the first principal component and the "direction" with the second-most variance as the second principal component (and so on).

If we wanted to reduce our features, one strategy would be to project all observations on our 2D space onto the 1D principal component. We would lose the information captured in the second principal component, but in some situations that would be acceptable trade-off. This is PCA.

PCA is implementedin scikit-learn using the 'pca' method. 'n_components' has two operations, depending on the argument provided. If the argument is greater than 1, 'n_components' will return that many features. This leads to the question of how to select the number of features that is optimal. Fortunately for us, if the argument to 'n_components' is between 0 and 1, 'pca' returns the minimum amount of features that retain that much varience. It is common to use values of 0.95 and 0.99, meaning 95% and 99% of varience of the original features has been retained, respectively. 'whiten=True' transforms the values of each principal component so that they have zero mean and unit variance. Another parameter and argument is 'svd_solver="randomized"', which implements a stochastic algorithm to find the first principal components on ofetn significantly less time.

The output of our solution shows that PCA let us reduce our dimensionality by 10 features while stillretaining 99% of the information (variance) in the feature.

![features x1 vs. x2](images/featureX1vsX2.jpg)

### See Also
   * scikit-learn documentation on PCA (http://bit.ly/2FrSvyx)
   * Choosing the Number of Principal Components (http://bit.ly/2FrSGtH)
   * Principal component analysis with linear algebra (http://bit.ly/2FuzdIW)

***

## 9.2 Reducing Features When Data Is Linearly Inseparable

### Problem

You suspect you have linearly inseparable data and want to reduce dimensions.

### Solution

Use an extension of principal component analysis that uses kernels to allow for non-linear dimensionality reduction:

In [7]:
# Load libraries
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles

In [8]:
# Create linearly inseparable data
features, _ = make_circles(n_samples = 1000, random_state = 1, noise = 0.1, factor = 0.1)
# To look at the data
# features, _

In [9]:
# Apply kernel PCA with radius basis function (RBF) kernel
kpca = KernelPCA(kernel = "rbf", gamma = 15, n_components = 1)
features_kpca = kpca.fit_transform(features)

In [10]:
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_kpca.shape[1])

Original number of features: 2
Reduced number of features: 1


### Discussion

PCA is able to reduce the dimensionality of our feature matrix (e.g., the number of features). Standard PCA uses linear projection to reduce the features. If the data is linearly separable (i.e., you can draw a straight line or hyperplane between different classes) then PCA works well. However,, if your data is not linearly separable (e.g., you can only seperate classes using a curved decision boundary), the linear transformation will not work as well. In our solution we used scikit-learn's 'make_circles' to generate a simulated dataset with a target vector of two classes and two features. 'make_circles' makes linearly inseparable data; specifically, one class is surrounded on all sides by the other class.

![Linearly inseparable data](images/)

If we used linearly PCA to reduce the dimensions of our data, the two classes would be linearly projected onto the first principal component such that they would become intertwined.

![Linear PCA reduction of dimensions](images/)

Ideally, we would want a transformation that would both reduce the dimensions and also make the data linearly separable. Kernel PCA can do both.

![Kernel PCA reduction of dimensions](images/)

Kernels allow us to project the linearly inseparable data into a higher dimension where it is linearly seperable; this is called the kernel trick. Don't worry if you don't understand the details of the kernel trick; just think of kernels as different ways of projecting the data. There are a number of kernels we can use in scikit-learn's 'kernelPCA', specified using the 'kernel' parameter. A common kernel to use is the Gaussian radial basis function kernel rbf, but other options are the polynomial kernel ('poly') and sigmoid kernel ('sigmoid'). We can even specify a linear projection ('linear'), which will produce the same results as standard PCA.

One downside of kernel PCA is that there are a number of parameters we need to specify. For example, in Recipe 9.1 we set 'n_components' to 0.99 to make PCA select the number of components to retain 99% of the variance. We don't have this option in kernel PCA. Instead we have to define the number of parameters (e.g., 'n_components=1'). Furthermore, kernels come with there own hyperparameters that we will have to set; for example, the radial basis function requires a 'gamma' value.

So how do we know which values to use? Through trial and error. Specifically we can train our machine learning model multiple times, each time with a different kernel or different value of the parameter. Once we find the combination of values that produces the highest quality predicted values, we are done. We will learn about this strategy in depth in Chapter 12.

### See Also:
   * scikit-learn documentation on Kernel PCA (http://bit.ly/2HRkxC3)
   * Kernel tricks and nonlinear dimensionality reduction via RBF kernel PCA (http://bit.ly/SHReP3f)

***

## 9.3 Reducing Features by Maximizing Class Seperability

### Problem

You want to reduce the features to be used by a classifier.

### Solution

Try linear discriminant analysis (LDA) to project the features onto component axes that maximize the separation of classes:

In [11]:
# Load libraries
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [12]:
# Load Iris flower datasets
iris = datasets.load_iris()
features = iris.data
target = iris.target

In [13]:
# Create and run an LDA, then use it to transform the feature
lda = LDA(n_components = 1)
features_lda = lda.fit(features, target).transform(features)

In [14]:
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_lda.shape[1])

Original number of features: 4
Reduced number of features: 1


We can use 'explained_variance_ratio_' to view the amount of variance explained by each component. In our solution the single component explained over 99% of the variance:

In [15]:
lda.explained_variance_ratio_

array([0.9912126])

### Discussion

LDA is a classification that is also a popular technique for dimensionality reduction. LDA works similarly to principal component analysis (PCA) in that it projects our feature space onto a lower-dimensional space. However, in PCA we were only interested in the component axes that maximize the variance in the data, while in LDA we have the additional goal of maximizing the differences between classes. In this pictured example, we have data comprising two target classes and two features. If we project the data onto the y-axis, the two classes are not easily separable (i.e., they overlap), while if we project the data onto the x-axis, we are left with a feature vector (i.e., we reduced our dimensionality by one) that still preserves class separability. In the real world, of course, the relationship between classes will be more complex and the dimensionality will be higher, but the concept remains the same.

![Two target classes and two features](images/)

In scikit-learn, LDAis implemented using 'LinearDiscriminantAnalysis', which includes a parameter, 'n_components', indicating the number of features we want returned. To figure out what argument value to use with 'n_components' (e.g., how many parameters to keep), we can take advantage of the fact that 'explained_variance_ratio_' tells us the variance explained by each outputted feature and is a sorted array. For example:

In [17]:
lda.explained_variance_ratio_

array([0.9912126])