# PCA and dimensionality reduction
Using PCA is a method of dimensionality reduction that aims at reducing it to only the parts that hold the most information. It's main goal is to reduce the number of features in a dataset while capturing a large amount of the variability contained within the data. 


## Why it is useful
When you have a large dataset, you can offer suffer from the "curse of dimensionality". This means that you need to reduce the number of features (or dimensions) before you can effective start building a model. Feature selection, and feature extraction are two ways of approaching dimensionality reduction.  

We can use PCA to extract new "latent features" (these are features that aren't explicityly in your dataset; think "performance" or "clutch" that might be produced by a variety of other features) from our dataset that are based on existing features; this allows us to reduce the number of dimensions overall but hold on to the parts that hold the most information and represent the highest variance. This can be done easily with `sklearn`.  

Once you have produced and carried out a PCA, you will look at the principal components, and the variability of the original data that has been captured by them; this is the main output from a principal component analysis. 

## Dimensionality Reduction
This isn't always necessary, but there are some problems and data sets that are going to involve a lot of features, hundreds perhaps thousands.  

There are two main approaches to dimensionality reduction: 
1. Feature Selection
2. Feature Extraction

### Feature Selection
This is finding a subset of the original features and determining which of those are most useful. There are a few methods for selecting features: 
* __Filter methods__ use a ranking or sorting alogirthm to filter out those that are deemed less useful. They focus on discerning some inherent correlations among the feature data in unsupervised learning, or on correlations with the output variable in supervised settings. These filter methods are usually applied as a preprocessing step. Common tools include Pearson's Correlation, Linear Discriminant Analysis (LDA) and Analysis of Variance (ANOVA). 
* __Wrapper methods__ work through selecting features by directly testing their impact on the performance of a model. The general idea of this method is to call the alogirthm while building models using different subsets of features, and measuring the performance of each of these models. Cross-validation is used across these multiple tests, and then the features that provide the best results are selected; this is obviously computationally intense. Common examples are Forward Search, Backward Search and Recursive Feature Elimination. 

`sklearn` has a feature selection module that offers a variety of methods to improve model accuracy scores or to boost their performance on very high-dimensional datasets. 


### Feature Extraction
This involves the construction of new features called __latent features__ (these are features, or topics, that combine a number of other features in your dataset and are therefore representative of those combined features). Constructing these latent features is exactly the goal of principal component analysis (PCA). Independent component analysis (ICA) and Random Projection are other methods of feature extraction.  

This is a method that can mean that prevents you from just dropping features directly. 


Other information: 
* [Introduction to feature selection](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/)
* [Dimensionality Reduction Algorithms](https://elitedatascience.com/dimensionality-reduction-algorithms)
* [Introduction to Variable and Feature Selection](http://www.ai.mit.edu/projects/jmlr/papers/volume3/guyon03a/source/old/guyon03a.pdf)

## Principal Components 
An advantage of Feature Extraction over Feature Selection is that the latent features can be constructed to incorporate data from multiple features, and thus retain more information present in the various original inputs rather than losing the information by dropping many original inputs. 

The latent features that are created as a mixture of the existing features are known as principal components; principal components are linear combinations of the original features in a dataset that aim to retain the most information in the original data.  

You want to shrink the space that the data lives in, you are change multiple dimensions into a single dimension.   

> A principal component is essentially a new feature that is a linear combination of existing features. This may mean that you don't need the original features anymore. They are a latent variable that reduces the number of features to retain the most amount of information in the dataset. 

### The properties of principal components 
#### Capture the most variance in the dataset 
If you select those that capture the most variance in a dataset then you lose the least amount of information. Each subsequent component then captures the largest amount of variance left in the data.  

To do this you are looking for a line that reduced the distance of the points to the component across all the points (like regression). The amount fo information lost is the distance between the component created and the original data points (sort of like an error function for other algorithms).


#### Created components are orthogonal to one another 
Orthogonal components are those that have 90 degree angles with one another. 

Additional components will be orthogonal to one another. Depending on how these components are then used means that we can keep features independent for further analysis.  


[Q & A about PCA](https://stats.stackexchange.com/questions/110508/questions-on-pca-when-are-pcs-independent-why-is-pca-sensitive-to-scaling-why)