## Lighthouse Labs
### W04D1 Programming in Python for DS

Instructor: Socorro Dominguez  
June 14, 2021

## Agenda
1. What is and why do we do dimensionality reduction?


    * Variable Selection Techniques
        - Filter Methods
        - Wrapper Methods
       
    * Dimensionality Reduction
        - Principal Components Analysis (PCA)
        - Linear Discriminant Analysis (LDA)
    

# What do you think is dimensionality reduction?

Imagine that you have a dataset of patients with two features: 
- Height 
- Weight

Can you plot it?

What if we add a third variable? Can you still plot it?
- Age

A fourth variable? Can you still plot it?
- Net worth

**Dimensionality reduction** is:

- Reducing the number of features in a dataset
- E.g., 1000 rows by 20 columns (features) to 1000 rows by 10 columns

## Why do we do it?

- Helps our machine learning algorithms perform better
- Improves run-time of our algorithms
- Storing and using less data (memory)
- For visualization

**The best solution is the most parsimonious model with acceptable accuracy (or other metric).**

## When do we do dimensionality reduction?


- Before visualization: 
    - The human visual system is the most powerful perceptual system in the known universe... But, it only works in up to 3 dimensions.



- To improve the performance of our baseline model  
Example:
   - Built a satellite imagery object detection model
   - Satellite imagery has 12 channels compared to normal images which have 3
   - We kept getting ~83-85% accuracy results until we finally tried PCA
   - We originally had 12 dimensions; feature engineered an additional 66 features for a total of 78
   - We reduced the 78 down to 3 and got 95% accuracy
    
**You don't know until you try.**

How do you think we can reduce the number of **features** from our dataset?

Actually, we have two ways:

- **Feature Selection**: selecting and excluding given features without changing them.
- **Dimensionality Reduction**: transform features into a lower dimension.



## Feature Selection

The easiest way to reduce features is to keep the most important features and "eliminating" the others.

The resulting feature set will still be interpretable.


## Variable Selection Techniques

**Filter methods**  

- Measure relevance of feature by correlation with dependent variable (target).
- If feature is correlated with target, keep. Otherwise, discard
- Applied before training ML model

- Advantages: 
    - Fast, no training involved
- Disadvantages: 
    - Ignores feature combinations
    - Keeps redundant features


![imgs](imgs/Filter_Methods.png)

**Wrapper methods**  
- Train ML model with different subsets of feature
- If feature improves performance, add/keep it. Otherwise, ignore/remove it.
- Applied during training ML model
- Advantages:
    - Evaluates features in context of others
    - Performance-driven
- Disadvantages:
    - Slow, retrain model several times

![imgs](imgs/Wrapper_Methods.png)

## Forward selection wrapper method

1. SelectedFeatures = [ ]
2. Find F in (AllFeatures - SelectedFeatures) that, if added to SelectedFeatures, best improves model performance
3. If adding F improved performance more than some threshold, permanently add it to SelectedFeatures and go back to (2)

## Backward elimination wrapper method

1. SelectedFeatures = AllFeatures
2. Find F in SelectedFeatures that, if removed from SelectedFeatures, decreases model performance the least
3. If removing F decreased performance less than some threshold, permanently remove it from SelectedFeatures and go back to (2)

## Recursive Feature Elimination

1. Decide $k$, the number of features to select. 
* Use a model (usually a linear model) to assign weights to features.
    - The weights of important features have higher absolute value.
* Rank the features based on the absolute value of weights.
* Drop the least useful feature.
* Try steps 2-4 again until desired number of features is reached

## Variable Selection - Wrapper Methods Tips
- Look for implementations, `sklearn` has a `rfe` implementations, for example
- It's not possible to tell which method will work better until you try
- Different variable selection algorithms may give you a different answers
- Different machine learning algorithms with the same variable selection method may give you given answers
- Over this process, you'll find out what features tend to get eliminated and which features tend to be kept (hopefully)

## What if....????

Instead of eliminating some of our features, we transformed features to keep as much information as possible?

## Dimensionality Reduction

**The goal is to preserve as much of the important data as possible.**

Two well-known techniques amongst many others that we'll cover today:

- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)

## Principal Component Analysis (PCA)

- Consider a data matrix $X$ with $n$ rows and $d$ columns


- We want a new data matrix $Z$ with $n$ rows and $k\leq d$ columns

$$
\textbf{X} = 
\underbrace{\left[
  \begin{array}{cccc}
    \rule[-1ex]{0.5pt}{2.5ex} & \rule[-1ex]{0.5pt}{2.5ex} &   &  &   & \rule[-1ex]{0.5pt}{2.5ex} \\
    \textbf{X}_{1}    & \textbf{X}_{2}    & \ldots & \ldots & \ldots & \textbf{X}_{d}    \\
    \rule[-1ex]{0.5pt}{2.5ex} & \rule[-1ex]{0.5pt}{2.5ex} &   &   &  & \rule[-1ex]{0.5pt}{2.5ex} 
  \end{array}
\right]}_{\text{d columns (wider)}}\\
\textbf{Z} = 
\underbrace{\left[
  \begin{array}{cccc}
    \rule[-1ex]{0.5pt}{2.5ex} &         & \rule[-1ex]{0.5pt}{2.5ex} \\
    {\textbf{Z}}_{1}    & \ldots & \textbf{Z}_{k}    \\
    \rule[-1ex]{0.5pt}{2.5ex}  &        & \rule[-1ex]{0.5pt}{2.5ex} 
  \end{array}
\right]}_{\text{k columns (narrower)}}
$$

Let's imagine we have a dataset with two features:
- height
- weight

And we plot them this way:  
![img1](imgs/PCA01.png)

Our task is to project this data into a smaller dimension: a line.

First, for all observations, we calculate the average measure for `Height` and then, the average measure for `Weight`

![img1](imgs/PCA01a.png)

Now, let's shift the data in such a manner that the center of the data, becomes the origin.

![img1](imgs/PCA01b.png)

** Data points are still related among themselves the same way.


Let's try now to fit a random line that captures most of our data points information. This line MUST pass through the origin.

![img1](imgs/PCA01c.png)

How do we get the best line?

- PCA projects the data on the line.
- PCA finds the line that maximizes the distances from the projected points to the origin.
    - This is the same as minimizing the distance between the line and the data observations.

PCA will measure the distance from the origin to each projected observation. 

If we only had 5 observations, it would only have 5 distances:

$d_1 + d_2 + d_3 + d_4 + d_5$

and then, squares them up:

${d_1}^2 + {d_2}^2 + {d_3}^2 + {d_4}^2 + {d_5}^2 = SS(distances)$


![img1](imgs/PCA01d.png)

We do this until we get the largest $SS(distances)$

This new line is called **Principal Component 1 (PC1)**

**SS(distances) for PC1** is called the **eigenvalue for PC1**

Let's say that our **PC1** has a slope of 0.5

That is, for every 2 unit increase in **height**, we increase 1 unit in **weight**.

Here, we can say then that for PC1, **height** is more important than **weight**.

**Data is more spread out on the height axis**

**PC1** ends up being a **Linear Combination** of:   
> $PC1 = 2*Height + 1*Weight$

When we make the vector have a measure of one, by normalizing it, we end up having the **eigenvector**.

With 2 features, it is easy to find **PC2**, it has to be the line that also passes through the origin and that is ortogonal to **PC1**

![img](imgs/PCA02.png)

Since we only have 2 dimensions, PC2 must be:
> $PC2 = -1*Height + 2*Weight$

For our final plot, we rotate everything so that PC1 and PC2 are horizontal

![img](imgs/PCA03.png)

If you had an extra dimension, you would still need to do some extra optimization. Make sure you find the best line and PC3 would just be finding the extra line that is ortogonal to both PC1 and PC2

## Measuring Variation

We will aid ourselves with a `Scree Plot` to measure variation. 

$Variation(PC1) = \frac{SS(distances_{PC1})}{n-1}$  
$Variation(PC2) = \frac{SS(distances_{PC2})}{n-1}$  
...  
$Variation(PCn) = \frac{SS(distances_{PCn})}{n-1}$  

![img](imgs/scree_plot.png)


From a **Scree Plot** you might determine that you only need the first 2 or 3 PCs rather than the complete set of PCs for a better model.

Remember, the max number of PCs that you have:  
a) number of features  
b) number of observations

## Linear Discriminant Analysis
- Similar to PCA: Projecting onto smaller number of dimensions
- Different from PCA: Uses the `y` or class label to help us decide what to select
- Can only use for classification (remember, when `y` is discrete)

<img src='imgs/pca-v-lda.png' width=700>

## Introducing the Iris dataset

<img src='imgs/iris-dataset.png'>

![img](imgs/LDA01.png)

## LDA: concept of interclass variance
- Which features better delineates the classes?

<img src='imgs/lda-iris.png' width=800>

**We are trying to find components that minimize the intra-class variance and maximizes the inter-class variance**

![img](imgs/LDA02.png)