# Introduction
The field of **Dimensionality Reduction** is all about *discovering non-linear relationships in data
<br> that are not obvious in the original feature space.*

## Why do we need Dimentionality Reduction?

- **Visualization**: if we reduce the number of dimentions in data into 2 or 3 space, we can visualize it. 
  <br>This helps us to analyze the data
<br>
<br>
- **Data Compression**
    - easier storage and processing of data
<br>
<br>
- **High dimention data has problems**:
    - high time/space complexity
    - prone to overfitting
    - not all features are relevant, thus if we can reduce the dimention we can reduce the noise.

## Example

### Reducing from D=2 into d=1
- rather then representing each point with **two** coordinates
    <br> we can represent each point by **one** coordinate (corresponding to its projection on the red line)
- by doing this we incure a bit of error as these points do not exactly lie on the red line.

### Reducing from D=3 into d=2
- similarly, in the drawing on the right, we can see that the data rufly lies on a surface. Tuhs can be described using only 2 dimentions.

<img src='images/pca1.png' />

# Definitions

## Variance
The variance is the measure of **how spread out** some data is.
<br>(the average distance of the data point to the mean)
- $variance = (std)^2$

<img src='images/variance.png' />

## Eigenvectors and Eigenvalues
(eigen is the word 'characteristic' in German)

A (non-zero) vector $v$ of dimension $N$ is an **eigenvector** of a square $N × N$ matrix $A$ if it satisfies the linear equation: $Av =\lambda v$
<br>where $λ$ is a scalar, termed the **eigenvalue** corresponding to $v$.

That is, the *eigenvectors* are the vectors that the linear transformation $A$ merely **elongates or shrinks**, 
<br>and the amount that they elongate/shrink by is the *eigenvalue*.

<img src='images/eigen-decomposition.png' />

# PCA

PCA is used as a method for Dimentionality Reduction

- PCA transorms a set of variables into a new set of variables which are a linear combination of the original variables.
- These new variables are known as **Principle Components**.
- PCA is an ortogonal linear transformation that transform data into a new coordinate system such that 
    - the gratest variance by some projection of the data lies on the **first** principle component
    - the second gratest variance lies on the **second** principle component
    - etc.

<img src='images/scree.png' />

# PCA intuition

## Data standardization
Our first step before we do PCA is to standardize data.
<br>We need to standardize data because (TODO: why?)
<br>Standardizing means **puting the data on the same scale**
<br> This means our data should have mean = 0 and variance = 1


## Maximize variance
PCA is a variance maximizing exersize.
<br>It projects the original data on a direction that maximizes varience.


## What is a rank of a matrix
<img src='images/rank.png' />

## Here is what will be presented
- First we'll make some data to apply PCA to
- How to use **PCA()** function from sklearn to do PCA.
- How to determine how much variation each principle component accounts for
- How to draw PCA graph using matplotlib
- Ho to examine the loading scores to determine what variables have the largest effect on the graph

In [None]:
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing # for scaling the data
import matplotlib.pyplot as plt

## Generate the sample dataset

In [None]:
# generate an array of 100 gene names (row names)
genes = ['gene' + str(i) for i in range(1,101)]

# we now create arrays of sample names (column names)
# we have 5 'wild types' or 'wt' samples and 5 'knock out' or 'ko' samples
wt = ['wt' + str(i) for i in range(1,6)]
ko = ['ko' + str(i) for i in range(1,6)]

# create a dataframe to store the made-up data 
# (the starts unpack the wt and ko arrays)
data = pd.DataFrame(columns=[*wt, *ko], index=genes)

# for each gene in the index, we create 5 values for the 'wt' samples and 5 values for the 'ko' samples
# the made up data comes from two poisson distributions (one for the 'wt' samples and the other for 'ko')
# for each gene we select a new mean for the poisson distribution (the means can vary between 10-1000)
for gene in data.index:
    data.loc[gene,'wt1':'wt5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5)
    data.loc[gene,'ko1':'ko5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5)

print(data.shape)
data.head()

## scale the data
before we do PCA we have to scale and center our data
<br>after scaling: the average value for each gene will be 0 and the standard deviation will be 1

In [None]:
# (the sacle function expect samples as rows instead of columns)
scaled_data = preprocessing.scale(data.T)

# you can also use: scaled_data = preprocessing.StandardScaler().fit_transform(data.T)

## PCA

In [None]:
pca = PCA()

# this is where all the math goes, 
# (i.e: calculate loading scores and the variation each principle component accounts for)
pca.fit(scaled_data)

# this is where we generate coordinates for the PCA graph based on the loading scores and the scaled data.
pca_data = pca.transform(scaled_data)

## Draw a scree plot
we'll start with a scree plot to see how many principle components should we use

In [None]:
# calculate the percentage of variation each principle component accounts for
per_var = np.round(pca.explained_variance_ratio_ * 100, decimals=1)

labels = ['PC' + str(i) for i in range(1, len(per_var) + 1)] # 'PC1', 'PC2', etc.

plt.bar(x=range(1, len(per_var) + 1), height=per_var, tick_label=labels)

plt.ylabel('Percentage of explained variance')
plt.xlabel('Principle component')
plt.title('Scree Plot')
plt.show()

# we can see the almost all of the variation goes into the first principle component, so
# a 2-D graph should do a good job in representing the original data.

## Draw the PCA graph

In [None]:
# we'll put the new coordinates created by pca.transform(scaled_data) into a matrix
# where the rows have sample labels and the columns have PC labels
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns=labels)
pca_df

In [None]:
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('PCA plot')
plt.xlabel('PC1 - {}%'.format(per_var[0]))
plt.ylabel('PC2 - {}%'.format(per_var[1]))

# add sample names to the graph
for sample in pca_df.index:
    plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))

plt.show()

The 'ko' samples are clustered on the left, suggesting they are correlated with each other.

The 'wt' samples are clustered on the right, also correlated with each other.

Lastly, lets look at the loading scores for PC1 to determine which genes had the largest influence
<br> on separating the two clusters along the x-axis.

In [None]:
loading_scores = pd.Series(pca.components_[0], index=genes)
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)
top_10_genes = sorted_loading_scores[0:10].index.values
print(sorted_loading_scores[top_10_genes])