# PCA: Principal Component Analysis (Intro)

### Definition

1. A PC is a linear combinaison of varibles
2. The scaled one unit-long-vector for a PC is called **singular vector** or **Eigenvector**
3. The proportion of each variable for one-unit vector are called **Loading score**
4. PCA call the sum of the squares distances for the line that best fit the data the **Eigenvalue for PC1**
5. SVD: Singular Value Decomposition
6. The **Scree plot** is a graphical representation of a percentage of variation that each PC accounts for
7. The proportion of variation that each PCs account for is given by the equation: 
   
   For $k$ PCs $$ ratio(PC_i) = {SSDistance(PC_i) \over \sum_{j=1}^{k}{SSDistance(PC_j)}} $$

### Make up data

In [2]:
import pandas as pd 
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt

In [8]:
genes = ['gene' + str(i) for i in range(1, 101)]
wt = ['wild-type' + str(i) for i in range(1, 6)]
ko = ['knock-out' + str(i) for i in range(1, 6)]

data = pd.DataFrame(columns=[*wt, *ko], index=genes)

for gene in data.index:
    data.loc[gene, 'wild-type1':'wild-type5'] = np.random.poisson(lam=rd.randrange(10, 1000), size=5)
    data.loc[gene, 'knock-out1':'knock-out5'] = np.random.poisson(lam=rd.randrange(10, 1000), size=5)
    
data.head()

Unnamed: 0,wild-type1,wild-type2,wild-type3,wild-type4,wild-type5,knock-out1,knock-out2,knock-out3,knock-out4,knock-out5
gene1,213,217,206,200,186,558,603,561,608,579
gene2,323,324,322,321,331,644,705,722,722,680
gene3,618,681,657,672,675,222,223,233,225,234
gene4,397,435,425,424,412,802,800,817,761,773
gene5,29,24,33,15,19,364,393,376,395,387


### use PCA

#### centering and scaling the data

In [11]:
scaled_data = preprocessing.scale(data.T)
scaled_data

array([[-0.95077924, -1.000019  ,  0.80101363, -1.1114    , -0.97081699,
         1.03241521, -1.01349126, -0.99603491, -1.16941255,  1.24146188,
        -0.95482252,  0.44910155, -1.18528681, -0.9502068 ,  1.03201055,
         1.00128688,  0.01165552,  1.09889836,  0.73664826, -0.62119537,
        -0.98403348, -1.08663607, -0.91722111, -1.08160697, -1.00131876,
         1.5589478 , -1.0458766 ,  0.93249511, -0.94563401,  1.05789195,
        -1.16146908,  0.99217807, -0.47149169, -0.98485793,  0.89175154,
        -1.26228857,  0.59202387, -1.21560082, -1.06127364, -1.11593534,
        -0.90742275,  1.188105  ,  1.10194989, -0.89703566, -0.86084933,
        -0.98432162,  0.88866374,  0.98016583,  0.75784189, -0.63392616,
         0.80807125,  2.1407011 , -0.91393515,  1.18893169,  0.30639438,
         0.8031622 , -0.77272081,  0.99233154, -0.23311582,  0.47011621,
        -0.05766599,  1.08569193, -0.93594037,  0.97439025,  1.12800598,
         0.92198099,  0.22425837,  1.40788315,  0.9

### Variation of each component

### PCA graph

### Loading score (proportion/impact of each variable)