# Lab Dimensionnality reduction

The goal of this lab session is to study different dimensionnality reduction methods. You will send only one notebook for both parts.

We begin with the standard imports:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('poster')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.25, 's' : 80, 'linewidths':0}

# Principal component analysis PCA

## Theoretical Problem 
Consider a random variable X with p features, $X = (X^1,X^2,...X^p)^T$  

The goal of the PCA is to create $p$ new variables (principal components) which summarizes the best the variance of the previous $p$ variables such as if we take a subset of these new features the amount of variance provided is quite similar to the sum of variances of the whole original features. 

We search new random variables, named principal components $Z^i$, as a linear combination of the original features $X^i$ i.e projecting X in a new basis.

\begin{align} Z^1 &= \alpha_{11}X^1 +\alpha_{12}X^2 + ... \alpha_{1p}X^p &= A_1^TX \\
 Z^2 &= \alpha_{21}X^1 +\alpha_{22}X^2 + ... \alpha_{2p}X^p &= A_2^TX  \\
 &\vdots \\
 Z^p &= \alpha_{p1}X^1 +\alpha_{p2}X^2 + ... \alpha_{pp}X^p &= A_p^TX 
\end{align}

Eventually PCA looks for $A_1,...,A_p$ with the following conditions : 
- $\forall i,  A^i= \underset{A}{\arg \max}\; Var(Z^i)$
- $Var(Z^1)\ge Var(Z^2)\ge...\ge Var(Z^p)$
- $\forall  i,j \; / \; i\neq j, cov(Z^i,Z^j) = 0 $ 
- $\forall i, Var(X^i) = 1$

The first principal components will contain most of the variance of X, we need to extract these first principal components for reducing the dimensionnality of our dataset.

Geometrically it means we seach for an orthogonal basis such as the inertia of the data points around the new axis is maximum.

## Theoretical Solution

If the random variable $X = (X^1,X^2,...X^p)^T $ is reduced & centered, i.e $ \forall i, E(X^i) =0 $ and $ Var(X^i) =1 $. 

We call $\Sigma$ the covariance matrix of X
$$ \Sigma = P^TDP $$
with $D = diag(\lambda_1,\lambda_2,...,\lambda_p)$ such that $ \lambda_1 \ge\lambda_2 \ge ... \ge\lambda_p $

$\boxed{A_i =\text{ is the eigenvector of $\Sigma$ related to the eigenvalue $\lambda_i =$ the ith column of $P$ }}$

And $Var(Z^i) = \lambda_i$

## Practical Solution

Now imagine you have a dataset $M  \in \mathbb{R}^{n\times p}$ which could be interpreted as n realizations of the random variable X. 

You need to create a centered matrix $\bar M=\begin{bmatrix} M_{1,1}-\bar M_1 & \cdots & M_{1,p}-\bar M_p \\ \vdots & \ddots & \vdots \\ M_{n,1}-\bar M_1 & \cdots & M_{n,p}-\bar M_p\end{bmatrix} $

With $\bar M_j = \frac{1}{n}\sum_{i=1}^n M_i^j$

We call $\bar \Sigma = \frac{1}{n-1}\bar M \bar M^T $the scatter matrix, which is an estimation of the covariance matrix.

#### Point on reduction
We can make the choice to reduce $\bar M$ to give a variance of one to each feature.

- If we do so : variables related to noise will have the same weight after PCA than a relevant variable.
- If we don't reduce our dataset : high variance features will totally dominate the PCA.

However if the features aren't with the same units, reduction is mandatory.





### Question : Considering  $ \bar \Sigma = P^T D P $, how do we project the vector $X$ on the k first principal components.




Answer : 

### Question : How do we compute the percentage of orignal variance retained by the k first principal components

Answer : 

### Tasks :  Compute the PCA transformation 

In [2]:
class my_pca:

    
    
    def __init__(self,reduce=False):
        '''
        Attributes:
        
        sigma : np.array
            the scatter matrix
        eigenvectors : np.array
            the eigenvectors in a matrix : P 
        reduce : boolean 
            Reduce the scatter matrix or not i.e give a variance of 1 for each feature
        X : np.array
             The dataset we want to project in the new basis
        '''        
        
        self.sigma = None
        self.eigenvectors = None 
        self.reduce = reduce 
        
        
    def fit(self,X):
        """ From X, compute the scatter matrix (sigma) and diagonalize sigma 
        to extract the eigenvectors and the eigenvalues
        
        Parameters:
        -----------
        X: (n, p) np.array
            Data matrix
        
        Returns:
        -----
        Update self.eigenvalues, self.eigenvectors
        """       
        
        #TODO
    
    def projection(self,X,no_dims):
        """ Project X on the no_dims first principal components 
        
        Parameters:
        -----------
        no_dims: integer
            The number of dimension of our projected dataset
        X : np.array
            Dataset
        
        Returns:
        -----
        The projection of X on the no_dims principal components
            np.array of size (n,no_dims)
        """
        
            
            
    def variance(self,no_dims):
        """ Returns the percentage of the total variance preserved by the projected dataset
        
        Parameters:
        -----------
        no_dims: integer
            The number of dimension of our projected dataset
         """ 
        
        #TODO
        

## Application : Biostatistics 

We are going to apply PCA to a medical dataset, in order to do data analysis and to finally  tell which medical features could allow doctors to diagnose breast cancer.



In [73]:
from sklearn.datasets import load_breast_cancer

In [95]:
H = load_breast_cancer()
X = load_breast_cancer().data
y = load_breast_cancer().target
feature_names = load_breast_cancer().feature_names

It seems that units of the features can be different, therefore we had better reduce our dataset in the PCA. 
### Task : Apply PCA to the dataset 

## Choosing the number of principal components for our projection 

As usual with unsupervised method there are many solutions to answer this question. Generally people either use elbow technique or keep a number of principal components such as it provides 80% or 90% or 95% of the total variance.

### Task : Print the percentage of variance retained versus the number of principal components 

### Task : If we want to keep 90% of the variance how many principal components must we keep ?  

### Task : Plot the datapoints with their label along the 2 first principal components. 
Use plt.scatter

What we can see is that we can simply assign a label to datapoints regarding their position on the first principal components. 
So the first axis is sufficient to do a quite good classification of malignant/benin cases. 

For our diagnostic, we must know which feature affect the most this first principal components. 



### Question : Which features influence the most 2 the first principal components

### Limitations of PCA. 

By its linear nature, PCA suffers from its inability to extract complicated structure. 

Let's see with an exemple. 

In [None]:
X = np.loadtxt('mnist2500_X.txt')
labels = np.loadtxt('mnist2500_labels.txt')

### Tasks : 
- Apply PCA to the dataset 
- Project in 2D and plot the dataset with their label

It's quite impossible to distinguish most of the clusters. Right ? 

### Non linear embedding : t-sne

In order to visualize dataset accurately, we are going to project our data using non linear methods in order to decompose faithfully the dataset in order to extract the intrisec structure of the data.

Sklearn definition of tsne : t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. 




In [13]:
from sklearn.manifold import TSNE

### Tasks : 
- Apply tsne to the dataset 
- Plot the dataset with their label