# Mastering PCA - An essential tool for data science

Hello everyone, in this notebook we shall see how to apply PCA for dimensionality reduction purposes. It is a key tool for visualization and statistical modeling. 
First, we'll see what is PCA and the maths behind it, then we'll use **Mechanism of Action** competition to practice PCA.

I assume that you have some foundations in algebra and statistics. 

Some resources to help you: 
- [Gilbert Strang, Linear Algebra course, MIT](https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/)
- [An introductory course to Statistics, MIT](https://www.youtube.com/watch?v=VPZD_aij8H0&ab_channel=MITOpenCourseWare/)

## What is PCA?

PCA is a dimensionality reduction technique. It consists in finding a low-dimensional representation that captures the statistical properties of high-dimensional data. Indeed, with PCA we are interested in analyzing the latent structure of the data, by removing as much noise as possible while retaining the maximum variance. PCA projects the data into a low-dimensional manifold. Let's recall that a manifold is a topological structure (a set of points connected to each other) embedded in a higher dimensional space. 

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_lle_digits_003.png">

For engineers, you might also know this technique from the **Karhunen-Lo√®ve transform** in signal processing.

**How does PCA do that?**

PCA projects high-dimensional data into a lower dimensional space spanned by orthogonal variables called **principal components**.

For an interesting read (a bit mathy) on how PCA is actually computed with QR decomposition, [check this paper](https://arxiv.org/abs/0909.4061). Thanks @cdeotte for the link! Most scientific software when computing PCA use a stochastic description described in the paper above, thus leading to slightly different principal components each time PCA is computed!


## A bit of maths

There are two ways to approach PCA: 
1. PCA can be defined as the orthogonal projection that minimizes the average projection  cost, usually measured using the mean squared distance between the data points and their projections. 
2. PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space known as the principal subspace, such that the variance of the projected data is maximized.

Let $(X_i)_{1\leqslant i\leqslant n}$ be i.i.d. random variables in $\mathbb{R}^d$ and consider the matrix $X\in\mathbb{R}^{n\times d}$ such that the $i$-th row of $X$ is the observation $X'_i$. Let $\Sigma_n$ be the empirical covariance matrix:
$$
\Sigma_n = \frac{1}{n}\sum_{i=1}^n X_i X'_i\,.
$$

### Minimizing the reconstruction error

Principal Component Analysis  aims at reducing the dimensionality of the observations $(X_i)_{1\leqslant i \leqslant n}$ using a "compression" matrix $W\in \mathbb{R}^{p\times d}$ with $p\leqslant d$ so that for each $1\leqslant i \leqslant n$, $WX_i$ ia a low dimensional representation of $X_i$. The original observation may then be partially recovered using another matrix $U\in \mathbb{R}^{d\times p}$. Let's note that since \\(U\\) is not surjective (not full rank), this recovery is necessarily incomplete. Principal Component Analysis computes $U$ and $W$ using the least squares approach:
$$
\hspace{-0.5cm}\underset{(U,W)\in \mathbb{R}^{d\times p}\times \mathbb{R}^{p\times d}}{\mathrm{argmin}} \;\sum_{i=1}^n\|X_i - UWX_i\|^2\,, 
$$

Let $\{\vartheta_1,\ldots,\vartheta_d\}$ be orthonormal eigenvectors associated with the eigenvalues $\lambda_1\geqslant \ldots \geqslant \lambda_d$ of $\Sigma_n$. Then a solution is given by the matrix $U_{\star}$ with columns $\{\vartheta_1,\ldots,\vartheta_p\}$ and $W_{\star} = U_{\star}'$.

### Maximizing the variance of the projection

For any dimension $1\leqslant p \leqslant  d$, let $\mathcal{F}_d^p$ be the set of all vector suspaces of $\mathbb{R}^d$ with dimension $p$. Principal Component Analysis computes a linear span $V_d$ such as
$$
V_p \in \underset{V\in \mathcal{F}_d^p}{\mathrm{argmin}} \;\sum_{i=1}^n\|X_i - \pi_V(X_i)\|^2\,, 
$$
where $\pi_V$ is the orthogonal projection onto the linear span $V$. Consequently, $V_1$ is a solution if and only if $v_1$ is solution to:
$$
v_1 \in \underset{v \in \mathbb{R}^d\,;\, \|v\|=1}{\mathrm{argmax}} \sum_{i=1}^n   \langle X_i, v \rangle^2\,.
$$
For all $2\leqslant p \leqslant d$, following the same steps, it can be proved that  a solution is given by $V_p = \mathrm{span}\{v_1, \ldots, v_p\}$ where
$$
v_1 \in \underset{v\in \mathbb{R}^d\,;\,\|v\|=1}{\mathrm{argmax}} \sum_{i=1}^n\langle X_i,v\rangle^2 \quad\mbox{and for all}\;\; 2\leqslant k \leqslant p\;,\;\; v_k \in \underset{\substack{v\in \mathbb{R}^d\,;\,\|v\|=1\,;\\ v\perp v_1,\ldots,v\perp v_{k-1}}}{\mathrm{argmax}}\sum_{i=1}^n\langle X_i,v\rangle^2\,. 
$$


## PCA in practice

As mentioned above, the data is assumed to be centered (the expectation of the columns is 0). Hence, beforehand, we need to preprocess our data by centering the feature columns.

1. Compute the covariance matrix of the data \\(X\\): \\(XX^T\\)
2. Compute the corresponding eigenvalues and eigenvectors of the matrix 
3. Normalize each of the eigenvectors to turn them into unit vectors.
4. Once done, each eigenvector can be interpreted as an axis of the ellipsoid fitted to the data.

Obviously, in practice, everything is handled for you by Scikit-Learn, but it is also good to know how things work in the background.

## PCA and SVD

Some of you might have heard that SVD is closely related to [singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition). I won't enter into too much details here since this notebook is about applying PCA. However, it is good to remember that when applying PCA to a dataset, instead of computing the eigenvectors as stated above, the software was usually compute the singular value decomposition of the design matrix \\(X\\). Using the spectral theorem, it can be shown that computing the eigenvalues of the covariance matrix and computing the SVD of the design matrix is equivalent and solve the optimization problems of PCA.

Now let's apply PCA on the **MoA** dataset.

## Applying PCA to MoA

First of all, if you seek a comprehensive EDA: https://www.kaggle.com/rftexas/moa-in-depth-eda-start-here

I hope you like the notebook! Don't hesitate to upvote it ;)

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import QuantileTransformer, LabelEncoder

from tqdm import tqdm

sns.set_style('whitegrid')

In [None]:
train_df = pd.read_csv('../input/lish-moa/train_features.csv')
targets_df = pd.read_csv('../input/lish-moa/train_targets_scored.csv')

cat_features = [x for x in list(train_df.columns) if 'cp_' in x]
gene_features = [x for x in list(train_df.columns) if 'g-' in x]
cell_features = [x for x in list(train_df.columns) if 'c-' in x]

features = cat_features + gene_features + cell_features 

For this experiment, I'll remove control groups since we know they don't trigger any mechanisms.

In [None]:
train_df = train_df[train_df['cp_type'] != 0]
targets_df = targets_df.loc[train_df.index]

train_df = train_df.reset_index(drop=True)
targets_df = targets_df.reset_index(drop=True)

Let's plot the distribution of the means of the columns.

In [None]:
means = train_df[gene_features + cell_features].mean(axis=0)


plt.figure(figsize=(7, 5))
sns.kdeplot(means, shade=True)

plt.title('Distribution of column means', fontsize=20, fontweight="bold");

Since we want to make sure the columns are centered, we apply some form of pre-processing. It could be **MinMaxScaler**, **StandardScaler**. Here I will go with **QuantileTransformer** since at the time of writing it has been shown that QuantileTransformer is particularly useful for this competition. In addition to centering our data, it will make them Gaussian.

In [None]:
for col in tqdm(gene_features + cell_features):
    transformer = QuantileTransformer(n_quantiles=100, random_state=0, output_distribution="normal")
    
    vec_len = len(train_df[col])
    
    raw_vec = train_df[col].values.reshape(vec_len, 1)
    transformer.fit(raw_vec)

    train_df[col] = transformer.transform(raw_vec).reshape(1, vec_len)[0]

In [None]:
means = train_df[gene_features + cell_features].mean(axis=0)


plt.figure(figsize=(7, 5))
sns.kdeplot(means, shade=True)

plt.title('Distribution of column means', fontsize=20, fontweight="bold");

Much better... Now the data has been centered.
Let's finally encode the categorical features, and we will be done with preprocessing.

In [None]:
for feat in cat_features:
    le = LabelEncoder()
    train_df[feat] = le.fit_transform(train_df[feat])

The dataset is composed of 3 categorical features, 100 cell features and 772 gene features. From the EDA performed in a previous notebook, we have inferred that some features have strong linear relationships, we want to get rid of some of them with PCA to reduce computational time and variance of our model. We will try two strategies:

1. Applying PCA to both cell features and gene features
2. Applying PCA to cell features and gene features separately

Since the categorical features are distinct (categorical) and each one brings additional and unrelated information, I won't apply PCA on them.


### Applying PCA to the entire dataset

We'll apply PCA on the whole gene and cell features.

In [None]:
# HELPER FUNCTIONS FOR CROSS_VALIDATION

#######################################
############# IGNORE ##################
#######################################

import sys
sys.path.append('../input/iterative-stratification/iterative-stratification-master')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import os, re, random, gc
from tqdm import tqdm
from glob import glob

from datetime import datetime
import time

from sklearn.metrics import log_loss

import torch 
import torch.nn as nn
import torch.nn.functional as F

from torch.optim import Adam, lr_scheduler
from torch.utils.data import Dataset

import warnings
warnings.filterwarnings('ignore')

#######################################
############# CONFIG ##################
#######################################

class config:
    
    ###############
    # Training
    ###############
    
    num_folds = 5
    
    num_workers = 8
    batch_size = 128
    num_epochs = 30
    
    ###############
    # LR scheduling
    ###############
    
    step_scheduler = True
    lr = 1e-4
    
    ###############
    # Miscellaneous
    ###############
    
    seed = 2020
    verbose = False
    verbose_step = 5
        
#######################################
############# UTILS ###################
#######################################

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(config.seed)


class AverageMeter(object):
    def __init__(self):
        self.reset()
    
    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0
    
    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count
        
#######################################
############# DATASET #################
#######################################
        
class MoADataset(Dataset):
    def __init__(self, features, targets=None, train=True):
        super().__init__()
        self.features = features
        self.train = train
        
        if self.train:
            self.targets = targets
                
    def __len__(self):
        return len(self.features)
    
    def __getitem__(self, item):
        feats = self.features[item, :].astype(np.float32)
        
        if self.train: 
            
            targets = self.targets[item, :].astype(np.float32) 
            
            return {
                'features': torch.tensor(feats, dtype=torch.float),
                'targets': torch.tensor(targets, dtype=torch.float),
            }
        else: 
            return {'features': torch.tensor(feats, dtype=torch.float)}

#######################################
############# MODEL ###################
#######################################

class BaselineModel(nn.Module):
    def __init__(self, num_features):
        super(BaselineModel, self).__init__()
        
        self.num_features = num_features
        
        self.block1 = nn.Sequential(
            nn.BatchNorm1d(self.num_features),
            nn.Dropout(0.2),
            nn.utils.weight_norm(nn.Linear(self.num_features, 2048)),
            nn.ReLU(),
        )
        
        self.block2 = nn.Sequential(
            nn.BatchNorm1d(2048),
            nn.Dropout(0.5),
            nn.utils.weight_norm(nn.Linear(2048, 1024)),
            nn.ReLU(),
        )
        
        self.block3 = nn.Sequential(
            nn.BatchNorm1d(1024),
            nn.Dropout(0.5),
            nn.utils.weight_norm(nn.Linear(1024, 206)),
        )
    
    def forward(self,
                inputs):
        
        
        x = self.block1(inputs)
        x = self.block2(x)
            
        return self.block3(x)

#######################################
############# FITTER ##################
#######################################
    
class Fitter:
    def __init__(self, model, seed, fold, device, config):
        self.config = config
        self.model = model
        self.seed = seed
        self.device = device
        self.fold = fold
                        
        self.epoch = 0
        
        self.history = {
            'train_history_loss': [],
            'val_history_loss': [],
        }
        
        self.base_dir = './'
        self.log_path = f'{self.base_dir}/log.txt'
        
        self.best_loss = float('inf')
        
        self.optimizer = torch.optim.Adam(
            self.model.parameters(),
            weight_decay=1e-5
        )
        
        self.scheduler = lr_scheduler.ReduceLROnPlateau(
            self.optimizer,
            mode='min',
            factor=0.1,
            patience=3,
            eps=1e-4,
            verbose=False
        )
        
        self.criterion = nn.BCEWithLogitsLoss().to(self.device)
        self.log(f'Fitter prepared. Training on {self.device}')
    
    def fit(self, train_loader, valid_loader):
        
        for epoch in range(self.config.num_epochs):
            
            if self.config.verbose:
                lr = self.optimizer.param_groups[0]['lr']
                timestamp = datetime.utcnow().isoformat()
                self.log(f'\n{timestamp}\nLR: {lr}\n')
            
            t = time.time()
            train_loss = self.train_one_epoch(train_loader)
            self.history['train_history_loss'].append(train_loss.avg)
            
            self.log(f'[RESULT]: Train. Epoch: {self.epoch}, ' + \
                     f'loss: {train_loss.avg:.5f}, ' + \
                     f'time: {(time.time() - t):.5f}')
            self.save(f'{self.base_dir}/last-checkpoint.bin')
            
            t = time.time()
            val_loss, y_oof = self.validation_one_epoch(valid_loader)
            self.history['val_history_loss'].append(val_loss.avg)
            
            self.log(f'[RESULT]: Val. Epoch: {self.epoch}, ' + \
                     f'val_loss: {val_loss.avg:.5f}, ' + \
                     f'time: {(time.time() - t):.5f}')
            
            self.scheduler.step(val_loss.avg)
            
            if val_loss.avg < self.best_loss:
                self.best_loss = val_loss.avg
                self.model.eval()
                self.save(f'{self.base_dir}/best-loss-fold-{str(self.fold)}-seed-{str(self.seed)}.bin')
            
            self.epoch += 1 
        
        return y_oof
    
    def train_one_epoch(self, train_loader):
        self.model.train()
        
        loss_score = AverageMeter()
        
        t = time.time()
        
        for step, data in enumerate(train_loader):
            if self.config.verbose:
                if step % self.config.verbose_step == 0:
                    print(
                        f'Train Step {step}/{len(train_loader)}, ' + \
                        f'loss: {loss_score.avg:.5f}, ' + \
                        f'time: {(time.time() - t):.5f}', end='\r'
                    )
            
            features = data['features']
            targets = data['targets']
            
            features = features.to(self.device)
            targets = targets.to(self.device).float()
                
            batch_size = features.shape[0]
            
            for p in self.model.parameters(): p.grad = None
                
            outputs = self.model(
                features
            )
                
            loss = self.criterion(outputs, targets)
            loss.backward()
                
            loss_score.update(
                loss.detach().item(), 
                batch_size
            )
                
            self.optimizer.step()
        
        return loss_score

    def validation_one_epoch(self, valid_loader):
        self.model.eval()
        
        preds = []
        
        loss_score = AverageMeter()
        
        t = time.time()
        
        for step, data in enumerate(valid_loader):
            if self.config.verbose:
                if step % self.config.verbose_step == 0:
                    print(
                        f'Val Step {step}/{len(valid_loader)}, ' + \
                        f'loss: {loss_score.avg:.5f}, ' + \
                        f'time: {(time.time() - t):.5f}', end='\r'
                    )
            
            features = data['features']
            targets = data['targets']
            
            features = features.to(self.device)
            targets = targets.to(self.device).float()
            
            batch_size = features.shape[0]
            
            with torch.no_grad():
                outputs = self.model(
                    features
                )
                loss = self.criterion(outputs, targets)
                loss_score.update(loss.detach().item(), batch_size)
                
                preds.append(
                    torch.sigmoid(outputs).detach().cpu().numpy()
                )
        
        return loss_score, np.concatenate(preds)
    
    def save(self, path):
        self.model.eval()
        
        torch.save({
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict(),
            'scheduler_state_dict': self.scheduler.state_dict(),
            'best_loss': self.best_loss,
            'epoch': self.epoch,
            'history': self.history,
        }, path)
    
    def load(self, path):
        checkpoint = torch.load(path)
        self.model.load_state_dict(checkpoint['model_state_dict'])
        self.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        self.scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
        self.best_summary_loss = checkpoint['best_loss']
        self.epoch = checkpoint['epoch'] + 1
        self.history = checkpoint['history']
        
    def log(self, message):
        if self.config.verbose:
            print(message)
        with open(self.log_path, 'a+') as logger:
            logger.write(f'{message}\n')
                
    def print_history(self):
        plt.figure(figsize=(15,5))
        
        plt.plot(
            np.arange(self.config.num_epochs),
            self.history['train_history_loss'],
            '-o',
            label='Train loss',
            color='#ff7f0e'
        )
        
        plt.plot(
            np.arange(self.config.num_epochs),
            self.history['val_history_loss'],
            '-o',
            label='Val loss',
            color='#1f77b4'
        )
        
        x = np.argmin(self.history['val_history_loss'])
        y = np.min(self.history['val_history_loss'])
        
        plt.ylim(0, 0.03)
        
        xdist = plt.xlim()[1] - plt.xlim()[0]
        ydist = plt.ylim()[1] - plt.ylim()[0]
        
        plt.scatter(x, y, s=200, color='#1f77b4')
        
        plt.text(
            x-0.03*xdist,
            y-0.13*ydist,
            'min loss\n%.5f'%y,
            size=14
        )
        
        plt.ylabel('Loss', size=14)
        plt.xlabel('Epoch', size=14)
        
        plt.legend(loc=2)
        
        plt.title(f'FOLD {self.fold + 1}',size=18)
        
        plt.legend(loc=3)
        plt.show()  

#######################################
############# ENGINE ##################
#######################################

def cross_validate_strategy(X, y):

    oof_preds = np.zeros((len(train_df), 206))

    device = torch.device(
        'cuda' if torch.cuda.is_available() else 'cpu'
    )

    kfold = MultilabelStratifiedKFold(config.num_folds, shuffle=True, random_state=config.seed)

    for fold, (trn_, val_) in tqdm(enumerate(kfold.split(X, y)), total=config.num_folds):

        # Model
        model = BaselineModel(X.shape[1]).to(device)

        # Data
        X_train = X[trn_, :]
        X_valid = X[val_, :]

        y_train = y[trn_, :]
        y_valid = y[val_, :]

        # Dataset
        train_dataset = MoADataset(X_train, y_train)
        valid_dataset = MoADataset(X_valid, y_valid)

        # Dataloader
        train_loader = torch.utils.data.DataLoader(
            train_dataset,
            batch_size=config.batch_size,
            pin_memory=True,
            drop_last=True,
            shuffle=True,
            num_workers=config.num_workers
        )

        valid_loader = torch.utils.data.DataLoader(
            valid_dataset,
            batch_size=config.batch_size,
            num_workers=config.num_workers,
            shuffle=False,
            pin_memory=True,
            drop_last=False,
        )

        # Fitter
        fitter = Fitter(model, config.seed, fold, device, config)

        y_oof = fitter.fit(train_loader, valid_loader) 
        oof_preds[val_, :] = y_oof
    
    target_cols = list(targets_df.columns)
    target_cols.remove('sig_id')
    
    oof_score = 0
    y_true = targets_df[target_cols].values

    for i in range(oof_preds.shape[1]):
        _score = log_loss(y_true[:,i], oof_preds[:,i])
        oof_score += _score / y_true.shape[1]
    
    return oof_score

In [None]:
%%time

X_gene_cell = train_df[gene_features + cell_features].values
y = targets_df.iloc[:, 1:].values

pca = PCA(n_components=436)
X_pca = pca.fit_transform(X_gene_cell)

X = np.concatenate((train_df[cat_features], X_pca), axis=1)

In [None]:
plt.figure(figsize=(7, 5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.title('Explained variance ratio', fontsize=20, fontweight="bold");

Obviously we see that adding more dimensions we'll explain more variance. For your theoretical understanding, explained variance is:
$$\frac{\sum_{i=1}^d \lambda_i}{\sum_{j=1}^p \lambda_j}$$

where \\((\lambda_1, ..., \lambda_k)\\) are eigenvalues of the covariance matrix. Here \\(d\\) is the number of retained principal components, while \\(p\\) is the number of features.

In [None]:
scores = []
n_comps = [x for x in range(100, 450, 50)]

for n_comp in n_comps:
    pca = PCA(n_components=n_comp)
    
    X_pca = pca.fit_transform(X_gene_cell)
    X = np.concatenate((train_df[cat_features], X_pca), axis=1)
    
    score = cross_validate_strategy(X, y)
    scores.append(score)

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.plot(n_comps, scores)
plt.xlabel('Number of principal components')
plt.ylabel('Cross validation score')
plt.title('CV score vs n_components', fontsize=20, fontweight="bold");

Here we see that we have a clear winner, with 300 components. The log_loss is minimal with 300 components.

## Applying PCA to gene and cell features separately

In [None]:
%%time

X_gene = train_df[gene_features].values
X_cell = train_df[cell_features].values
y = targets_df.iloc[:, 1:].values

pca1 = PCA(n_components=580)
X_gene_pca = pca1.fit_transform(X_gene)

pca2 = PCA(n_components=75)
X_cell_pca = pca2.fit_transform(X_cell)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

ax[0].plot(np.cumsum(pca1.explained_variance_ratio_))
ax[0].set_xlabel('Number of components')
ax[0].set_ylabel('Cumulative explained variance')
ax[0].set_title('Gene', fontsize=14)

ax[1].plot(np.cumsum(pca2.explained_variance_ratio_))
ax[1].set_xlabel('Number of components')
ax[1].set_ylabel('Cumulative explained variance')
ax[1].set_title('Cell', fontsize=14)

fig.suptitle('Explained variance ratio', fontsize=20, fontweight="bold");

In [None]:
scores = []
n_comps1 = [100, 150, 200, 250, 300, 350]
n_comps2 = [20, 25, 30, 40, 50, 60]

for n_comp1, n_comp2 in zip(n_comps1, n_comps2):
    pca1 = PCA(n_components=n_comp1)
    pca2 = PCA(n_components=n_comp2)
    
    X_gene_pca = pca1.fit_transform(X_gene)
    X_cell_pca = pca2.fit_transform(X_cell)
    
    X = np.concatenate((train_df[cat_features], X_gene_pca, X_cell_pca), axis=1)
    
    score = cross_validate_strategy(X, y)
    scores.append(score)

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.plot(n_comps1, scores)
plt.xlabel('Number of principal components (Gene, Cell)')
plt.ylabel('Cross validation score')
plt.title('CV score vs n_components', fontsize=20, fontweight="bold");