In this short kernel, I am going to show you how to choose the `number of principal components` when using principal component analysis for dimensionality reduction as in MoA Competition 

Full Post for detailed Explanation :- https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# If you like it, Do Upvote :)

In [None]:
import numpy as np
import pandas as pd
import os

from sklearn import preprocessing
from sklearn.decomposition import PCA

In [None]:
os.listdir('../input/lish-moa')

In [None]:
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
test_features = pd.read_csv('../input/lish-moa/test_features.csv')

In [None]:
train_features.info()

In [None]:
train_features.head()

In [None]:
GENES = [col for col in train_features.columns if col.startswith('g-')]
CELLS = [col for col in train_features.columns if col.startswith('c-')]
len(GENES+CELLS)

## Choosing PCA on Genes columns 

Don’t do it. Don’t choose the number of components manually.Instead of that, use the option that allows you to set the variance of the input that is supposed to be explained by the generated components.

### Remember to scale the data to the range between 0 and 1 before using PCA!
Typically, we want the explained variance to be between 95–99%. 

In [None]:
from sklearn.preprocessing import QuantileTransformer

In [None]:
for col in (GENES + CELLS):
    transformer = QuantileTransformer(random_state=0, output_distribution="normal")
    vec_len = len(train_features[col].values)
    vec_len_test = len(test_features[col].values)
    raw_vec = train_features[col].values.reshape(vec_len, 1)
    transformer.fit(raw_vec)

    train_features[col] = transformer.transform(raw_vec).reshape(1, vec_len)[0]
    test_features[col] = transformer.transform(test_features[col].values.reshape(vec_len_test, 1)).reshape(1, vec_len_test)[0]
    
data = pd.concat([pd.DataFrame(train_features[GENES]), pd.DataFrame(test_features[GENES])])

### Now we have standardized our Data

From the Scikit-learn implementation, we can get the information about the explained variance and plot the cumulative variance.

In [None]:
pca = PCA().fit(data)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,6)

fig, ax = plt.subplots()
xi = np.arange(1, 773, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, 750, step=50)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

On the plotted chart, we see what number of principal components we need.

## In this case, to get 95% of variance explained I need 600 principal components.