This notebook is largely reproduced from 

- [PCA clearly explained —When, Why, How to use it and feature importance: A guide in Python](https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e)
- [Principal Component Analysis Visualization by Prasad Ostwal](https://ostwalprasad.github.io/machine-learning/PCA-using-python.html)
- [PCA in 3 steps by Sebastian Raschka](http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html)
- [PCA by plotly](https://plot.ly/ipython-notebooks/principal-component-analysis/)
- [In Depth: Principal Component Analysis](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html)

# Dimensionality Reduction

- A family of unsupervised machine learning technigies. You're already familiar with unsupervised ML (e.g., clustering analysis, K-Means algorithm)
- **Reducing the number of variables (or the dimension of your data) to a more manageable set of variables**

### Family of Machine Learning Algorithms

<img src='images/machinelearning.gif' height=500>

### Example


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston

boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head(2)

In [None]:
len(boston.columns)

We can **convert 13 columns to two dimensions or components**. Here is the visual presentation.

<img src='https://ostwalprasad.github.io/images/p2/output_20_1.png'>

Now we can see **how influential each of the original variables was in each component**.

<img src='https://ostwalprasad.github.io/images/p2/output_26_0.png'>

In [None]:
# pay attention to PTRATIO and CRIM
sns.heatmap(boston.corr(), vmax=.8, square=True, annot=True, fmt=".1f");

### Algorithms for Dimentionality Reduction: [scikit-learn](https://scikit-learn.org/stable/modules/decomposition.html#decompositions)

1. **Principal component analysis (PCA)**
2. Truncated singular value decomposition and latent semantic analysis
3. Dictionary Learning
4. Factor Analysis
5. Independent component analysis (ICA)
6. Non-negative matrix factorization (NMF or NNMF)
7. Latent Dirichlet Allocation (LDA)

# Principal Component Analysis (PCA)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px  #if you don't have this, install first

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")

### Data: iris

This data has four variables and let's try to reduce to two dimensions. 

<img src="images/iris.png" height=25 width=200>
<img src="images/iris_3.gif" height=300">

In [None]:
df = pd.read_csv("data/iris.csv")
df.head()

In [None]:
X = df[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']].values
y = df['Name'].values

In [None]:
# Standardizing or normalizing the features 
x = StandardScaler().fit_transform(X)
pd.DataFrame(x).head()

### PCA Projection to 2D

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
components = pca.fit_transform(x)

fig = px.scatter(components, x=0, y=1, color=y)
fig.show()

In [None]:
pcadf = pd.DataFrame(data = components, columns = ['pca1', 'pca2'])
pcadf.head()

In [None]:
finalDf = pd.concat([pcadf, df[['Name']]], axis = 1)
finalDf.head(2)

### Explained Variance

*The explained variance tells you how much information (variance) can be attributed to each of the principal components. This is important as while you can convert 4 dimensional space to 2 dimensional space, you lose some of the variance (information) when you do this*

In [None]:
print(pca.explained_variance_ratio_)            # explained variance of each component
print(pca.explained_variance_ratio_.cumsum())   # cumulative sum

In [None]:
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

total_var = pca.explained_variance_ratio_.sum() * 100

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={"x": "# Components", "y": "Explained Variance"}
)

### Visualize Loadings

Visualize how strongly each characteristic influences a principal component.

In [None]:
fig = px.scatter(components, x=0, y=1, color=y)
fig.show()

In [None]:
features = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']

loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

fig = px.scatter(components, x=0, y=1, color=y)

for i, feature in enumerate(features):
    fig.add_shape(
        type='line',
        x0=0, y0=0,
        x1=loadings[i, 0],
        y1=loadings[i, 1]
    )
    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0, ay=0,
        xanchor="center",
        yanchor="bottom",
        text=feature,
    )
fig.show()

In [None]:
df.columns

- **SepalLength, PetalLength, and PetalWidth** are the most important for PC1
- **SepalWidth** is the most important for PC2
- **Arrows** (variables/features) that point into the same direction indicate **correlation** between the variables that they represent whereas, the arrows heading in **opposite directions** indicate a contrast between the variables they represent.

In [None]:
# The above findings are verified using correlation analysis
df[features].corr()

In [None]:
# Effect of variables on each components

sns.set(rc={'figure.figsize':(11,8)})

ax = sns.heatmap(pca.components_,
                 cmap='YlGnBu',
                 yticklabels=[ "PCA"+str(x) for x in range(1,pca.n_components_+1)],
                 xticklabels=list(df[features].columns),
                 cbar_kws={"orientation": "horizontal"})
ax.set_aspect("equal")

### PCA Projection to 3D

In [None]:
pca = PCA(n_components=3)
components = pca.fit_transform(x)

total_var = pca.explained_variance_ratio_.sum() * 100

fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=y,
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.show()

# Mathematics

Reproduced from [PCA from scratch](https://towardsdatascience.com/principal-component-analysis-pca-from-scratch-in-python-7f3e2a540c51) and [PCA with Numpy](https://towardsdatascience.com/pca-with-numpy-58917c1d0391)

### standardizing data

In [None]:
x[:5]

### Covariance matrix

Covariance matrices, like correlation matrices, contain information about the amount of variance shared between pairs of variables.

Covariance shows **the direction (positivity, negativity) of the relationship**, rather than its strength (correlation). The covariance values are not standardized so go beyond -1 and +1.

In [None]:
# Calculating the covariance matrix
covariance_matrix = np.cov(x.T)
covariance_matrix

### Eigenvectors & Eigenvalues

- Eigenvectors: principal components, detemining the directions of the new feature space 
- Eigenvalue: determining their magnitude

In [None]:
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
print("Eigenvector: \n", eigen_vectors,"\n")
print("Eigenvalues: \n", eigen_values, "\n")

- **Eigenvectors are the principal components**. The **first principal component is the first column with values of 0.52, -0.26, 0.58, and 0.56**. The second principal component is the second column and so on. 
- **Each Eigenvector will correspond to an Eigenvalue**, each eigenvector can be scaled of its eigenvalue, whose magnitude indicates how much of the data’s variability is explained by its eigenvector.

Just from this, we can calculate the percentage of explained variance per principal component:

In [None]:
# Calculating the explained variance on each of components
variance_explained = []
for i in eigen_values:
     variance_explained.append((i/sum(eigen_values))*100)
        
print(variance_explained)

In [None]:
# Identifying components that explain at least 95%
cumulative_variance_explained = np.cumsum(variance_explained)
print(cumulative_variance_explained)

The first two principal components account for around 96% of the variance in the data

In [None]:
# Visualizing the eigenvalues and finding the "elbow" in the graphic
sns.lineplot(x = [1,2,3,4], y=cumulative_variance_explained)
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.title("Explained variance vs Number of components");