# Principal Component Analysis

Principal component analysis (PCA) is a technique that transforms high-dimensions data into lower-dimensions while retaining as much information as possible.

<table><tr>
    <td><img src="https://i.postimg.cc/TPZxdTNr/PCA1.png" width="417"/> </td>
    <td> <img src="https://i.postimg.cc/NFxvkms4/PCA2.png" width="500"/></td>
</tr></table>

### Curse of Dimensionality  

Dimensionality in a dataset becomes a severe impediment to achieve a reasonable efficiency for most algorithms. Increasing the number of features does not always improve accuracy. When data does not have enough features, the model is likely to underfit, and when data has too many features, it is likely to overfit. Hence it is called the curse of dimensionality. The curse of dimensionality is an astonishing paradox for data scientists, based on the exploding amount of n-dimensional spaces — as the number of dimensions, n, increases.

<img src="https://i.postimg.cc/Yqbw46y2/curse.jpg" width="400"/>

There are two techniques to make dimensionality reduction:

* Feature Selection
* Feature Extraction

<img src="https://i.postimg.cc/4xmjmr9w/vs.png" width="500"/>

### Feature Selection
n feature selection, usually, a subset of original features is selected.
<img src="https://i.postimg.cc/7LzcN0X2/feature-selection.png" width="300"/>

### Feature Extraction
In feature extraction, a set of new features are found. That is found through some mapping from the existing features. Moreover, mapping can be either linear or non-linear.
<img src="https://i.postimg.cc/4yWF354r/feature-extraction.png" width="300"/>

## Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an exploratory approach to reduce the data set's dimensionality to 2D or 3D, used in exploratory data analysis for making predictive models. Principal Component Analysis is a `linear` transformation of data set that defines a new coordinate rule such that:

   * The highest variance by any projection of the data set appears to lays on the first axis.
   * The second biggest variance on the second axis, and so on.

> 💥 In the eyes of PCA, variance is an objective and mathematical way to quantify the amount of information in our data.
**Variance is information.**

<img src="https://i.postimg.cc/X7DzSQz5/variance.jpg" width="400"/>

>**✨ Additional infornamtion:**  
    **Autoencoders** are neural networks that stack numerous non-linear transformations to reduce input into a low-dimensional latent space (layers). The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a higher-dimensional data, typically for dimensionality reduction, by training the network to capture the most important parts of the input image.

# PCA for Data Visualization

### PCA and KNN on IRIS dataset

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import warnings 
warnings.filterwarnings('ignore')

In [None]:
iris = load_iris()

In [None]:
X = iris.data
y = iris.target

In [None]:
iris.feature_names

In [None]:
X_df = pd.DataFrame(X, columns=iris.feature_names)
y_df = pd.DataFrame(y, columns=["label"])
X_df.head()

In [None]:
px.scatter_matrix(X_df, color=y, title='Scatter Matrix of Features', height=800, )

In [None]:
px.pie(y_df, names="label")

In [None]:
df = pd.concat([X_df, y_df], axis=1)
df.head()

In [None]:
px.imshow(df.corr())

>**❗ NOTE:** One of the biggest aims of these sort of plots and EDAs are to identify features that are not much helpful in explaining the target outcome. The SepalWidthCm feature seems to be less relevant in explaining the target class as compared to the other features

In [None]:
X_df.describe()

### Train Test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

### Standardizing the features

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### Apply PCA to transform iris dataset

In [None]:
pca = PCA(n_components=4)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

> The `explained_variance_ratio_` tells us how much of the total variance is explained by each principal component.

In [None]:
pca.explained_variance_ratio_

In [None]:
px.bar(x= ["pca-1", "pca-2", "pca-3", "pca-4"] ,y = pca.explained_variance_ratio_)

> As we can see from the above plot :  
>* The first component covers 72.962% of the original datas information with a loss of ~ 28%.
>* The second component covers 22.850% of the original datas information with a loss of ~ 78%.
>* Both the first and second principal components are enough to cover ~ 95% with a loss of ~ 5%.
> The third and fourth components can be safely ignored because they only contribute to ~3% and 0.5% of original datas information.

💥 **Since the first two principal components have high variance we will select them for dimensionality reduction.**

### Plotting the 2 principal components with maximum variance

In [None]:
pca = PCA(n_components=2)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
pca.explained_variance_ratio_

In [None]:
px.bar(x= ["pca-1", "pca-2"] ,y = pca.explained_variance_ratio_)

In [None]:
X_train_pca_df = pd.DataFrame(X_train_pca,columns=['PCA-1','PCA-2'])
X_train_pca_df.head()

In [None]:
px.scatter(X_train_pca_df, x="PCA-1", y="PCA-2", color= y_train)

In [None]:
X_train_pca.shape, X_test_pca.shape

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train, y_train, cv=5)
print("score before dimension reduction:",scores.mean())   

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train_pca, y_train, cv=5)
print("score after dimension reduction:",scores.mean())   

### Plotting the 3 principal components with maximum variance

In [None]:
pca3D = PCA(n_components=3)
pca3D.fit(X_train)
X_train_pca3D = pca3D.transform(X_train)
X_test_pca3D = pca3D.transform(X_test)

In [None]:
pca3D.explained_variance_ratio_

In [None]:
X_train_pca3D_df = pd.DataFrame(X_train_pca3D,columns=['PCA-1','PCA-2', 'PCA-3'])
X_train_pca3D_df.head()

In [None]:
px.scatter_3d(X_train_pca3D_df, x="PCA-1", y="PCA-2",z="PCA-3", color= y_train)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train, y_train, cv=5)
print("score before dimension reduction:",scores.mean())   

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train_pca, y_train, cv=5)
print("score after dimension reduction to 2D:",scores.mean())  

knn = KNeighborsClassifier()
scores = cross_val_score(knn, X_train_pca3D, y_train, cv=5)
print("score after dimension reduction to 3D:",scores.mean())  

# PCA to Speed-up Machine Learning Algorithms

In [15]:
import pandas as pd
import numpy as np
import plotly.express as px
from sklearn.decomposition import PCA
from time import time
import warnings 
warnings.filterwarnings('ignore')

In [3]:
train_df = pd.read_csv("./dataset/mnist_train.csv", dtype=np.uint8)
test_df = pd.read_csv("./dataset/mnist_test.csv", dtype=np.uint8)

In [4]:
train_df.head()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
train_df.describe()

Unnamed: 0,label,1x1,1x2,1x3,1x4,1x5,1x6,1x7,1x8,1x9,...,28x19,28x20,28x21,28x22,28x23,28x24,28x25,28x26,28x27,28x28
count,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,...,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0,60000.0
mean,4.453933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.200433,0.088867,0.045633,0.019283,0.015117,0.002,0.0,0.0,0.0,0.0
std,2.88927,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.042472,3.956189,2.839845,1.68677,1.678283,0.3466,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,254.0,254.0,253.0,253.0,254.0,62.0,0.0,0.0,0.0,0.0


In [6]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 785 entries, label to 28x28
dtypes: uint8(785)
memory usage: 44.9 MB


In [7]:
px.pie(train_df, names="label")

In [8]:
X_train = train_df.drop(['label'], axis=1)
y_train = train_df['label']
X_test = test_df.drop(['label'], axis=1)
y_test = test_df['label']

In [9]:
del train_df
del test_df

In [10]:
print("X_train_shape:",X_train.shape)
print("Y_train_shape:",y_train.shape)
print("X_test_shape:",X_test.shape)
print("Y_test_shape:",y_test.shape)

X_train_shape: (60000, 784)
Y_train_shape: (60000,)
X_test_shape: (10000, 784)
Y_test_shape: (10000,)


In [11]:
instance_index = 7890 
matrix_conv=X_train.iloc[instance_index].to_numpy().reshape(28,28)
px.imshow(matrix_conv)

## Standardizing the features

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Apply PCA to transform MNIST dataset

### Let’s see a Choosing the number of components

In [13]:
pca = PCA()
pca.fit(X_train)
dataframe = pd.DataFrame({'number of components':range(1, len(pca.explained_variance_ratio_)+1) , 'cumulative explained variance':np.cumsum(pca.explained_variance_ratio_)})
px.line(dataframe, x="number of components" ,y="cumulative explained variance")

In [24]:
pca = PCA(0.95)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

>**❗ NOTE:** Notice the code has .95 for the number of components parameter. It means that scikit-learn choose the minimum number of principal components such that 95% of the variance is retained.

In [25]:
len(pca.explained_variance_ratio_)

331

### Apply SVM to the Transformed Data

In [23]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
t0 = time()
svc = SVC(kernel='rbf')
svc.fit(X_train_pca, y_train)
print(f"Time (Seconds): {time() - t0}")
y_pred = svc.predict(X_test_pca)

print(f"Accuracy score : {accuracy_score(y_test, y_pred)}")

KeyboardInterrupt: ignored

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
t0 = time()
svc = SVC(kernel='rbf')
svc.fit(X_train, y_train)
print(f"Time (Seconds): {time() - t0}")
y_pred = svc.predict(X_test)

print(f"Accuracy score : {accuracy_score(y_test, y_pred)}")