## Principal component analysis (PCA)

#### What is PCA
We know that in machine learing, the accuracy of the model increases with an increase in the training data but this is a problem with higher dimension data because of curse of dimensionality ( the amount of computation increases with the increase in the dimesionality of the data). To tackle this issue we use the process called PCA, in which we reduce the dimension of the data without the loss of information.

#### Procedure to compute PCA
PCA is computed in 5 steps: <br>

Step 1: Standardization of data <br> 
Step 2: Compte covariance matrix  <br>
Step 3: Calculate th eeigen vectors and eigenvalues <br>
Step 4: Compute the principal components <br>
Step 5: Reduce the dimensions of the dataset <br> 

##### Standardization:
Scaling data into similar range. <br>

Can be done by using the formula z=$\frac{ variable value - mean}{standard deviation}$

### Code for PCA

In [1]:
pip install seaborn

Note: you may need to restart the kernel to use updated packages.


In [2]:
import math
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb

In [3]:
#importing data (dataset = iris dataset)

data = pd.read_csv("iris.csv")

print(data)
data.describe()

     sepal-length  sepal-width  petal-length  petal-width         species
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [5]:
data = data.drop("species",1)
data

  data = data.drop("species",1)


Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [6]:
# standardizing the data
data = (data - data.mean())/data.std(ddof=0)

In [7]:
data

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width
0,-0.900681,1.032057,-1.341272,-1.312977
1,-1.143017,-0.124958,-1.341272,-1.312977
2,-1.385353,0.337848,-1.398138,-1.312977
3,-1.506521,0.106445,-1.284407,-1.312977
4,-1.021849,1.263460,-1.341272,-1.312977
...,...,...,...,...
145,1.038005,-0.124958,0.819624,1.447956
146,0.553333,-1.281972,0.705893,0.922064
147,0.795669,-0.124958,0.819624,1.053537
148,0.432165,0.800654,0.933356,1.447956


In [10]:
#calculating the covariance matrix 
correlationMatrix = data.corr()
correlationMatrix.shape

(4, 4)

In [11]:
#computing eigen values and vectors
eigValues, eigVectors = np.linalg.eig(correlationMatrix)
eigValues, eigVectors

(array([2.91081808, 0.92122093, 0.14735328, 0.02060771]),
 array([[ 0.52237162, -0.37231836, -0.72101681,  0.26199559],
        [-0.26335492, -0.92555649,  0.24203288, -0.12413481],
        [ 0.58125401, -0.02109478,  0.14089226, -0.80115427],
        [ 0.56561105, -0.06541577,  0.6338014 ,  0.52354627]]))

From here we decide what are the parameters that are independent of each other and then reduce the dimension of the data