This notebook implements the **Principal Component Ananlyis(PCA)** as explained in the book "**Python Machine Learning**" by **Sebastian Raschka** and **Vahid Mirjalili**.

**Prerequisites**:

* Python
* pandas
* numpy

**Dataset**: Wine

**Note:** Descriptive comments explain the code in a better way

Import necessary packages:

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from matplotlib.colors import ListedColormap

Load the Wine dataset in a panda dataframe:

In [None]:
df_wine = pd.read_csv('../input/Wine.csv');
df_wine.head()

Add headers in the data:

In [None]:
df_wine.columns = [  'name'
                 ,'alcohol'
             	,'malicAcid'
             	,'ash'
            	,'ashalcalinity'
             	,'magnesium'
            	,'totalPhenols'
             	,'flavanoids'
             	,'nonFlavanoidPhenols'
             	,'proanthocyanins'
            	,'colorIntensity'
             	,'hue'
             	,'od280_od315'
             	,'proline'
                ]
df_wine.head()

Step 1 : Preprocess the data into train and test sets with 70%:30% ratio respectively and standardize the data as is a requirement for PCA to assign equal importance to each feature beforehand

In [None]:
#make train-test sets
from sklearn.model_selection import train_test_split;
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values;
#print(np.unique(y))
#split with stratify on y for equal proportion of classes in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, stratify = y,random_state = 0);

#standardize the features with same model on train and test sets
from sklearn.preprocessing import StandardScaler;
sc = StandardScaler();
X_train_std = sc.fit_transform(X_train);
X_test_sd = sc.transform(X_test);


Step 2: Make a covariance matrix of the training data and extract eigenvectors and eigenvalues

In [None]:
cov_mat = np.cov(X_train_std.T);
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat);
print('\nEigenvalues \n%s' % eigen_vals)


Each principal component i.e. eigenvector indicate the variance and we have to select the top *k* eigenvectors based upon their magnitude i.e . eigenvalues. Lets plot the variance explained ratios of the eigenvalues for the above dataset:

In [None]:
tot = sum(eigen_vals);
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)];
cum_var_exp = np.cumsum(var_exp);
plt.bar(range(1,14), var_exp, alpha=0.5, align='center',label='individual explained variance')
plt.step(range(1,14), cum_var_exp, where='mid',label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.show()

The above plot clearly shows that the first principal component explains almost 40% of the variance in the data and the first and second combined explain 60%.

Step 3: Select the top *k* eigenvectors and compose the projection matrix *W* using these vectors

In [None]:
eigen_pairs = [(np.abs(eigen_vals[i]), eigen_vecs[:, i]) for i in range(len(eigen_vals))]
#Sort the (eigenvalue, eigenvector) tuples from high to low
eigen_pairs.sort(key=lambda k: k[0], reverse=True)
#chossing k = 2 for better representation via 2-dimensional scatter plot.
w = np.hstack((eigen_pairs[0][1][:, np.newaxis],eigen_pairs[1][1][:, np.newaxis]))
print('Matrix W:\n', w)

Step 4: Transform original data(13-dimensional) to new dimensioanl sub-space(2-dimensional) using the projection matrix consisiting of two principal components

In [None]:
X_train_pca = X_train_std.dot(w)

Lets visualize the new 2-dimensional dataset via a scatter plot:

In [None]:
colors = ['r', 'b', 'g']
markers = ['s', 'x', 'o']
for l, c, m in zip(np.unique(y_train), colors, markers):plt.scatter(X_train_pca[y_train==l, 0],X_train_pca[y_train==l, 1],c=c, label=l, marker=m)
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.legend(loc='lower left')
plt.show()

The above plot clearly show that the data is more spread along PC1 as had been seen in the variance ratio plot. A linear classifier could very well seperate the classes in the new feature space.