Implementing PCA in Python with Scikit-Learn
------------------------------------------------------------------

With the availability of high performance CPUs and GPUs, it is pretty much possible to solve every regression, classification, clustering and other related problems using machine learning and deep learning models. However, there are still various factors that cause performance bottlenecks while developing such models. 

Large number of features in the dataset is one of the factors that affect both the training time as well as accuracy of machine learning models. You have different options to deal with huge number of features in a dataset.

1. Try to train the models on original number of features, which take <font color='red'>days or weeks</font> if the number of features is too high.

2. Reduce the number of variables by merging correlated variables.

3. <font color='green'>Extract the most important features from the dataset that are responsible for maximum variance in the output.</font> Different statistical techniques are used for this purpose e.g. linear discriminant analysis, factor analysis, and <b>principal component analysis (PCA)</b>.

About Principal Component Analysis
-----------------------------------------------------
Principal component analysis, or <b>PCA, is a statistical technique to convert high dimensional data to low dimensional data by selecting the most important features that capture maximum information about the dataset</b>. The features are selected on the basis of variance that they cause in the output. The feature that causes highest variance is the first principal component. 

The feature that is responsible for second highest variance is considered the second principal component, and so on. It is important to mention that principle components do not have any correlation with each other.

Advantages of PCA
----------------------------
There are two main advantages of dimensionality reduction with PCA.

1> The training time of the algorithms reduces significantly with less number of features.

2> It is not always possible to analyze data in high dimensions. For instance if there are 100 features in a dataset. Total number of scatter plots required to visualize the data would be (100(100-1))/2 = 4950. Practically it is not possible to analyze data this way.

Normalization of Features
-------------------------------------

It is imperative to mention that a feature set must be normalized before applying PCA. For instance if a feature set has data expressed in units of Kilograms, Light years, or Millions, the variance scale is huge in the training set. If PCA is applied on such a feature set, the resultant loadings for features with high variance will also be large. Hence, principal components will be biased towards features with high variance, leading to false results.

Finally, the last point to remember before we start coding is that PCA is a statistical technique and can only be applied to **numeric data. Therefore, categorical features are required to be converted into numerical features before PCA can be applied.

Important Note
---------------------
>PCA is a feature Extraction technique. <font color='red'>How ?</font>
read this : Say we have 10 independent varaibles. In Feature Extraction we create 10 "new" independent variables. However these "new" independent variables are created (by PCA) from a combination of each of 10 "old" independent varibales. Now the dimensionality reduction comes into action. We keep the most important of the "new" independent variables and drop the least important ones. 

> This technique is <font color='green'>better than simple "Feature selection"</font> because each of the "new" independent variables is calculated from all the "old" independent variables. Hence we are still keeping the most valuable parts of our old variables, even when drop one or more of these "new" variables.



Implementing PCA with Scikit-Learn
---------------------------------------------------

In [15]:
# Importing Libraries
import seaborn as sns
import numpy as np
import pandas as pd  
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with. We are loading the popular Iris Data set
irisdata = sns.load_dataset('iris')
irisdata.head()  # have a look at the attributres(=> X) and Labels(=> y) 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [34]:
# Preprocessing data
X = irisdata.drop('species', axis=1)  
y = irisdata['species']

# Train Test Split
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)


#  PCA performs best with a normalized feature set. 
#  We will perform standard scalar normalization to normalize our feature set.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)

# Performing PCA using Scikit-Learn is a two-step process:
# 1. Initialize the PCA class by passing the number of components to the constructor.
# 2. Call the fit and then transform method by passing the feature set to these methods. 
#    The transform method returns the specified number of principal components.

# Applying PCA
from sklearn.decomposition import PCA
pca = PCA()  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

In the code above, we create a PCA object named pca. <font color='green'><b>We did not specify the number of components in the constructor. Hence, all <u>four</u> features in the feature set will be returned for both the training and test sets.</b></font>

The PCA class contains `explained_variance_ratio_` which returns the variance caused by each of the principal components.

In [35]:
explained_variance = pca.explained_variance_ratio_ 
explained_variance

array([0.74223545, 0.21601729, 0.03647121, 0.00527604])

It can be seen that first principal component is responsible for 73 % variance. Similarly, the second principal component causes 21 % variance in the dataset. Collectively we can say that 73 + 21 ~ **95%**  of the classification information contained in the feature set is **captured by the first two principal components**.

In [36]:
# Let's first try to use 1 principal component to train our algorithm. 
# To do so, execute the following code:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20, 
                                                    random_state=100)

#  PCA performs best with a normalized feature set. 
#  We will perform standard scalar normalization to normalize our feature set.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)

from sklearn.decomposition import PCA

pca = PCA(n_components=1)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

print(X_train[:5])

[[ 0.01197422]
 [ 0.29176303]
 [ 1.27693979]
 [-2.25154785]
 [-2.15421304]]


In [19]:
# Training and Making Predictions
# In this case we'll use random forest classification 
# for making the predictions.
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0, 
                                    n_estimators=10)  
classifier.fit(X_train, y_train)
# Please Note : if n_estimators is not specified in RandomForestClassifier
# default value of 10 is taken. For this you may get FutureWarning
# To avoid the warning, either specify n_estimators or suppress warnings


# Predicting the Test set results
y_pred = classifier.predict(X_test)  


# Performance Evaluation
from sklearn.metrics import confusion_matrix  
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)  
print(cm)  
print('Accuracy ' ,  accuracy_score(y_test, y_pred))

[[11  0  0]
 [ 0  5  1]
 [ 0  1 12]]
Accuracy  0.9333333333333333


It can be seen from the output that with only one feature, the random forest algorithm is able to correctly predict 27 out of 30 instances, resulting in 90% accuracy. **<i><font color='blue'>(This accuracy value will change at train-test dataset changes. So you may get a different accuracy score)</font></i>

In [37]:
# Try PCA with 2 Principal Components
# pca = PCA(n_components=2)
#------------------------------------------
# all the above steps would have to be repeated.


from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20, 
                                                    random_state=100)

#  PCA performs best with a normalized feature set. 
#  We will perform standard scalar normalization to normalize our feature set.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)

from sklearn.decomposition import PCA

pca = PCA(n_components=2)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

print(X_train[:5])



[[ 0.01197422 -1.532126  ]
 [ 0.29176303 -0.56916401]
 [ 1.27693979 -1.69685591]
 [-2.25154785 -0.62953216]
 [-2.15421304  1.58773001]]


In [38]:
# Try PCA with 3 Principal Components
# pca = PCA(n_components=3)
#------------------------------------------
# all the above steps would have to be repeated.



from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0, 
                                    n_estimators=10)  
classifier.fit(X_train, y_train)
# Please Note : if n_estimators is not specified in RandomForestClassifier
# default value of 10 is taken. For this you may get FutureWarning
# To avoid the warning, either specify n_estimators or suppress warnings


# Predicting the Test set results
y_pred = classifier.predict(X_test)  


# Performance Evaluation
from sklearn.metrics import confusion_matrix  
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)  
print(cm)  
print('Accuracy ' ,  accuracy_score(y_test, y_pred))



[[11  0  0]
 [ 0  4  2]
 [ 0  5  8]]
Accuracy  0.7666666666666667


In [39]:
# Try PCA with 2 Principal Components
# pca = PCA(n_components=2)
#------------------------------------------
# all the above steps would have to be repeated.


from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20, 
                                                    random_state=100)

#  PCA performs best with a normalized feature set. 
#  We will perform standard scalar normalization to normalize our feature set.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)

from sklearn.decomposition import PCA

pca = PCA(n_components=3)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

print(X_train[:5])


[[ 0.01197422 -1.532126   -0.29982543]
 [ 0.29176303 -0.56916401  0.05135883]
 [ 1.27693979 -1.69685591 -0.34542979]
 [-2.25154785 -0.62953216 -0.24698463]
 [-2.15421304  1.58773001  0.01212647]]


In [40]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0, 
                                    n_estimators=10)  
classifier.fit(X_train, y_train)
# Please Note : if n_estimators is not specified in RandomForestClassifier
# default value of 10 is taken. For this you may get FutureWarning
# To avoid the warning, either specify n_estimators or suppress warnings


# Predicting the Test set results
y_pred = classifier.predict(X_test)  


# Performance Evaluation
from sklearn.metrics import confusion_matrix  
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)  
print(cm)  
print('Accuracy ' ,  accuracy_score(y_test, y_pred))

[[11  0  0]
 [ 0  6  0]
 [ 0  9  4]]
Accuracy  0.7


<font color='red'>Complete your Analysis below : </font>

1.  With two principal components the classification accuracy was 76% , as compared to 93% for 1 component.

2. With three principal components the classification accuracy again decresaes to 7%.

From the above experimentation we achieved optimal level of accuracy while significantly reducing the number of features in the dataset. We saw that accuracy achieved with only 1 principal component is equal to the accuracy achieved with all feature set. It is also pertinent to mention that the accuracy of a classifier doesn't necessarily improve with increased number of principal components. From the results we can see that the accuracy achieved with one principal component => 93% , was greater than the one achieved with two principal components => 76%

The number of principal components to retain in a feature set depends on several conditions such as storage capacity, training time, performance, etc. In some dataset all the features are contributing equally to the overall variance, therefore all the principal components are crucial to the predictions and none can be ignored. 

<font color='green'><b>
**A general rule of thumb is to take number of principal components that contribute to significant variance and ignore those with lower variance returns.</b></font>