## Dimentionality Reduction with PCA

### Introduction

Principal Component Analysis (PCA) is a tool to reduce dimensionality in the dataset. Whent the dataset has more number of columns and we do not know, which features have strong influence on the result of the model, we need to perform a feature selection techniques to identify the importance of the features. 

PCA is neither a `classifier` nor `estimator`, in `scikit-learn` both will have `predict` method, whereas PCA does the data transformation. The transformed data will be classified using classifier. 

In [2]:
# Import statements for the model implementation. 
import pandas as pd

from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

## Loading a dataset from the datasets package
iris_ds = load_iris()

y = iris_ds.target
X = iris_ds.data

print("Data X : ", X)

Data X :  [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 

In [3]:
# Creating PCA object
pca = PCA()

# Creating Tree classifier
classifier = DecisionTreeClassifier()

In [4]:
# Train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)

In [5]:
# fit and transform
X_transformed = pca.fit_transform(X_train)

# Fitting the transformed data with classifier
classifier.fit(X_transformed, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [7]:
# Transforming test data using PCA
X_test_transformed = pca.fit_transform(X_test)
# Predict the y values
pred_y = classifier.predict(X_test_transformed)

In [8]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, pred_y))
print(confusion_matrix(y_test, pred_y))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        18
          1       0.94      1.00      0.97        15
          2       1.00      0.94      0.97        17

avg / total       0.98      0.98      0.98        50

[[18  0  0]
 [ 0 15  0]
 [ 0  1 16]]
