# Feature Engineering using PCA

**Principal Component Analysis or PCA, is a statistical technique to convert high dimensional data to low dimensional data by selecting the most important features that capture maximum information about the dataset. The features are selected on the basis of variance that they cause in the output. The feature that causes highest variance is the first principal component. The feature that is responsible for second highest variance is considered the second principal component, and so on. It is important to mention that principal components do not have any correlation with each other.**

*Advantages of PCA
There are two main advantages of dimensionality reduction with PCA.
The training time of the algorithms reduces significantly with less number of features.
It is not always possible to analyze data in high dimensions. For instance if there are 100 features in a dataset. Total number of scatter plots required to visualize the data would be 100(100-1)2 = 4950. Practically it is not possible to analyze data this way.*

**Normalization of Features
It is imperative to mention that a feature set must be normalized before applying PCA. For instance if a feature set has data expressed in units of Kilograms, Light years, or Millions, the variance scale is huge in the training set. If PCA is applied on such a feature set, the resultant loadings for features with high variance will also be large. Hence, principal components will be biased towards features with high variance, leading to false results.** 

*Finally, the last point to remember before we start coding is that PCA is a statistical technique and can only be applied to numeric data. Therefore, categorical features are required to be converted into numerical features before PCA can be applied.*

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt
import os
import seaborn as sns


In [None]:
df = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")
df.head()

**Step 1:**
The first preprocessing step is to divide the dataset into a feature set and corresponding labels.

In [None]:
X = df.drop('species', 1)
y = df['species']
X,y

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#X_train,X_test,y_train,y_test

**As mentioned earlier, PCA performs best with a normalized feature set. We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code:**

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#X_train, X_test

# Applying PCA

*It is only a matter of three lines of code to perform PCA using Python's Scikit-Learn library. The PCA class is used for this purpose. PCA depends only upon the feature set and not the label data. Therefore, PCA can be considered as an unsupervised machine learning technique.*

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [None]:
X_train = pca.fit_transform(X_train)
X_train.shape

In [None]:
??pca.fit_transform

*The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal components.*

In [None]:
explained_variance = pca.explained_variance_ratio_

explained_variance

**It can be seen that first principal component is responsible for 72.22% variance. Similarly, the second principal component causes 23.9% variance in the dataset. Collectively we can say that (72.22 + 23.9) 96.21% percent of the classification information contained in the feature set is captured by the first two principal components.**

*Let's first try to use 1 principal component to train our algorithm. To do so, execute the following code*

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

# Training and Making Predictions

*In this case we'll use random forest classification for making the predictions.*

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Performance Evaluation

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy =',accuracy_score(y_test, y_pred))

*It can be seen from the output that with only one feature, the random forest algorithm is able to correctly predict 28 out of 30 instances, resulting in 93.33% accuracy.*

[Stackoverflow](https://stackoverflow.com/questions/56694980/valueerror-n-components-4-must-be-between-0-and-minn-samples-n-features-2-wi)

# Results with 2 and 3 Principal Components

Now after evaluating classification performance of the random forest algorithm with 2 principal components. Update this piece of code:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

The number of components for PCA has been set to 2. The classification results with 2 components with accuracy = 0.8

*With two principal components the classification accuracy decreases to 80.00% compared to 93.33% for 1 component.*

**With three principal components, the result looks like this: 0.90**

*Results with Full Feature Set was 90.00%*


# # Logistic Regression

In [None]:
#pima diabetes dataset
dib_df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
dib_df

In [None]:
dib_df.info()

In [None]:
corr = dib_df.corr()
corr
sns.heatmap(corr,xticklabels =corr.columns,yticklabels = corr.columns)

In [None]:
dib = []
non_dib = []

for i in range(768):
    if dib_df['Outcome'][i]==1:
        dib.append(1)
    else:
        non_dib.append(0)
size_dib = len(dib)
size_non_dib = len(non_dib)
print(f'Diabetes = {size_dib}')
print(f'Non-Diabetes = {size_non_dib}')

In [None]:
dftrain = dib_df[:650]
dftest = dib_df[650:750]
# dftrain,dftest

In [None]:
trainLabel = np.asarray(dftrain['Outcome'])
trainData = np.asarray(dftrain.drop('Outcome',1))
testLabel = np.asarray(dftest['Outcome'])
testData = np.asarray(dftest.drop('Outcome',1))

In [None]:
means = np.mean(trainData, axis=0)
stds = np.std(trainData, axis=0)
trainData = (trainData - means)/stds
testData = (testData - means)/stds

# np.mean(trainData, axis=0) => check that new means equal 0
# np.std(trainData, axis=0) => check that new stds equal 1

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import joblib

In [None]:
diabetesCheck = LogisticRegression()
diabetesCheck.fit(trainData, trainLabel)

In [None]:
accuracy = diabetesCheck.score(testData, testLabel)
print("accuracy = ", accuracy * 100, "%")

In [None]:
#Saving the Model..."Now we will save our trained model for future use using joblib."

joblib.dump([diabetesCheck, means, stds], 'diabeteseModel.pkl')

In [None]:
diabetesLoadedModel, means, stds = joblib.load('diabeteseModel.pkl')
accuracyModel = diabetesLoadedModel.score(testData, testLabel)
print("accuracy = ",accuracyModel * 100,"%")