## Introduction ##

This was an assignment for my Machine Learning class, thought I should share it here. 

The purpose of the assignment was to implement a classifier to differentiate between the two classes diabetes vs no_diabetes in the *Pima* dataset. Using 5-crossfold validation technique, we should train the best classifier once using all 8 features of the data and once after reducing the dimensionality of the data using principal component analysis (PCA). 

Finally, we should compare the best accuracy with/without using PCA. 

You will notice that I had to write my own PCA implementation as that was required from us, you can ignore that part. 

I was able to achieve **80.13%** average cross validation accuracy using the following fully connected model

In [None]:
# coding: utf-8
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 

from sklearn import preprocessing, decomposition
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.core import Dense, Activation, Dropout, Flatten
from keras.optimizers import SGD
from keras.utils import np_utils

np.random.seed(1337)  # for reproducibility

In [None]:
# Some global variables
num_classes = 1
num_features = 8
num_reduce = 7
epochs = 200
eig_vec = []

In [None]:
# You can ignore this, as this was part of my assignment
class PCA(object):
    def  __init__(self, k):
        self.U = None 
        self.mean = None
        self.std = None
        self.k = k

    def process(self, X_t):
        X = X_t.copy()
        pca_var = None
        if self.mean is None:
            self.mean = np.mean(X, axis=0)
            self.std = np.std(X, axis=0)

        X -= self.mean
        X /= self.std
        
        if self.U is None:
            cov = X.T.dot(X) / X.shape[0]
            self.U, S, V = np.linalg.svd(cov)
            pca_var = np.sum(S[:self.k]) / np.sum(S)
            
        return X.dot(self.U[:, :self.k]), pca_var       
        

In [None]:
def read_data():
    df = pd.read_csv("../input/diabetes.csv")
    data = df.as_matrix()
    y = data[:,  -1]
    X = data[:, :-1]
    return X, y

In [None]:
def get_model():
    model = Sequential()
    model.add(Dense(4,activation='elu',input_dim=(num_reduce)))
    model.add(Dense(6,activation='elu'))
    model.add(Dense(7,activation='elu'))
    model.add(Dense(8,activation='elu'))
    model.add(Dense(num_classes,activation='sigmoid'))
    model.compile(optimizer='adam', loss='binary_crossentropy',  metrics=['accuracy'])
    return model

In [None]:
def preprocess(X_train, X_val):
    X_t = X_train.copy()
    X_v = X_val.copy()
    mean = np.mean(X_t, axis=0)
    std = np.std(X_t, axis=0)
    X_t -= mean
    X_t /= std 
    X_v -= mean
    X_v /= std
    return X_t, X_v

In [None]:
def cross_val(X, y, k_fold = 5):
    step = X.shape[0] // k_fold 
    accuracies = []
    pca_kept_var = []

    for k in range(k_fold):
        # Divide dataset to training and validation sets
        X_val = X[k*step:((k+1)*step)]
        y_val = y[k*step:((k+1)*step)]
        X_train = np.delete(X,np.arange(k*step,((k+1)*step)), axis = 0)
        y_train = np.delete(y,np.arange(k*step,((k+1)*step)))

        if(num_features != num_reduce):
            pca = PCA(num_reduce)
            X_train, pca_var = pca.process(X_train)
            X_val, _ = pca.process(X_val)
            pca_kept_var.append(pca_var)
        else:
            X_train, X_val = preprocess(X_train, X_val)

        model = get_model()
        #model.summary() if k == 0 else None
        history = model.fit(X_train, y_train, epochs=epochs, batch_size=50, validation_data=(X_val,y_val), verbose = 0)
        accuracies.append(np.max(history.history["val_acc"]))
        print("accuracy #",k,": ",accuracies[k])
    return np.mean(accuracies), np.mean(pca_kept_var) if len(pca_kept_var) > 0 else None

In [None]:
X, y = read_data()
print("X, Y shape", X.shape, y.shape)
acc, pca_var = cross_val(X, y)
print("ACCR: ", acc)
print("PCA Kept Variance: ", pca_var)

## Cross Validation Accuracy without PCA ##

I preprocessed the data in the beginning by zero-centering the data then normalizing it
using the training set. This improved the accuracy significantly


![enter image description here][1]


  [1]: https://image.ibb.co/cfAyea/ACCR.png

## Average Cross Validation Accuracy with PCA ##


![enter image description here][1]


  [1]: https://image.ibb.co/fQTUkF/ACCR_PCA.png

## PCA Kept-Variance vs Eigen Vectors ##

![enter image description here][1]


  [1]: https://image.ibb.co/cVKDCv/PCA_VAR.png

## Conclusion ##
The PCA didn’t enhance my classification accuracy, although reducing the data into 6 or 7 features got me a more or less similar accuracy, 78.6% compared to 80.13% with 8 features using the same network.

It’s important to note also that the pre-processing step I did using the full features: subtracting the mean and dividing by the standard deviation improved the accuracy significantly, without this step I got an ACCR of 71.76% using the full features, which is a much less accuracy than the ones I got using PCA.