# Ensemble models for classifying wearing glasses or not from GAN generated images 

## Introduction

This projects is the extension of the previous Bayesian optimization. The same dataset and objectives have been adopted. But I will use a different model strategies to determine if a person is wearing glasses or not using data contianing all features. These features are generated from a Generative Adversarial Neural Network (GAN). The detailed information can refer to [here](https://www.kaggle.com/jeffheaton/glasses-or-no-glasses).

In the second step, we have determined the right NN model for the analysis using Bayesian optimization. Based on that results, we will build an ensembler with multiple models such as RF, KNN and gradient boosting methods to conduct prediction with 10-fold cross validation.

### Overview of following steps

**1** Base on optimized parameters such as dropout,neuronPct, neuronShrink to form the NN models.

**2** Build a model ensembler with seven models and 10 folds cross validation.

**3** For each fold, used 9/10 of training data (9 folds) to build the model, the rest to conduct prediction. Meanwhile, we also used the trained model to predict at the test data. Here, different strategies have been applied to collect the validation prediction results (from cv of training data) and submit prediction results. 

Regarding the **validation prediction results**, it need to collect in each fold as we only have 1/10 of training data to obtian the prediction results. As follows, the test will tell you which row index we should fill in. At the end of 10 folds, we should have every values in the one column (represent one model).
```    
                                     dataset_blend_train[test, j] = pred[:, 1]
```
Regarding the **test prediction results**, we collected the prediction results for the submit data for each fold and each model. At the end of each model, we need to do avergae of these prediction results of these 10 folds.

**4** Build a logstic regresson between y and the validation prediction results. Use this model to conduct the prediction of our submit prediction results. This results is more like to assign weights to these models for prediction of final output.

**5** Format te output and save results.

### Data
There are two data were used for this project:

**(1)** training.csv, which include the 512 features one response variable glasses (1 represent have glass, 0 means no glasses). 

**(2)** submit.csv, which in fact is the test data which to measure how good of the model.


**Objetive**: To develop a robust approach to conduct classification on data (a person is wearing glasses or not) using a ensemble of models, which include machine learning models (random forest,Gradient Boosting and Extra Trees) and deep learning model (optimized NN using Bayesian optimization).

In [1]:
import numpy as np
import os
import pandas as pd
import math
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
import time
import tensorflow.keras.initializers
import statistics
import tensorflow.keras
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, InputLayer,BatchNormalization
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import StratifiedShuffleSplit
from tensorflow.keras.layers import LeakyReLU,PReLU
from tensorflow.keras.optimizers import Adam

## Use optmized NN model's parameters from previous Bayesian optimization

Based on previous Bayesian Optimization, we can chose the optimzied parameters to build a NN model

{'dropout': 0.07323118951773941,
 'lr': 0.009233859476879781,
 'neuronPct': 0.1943976092638942,
 'neuronShrink': 0.35210511977261727}
 
### Optimized NN model

In [3]:
##revised from generate_model functon
dropout = 0.07323118951773941
neuronPct = 0.1943976092638942
neuronShrink = 0.35210511977261727
lr = 0.01863

def nn_model(dropout, neuronPct, neuronShrink,num_class,lr,init_num_neurons = 4000):
    # We start with some percent of 5000 starting neurons on the first hidden layer.
    neuronCount = int(neuronPct * init_num_neurons)
    
    # Construct neural network
    # kernel_initializer = tensorflow.keras.initializers.he_uniform(seed=None)
    model = Sequential()

    # So long as there would have been at least 25 neurons and fewer than 10
    # layers, create a new layer.
    layer = 0
    while neuronCount>25 and layer<10:
        # The first (0th) layer needs an input input_dim(neuronCount)
        if layer==0:
            model.add(Dense(neuronCount, 
                input_dim=x.shape[1], 
                activation=PReLU()))
        else:
            model.add(Dense(neuronCount, activation=PReLU())) 
            
        layer += 1
        model.add(BatchNormalization())
        # Add dropout after each hidden layer
        model.add(Dropout(dropout))

        # Shrink neuron count for each layer
        neuronCount = neuronCount * neuronShrink

    if num_class>2:
        
        model.add(Dense(num_class,activation='softmax')) # Output
    ##new added part
        model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=lr))
    else:
        model.add(Dense(num_class,activation='sigmoid')) # Output
        
        ##new added part
        model.compile(loss='binary_crossentropy', optimizer=Adam(lr=lr))
    return model

In [5]:
PATH = './data' ##input
outpath = './data/submit' ##output

SHUFFLE = False
FOLDS = 10

## in our case, we didn't really use this function
# def mlogloss(y_test, preds):
#     epsilon = 1e-15
#     sum = 0
#     for row in zip(preds,y_test):
#         x = row[0][row[1]]
#         x = max(epsilon,x)
#         x = min(1-epsilon,x)
#         sum+=math.log(x)
#     return( (-1/len(preds))*sum)

def stretch(y):
    return (y - y.min()) / (y.max() - y.min())


def blend_ensemble(x, y, x_submit):
    kf = StratifiedKFold(FOLDS)
    #folds = list(kf.split(x,y[:,0]))  # '''this is very important as y should be 1-D arrary'''
    folds = list(kf.split(x,y))
    
    models = [
        KerasClassifier(build_fn=nn_model,dropout = dropout, neuronPct = neuronPct, 
                        neuronShrink = neuronShrink,num_class = num_class,lr=lr), ##definided NN model with customized parameters
        KNeighborsClassifier(n_neighbors=3),
        RandomForestClassifier(n_estimators=200, n_jobs=-1, criterion='gini'),
        RandomForestClassifier(n_estimators=200, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=200, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=200, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]

    dataset_blend_train = np.zeros((x.shape[0], len(models)))
    dataset_blend_test = np.zeros((x_submit.shape[0], len(models)))

    for j, model in enumerate(models):
        print("Model: {} : {}".format(j, model) )
        fold_sums = np.zeros((x_submit.shape[0], len(folds)))
        total_loss = 0
        for i, (train, test) in enumerate(folds):
            x_train = x[train]
            y_train = y[train]
            x_test = x[test]
            y_test = y[test]
            
            if j==0: ## as it nrequire different formats for our definied model
                tem = pd.DataFrame(y[train],columns=['yp'])
                y_train = pd.get_dummies(tem['yp']).values
                
                tem1 = pd.DataFrame(y[test],columns=['yp'])
                y_test = pd.get_dummies(tem1['yp']).values
                
            model.fit(x_train, y_train)
            pred = np.array(model.predict_proba(x_test)) ## it didn't work well for KNN as it will generate three dimension results
            
            #if len(pred.shape) == 3:
            #    pred = pred[1]
            #pred = np.array(model.predict(x_test))
            # pred = model.predict_proba(x_test)
            dataset_blend_train[test, j] = pred[:, 1]
            pred2 = np.array(model.predict_proba(x_submit))
            #if len(pred2.shape) == 3:
            #    pred2 = pred2[1]
            #fold_sums[:, i] = model.predict_proba(x_submit)[:, 1]
            fold_sums[:, i] = pred2[:, 1]
            loss = metrics.log_loss(y_test, pred) ##here we ued logloss from the function, not our own defined
            total_loss+=loss
            print("Fold #{}: loss={}".format(i,loss))
        print("{}: Mean loss={}".format(model.__class__.__name__,total_loss/len(folds)))
        dataset_blend_test[:, j] = fold_sums.mean(1)

    print()
    print("Blending models.")
    blend = LogisticRegression(solver='lbfgs')
    blend.fit(dataset_blend_train, y)
    return blend.predict_proba(dataset_blend_test)

if __name__ == '__main__':

    np.random.seed(2)  # seed to shuffle the train set

    print("Loading data...")
    filename_train = os.path.join(PATH, "train.csv")
    df_train = pd.read_csv(filename_train, na_values=['NA', '?'])
    df_train.drop('id',axis =1, inplace = True)
    num_class = df_train['glasses'].nunique()
    
    filename_submit = os.path.join(PATH, "test.csv")
    df_submit = pd.read_csv(filename_submit, na_values=['NA', '?'])
    ids = df_submit['id']
    df_submit.drop('id',axis =1, inplace = True)
    
    
    predictors = list(df_train.columns.values)
    predictors.remove('glasses')
    x = df_train[predictors].values
    y = df_train['glasses'].values
    #dummies = pd.get_dummies(df_train['glasses']) # Classification
    #y = dummies.values
    
    
    x_submit = df_submit.values

    if SHUFFLE:
        idx = np.random.permutation(y.size)
        x = x[idx]
        y = y[idx]

    submit_data = blend_ensemble(x, y, x_submit)
    submit_data = stretch(submit_data)

    ####################
    # Build submit file
    ####################
    #ids = [id+1 for id in range(submit_data.shape[0])]
    submit_filename = os.path.join(outpath, "submit_nn_ensembles.csv")
    submit_df = pd.DataFrame({'id': ids, 'glasses': submit_data[:, 1]},
                             columns=['id','glasses'])
    submit_df.to_csv(submit_filename, index=False)

Loading data...
Model: 0 : <tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier object at 0x0000029C3302EA08>
Train on 4050 samples
Fold #0: loss=0.19948168075353725
Train on 4050 samples
Fold #1: loss=0.2835264015548779
Train on 4050 samples
Fold #2: loss=0.28316105559833993
Train on 4050 samples
Fold #3: loss=0.2954316608935089
Train on 4050 samples
Fold #4: loss=0.33633542314876386
Train on 4050 samples
Fold #5: loss=0.19641085826017335
Train on 4050 samples
Fold #6: loss=0.2701019382864923
Train on 4050 samples
Fold #7: loss=0.2734718042067352
Train on 4050 samples
Fold #8: loss=0.29325654197510176
Train on 4050 samples
Fold #9: loss=0.38209583728065244
KerasClassifier: Mean loss=0.2813273201958183
Model: 1 : KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
Fold #0: loss=0.0018020671471483923
Fold #1: loss=0.005144461362206777
Fold #2: 