## Bremen Big Data Challenge 2019

### Application of Fully-Connected Neural Network

Created on March 13, 2019 by Ralph Florent <r.florent@jacobs-university.de>

**Note**: The `dataset` provided by the University of Bremen for the challenge has been pre-processed by [Gari Ciodaro](mailto:g.ciodaroguerra@jacobs-university.de) and [Diogo Cosin](mailto:d.ayresdeoliveira@jacobs-university.de) using *PCA* for the dimension reduction, *Logistic Regression* to train the models, and *10-Fold Cross Validation* to estimate the test error rate.

The Fully-Connected Neural Network will serve as an alternate procedure to improve accuracy on the pre-processed data from the binary files located here: `../assets/data_{*}_train.file`.

In [1]:
# -*- coding: utf-8 -*-
#
# Created on March 13, 2019
# Author: Ralph Florent <r.florent@jacobs-university.de>

# Import relevant libraries
import pandas as pd
import numpy as np
import pickle

FILE_PATH = "../assets/"

with open(FILE_PATH + "data_X_train.file", "rb") as f:
    data_X_train = pickle.load(f)
    # Now you can use the dump object as the original one  
    # self.some_property = dump.some_property

with open(FILE_PATH + "data_y_train.file", "rb") as f:
    data_y_train = pickle.load(f)
    

### View the X-data in a tabular form 

In [5]:
data_X_train.head()

Unnamed: 0,curve-left-step,stand-to-sit,curve-right-spin-Rfirst,jump-one-leg,lateral-shuffle-right,curve-right-spin-Lfirst,v-cut-right-Lfirst,stair-down,v-cut-left-Rfirst,v-cut-right-Rfirst,...,curve-right-step,sit-to-stand,run,v-cut-left-Lfirst,stand,curve-left-spin-Lfirst,walk,curve-left-spin-Rfirst,lateral-shuffle-left,lay
0,0.033164,0.995012,0.993435,0.883137,0.993072,0.999744,0.982115,0.999704,0.999167,0.999856,...,0.909114,0.995479,0.997755,0.945079,0.999936,0.998107,0.981452,0.998682,0.9844,0.999685
1,0.012396,0.997673,0.989988,0.912907,0.953943,0.999196,0.982334,0.996781,0.998068,0.999813,...,0.426234,0.999056,0.998155,0.999292,0.999986,0.998123,0.998996,0.998454,0.995678,0.999839
2,0.975501,0.557959,0.999916,0.983967,0.997322,0.999982,0.998201,0.9882,0.971928,0.993051,...,0.999588,0.999901,0.99217,0.99787,0.999999,0.999748,0.945848,0.999428,0.981308,0.999505
3,0.993094,0.999957,0.835326,0.959761,0.988113,0.992239,0.618073,0.996668,0.997662,0.991547,...,0.999539,0.991327,0.998971,0.994514,0.978037,0.999979,0.992488,0.998575,0.993449,0.99954
4,0.999398,0.974104,0.993987,0.573354,0.993094,0.999421,0.911141,0.999611,0.999391,0.889177,...,0.93261,0.99709,0.981,0.999972,0.313301,0.99959,0.857307,0.993626,0.999878,0.999241


### View the Y-data in a tabular form 

In [6]:
data_y_train.head()

Unnamed: 0,Target
0,curve-left-step
1,curve-left-step
2,stand-to-sit
3,curve-right-spin-Rfirst
4,jump-one-leg


### Summary of the X-data

In [7]:
data_X_train.describe()

Unnamed: 0,curve-left-step,stand-to-sit,curve-right-spin-Rfirst,jump-one-leg,lateral-shuffle-right,curve-right-spin-Lfirst,v-cut-right-Lfirst,stair-down,v-cut-left-Rfirst,v-cut-right-Rfirst,...,curve-right-step,sit-to-stand,run,v-cut-left-Lfirst,stand,curve-left-spin-Lfirst,walk,curve-left-spin-Rfirst,lateral-shuffle-left,lay
count,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,...,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0,6401.0
mean,0.9493751,0.973442,0.973119,0.9414576,0.963218,0.968791,0.9558974,0.958242,0.9581618,0.959433,...,0.951333,0.974778,0.9550495,0.955528,0.944583,0.969978,0.934495,0.965083,0.959485,0.997672
std,0.143912,0.11508,0.116581,0.1481148,0.12895,0.124731,0.1366952,0.136395,0.1338524,0.133932,...,0.14523,0.111656,0.1349484,0.135789,0.156698,0.120698,0.1585219,0.130244,0.134819,0.004574
min,2.142298e-08,0.011336,0.01176,3.785107e-07,5e-06,0.002728,1.392495e-08,4.3e-05,4.451998e-07,6.5e-05,...,0.000844,0.00138,4.19239e-07,2e-06,0.003994,0.001845,4.440892e-14,0.026992,9.3e-05,0.674407
25%,0.9792927,0.996982,0.997924,0.9622077,0.988261,0.995078,0.9806403,0.982543,0.986735,0.985857,...,0.981114,0.997618,0.977416,0.982826,0.977705,0.995571,0.9632756,0.996266,0.985696,0.997235
50%,0.9973256,0.999532,0.999674,0.9901014,0.99653,0.99931,0.9954707,0.997227,0.9974115,0.996497,...,0.997854,0.99953,0.992649,0.996762,0.998376,0.999601,0.9934267,0.999527,0.995496,0.997903
75%,0.9994578,0.999911,0.999969,0.9972596,0.998973,0.9999,0.9992303,0.999512,0.9995336,0.999202,...,0.999781,0.999892,0.9976874,0.999513,0.999948,0.999955,0.9986971,0.99993,0.99856,0.998556
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.9999998,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,0.9999996,1.0,1.0,0.999986


### Application of the Fully-Connected Neural Network on the dataset

In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 

# randomly split into 75% training data and 25% test data
split_results = train_test_split( data_X_train, data_y_train, test_size=0.25, random_state=0 ) 
X_train, X_test, y_train, y_test = split_results

### Important notes on MLP Classifier

#### This neural network is based on the Backpropagation technique. An alternative could be MLP Regression.

There are several specifications to account for building an MLP classifier. Among them are the `Activation Function`, the regularization factor `alpha` for penalty, the number of hidden layers, and so on.

View this [link](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) for more info about the parameters.

1. The solvers for the weight optimization are:
    * ‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
    * ‘sgd’ refers to stochastic gradient descent.
    * ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba

2. The number of hidden layers ranges between 100 -+ 2

3. The regularization factor can vary between 0.1 and 0.00001. Practice shows that by varying `alpha` MLP has greater chance to perform way better than by adjusting the number of layers, or trying other algorithm.
4. Activation function for the hidden layer.
    * ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x
    * ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).
    * ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).
    * ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)
5. Another possible combination could be `Scaling the data points`

Given the dataset, whose order of magnitude is 1K+, the `lbfgs` seems to be ideal to begin with. For the other parameters, let's keep the default ones. Once we assess the accuracy of the prediction, we'll try to tune the other parameters, like `alpha` for example, accordingly with the expectations of improving the accuracy.

In [74]:
# build classifier with the following specs:
# activation function: 'lbfgs'
# penalty: 0.01

classifier = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(50,))
classifier.fit(X_train, y_train.values.ravel())

MLPClassifier(activation='relu', alpha=0.01, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(50,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [75]:
classifier.score(X_test, y_test)

0.8376014990630856

In [78]:
classifier.score(X_train, y_train)

0.8385416666666666

### Let's combine the above-mentioned options using Grid Search and see what parameters work best

In [76]:
from sklearn.model_selection import GridSearchCV
#import warnings
#warnings.filterwarnings('ignore')

# set the parameters
param_grid = [
        {
            'activation' : ['identity', 'logistic', 'tanh', 'relu'],
            'solver' : ['lbfgs', 'sgd', 'adam'],
            'alpha': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
            'hidden_layer_sizes': [
             (1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),
                (16,),(17,),(18,),(19,),(20,)
             ]
        }
       ]

gs_clf = GridSearchCV( MLPClassifier(), param_grid, cv=3, scoring='accuracy' )
gs_clf.fit(X_train, y_train.values.ravel())

print("Best parameters set found on the MLP classifier development set: ")
print(gs_clf.best_params_)





































































































































Best parameters set found on the MLP classifier development set: 
{'activation': 'tanh', 'alpha': 0.01, 'hidden_layer_sizes': (20,), 'solver': 'lbfgs'}


### Now that we know the best parameters for the MLP Classifier, let's plug them in and calculate the accurary

```Best parameters set found on the MLP classifier development set: 
{'activation': 'tanh', 'alpha': 0.01, 'hidden_layer_sizes': (20,), 'solver': 'lbfgs'}```

In [77]:
# build classifier with the following specs:
# activation function: 'tanh'
# penalty: 0.01
# number of hidden layers: 20
# optimization weight: lbgs

best_classifier = MLPClassifier(activation='tanh', solver='lbfgs', alpha=0.01, hidden_layer_sizes=(20,))
best_classifier.fit(X_train, y_train.values.ravel())
accuracy = best_classifier.score(X_test, y_test)
print('The accuracy for the selected parameters are: ', accuracy)

The accuray for the selected parameters are:  0.8444722048719551


### Conclusion: 
The parameters for which we obtain an accuracy of 84.444722% clearly indicate that there is a non-linear relationship between the feature vectors and the classified label. Now, let's use this classifier to predict the labels for the `challenge.csv` dataset.

In [110]:
y_pred_test = best_classifier.predict(X_test)

In [109]:
output = {}

# Subject_c = list(df_test.Subject.values)
# output['Subject'] = Subject_c

# Datafile_c = list(df_test.Datafile.values)
# output['Datafile'] = Datafile_c

# output['Label_pre'] = y_pred_test

# Subject_c = list(df_test.Label.values)
# output['Label'] = Subject_c

# data_frame_out = pd.DataFrame.from_dict(output)
# data_frame_out.to_csv(index=False, path_or_buf='../dist/deliverable.csv')