# Project Pipeline
Execute the cells step by step to obtain a prediction and score for your configuration.
In order to make the widgets work you might need to execute 
```
jupyter nbextension enable --py --sys-prefix widgetsnbextension
```
on your system.

## 1. Imports

In [10]:
import importlib
import data_utils
import widget_ui as ui

# Load potential changes in the modules during development
data_utils = importlib.reload(data_utils)
ui = importlib.reload(ui)

## 2. Load Data Sets
Load training and test data. Furthermore, create a sample set for quickly calculating transformations such as PCA.

In [11]:
data = data_utils.load_from('data/all_train.csv', 'data/all_test.csv')
sample_data = data_utils.load_sample_data('data/all_sample.csv')

## 3. Preprocessing

In [12]:
preprocessors = {}

### 3.1 Remove Correlations
Choose between two methods to remove correlations: Pearson correlation coefficient or PCA.

__Pearson correlation coefficient:__ Find correlated pairs of features and remove one feature of these two, respectively. All pairs with a Pearson correlation coefficient above the selected threshold are considered correlated.

__PCA:__ Apply a principal component analysis. The number of principal components can be choosen.

In [13]:
ui.preprocessors_ui(preprocessors, sample_data)

A PCA with 5 pricipal components will be applied.


## 4. Configure Classifiers
Both classifiers are implemented to iteratively learn on batches of data. In the offline setting, this method enables us to process large amounts of data that does not fit in memory at the same time. In the online setting, it allows for a continuous update of the classifier whenever new data arrives. For efficiency reasons it is useful to wait for enough data to form a batch rather than processing single data instances. 

In [14]:
classifiers = {}

### 4.1 Multilayer Perceptron
Based on the [Scikit-learn MLP module](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

In [15]:
ui.mlp_ui(classifiers)

### 4.2 Naive Bayes
Based on the [Scikit-learn Naive Bayes module](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html).

In [16]:
ui.nb_ui(classifiers)

## 5. Training

In [17]:
ui.training_ui(data, sample_data, preprocessors, classifiers)

## 6. Prediction

In [18]:
ui.prediction_ui(data, sample_data, preprocessors, classifiers)

## 7. Results
A comparison of different parameter settings in terms of accuracy can be found in the following table:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#aabcfe;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#aabcfe;color:#669;background-color:#e8edff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#aabcfe;color:#039;background-color:#b9c9fe;}
.tg .tg-1d7g{font-weight:bold;font-size:16px;vertical-align:top}
.tg .tg-qv16{font-weight:bold;font-size:16px;text-align:center;vertical-align:top}
.tg .tg-qgsu{font-size:15px;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
.tg .tg-6k2t{background-color:#D2E4FC;vertical-align:top}
.tg .tg-2nw2{background-color:#D2E4FC;font-size:15px;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
    <th class="tg-1d7g">Naive Bayes<br></th>
    <th class="tg-qv16" colspan="4">Multilayer Perceptron<br></th>
  </tr>
  <tr>
    <td class="tg-yw4l"></td>
    <td class="tg-6k2t"></td>
    <td class="tg-qgsu"></td>
    <td class="tg-2nw2">20, 20<br></td>
    <td class="tg-qgsu">40, 20<br></td>
    <td class="tg-2nw2">20, 20, 20, 20<br></td>
    <td class="tg-qgsu">20^10<br></td>
  </tr>
  <tr>
    <td class="tg-1d7g">None<br></td>
    <td class="tg-2nw2"></td>
    <td class="tg-yw4l">0.7956</td>
    <td class="tg-6k2t">0.8447</td>
    <td class="tg-yw4l">0.8427</td>
    <td class="tg-6k2t">0.84</td>
    <td class="tg-yw4l">0.846</td>
  </tr>
  <tr>
    <td class="tg-1d7g" rowspan="3">PCA</td>
    <td class="tg-2nw2">3</td>
    <td class="tg-yw4l">0.8002</td>
    <td class="tg-6k2t">0.7979</td>
    <td class="tg-yw4l">0.7955</td>
    <td class="tg-6k2t">0.8</td>
    <td class="tg-yw4l">0.8003</td>
  </tr>
  <tr>
    <td class="tg-2nw2">6</td>
    <td class="tg-yw4l">0.7986</td>
    <td class="tg-6k2t">0.8064</td>
    <td class="tg-yw4l">0.8025</td>
    <td class="tg-6k2t">0.82</td>
    <td class="tg-yw4l">0.812</td>
  </tr>
  <tr>
    <td class="tg-2nw2">15</td>
    <td class="tg-yw4l">0.8047</td>
    <td class="tg-6k2t">0.8319</td>
    <td class="tg-yw4l">0.8305</td>
    <td class="tg-6k2t">0.83</td>
    <td class="tg-yw4l">0.8257</td>
  </tr>
  <tr>
    <td class="tg-1d7g" rowspan="3">Pearson</td>
    <td class="tg-2nw2">0.6 (4 features removed)<br></td>
    <td class="tg-yw4l">0.7622</td>
    <td class="tg-6k2t">0.8044</td>
    <td class="tg-yw4l">0.825</td>
    <td class="tg-6k2t">0.8</td>
    <td class="tg-yw4l">0.8073</td>
  </tr>
  <tr>
    <td class="tg-2nw2">0.4 (9 features removed)<br></td>
    <td class="tg-yw4l">0.7514</td>
    <td class="tg-6k2t">0.7971</td>
    <td class="tg-yw4l">0.7806</td>
    <td class="tg-6k2t">0.7976</td>
    <td class="tg-yw4l">0.8082</td>
  </tr>
  <tr>
    <td class="tg-2nw2">0.2 (16 features removed)<br></td>
    <td class="tg-yw4l">0.5273</td>
    <td class="tg-6k2t">0.5456</td>
    <td class="tg-yw4l">0.5255</td>
    <td class="tg-6k2t">0.5603</td>
    <td class="tg-yw4l">0.5693</td>
  </tr>
</table>

The Multilayer Perceptron performed in general better than the Naive Bayes classifier. 

Notable is the MLP's ability to work well on raw data. Here, the best results could be achieved without preprocessing. The more features we removed, the worse got the accuracy on the test set. Since the training set contains several million instances it is still possible for the MLP to learn its parameters regardless of many data dimensions or a deep network architecture.

The Naive Bayes classifier expects normally distributed features that are independent of each other given the class. Our preliminary analysis of the data set showed that several features could be approximated by a Gaussian distribution, while others can not. The latter are most likely the main reason for the poor classification performance. Applying a PCA improved the results compared to using the raw data since it removes correlations in the data.

## 8. Outlook
The MLP results could be further improved by a detailed analysis of the training process. Especially an appropriate value for the learning rate should be determined since training is usually very sensitive to this parameter. Monitoring the test accuracy over time could give hints whether the model needs more iterations of training or already started to overfit the training data set. To prevent overfitting and achieve a better generalization techniques like dropout could be included.

For the Naive Bayes classifier a more thoughtful selection of the features could help to improve the performance. Features that are not normally distributed (e.g. mass) should be removed before the training.