# Project Pipeline
Execute the cells step by step to obtain a prediction and score for your configuration.
In order to make the widgets work you might need to execute 
```
jupyter nbextension enable --py --sys-prefix widgetsnbextension
```
on your system.

## 1. Imports

In [10]:
import importlib
import data_utils
import widget_ui as ui

# Load potential changes in the modules during development
data_utils = importlib.reload(data_utils)
ui = importlib.reload(ui)

## 2. Load Data Sets
Load training and test data. Furthermore, create a sample set for quickly calculating transformations such as PCA or performing test runs in the development process.

In [11]:
data = data_utils.load_from('data/all_train.csv', 'data/all_test.csv')
sample_data = data_utils.load_sample_data('data/all_sample.csv')

## 3. Preprocessing

In [12]:
# Dictionary that stores the initialized preprocessors for later use
preprocessors = {}

### 3.1 Remove Correlations
Choose between two methods to remove correlations: Pearson correlation coefficient or PCA.

__Pearson correlation coefficient:__ Find correlated pairs of features and remove one feature of these two, respectively. All pairs with a Pearson correlation coefficient above the selected threshold are considered correlated.

__PCA:__ Apply a principal component analysis. The number of principal components can be choosen.

In [13]:
ui.preprocessors_ui(preprocessors, sample_data)

## 4. Configure Classifiers
Both classifiers are implemented to iteratively learn on batches of data. In the _offline_ setting, this method enables us to process large amounts of data that do not fit in memory at the same time. In the _online_ setting, it allows for continuous updates of the classifier whenever new data arrives. For efficiency reasons it is useful to wait for enough data to form a batch rather than processing single data instances. 

In [14]:
# Dictionary that stores the configured classifiers for later use
classifiers = {}

### 4.1 Multilayer Perceptron
The implementation is based on the [Scikit-learn MLP module](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). Even though mulilayer perceptrons have many parameters that can be adapted, we decided to only vary the general structure, because this gives us enough diversity for estimating the performance of this kind of model and keeps the amount of testing manageable.

__Num hidden layers:__ This parameter defines the number of hidden layers for the MLP.

__Hidden layer sizes:__ For each hidden layer, the size can individually be specified.

In [15]:
ui.mlp_ui(classifiers)

### 4.2 Naive Bayes
The implementation is based on the [Scikit-learn Naive Bayes module](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). In order to work with the given continuous features, it was necessary to use the assumption of Gaussian distributions with this classifier. The only parameter for this bayesian model is the distribution of class priors, but as we know that for the HEPMASS data set both classes have the same amount of samples, the priors were fixed to [0.5, 0.5].

In [16]:
ui.nb_ui(classifiers)

## 5. Training
The red button below starts a training run with the selected parameters. As can be seen in the following code snippet, training is performed in an online manner, even though the data set is available offline, for the reasons given in section 4.

```python
# create empty window
window = np.zeros((0, len(data.train_data.columns)))

# process instances in a continuous stream
iterator = data.train_data.iterrows()
for idx, row in enumerate(iterator):
    # add to window
    window = np.append(window, [row[1]], axis=0)
    
    if len(window) == window_size:
        train_on_window(preprocessors, classifier, pd.DataFrame(window, columns=data.train_data.columns))
        
        # reset window
        window = np.zeros((0, len(data.train_data.columns)))
```

The user can specify each of the following parameters:

__Use fast sample data:__ Instead of the whole training set, use a small subsample for quick test runs. This option should only be used in the development process, because the sample set is not very representative.

__Window size:__ This parameter defines the size of a window/batch for the training process. Batches are collected until they reach this size and then they are given to the classifier for training.

__Classifier:__ Select one of the previously configured classifiers.

In [17]:
ui.training_ui(data, sample_data, preprocessors, classifiers)

Training MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=[20, 20], learning_rate='constant',
       learning_rate_init=0.001, max_iter=400, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=True)


Time taken: 0.9007759094238281


## 6. Prediction
On click of the blue button, the previously trained model is used to predict labels for the test data. As in the training step, an online strategy is used and the same parameters can be specified. 

__Save prediction to file:__ The resulting confusion matrix can be saved to a file in the local `results/` directory.

In [18]:
ui.prediction_ui(data, sample_data, preprocessors, classifiers)

Predict with MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=[20, 20], learning_rate='constant',
       learning_rate_init=0.001, max_iter=400, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=True)


[[   60.  3485.]
 [   12.  3388.]]
Accuracy: 0.4965


## 7. Results
A comparison of different parameter settings in terms of test accuracy can be found in the following table:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;border-color:#aabcfe;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#aabcfe;color:#669;background-color:#e8edff;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:#aabcfe;color:#039;background-color:#b9c9fe;}
.tg .tg-1d7g{font-weight:bold;font-size:16px;vertical-align:top}
.tg .tg-qv16{font-weight:bold;font-size:16px;text-align:center;vertical-align:top}
.tg .tg-qgsu{font-size:15px;vertical-align:top}
.tg .tg-yw4l{vertical-align:top}
.tg .tg-6k2t{background-color:#D2E4FC;vertical-align:top}
.tg .tg-2nw2{background-color:#D2E4FC;font-size:15px;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
    <th class="tg-1d7g">Naive Bayes<br></th>
    <th class="tg-qv16" colspan="4">Multilayer Perceptron (with network)<br></th>
  </tr>
  <tr>
    <td class="tg-yw4l"></td>
    <td class="tg-6k2t"></td>
    <td class="tg-qgsu"></td>
    <td class="tg-2nw2">20, 20<br></td>
    <td class="tg-qgsu">40, 20<br></td>
    <td class="tg-2nw2">20, 20, 20, 20<br></td>
    <td class="tg-qgsu">20^10<br></td>
  </tr>
  <tr>
    <td class="tg-1d7g">None<br></td>
    <td class="tg-2nw2"></td>
    <td class="tg-yw4l">0.7956</td>
    <td class="tg-6k2t">0.8447</td>
    <td class="tg-yw4l">0.8427</td>
    <td class="tg-6k2t">0.8390</td>
    <td class="tg-yw4l">0.846</td>
  </tr>
  <tr>
    <td class="tg-1d7g" rowspan="3">PCA</td>
    <td class="tg-2nw2">3</td>
    <td class="tg-yw4l">0.8002</td>
    <td class="tg-6k2t">0.7979</td>
    <td class="tg-yw4l">0.7955</td>
    <td class="tg-6k2t">0.8032</td>
    <td class="tg-yw4l">0.8003</td>
  </tr>
  <tr>
    <td class="tg-2nw2">6</td>
    <td class="tg-yw4l">0.7986</td>
    <td class="tg-6k2t">0.8064</td>
    <td class="tg-yw4l">0.8025</td>
    <td class="tg-6k2t">0.8156</td>
    <td class="tg-yw4l">0.812</td>
  </tr>
  <tr>
    <td class="tg-2nw2">15</td>
    <td class="tg-yw4l">0.8047</td>
    <td class="tg-6k2t">0.8319</td>
    <td class="tg-yw4l">0.8305</td>
    <td class="tg-6k2t">0.8311</td>
    <td class="tg-yw4l">0.8257</td>
  </tr>
  <tr>
    <td class="tg-1d7g" rowspan="3">Pearson</td>
    <td class="tg-2nw2">0.6 (4 features removed)<br></td>
    <td class="tg-yw4l">0.7622</td>
    <td class="tg-6k2t">0.8044</td>
    <td class="tg-yw4l">0.825</td>
    <td class="tg-6k2t">0.7984</td>
    <td class="tg-yw4l">0.8073</td>
  </tr>
  <tr>
    <td class="tg-2nw2">0.4 (9 features removed)<br></td>
    <td class="tg-yw4l">0.7514</td>
    <td class="tg-6k2t">0.7971</td>
    <td class="tg-yw4l">0.7806</td>
    <td class="tg-6k2t">0.7976</td>
    <td class="tg-yw4l">0.8082</td>
  </tr>
  <tr>
    <td class="tg-2nw2">0.2 (16 features removed)<br></td>
    <td class="tg-yw4l">0.5273</td>
    <td class="tg-6k2t">0.5456</td>
    <td class="tg-yw4l">0.5255</td>
    <td class="tg-6k2t">0.5603</td>
    <td class="tg-yw4l">0.5693</td>
  </tr>
</table>

The Multilayer Perceptron in general performed better than the Naive Bayes classifier. 

Notable is the MLP's ability to work well on raw data. Here, the best results could be achieved without preprocessing. The more features we removed, the more the accuracy decreased on the test set. Since the training set contains several million instances it is still possible for the MLP to learn its parameters regardless of many data dimensions or a deep network architecture.

The Naive Bayes classifier expects normally distributed features that are independent of each other given the class. Our preliminary analysis of the data set showed that several features could be approximated by a Gaussian distribution, while others could not. The latter are most likely the main reason for the poor classification performance. Applying a PCA improved the results compared to using the raw data since it removes correlations in the data.

## 8. Outlook
The MLP results could be further improved by a detailed analysis of the training process. Especially an appropriate value for the learning rate should be determined since training is usually very sensitive to this parameter. Monitoring the test accuracy over time could give hints whether the model needs more iterations of training or already started to overfit the training data set. To prevent overfitting and achieve a better generalization techniques like dropout could be included.

For the Naive Bayes classifier a more thoughtful selection of the features could help to improve the performance. Features that are not normally distributed (e.g. mass) should be removed before the training.