<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/demos/03_Model_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Selection

There is a variety of models that can be used in machine learning like decision trees, random forests, neural networks...
Depending on the problem we have many different models to choose from. Here a small overview of the most common:

<img src="../images/model_selection.png"/>

If have labeled training data we can choose between different many different supervised methods. Whereas if we don't have the labels, the we have to use unsupervised methods like clustering.

Also according to the problem, some models fit better than others. For example, for a simple problem it makes sense to use a more simple model like a decision tree, because more complex models like neural networks can lead to overfitting. Whereas these complexe models perform better at non-linear problems. In this notebook we want to show some models and how they perform on different tasks. 

In this notebook we want to introduce Time Series (TS) and some state-of-the-art architectures. A time series is a series/list of data points in time order. A simple example is be the temperature measurement over one day.

<img src="../images/ts.png"/>

Time series in AI can be divided in 2 tasks:
- Classification: asign the series to one class (sometimes multiple)
- Regression/Forecasting: use time series to predict future values

There are a variety of state-of-the-art architectures to solve these tasks. In this repo we provide a simple interface to train common architectures (and later to finetune them). In order to do that we use the library [tsai](https://github.com/timeseriesAI/tsai) which provides state-of-the-art techniques for time series.

# Introduction of Different Models

There are a variety of machine learning models. Now we want to present the most common models.

### Decision tree

A decision tree is one of the most simple models. Every node represents a logical rule (e.g. is feature smaller than a certain threshold). Depending on the values of the feature of the sample that is used to be classified, we look at the left or right child node. 

<img src="../images/decision_tree2.png"/>

With the DataFactory, we can select a model (e.g. a decision tree for classification) and finetune this model to achieve the best results. The algorithm builds multiple decision trees with different parameters. At the end it returns the decision tree with the best score.

### Random forest

A random forest consists of multiple different decision trees. The finale prediction is the average over the predictions of the decision trees.

<img src="../images/random_forest.png"/>

### Adaptive Boosting (AdaBoost)

Like random forest, AdaBoost uses multiple decision trees to make a prediction. But when building the decision tree, the new tree is based on the previous tree. It focuses on the samples which are predicted badly by the previous tree.

<img src="../images/adaboost.png"/>

### Gradient Boosting Decision Tree (GBDT)

Also Gradient Boosting Decision Tree (GBDT) uses multiple decision trees. But instead of averaging the predictions of the trees, their preditctions are summed. So a decision tree predicts the error of the previous tree.

<img src="../images/gbdt.png"/>

### K-Nearest Neighbour (KNN)

To classify a sample with the K-Nearest Neighbour (KNN) algorithm, we look in the proximity of the sample. So we examine what is the most frequent class of the k neigbours. The sample is then assigned to this class. 

<img src="../images/knn.png"/>

### Support Vector Machine

The Support Vector Machine (SVM) creates a hyper-plane to segregate the samples of a class. To find the best hyper-plane it tries to maximaize the the distances between nearest sample of either class. If it can't find a plane, it introduces an additional feature.

<img src="../images/svm.png"/>

### Neural Networks

A neural network, also called multi layer perceptron, is one of the most powerful models. It consits of one input layer, one output layer and one or multiple hidden layers in between. Each layer consists of neurons that are connected with the previous layer by edges. After giving the data into the input layer it passes the network to the output node. If the data reaches an edge it is weighted with weight. If the data reaches a node, a bias is added to the data and an 'activation' function is applied. The output layer outputs the prediction.  

<img src="../images/neural_network.png"/>

Sometimes the performance of the neural network is worse then them of the other models. Even though, neural networks are more powerful, but if they are applied to too simple problems it might lead to overfitting. Therefor the model selection is very important. 

There are some basic layers and functions that are used in neural networks. We will briefly introduce them now.

### Layers

#### Fully Connected Layer

In a fully connected layer has every neurona on the previous layer a connection to every neuron in the following layer. A standard neural network consists mainly of this layers. In an convolutional neural network they are in general at the end.

<img src="../images/fc.png"/>

#### Convolutional Layer

Convolutional layer apply a filter/kernel on a matrix of values. The kernel is shifted over the matrix in a certain step size (stride). In each step each value of the kernel is multiplied with the value of the matrix where it layes above. At the end these values are summed to a entry in a new matrix. In general the output is smaller than the input, but we can add a padding to maintain the input size. During training learns the convolutional layer the values of the kernel.

<img src="../images/convolution.gif">

#### Pooling Layer

The mostly used pooling layer is the max pooling layer. It normally follows after a convolutional layer and an activation function. There is also some kind of kernel (normally 2x2 and stride=2) that shifts over the image and finds the maximum value in the area where it is applied to.

<img src="../images/pooling.png"/>

#### Inception Layer

A inception layer deploys multiple convolutions with multiple kernels and pooling layers simultaneously in parallel. At the it concatinates the autput of the different operations. It is more computational expensive but allows better learning of useful features.

<img src="../images/inception.png"/>

#### Batch Normalization Layer

Batch normalization is used to make a neural network faster and more stable. It rescales or recenters the input of this layer. It maintains the mean of the output close to 0 and the strandard devation of the output close to 1.

#### Dropout Layer

A dropout layer is used to decrease the risk of overfitting. During training it drops out (turn off) randomly a defined percentage of neurons. 

<img src="../images/dropout.png"/>

### Activation Functions

An activation function of a node is applied on the ouput of that node. Neural networks are able to solve complex problems because in general they use non-linear activation functions. Common functions are:
- ReLU
- Tanh
- Sigmoid (today hardly used)
- Softmax (commonly after last layer)

### Comon Architectures

#### MLP
The MLP was proposed from Wang et al. in 2016 ([paper](https://arxiv.org/abs/1611.06455)). It stacks three fully-connected layers. Each layer conists of 500 neurons and is followed by a dropout layer and a ReLU function. It ends with a softmax layer.

<img src="../images/mlp.png"/>

#### ResNet
ResNet was proposed from Wang et al. in 2016 ([paper](https://arxiv.org/abs/1611.06455)). It consists of three residual blocks. Each block consits of three convolutional layers each followed by batch normalization and a ReLU function. Also there is a shortcut in every residual block. At the end is a global average pooling layer and a softmax layer.

<img src="../images/res_net.png"/>

#### IncetptionTime
InceptionTime was propsed from Fawaz et al. in 2019 ([paper](https://arxiv.org/abs/1909.04939)). Compared to ResNet, InceptionTime has three inception blocks instead of convolutional layer.

<img src="../images/inception_time.png"/>

#### MiniRocket
MiniRocket was proposed from Dempster et al. in 2021 ([paper](https://arxiv.org/abs/2102.00457)). In contrast to the other methods is that it is a linear classifier. It transforms the input TS with random convolutional kernels and uses the transformed features to train the linear classifier. As a consequence it is less accurate then the other state-of-the-art-methods, but much faster to train.