<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/model_selection/Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model selection

There is a variety of models that can be used in machine learning like decision trees, random forests, neural networks...
Depending on the problem we have many different models to choose from. Here a small overview of the most common:

<img src="../images/model_selection.png"/>

If have labeled training data we can choose between different many different supervised methods. Whereas if we don't have the labels, the we have to use unsupervised methods like clustering.

Also according to the problem, some models fit better than others. For example, for a simple problem it makes sense to use a more simple model like a decision tree, because more complex models like neural networks can lead to overfitting. Whereas these complexe models perform better at non-linear problems. In this notebook we want to show some models and how they perform on different tasks.

In [1]:
! git clone https://github.com/sdsc-bw/model_selection.git
! ls

Cloning into 'model_selection'...
remote: Repository not found.
fatal: repository 'https://github.com/sdsc-bw/model_selection.git/' not found
Der Befehl "ls" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


## Import packages

In [1]:
root = '../'

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.datasets import load_iris, load_wine, fetch_covtype
import sys
sys.path.insert(0, root + "codes")

from DataFactory import DataFactory
from util import compare_models

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Titanic dataset

The first dataset is the [titanic dataset](https://www.kaggle.com/c/titanic-dataset/data) from kaggle. It contains the follwing information:
- __passenger_id__ unique identifier for each passenger
- __pclass__ class of the passenger  (1 = 1st; 2 = 2nd; 3 = 3rd)
- __name__ name of the passenger
- __sex__ sex of the passenger
- __age__ age of the passenger in years
- __sibsp__ number of siblings/souses aboard
- __parch__ number of parents/children aboard
- __ticket__ number of the ticket
- __fare__ passenger fare in British pound
- __cabin__ cabin of the passenger
- __embarked__ port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- __boat__ Lifeboat
- __body__ body identification number
- __home.dest__ Home/Destination
- __survived__ wether the person survived 

In [4]:
df_titanic = pd.read_csv('../data/titanic.csv')

In [5]:
df_titanic.head()

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q,13.0,,,1.0
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,,S,,,Croatia,0.0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.15,,S,,,,0.0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0,,S,4.0,,"Cornwall / Akron, OH",1.0
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0,,S,,,"Barre, Co Washington, VT",0.0


There we can see that there are many attributes with many missing values. As in the demo before, we have to preprocess the data.

In [6]:
datafactory = DataFactory()
dfx_titanic, dfy_titanic = datafactory.preprocess(df_titanic, y_col='survived')

2021-11-16 18:39:19,919 - DataFactory - INFO - Remove columns with NAN-values of target feature: survived
2021-11-16 18:39:19,924 - DataFactory - INFO - Start to transform the categorical columns...
2021-11-16 18:39:19,931 - DataFactory - INFO - Start with one-hot encoding of the following categoric features: ['sex', 'embarked']...
2021-11-16 18:39:19,933 - DataFactory - INFO - ...End with one-hot encoding
2021-11-16 18:39:19,934 - DataFactory - INFO - Start label encoding of the following categoric features: ['name', 'ticket', 'cabin', 'boat', 'home.dest']...
2021-11-16 18:39:19,942 - DataFactory - INFO - ...End with label encoding
2021-11-16 18:39:19,944 - DataFactory - INFO - ...End with categorical feature transformation
2021-11-16 18:39:19,945 - DataFactory - INFO - Start to clean the given dataframe...
2021-11-16 18:39:19,950 - DataFactory - INFO - Number of INF- and NAN-values are: (0, 952)
2021-11-16 18:39:19,950 - DataFactory - INFO - Set type to float32 at first && deal with 

## Load dataset: iris dataset

The second dataset is the [iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) from sklearn. It contains the follwing information:
- __sepal length__ sepal length of the iris in cm
- __sepal width__ sepal width of the iris in cm
- __petal length__ petal length of the iris in cm
- __petal width__ petal width of the iris in cm
- __species__ species of the iris (0 = setosa; 1 = versicolor; 2 = virginica)

In [7]:
data = load_iris()
df_iris = pd.DataFrame(data.data, columns=data.feature_names)
df_iris['species'] = pd.Series(data.target)

In [8]:
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Load dataset: wine dataset

The third dataset is the [wine dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) from sklearn. It contains the follwing information:
- 13 features
- 3 classes

In [9]:
data = load_wine()
df_wine = pd.DataFrame(data.data, columns=data.feature_names)
df_wine['class'] = pd.Series(data.target)

In [10]:
# needs no prepocessing
df_wine.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,class
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


## Load dataset: covertype

The fourth dataset is the [covertype dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_covtype.html) from sklearn. It contains the follwing information:
- 54 features
- 7 classes

In [11]:
data = fetch_covtype()
df_covtype = pd.DataFrame(data.data, columns=data.feature_names)
df_covtype['type'] = pd.Series(data.target)

In [12]:
# needs no prepocessing
df_covtype.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,type
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


## Models

There are a variety of machine learning models. Now we want to present the most common models.

### Decision tree

A decision tree is one of the most simple models. Every node represents a logical rule (e.g. is feature smaller than a certain threshold). Depending on the values of the feature of the sample that is used to be classified, we look at the left or right child node. 

<img src="../images/decision_tree2.png"/>

With the DataFactory, we can select a model (e.g. a decision tree for classification) and finetune this model to achieve the best results. The algorithm builds multiple decision trees with different parameters. At the end it returns the decision tree with the best score.

### Random forest

A random forest consists of multiple different decision trees. The finale prediction is the average over the predictions of the decision trees.

<img src="../images/random_forest.png"/>

### AdaBoost

Like random forest, AdaBoost uses multiple decision trees to make a prediction. But when building the decision tree, the new tree is based on the previous tree. It focuses on the samples which are predicted badly by the previous tree.

<img src="../images/adaboost.png"/>

### GBDT

Also Gradient Boosting Decision Tree (GBDT) uses multiple decision trees. But instead of averaging the predictions of the trees, their preditctions are summed. So a decision tree predicts the error of the previous tree.

<img src="../images/gbdt.png"/>

### K-Nearest Neighbour

To classify a sample with the K-Nearest Neighbour (KNN) algorithm, we look in the proximity of the sample. So we examine what is the most frequent class of the k neigbours. The sample is then assigned to this class. 

<img src="../images/knn.png"/>

### Support Vector Machine

The Support Vector Machine (SVM) creates a hyper-plane to segregate the samples of a class. To find the best hyper-plane it tries to maximaize the the distances between nearest sample of either class. If it can't find a plane, it introduces an additional feature.

<img src="../images/svm.png"/>

### Neural Network

A neural network, also called multi layer perceptron, is one of the most powerful models. It consits of one input layer, one output layer and one or multiple hidden layers in between. Each layer consists of neurons that are connected with the previous layer by edges. After giving the data into the input layer it passes the network to the output node. If the data reaches an edge it is weighted with weight. If the data reaches a node, a bias is added to the data and an 'activation' function is applied. The output layer outputs the prediction.  

<img src="../images/neural_network.png"/>

In [13]:
# TODO add NN to datafactory
#model = keras.Sequential(
#    [
#        layers.Dense(9, kernel_initializer = 'uniform', input_dim=X_train.shape[1], activation='relu'),
#        layers.Dropout(0.2),
#        layers.Dense(9, kernel_initializer = 'uniform', activation='relu'),
#        layers.Dropout(0.2),
#        layers.Dense(5, kernel_initializer = 'uniform', activation='relu'),
#        layers.Dropout(0.2),
#        layers.Dense(1, kernel_initializer = 'uniform', activation='relu'),
#    ]
#)

#model.summary()

In [14]:
#model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy'])
#training = model.fit(X_train, y_train, epochs=200, batch_size=64, verbose=0)

In [15]:
#test_acc = model.evaluate(X_test, y_test, batch_size=64, verbose=0)[1]

In [16]:
#test_acc

We can see that the accuracy of the neural network is wore then of the other models. Even though, neural networks are more powerful, but if they are applied to too simple problems it might lead to a worse performance then more simple models.

## Comparison of the Models

Not every model fits for every problem. Here we can see the F1 scores for several models on different datasets. 

The F1 score is the harmonic mean of the precision and the recall: 
$$F1 = 2 * \frac{precision * recall}{precision + recall}$$
The higher the F1 score, the better the prediction. Precision and recall are defined as:

$$Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}$$
TP: True Positive, FN: False Negative, FP: False Positive

In [None]:
results = compare_models(['decision_tree', 'random_forest', 'adaboost', 'gbdt', 'svm', 'knn'])

2021-11-16 18:39:20,769 - DataFactory - INFO - Remove columns with NAN-values of target feature: survived
2021-11-16 18:39:20,773 - DataFactory - INFO - Start to transform the categorical columns...
2021-11-16 18:39:20,777 - DataFactory - INFO - Start with one-hot encoding of the following categoric features: ['sex', 'embarked']...
2021-11-16 18:39:20,780 - DataFactory - INFO - ...End with one-hot encoding
2021-11-16 18:39:20,780 - DataFactory - INFO - Start label encoding of the following categoric features: ['name', 'ticket', 'cabin', 'boat', 'home.dest']...
2021-11-16 18:39:20,787 - DataFactory - INFO - ...End with label encoding
2021-11-16 18:39:20,789 - DataFactory - INFO - ...End with categorical feature transformation
2021-11-16 18:39:20,790 - DataFactory - INFO - Start to clean the given dataframe...
2021-11-16 18:39:20,794 - DataFactory - INFO - Number of INF- and NAN-values are: (0, 952)
2021-11-16 18:39:20,794 - DataFactory - INFO - Set type to float32 at first && deal with 

2021-11-16 18:42:52,103 - DataFactory - INFO - ...End search
2021-11-16 18:42:52,103 - DataFactory - INFO - Best parameters are: {'weights': 'uniform', 'n_neighbors': 14} with score 0.77
2021-11-16 18:42:52,561 - DataFactory - INFO - Remove columns with NAN-values of target feature: type
2021-11-16 18:42:52,733 - DataFactory - INFO - Start to transform the categorical columns...
2021-11-16 18:42:52,809 - DataFactory - INFO - ...End with categorical feature transformation
2021-11-16 18:42:52,827 - DataFactory - INFO - Start to clean the given dataframe...
2021-11-16 18:42:52,939 - DataFactory - INFO - Number of INF- and NAN-values are: (0, 0)
2021-11-16 18:42:52,939 - DataFactory - INFO - Set type to float32 at first && deal with INF
2021-11-16 18:42:53,420 - DataFactory - INFO - Remove columns with half of NAN-values
2021-11-16 18:42:53,528 - DataFactory - INFO - Remove constant columns
2021-11-16 18:42:53,725 - DataFactory - INFO - ...End with Data cleaning, number of INF- and NAN-val

In [None]:
results