<a href="https://colab.research.google.com/github/sdsc-bw/DataFactory/blob/develop/model_selection/Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model selection

There is a variety of models that can be used in machine learning like decision trees, random forests, neural networks...
According to the problem, some models fit better than others. For example, for a simple problem it makes sense to use a more simple model like a decision tree, because more complex models like neural networks can lead to overfitting. Whereas these complexe models perform better at non-linear problems.

## Import packages

In [1]:
root = '../'

In [15]:
import pandas as pd
import sys
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
sys.path.insert(0, root + "codes")

from DataFactory import DataFactory

## Load dataset: titanic dataset

For this demo we use the [titanic dataset](https://www.kaggle.com/c/titanic-dataset/data) from kaggle. It contains the follwing information:
- __passenger_id__ unique identifier for each passenger
- __pclass__ class of the passenger  (1 = 1st; 2 = 2nd; 3 = 3rd)
- __name__ name of the passenger
- __sex__ sex of the passenger
- __age__ age of the passenger in years
- __sibsp__ number of siblings/souses aboard
- __parch__ number of parents/children aboard
- __ticket__ number of the ticket
- __fare__ passenger fare in British pound
- __cabin__ cabin of the passenger
- __embarked__ port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- __boat__ Lifeboat
- __body__ body identification number
- __home.dest__ Home/Destination

In [3]:
df = pd.read_csv('../data/titanic.csv')

In [4]:
df.head()

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q,13.0,,,1.0
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,,S,,,Croatia,0.0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.15,,S,,,,0.0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0,,S,4.0,,"Cornwall / Akron, OH",1.0
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0,,S,,,"Barre, Co Washington, VT",0.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  1309 non-null   int64  
 1   pclass        1309 non-null   int64  
 2   name          1309 non-null   object 
 3   sex           1309 non-null   object 
 4   age           1046 non-null   float64
 5   sibsp         1309 non-null   int64  
 6   parch         1309 non-null   int64  
 7   ticket        1309 non-null   object 
 8   fare          1308 non-null   float64
 9   cabin         295 non-null    object 
 10  embarked      1307 non-null   object 
 11  boat          486 non-null    object 
 12  body          121 non-null    float64
 13  home.dest     745 non-null    object 
 14  survived      850 non-null    float64
dtypes: float64(4), int64(4), object(7)
memory usage: 153.5+ KB


There we can see that there are many attributes with many missing values. As in the demo before, we have to preprocess the data.

In [6]:
datafactory = DataFactory()

In [7]:
X_train, X_test, y_train, y_test = datafactory.preprocess_and_split(df, y_col='survived')

2021-11-02 18:20:20,947 - DataFactory - INFO - Remove columns with na values of target feature survived
2021-11-02 18:20:20,950 - DataFactory - INFO - + Start to transform the categorical columns
2021-11-02 18:20:20,957 - DataFactory - INFO -     Start to do onehot to the following categoric features: ['sex', 'embarked']
2021-11-02 18:20:20,960 - DataFactory - INFO -     End with onehot encoding
2021-11-02 18:20:20,961 - DataFactory - INFO -     Start to do label encoding to the following categoric features: ['name', 'ticket', 'cabin', 'boat', 'home.dest']
2021-11-02 18:20:20,969 - DataFactory - INFO -     End with label encoding
2021-11-02 18:20:20,972 - DataFactory - INFO - - End with categorical feature transformation
2021-11-02 18:20:20,973 - DataFactory - INFO - + Start to clean the given dataframe
2021-11-02 18:20:20,977 - DataFactory - INFO -     number of inf and nan are for dataset: (0, 952)
2021-11-02 18:20:20,978 - DataFactory - INFO -     set type to float32 at first && dea

## Models

There are a variety of machine learning models. Now we want to present the most common models.

### Decision tree

A decision tree is one of the most simple models. Every node represents a logical rule (e.g. is feature smaller than a certain value). Depending on the values of the feature of the sample that is used to be classified, we look at the left or right child node. 

<img src="../images/decision_tree2.png"/>

In [10]:
dt = tree.DecisionTreeClassifier()
dt = dt.fit(X_train, y_train)

In [13]:
predict = dt.predict(X_test)
score = f1_score(y_test, predict)

In [14]:
score

0.9358974358974359

### Random forest

A random forest consists of multiple different decision tree. The finale prediction is the average over the prediction of each decision tree.

<img src="../images/random_forest.png"/>

In [18]:
rf = RandomForestClassifier()
rf = rf.fit(X_train, y_train)

In [19]:
predict = rf.predict(X_test)
score = f1_score(y_test, predict)

In [20]:
score

0.9610389610389611

### Neural Network

<img src="../images/neural_network.png"/>