# Basics of building classification models

Classification models are used to define a model which predicts a discrete variable from a number of observations. Classification modelling might be less common in (geotechnical) engineering but there are still a number of useful applications. For example, soil type classification from CPT records is a classification problem and it will be studied in this notebook.

The scikit-learn machine learning library includes several model types to define relations between the observations (features) and the outcome (target).

In this notebook, we will show how a simple classification model can be created, how it is trained and how the model accuracy is assessed. The demo is created using data from the Borssele wind farm area.

In [None]:
import numpy as np
import pandas as pd
import sklearn
from plotly import subplots
import plotly.graph_objs as go

We will import the tree module from scikit-learn to build decision trees.

In [None]:
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import accuracy_score

## Dataset loading

We can load the data from a .csv file. The data has been preprocessed to link soil type class according to Robertson (2010) to cone resistance properties. The soil type class was determined from sampling boreholes adacent to CPT tests. To allow comparison with the Robertson soil type classification charts, the features $ \log (Q_t) $ and $ \log (F_r) $ have already been calculated.

In [None]:
data = pd.read_csv('Data/cleaneddata.csv')

The data contains some $ \infty $ values which can be replaced with NaN values using the ``replace`` method on the dataframe. The Nan values can then be dropped from the data:

In [None]:
data.replace([np.inf, -np.inf], np.nan, inplace=True)
data.dropna(inplace=True)

In [None]:
data.head()

## Robertson's chart

### Graphical comparison

The features for the Robertson soil type classification chart can be selected and the data can be visualised. It always good to include the verbose explaination of the class numbers:

In [None]:
robertson_dict = {
    '1': '1 - Sensitive, fine-grained',
    '2': '2 - Organic soils - Peats',
    '3': '3 - Clays, clay to silty clay',
    '4': '4 - Silt mixtures, clayey silt to silty clay',
    '5': '5 - Sand mixtures, silty sand to sand silty',
    '6': '6 - Sands, clean sands to silty sands',
    '7': '7 - Gravelly sand to sand',
    '8': '8 - Very stiff sand to clayey sand',
    '9': '9 - Very stiff fine grained'
}

In [None]:
fig = subplots.make_subplots(rows=2, cols=3, print_grid=False, subplot_titles=(
    '3 - Clays, clay to silty clay', '4 - Silt mixtures, clayey silt to silty clay',
    '5 - Sand mixtures, silty sand to sand silty', '6 - Sands, clean sands to silty sands',
    '7 - Gravelly sand to sand', '8 - Very stiff sand to clayey sand'))

soiltypeclass = 3
for i in range(2):
    for j in range(3):
        trace1 = go.Scatter(
            x=data[data['Soil type'] == soiltypeclass]['log(Fr)'],
            y=data[data['Soil type'] == soiltypeclass]['log(Qt)'],
            showlegend=False, mode='markers')
        fig.append_trace(trace1, i+1, j+1)
        soiltypeclass += 1

fig['layout'].update(title='Borssele classification data', height=900, width=1000)  
image_list = []
for i in range(6):
    fig['layout']['xaxis%i' % (i+1)].update(
        title='Fr [%]', range=(-1, 1), tickvals=[-1, 0, 1], ticktext=['0.1', '1', '10'])
    fig['layout']['yaxis%i' % (i+1)].update(
        title='Qt [-]', range=(0, 3), tickvals=[0, 1, 2, 3], ticktext=['1', '10', '100', '1000'])
    image_list.append(
        dict(source='Images/RobertsonFr.png',
             xref='x%i' % (i+1),
             yref='y%i' % (i+1),
             x=-1, y=3,
             sizex=2, sizey=3,
             sizing='stretch', opacity=1,
             layer='below',
        ))
fig['layout'].update(images=image_list)
fig.show()

The comparison with Robertson's chart shows that cohesive soils, silts and clean sands are well reasonable well captured. For silty sand and clayey sand, there are a lot of erroneous predictions.

### Confusion matrix

The Robertson chart is effectively a model, based on two features; $ \log (Q_t) $ and $ \log (F_r) $. We can evaluate this model. The column ``Soil type Robertson`` contains the calculated soil type according to Robertson's chart. This allows the creation of a <i>confusion matrix</i> which  contains the predicted label on the X-axis and the true label on the Y-axis.

The confusion matrix is a very useful instrument for evaluating the effectiveness of a model. The diagonal of the matrix should be as bright as possible, meaning that model has a high accuracy.

In [None]:
cf_robertson = confusion_matrix(
    data['Soil type'], data['Soil type Robertson'], labels=[3, 4, 5, 6, 7, 8], normalize='true')

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False)

_data = go.Heatmap(
    z=cf_robertson, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=True)
fig.append_trace(_data, 1, 1)
fig['layout']['xaxis1'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis1'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout'].update(title='Confusion matrix for Robertson chart', height=600, width=600)
fig.show()

### Accuray score

The accuracy score can also be calculated as follows:

$$ \text{Accuracy} = \frac{\text{No of correct predictions}}{\text{Total no of predictions}} $$

The ``accuracy_score`` function is directly available from scikit-learn and does not need to be coded.

In [None]:
accuracy_score(y_true=data['Soil type'], y_pred=data['Soil type Robertson'])

The model is 39% accurate. Not great, we will see if we can improve this using tree-based classifiers.

## Creating a decision tree classifier

Creating a decision tree classifier is straightforward using ```scikit-learn```. A decision tree is a kind of flow chart in which consecutive questions are answered based on the features of a sample. An example is shown of a decision tree with a depth of 3 (number of questions answered) based on the features $ \log (Q_t) $ and $ \log (F_r) $.

<img src="Images/decision_tree_example.png">
<br><center><b>Decision tree example</b></center>

During the model training, the questions which are asked at the nodes are optimised to ensure maximum information gain with each split.

It is clear from the decision tree concept that the depth of the tree can be increased until a fully accurate prediction is obtained. This is undesirable since the data itself has uncertainty. The model should therefore not overfit the data. We can solve this through cross-validation whereby the trained model is tested on a batch of unseen data. scikit-learn includes the function ``train_test_split`` which can split the available data intoa train and test set.

### Feature selection

We can first select features for the model. For this example, we will select $ \log (Q_t) $ and $ \log (F_r) $, the same features as in Robertson's chart. The target is the Soil type inferred from the borehole logs.

In [None]:
X = data[['log(Fr)', 'log(Qt)']]
y = data['Soil type']

### Train-test splitting

We can split the data keeping 20% as the test set. ``X_train`` and ``y_train`` are the features and target (true value) of the training set and are used to train the classifier. ``X_test`` and ``y_test`` are the features and target of the test set. These are used to evaluate whether the model generalises well.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Hyperparameter selection

The ``DecisionTreeClassifier`` has several hyperparameters; model parameters which can be modified by the user to control how the tree is built. Documentation on these hyperparameters is provided in the scikit-learn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and any aspiring data scientist should get comfortable with reading function documentation. Documentation provides essential insights in the possiblities of software written by others and ensures that the code created on the basis of this software reaches maximum effectiveness. In data science, a proper choice of hyperparameters can greatly influence model accuracy and generalisation.

In this tutorial, we will focus on the hyperparameter ``max_depth`` which controls the maximum number of splits in the tree.

## Decision tree models

### Maximum depth of 4

First, we can create a tree with 4 levels y setting the hyperparameter ``max_depth=4``.

In [None]:
clf_4 = tree.DecisionTreeClassifier(max_depth=4)

We can train this classifier using the training data.

In [None]:
clf_4.fit(X_train, y_train)

Given the relatively modest volume of data, the training process is quick. The trained model can now be used to make predictions.

First, we can predict the labels for the training dataset:

In [None]:
y_pred_train = clf_4.predict(X_train)

The accuracy score can be assessed:

In [None]:
accuracy_score(y_true=y_train, y_pred=y_pred_train)

With an accuracy of 60%, the model is already outperforming Roberton's chart. However, we need to check whether the model also performs well on unseen data.

In [None]:
y_pred_test = clf_4.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred_test)

The accuracy on the test is also 60% which indicates that the model generalises well.

Confusion matrices for the train and test set can be visualised:

In [None]:
cf_4_train = confusion_matrix(
    y_train, y_pred_train, labels=[3, 4, 5, 6, 7, 8], normalize='true')
cf_4_test = confusion_matrix(
    y_test, y_pred_test, labels=[3, 4, 5, 6, 7, 8], normalize='true')

In [None]:
fig = subplots.make_subplots(rows=1, cols=2, print_grid=False, subplot_titles=('Training data', 'Test data'))

cf_train_trace = go.Heatmap(
    z=cf_4_train, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=True)
fig.append_trace(cf_train_trace, 1, 1)
cf_test_trace = go.Heatmap(
    z=cf_4_test, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=False)
fig.append_trace(cf_test_trace, 1, 2)
fig['layout']['xaxis1'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis1'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['xaxis2'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis2'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])

fig['layout'].update(title='Confusion matrix for decision tree with max_depth=4', height=600, width=1000)
fig.show()

This shows that accuracy score alone is insufficient to make judgements on the model quality. For example, clean sands, which were well captured by Robertson's chart are now mostly prediced to be sand mixtures.

For further insight in the model behaviour. The decision boundaries for this classifier can be plotted. This is possible since there are only two features. Note that the code shown hereunder is beyond the scope of beginners.

In [None]:
Fr_min, Fr_max = -1, 1
Qt_min, Qt_max = 0, 3

h = 0.005

Fr, Qt = np.meshgrid(np.arange(Fr_min, Fr_max, h),
                     np.arange(Qt_min, Qt_max, h))
Z_s = clf_4.predict(np.c_[Fr.ravel(), Qt.ravel()]).reshape(Fr.shape)

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False)
_data = go.Heatmap(
    z=Z_s, x=np.arange(Fr_min, Fr_max, h), y=np.arange(Qt_min, Qt_max, h), showlegend=False, opacity=0.8)
fig.append_trace(_data, 1, 1)
for i, _soiltype in enumerate(data['Soil type'].unique()):
    _soiltypedata = data[data['Soil type'] == _soiltype]
    try:
        _name = robertson_dict["%.0f" % _soiltype]
    except:
        _name = None 
    _data = go.Scattergl(
        y=_soiltypedata['log(Qt)'], x=_soiltypedata['log(Fr)'],
        showlegend=True, mode='markers', name=_name,
        marker=dict(size=8), opacity=1)
    fig.append_trace(_data, 1, 1)
fig['layout']['xaxis1'].update(title=r'$ \log (F_r) \ \text{[%]} $', range=(Fr_min, Fr_max))
fig['layout']['yaxis1'].update(title=r'$ \log (Q_t) $', range=(Qt_min, Qt_max))
fig['layout'].update(
    height=600, width=500,
    title='Decision boundaries for decision tree with max depth of 4',
    legend=dict(orientation='h', x=0.2, y=-0.2),
    images=[
        dict(
            source='Images/RobertsonFr.png',
            xref='x1', yref='y1',
            x=Fr_min, y=Qt_max,
            sizex=Fr_max - Fr_min, sizey=Qt_max - Qt_min,
            sizing='stretch', opacity=1, layer='below',
        )],
    )
fig.show()

### Maxium depth of 20

To demonstrate how overfitting can obscure data scientists' work, we can create a tree with a large depth.

In [None]:
clf_20 = tree.DecisionTreeClassifier(max_depth=20)

Training happens in exactly the same way as before:

In [None]:
clf_20.fit(X_train, y_train)

Predictions can be made and accuracy scores calculated for the test and train set.

In [None]:
y_pred_train_20 = clf_20.predict(X_train)

In [None]:
accuracy_score(y_true=y_train, y_pred=y_pred_train_20)

The accuracy on the training set is very high. This is obviously because the higher maximum depth allows more refinement.

In [None]:
y_pred_test_20 = clf_20.predict(X_test)

In [None]:
accuracy_score(y_true=y_test, y_pred=y_pred_test_20)

The accuracy score on the test set is much poorer, this shows that the model with ``max_depth=20`` overfits the training data and does not generalise well. The training data has too much influence on the data.

This can also be observed in the confusion matrices. The confusion matrix for the training data looks very good but the one for the test set is not in line with this.

In [None]:
cf_20_train = confusion_matrix(
    y_train, y_pred_train_20, labels=[3, 4, 5, 6, 7, 8], normalize='true')
cf_20_test = confusion_matrix(
    y_test, y_pred_test_20, labels=[3, 4, 5, 6, 7, 8], normalize='true')

In [None]:
fig = subplots.make_subplots(rows=1, cols=2, print_grid=False, subplot_titles=('Training data', 'Test data'))

cf_train_trace = go.Heatmap(
    z=cf_20_train, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=True)
fig.append_trace(cf_train_trace, 1, 1)
cf_test_trace = go.Heatmap(
    z=cf_20_test, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=False)
fig.append_trace(cf_test_trace, 1, 2)
fig['layout']['xaxis1'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis1'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['xaxis2'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis2'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])

fig['layout'].update(title='Confusion matrix for decision tree with max_depth=20', height=600, width=1000)
fig.show()

Plotting the boundaries in the parameter space even shows this more clearly:

In [None]:
Z_s_20 = clf_20.predict(np.c_[Fr.ravel(), Qt.ravel()]).reshape(Fr.shape)

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False)
_data = go.Heatmap(
    z=Z_s_20, x=np.arange(Fr_min, Fr_max, h), y=np.arange(Qt_min, Qt_max, h), showlegend=False, opacity=0.8)
fig.append_trace(_data, 1, 1)
for i, _soiltype in enumerate(data['Soil type'].unique()):
    _soiltypedata = data[data['Soil type'] == _soiltype]
    try:
        _name = robertson_dict["%.0f" % _soiltype]
    except:
        _name = None 
    _data = go.Scattergl(
        y=_soiltypedata['log(Qt)'], x=_soiltypedata['log(Fr)'],
        showlegend=True, mode='markers', name=_name,
        marker=dict(size=8), opacity=1)
    fig.append_trace(_data, 1, 1)
fig['layout']['xaxis1'].update(title=r'$ \log (F_r) \ \text{[%]} $', range=(Fr_min, Fr_max))
fig['layout']['yaxis1'].update(title=r'$ \log (Q_t) $', range=(Qt_min, Qt_max))
fig['layout'].update(
    height=600, width=500,
    title='Decision boundaries for decision tree with max depth of 20',
    legend=dict(orientation='h', x=0.2, y=-0.2),
    images=[
        dict(
            source='Images/RobertsonFr.png',
            xref='x1', yref='y1',
            x=Fr_min, y=Qt_max,
            sizex=Fr_max - Fr_min, sizey=Qt_max - Qt_min,
            sizing='stretch', opacity=1, layer='below',
        )],
    )
fig.show()

The boundaries in the parameter space show very rapid changes from one class to the next. This is not meaningful and is a clear sign of overfitting.

## Practice

In the exercise on classifiers, we will build up a model with multiple features using random forests, an ensemble model which builds up several trees and averages the predictions.