# Practice - Classification models for soil type determiniation - Assignment

In this notebook, we will expand on the tutorial on classification models by building up a model with multiple features using random forests, an ensemble model which builds up several trees and averages the predictions. We will see if the predictions with the simple decision tree from the tutorial can be improved.

In [None]:
import numpy as np
import pandas as pd
import sklearn
from plotly import subplots
import plotly.graph_objs as go

We will import the ``RandomForestClassifier`` class from scikit-learn ``ensemble`` module to build the random forest regressor.

For more info on the principle of random forests, you can check the scikit-learn documentation.

In [None]:
from sklearn.ensemble import ____
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import accuracy_score

## Dataset loading

We can load the data from a .csv file. The data has been preprocessed to link soil type class according to Robertson (2010) to cone resistance properties. The soil type class was determined from sampling boreholes adacent to CPT tests. To allow comparison with the Robertson soil type classification charts, the features $ \log (Q_t) $ and $ \log (F_r) $ have already been calculated.

In [None]:
data = pd.___('Data/___.csv')

The data contains some $ \infty $ values which can be replaced with NaN values using the ``replace`` method on the dataframe. The Nan values can then be dropped from the data:

In [None]:
data.replace([np.inf, -np.inf], np.nan, inplace=True)
data.dropna(inplace=True)

In [None]:
data.head()

We can also create the dictionary with the full names of the soil types according to Robertson.

In [None]:
robertson_dict = {
    '1': '1 - Sensitive, fine-grained',
    '2': '2 - Organic soils - Peats',
    '3': '3 - Clays, clay to silty clay',
    '4': '4 - Silt mixtures, clayey silt to silty clay',
    '5': '5 - Sand mixtures, silty sand to sand silty',
    '6': '6 - Sands, clean sands to silty sands',
    '7': '7 - Gravelly sand to sand',
    '8': '8 - Very stiff sand to clayey sand',
    '9': '9 - Very stiff fine grained'
}

## Creating a Random Forest classifier

Creating a Random Forest classifier is again straightforward using ```scikit-learn```. A Random Forest classifier builds multiple decision trees on randomised subset of the training data. Predictions from these trees are averaged. The principle is that errors made by individual trees are averaged out, leading to a more accurate prediction.

During the model training, a randomised subset of the dataset needs to be selected every time and a decision tree is then trained on this subset.

The hyperparameters for the Random Forest classifier are the same as the hyperparameters for individual decision trees (e.g. ``max_depth``). However, since multiple trees are built, the number of trees also needs to be set. This is done with the ``n_estimators`` hyperparameter.

### Feature selection

We can first select features for the model. For this example, we will start by selecting $ \log (Q_t) $ and $ \log (F_r) $, the same features as in the tutorial. We can also add the depth ``'z [m]'``The target is the Soil type inferred from the borehole logs.

In [None]:
X = data[['log(Fr)', '____', '____']]
y = data['____']

### Train-test splitting

We can split the data keeping 20% as the test set. ``X_train`` and ``y_train`` are the features and target (true value) of the training set and are used to train the classifier. ``X_test`` and ``y_test`` are the features and target of the test set. These are used to evaluate whether the model generalises well.

In [None]:
X_train, X_test, y_train, y_test = ____(____, y, test_size=____, random_state=42)

### Hyperparameter selection

The ``n_estimators`` and ``max_depth`` hyperparameters can be set for the ``RandomForestClassifier``.

## Random Forest classifier

We can set up the classifier with the selected hyperparameters (``n_estimators=100`` and a maximum tree depth of 4).

In [None]:
rfc = ____(n_estimators=____, ____=4)

We can train this classifier using the training data.

In [None]:
rfc.____(____, _____)

Given the relatively modest volume of data, the training process is quick. The trained model can now be used to make predictions.

First, we can predict the labels for the training dataset:

In [None]:
y_pred_train = rfc.____(____)

The accuracy score can be assessed:

In [None]:
accuracy_score(y_true=____, y_pred=____)

Is this accuracy greater than the accuracy of the simple decision tree model?

Predictions can also be made on the test set and the accuracy can be evaluated.

In [None]:
y_pred_test = rfc.____(____)
accuracy_score(y_true=____, y_pred=____)

Does this model generalise well?

Confusion matrices for the train and test set can be visualised:

In [None]:
cf_train = confusion_matrix(
    ____, ____, labels=[3, 4, 5, 6, 7, 8], normalize='true')
cf_test = confusion_matrix(
    ____, ____, labels=[3, 4, 5, 6, 7, 8], normalize='true')

In [None]:
fig = subplots.make_subplots(rows=1, cols=2, print_grid=False, subplot_titles=('Training data', 'Test data'))

cf_train_trace = go.Heatmap(
    z=____, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=True)
fig.append_trace(cf_train_trace, 1, 1)
cf_test_trace = go.Heatmap(
    z=____, x=[3, 4, 5, 6, 7, 8], y=[3, 4, 5, 6, 7, 8], showlegend=False, showscale=False)
fig.append_trace(cf_test_trace, 1, 2)
fig['layout']['xaxis1'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis1'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['xaxis2'].update(title='Predicted class', tickvals=[3, 4, 5, 6, 7, 8])
fig['layout']['yaxis2'].update(title='True class', tickvals=[3, 4, 5, 6, 7, 8])

fig['layout'].update(title='Confusion matrix for Random Forest Classifier', height=600, width=1000)
fig.show()

What conclusions does this confusion allow you to make on the accuracy of predicting a certain soil type.

For further insight in the model behaviour. The decision boundaries for this classifier can be plotted. This is possible for models there are only two features. For models with multiple features, a projection on the $ \log (F_r) - \log (Q_t) $ space can be made by Note that the code shown hereunder is beyond the scope of beginners.

In [None]:
Fr_min, Fr_max = -1, 1
Qt_min, Qt_max = 0, 3

other_feature_names = ['z [m]',]
other_feature_values = [20, ]

h = 0.005

Fr, Qt = np.meshgrid(np.arange(Fr_min, Fr_max, h),
                     np.arange(Qt_min, Qt_max, h))

param_space_features = pd.DataFrame(np.c_[Fr.ravel(), Qt.ravel()], columns=['log(Fr)', 'log(Qt)'])
for i, _value in enumerate(other_feature_values):
    param_space_features.loc[:, other_feature_names[i]] = _value

Z_s = rfc.predict(param_space_features).reshape(Fr.shape)

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False)
_data = go.Heatmap(
    z=Z_s, x=np.arange(Fr_min, Fr_max, h), y=np.arange(Qt_min, Qt_max, h), showlegend=False, opacity=0.8)
fig.append_trace(_data, 1, 1)
for i, _soiltype in enumerate(data['Soil type'].unique()):
    _soiltypedata = data[data['Soil type'] == _soiltype]
    try:
        _name = robertson_dict["%.0f" % _soiltype]
    except:
        _name = None 
    _data = go.Scattergl(
        y=_soiltypedata['log(Qt)'], x=_soiltypedata['log(Fr)'],
        showlegend=True, mode='markers', name=_name,
        marker=dict(size=8), opacity=1)
    fig.append_trace(_data, 1, 1)
fig['layout']['xaxis1'].update(title=r'$ \log (F_r) \ \text{[%]} $', range=(Fr_min, Fr_max))
fig['layout']['yaxis1'].update(title=r'$ \log (Q_t) $', range=(Qt_min, Qt_max))
fig['layout'].update(
    height=600, width=500,
    title='Projected decision boundaries for Random Forest Classifier',
    legend=dict(orientation='h', x=0.2, y=-0.2),
    images=[
        dict(
            source='Images/RobertsonFr.png',
            xref='x1', yref='y1',
            x=Fr_min, y=Qt_max,
            sizex=Fr_max - Fr_min, sizey=Qt_max - Qt_min,
            sizing='stretch', opacity=1, layer='below',
        )],
    )
fig.show()

How do the decision boundaries change when you change the values of the other features? Can you create a better prediction model when including more features? Can your knowledge of soil behaviour guide you in the selection of suitable features? Can you change the hyperparameters (e.g. increasing ``n_estimators``) to further improve predictions?