# Practice - Regression models for pile driveability - Assignment

In the notebook, a Support Vector Regressor will be trained using the RBF (Radial Basis Function). This example is included to show that the mathematical details of a machine learning model are sometimes beyond the abilities of data scientist but a basic insight in the internal workings of the model can still be useful.

Details on the RBF kernel can be found here (https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html) with an example of several kernel functions given here (https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html?highlight=svr). 

Support vector regression (https://scikit-learn.org/stable/modules/svm.html#svm-regression) tries to draw linear boundaries in the parameter space but the kernel functions allow non-linear boundaries to be created by transformation of the parameter space.

## Package imports

A number of Python packages are required. Numpy, Pandas and Plotly are know from the previous tutorial.  scikit-learn is a comprehensive Python package for machine learning which will be used here.

In [None]:
import pandas as pd
import numpy as np
import sklearn
from plotly import subplots
import plotly.graph_objs as go
from copy import deepcopy

## Pile driving data

### Data import

The same data used before can be imported:

In [None]:
____ = pd.___("Data/____.csv")  # Store the contents of the csv file in the variable 'data'
data.head()

### Reducing operator dependence

In the tutorial, it was observed that blowcounts were always around 75 blows/m. This is because the hammer operator will try to keep the blowcount steady. A better target can therefore be selected when the number of blows is multiplied by the hammer energy. If the soil generates more resistance, this target will increase, even when the blowcount is steady. The following formula can be used:

$$ \text{Normalised total ENTHRU per distance} = \text{Blowcount} \cdot \text{Normalised ENTRHU} $$

It is easy to implement this with Pandas. Perform the calculation and stored the result in the colum ``Normalised total ENTRHU per distance [-/m]``.

In [None]:
data['____'] = data['____'] * data['____']

### Shaft resistance proxy

The engineered feature for approximating shaft resistance can be calculated..

In [None]:
enhanced_data = pd.DataFrame() # Create a dataframe for the data enhanced with the shaft friction feature
for location in data['Location ID'].unique(): # Loop over all unique locations
    # Select the location-specific data
    locationdata = data[data['Location ID']==location].copy() 
    # Calculate the shaft resistance feature
    locationdata["Rs [kN]"] = \
        (np.pi * locationdata["Diameter [m]"] * locationdata["z [m]"].diff() * locationdata["qc [MPa]"]).cumsum()
    # Combine data for the different locations in 1 dataframe
    enhanced_data = pd.concat([enhanced_data, locationdata]) 

### Visualisation of new target variable

A plot can be created to show the variation of the new target variable vs cone tip resistance, depth and $ R_s $. The plot has three panels, 

In [None]:
fig = subplots.make_subplots(rows=1, cols=3, print_grid=False, shared_yaxes=True)
trace1 = go.Scatter(
    x=enhanced_data["____"], y=enhanced_data["____"], showlegend=False,
    mode='markers',name='qc')
fig.append_trace(trace1, 1, 1)
trace2 = go.Scatter(
    x=enhanced_data["____"], y=enhanced_data["____"], showlegend=False,
    mode='markers',name='z')
fig.append_trace(trace2, 1, 2)
trace3 = go.Scatter(
    x=enhanced_data["____"], y=enhanced_data["____"], showlegend=False,
    mode='markers',name='Rs')
fig.append_trace(trace3, 1, 3)

fig['layout']['xaxis1'].update(title='qc [MPa]', range=(0, 100), dtick=10)
fig['layout']['xaxis2'].update(title='z [m]', range=(0, 40), dtick=5)
fig['layout']['xaxis3'].update(title='Rs [kN]', range=(0, 12e3))
fig['layout']['yaxis1'].update(title='Normalised total ENTHRU [-/m]')

fig.show()

### Preparing data for machine learning

NaN values can be dropped:

In [None]:
enhanced_data.dropna(inplace=True)

### Train-splitting

A location-based train-test split is used. We will create a deep copy of the resulting dataframe. Otherwise, Pandas will update the original dataframe when we make changes, which is undesirable.

In [None]:
validation_ids = ['EL', 'CB', 'AV', 'BV', 'EF', 'DL', 'BM']
# Training data - ID not in validation_ids
training_data = deepcopy(enhanced_data[~enhanced_data['Location ID'].isin(validation_ids)])
# Validation data - ID in validation_ids
validation_data = deepcopy(enhanced_data[enhanced_data['Location ID'].isin(validation_ids)])

## Model building

### Support vector regression import

The SVR class can be imported (see scikit-learn docs).

In [None]:
from sklearn.svm import ____

### SVR instance

An SVR instance can be created using the ``rbf`` kernel function and with other hyperparameters kept at their defaults. Can you identify from the documentation how these hyperparameters are determined and what their meaning is?

In [None]:
svr_rbf = SVR(kernel="____")

### Feature selection

The features on which we will train can be selected. Initially, we can work with a single feature ``Rs [kN]``. Can you add features to improve the accuracy of the model?

The target is our new normalised total ENTHRU.

In [None]:
features = ["____"]
X = training_data[features]
y = training_data["____"]

### Model training

We can train the model using ``Rs [kN]`` as the single feature and the total normalised ENTHRU as the target.

In [None]:
svr_rbf.____(____, ____)

## Model predictions

### Training set

Making predictions is as easy as running the ``predict`` method on the trained model. We can assign the result to ``y_pred_train``.

In [None]:
____ = svr_rbf.____(____)

The predictions can be visualised against the observed values.

In [None]:
fig = subplots.make_subplots(rows=1, cols=1, print_grid=False, shared_yaxes=True)
trace1 = go.Scatter(
    x=enhanced_data["Rs [kN]"], y=enhanced_data["Normalised total ENTRHU per distance [-/m]"], showlegend=True,
    mode='markers',name='True values')
fig.append_trace(trace1, 1, 1)

trace1 = go.Scatter(
    x=____["Rs [kN]"], y=____, showlegend=True,
    mode='markers',name='Predictions')
fig.append_trace(trace1, 1, 1)

fig['layout']['xaxis1'].update(title='Rs [kN]', range=(0, 12e3))
fig['layout']['yaxis1'].update(title='Normalised total ENTHRU [-/m]')

fig.show()

The $ R^2 $-score can be determined using the ``score`` method.

In [None]:
svr_rbf.____(____, ____)

### Test set

The features and target can be selected for the test set.

In [None]:
X_test = validation_data[____]
y_test = validation_data["____"]

Predictions can again be made.

In [None]:
y_pred_test = svr_rbf.____(____)

And the $ R^2 $-score can be displayed. Does the model generalise well?

In [None]:
svr_rbf.____(____, ____)

It should be noted that the predictions are normalised total ENTRHU per unit length. To predict blowcount, we need to divide by the normalised ENTHRU. We can first assign the predictions to the column ``'Predictions'`` in the ``validation_data`` dataframe.

In [None]:
____['____'] = y_pred_test

The predicted blowcount can then be calculated in a single line of code.

In [None]:
validation_data['Predicted blowcount'] = validation_data['____'] / validation_data['____']

### Results for selected locations

The results for selected locations in the test set can be visualised by overlaying predictions and observed values for the total normalised ENTRHU per unit length and blowcount.

The results show that the model generally performs well. However, the increasing detrimental effect of blowcount underprediction towards the final pile penetration is not captures in any of the scikit-learn metrics. Custom coding is required to get more insight in this.

In [None]:
# Available locations are 'EL', 'CB', 'AV', 'BV', 'EF', 'DL', 'BM'
selected_location = 'EL'

In [None]:
fig = subplots.make_subplots(rows=1, cols=2, print_grid=False, shared_yaxes=True)

loc_data = validation_data[validation_data['Location ID'] == ____]
trace_observations = go.Scatter(
    x=loc_data['Normalised total ENTRHU per distance [-/m]'],
    y=loc_data['____'], showlegend=True, mode='lines',name='Observed values')
fig.append_trace(trace_observations, 1, 1)

trace_predictions = go.Scatter(
    x=loc_data['Predictions'],
    y=loc_data['z [m]'], showlegend=True, mode='markers',name='Predictions')
fig.append_trace(trace_predictions, 1, 1)

trace_observations_blct = go.Scatter(
    x=loc_data['____'],
    y=loc_data['z [m]'], showlegend=True, mode='lines',name='Observed blowcount')
fig.append_trace(trace_observations_blct, 1, 2)

trace_predictions_blct = go.Scatter(
    x=loc_data['____'],
    y=loc_data['z [m]'], showlegend=True, mode='markers',name='Predicted blowcount')
fig.append_trace(trace_predictions_blct, 1, 2)

fig['layout']['xaxis1'].update(title='Normalised total ENTRHU [-/m]', side='top', anchor='y')
fig['layout']['xaxis2'].update(title='Blowcount [Blows/m]', side='top', anchor='y')
fig['layout']['yaxis1'].update(title='Depth below mudline [m]', autorange='reversed')
fig['layout'].update(height=700, width=900)
fig.show()

## Modifications

When the initial model performs as expected, you can investigate the effect of adding more features and changing the hyperparameters. Can you get even better scores by doing this.

In scikit-learn, there are methods such as ``GridSearchCV`` which allow selection of an optimal set of hyperparameters but discussing these methods is beyond the scope of this tutorial.