# Using Multivariate Gaussian NGBoost (MVN NGBoost)

This notebook outlines simple modelling tasks using MVN NGBoost for predicting drifting buoy velocities from satellite-derived explanatory variables. This notebook draws upon [O'Malley et al. 2023](https://www.cambridge.org/core/journals/environmental-data-science/article/probabilistic-prediction-of-oceanographic-velocities-with-multivariate-gaussian-natural-gradient-boosting/F26F2BD51213758208B0EBAE51D1A973#article) and the supplementary materials they have provided.

In [1]:
# import packages
import pandas as pd
import ngboost

In [3]:
# load data
path_to_data = '../data/filtered_nao_drifters_with_sst_gradient.h5'
data = pd.read_hdf(path_to_data)
# add day of the year as an index (to be added to the data later)
data['day_of_year'] = data['time'].apply(lambda t : t.timetuple().tm_yday)

# separate into explanatory and response variables
explanatory_var_labels = ['u_av','v_av','lat','lon','day_of_year','Wx','Wy','Tx','Ty','sst_x_derivative','sst_y_derivative']
response_var_labels = ['u','v']

explanatory_vars = data[explanatory_var_labels]
response_vars = data[response_var_labels]

## Training, Cross-Validation, and Testing

To do:
* Explain what each of these terms mean and why we are doing them.
* Decide on the train-test-cross validation split
* Explain that this split must also apply for each drifter and explain why
* Include a sketch implementation of how this might be done.

## The MVN NGBoost Model

Include an explainer here as to how the model works.

In [25]:
import ngboost.distns

multivariate_ngboost = ngboost.NGBoost(Dist=ngboost.distns.MultivariateNormal(2),n_estimators=15)
# fit the model
multivariate_ngboost.fit(X = explanatory_vars, Y = response_vars)

[iter 0] loss=0.0386 val_loss=0.0000 scale=0.5000 norm=1.5508


<ngboost.ngboost.NGBoost at 0x190ca651ca0>

In [26]:
# get predictive distribution for each 
predicted_distribution = multivariate_ngboost.pred_dist(explanatory_vars)

In [27]:
import numpy as np

point_predictions = [point for point in predicted_distribution.loc]
np.sqrt(np.mean(np.square(np.array(response_vars)-point_predictions)))

0.23382208583662384