# Hypothesis Formulation and Testing

To address my research questions, I will test the following hypotheses:

In [1]:
# import modules
import numpy as np

# import models


## Impact of SST Gradient on Surface Current Prediction using MVN NGBoost

### Prediction Performance

#### 1. Including SST gradient in the MVN NGBoost model improves model prediction performance:

##### RMSE

Let the random variable, $X$, be the Root Mean Square Error (RMSE) of the MVN NGBoost model evaluated on a test dataset

$$X = \sqrt{\frac{1}{2P} \sum_{i=1}^2\sum_{j=1}^P \left(\mathbf{u}_i^{(j)} - \hat{\mathbf{u}}_i^{(j)}\right)^2},$$

where $P$ is the number of samples in the test data, $\mathbf{u}=(u,v)$, is the actual drifter velocity, and, $\hat{\mathbf{u}} = (\hat{u},\hat{v})$, is the expected value of the predicted distribution of the drifter velocity.

Let $X_{\nabla SST}$ be the RMSE on the test data for the MVN NGBoost model that uses Sea Surface Temperature (SST) gradients and let $X_0$ be the RMSE on the test data for the MVN NGBoost model that does not. 

In [2]:
def rmse(vec1,vec2):
    return 100*np.sqrt(np.mean(np.square(vec1-vec2))) # m/s -> cm/s

##### MAE
Let the random variable, $X$, be the Mean Absolute Error (MAE) of the MVN NGBoost model evaluated on a test dataset

$$X = \frac{1}{2P} \sum_{i=1}^2\sum_{j=1}^P \left\vert\mathbf{u}_i^{(j)} - \hat{\mathbf{u}}_i^{(j)}\right\vert,$$

where $P$ is the number of samples in the test data, $\mathbf{u}=(u,v)$, is the actual drifter velocity, and, $\hat{\mathbf{u}} = (\hat{u},\hat{v})$, is the expected value of the predicted distribution of the drifter velocity.

Let $X_{\nabla SST}$ be the MAE on the test data for the MVN NGBoost model that uses Sea Surface Temperature (SST) gradients and let $X_0$ be the MAE on the test data for the MVN NGBoost model that does not. 

In [3]:
def mae(vec1,vec2):
    return 100*np.mean(np.abs(vec1-vec2)) # m/s -> cm/s

##### MAAO

Let the random variable, $X$, be the Mean Absolute Angle Offset (MAAO) of the MVN NGBoost model evaluated on a test dataset

$$X = \frac{1}{P}\sum_{i=1}^P \arccos \left(\frac{\mathbf{u}_i\cdot\hat{\mathbf{u}}_i}{|\mathbf{u}_i||\hat{\mathbf{u}}_i|}\right),$$

where $P$ is the number of samples in the test data, $\mathbf{u}=(u,v)$, is the actual drifter velocity, and, $\hat{\mathbf{u}} = (\hat{u},\hat{v})$, is the expected value of the predicted distribution of the drifter velocity.

Let $X_{\nabla SST}$ be the MAAO on the test data for the MVN NGBoost model that uses Sea Surface Temperature (SST) gradients and let $X_0$ be the MAAO on the test data for the MVN NGBoost model that does not. 



In [42]:
def maao(vec1,vec2):
    elem_wise_dot_product = np.einsum('ij,ij->i',vec1,vec2)
    normalisation = np.linalg.norm(vec1,axis=1)*np.linalg.norm(vec2,axis=1)
    return np.arccos(np.clip(
        elem_wise_dot_product/normalisation,
        -1,1
    ))


##### MAPE

Let the random variable, $X$, be the Mean Absolute Percentage Error (MAPE) of the MVN NGBoost model evaluated on a test dataset

$$X = \frac{100}{2P} \sum_{i=1}^2\sum_{j=1}^P \left\vert\frac{\mathbf{u}_i^{(j)} - \hat{\mathbf{u}}_i^{(j)}}{\mathbf{u}_i^{(j)}}\right\vert,$$

where $P$ is the number of samples in the test data, $\mathbf{u}=(u,v)$, is the actual drifter velocity, and, $\hat{\mathbf{u}} = (\hat{u},\hat{v})$, is the expected value of the predicted distribution of the drifter velocity.

Let $X_{\nabla SST}$ be the MAPE on the test data for the MVN NGBoost model that uses Sea Surface Temperature (SST) gradients and let $X_0$ be the MAPE on the test data for the MVN NGBoost model that does not. 

In [43]:
def mape(true,pred):
    return 100*np.mean(np.abs((
        true-pred
    )/true))

##### RMSLE

Let the random variable, $X$, be the Root Mean Square Logarithmic Error (RMSLE) of the MVN NGBoost model evaluated on a test dataset

$$X = \sqrt{\frac{1}{2P} \sum_{i=1}^2\sum_{j=1}^P \left[\ln(1 +\mathbf{u}_i^{(j)}) - \ln(1+\hat{\mathbf{u}}_i^{(j)})\right]^2},$$

where $P$ is the number of samples in the test data, $\mathbf{u}=(u,v)$, is the actual drifter velocity, and, $\hat{\mathbf{u}} = (\hat{u},\hat{v})$, is the expected value of the predicted distribution of the drifter velocity.

Let $X_{\nabla SST}$ be the RMSLE on the test data for the MVN NGBoost model that uses Sea Surface Temperature (SST) gradients and let $X_0$ be the RMSLE on the test data for the MVN NGBoost model that does not. 

In [None]:
def rmsle(vec1,vec2):
    return np.sqrt(np.mean(np.square(
        np.log(1+vec1)-np.log(1+vec2)
        )))

##### Formulate One-Sided Two-Sample t-test
100 replications of MVN NGBoost fitting with different random seeds for each replication so that each replication is *independent*.

Checking Normality:

In [44]:
# check samples are normally distributed

Let $\mu_{\nabla SST} = \frac{1}{N}\sum_{n=1}^N X_{\nabla SST}^{(n)}$, $\mu_0 = \frac{1}{N}\sum_{n=1}^N X_{0}^{(n)}$ be the sample means of $X_{\nabla SST}^{(1)}, \dots, X_{\nabla SST}^{(N)}$ and $X_{0}^{(1)}, \dots, X_{0}^{(N)}$  where $N=100$, respectively.

$H_0$: $\mu_{\nabla SST}$ < $\mu_0$.

$H_1$: $\mu_{\nabla SST} \geq \mu_0$.

### Uncertainty Reduction

2. Including SST gradient in the MVN NGBoost model reduces the variance of:
- RMSE
- MAE
- MAAO
- MAPE

3. Including SST Gradient in the MVN NGBoost model reduces the area of the prediction region.

### Goodness of Fit
4. Including SST gradient in MVN NGBoost improves fit of the model to the data in:
- NLL
- $\chi^2$
- Prediction region coverage

### SST Gradient as Physical Phenomena

5. SST gradients are significant features for explaining the variance of the MVN NGBoost parameters.

6. SST gradients improve prediction significantly in the Gulf Stream and Labrador Current regions.


### Polar Form

Including the velocities in polar form does all the above things

## The SeaDucks Implementation

7. The SeaDucks implementation of MVN NGBoost improves upon the model presented by O'Malley et al. (2023) on the mean of:
- RMSE
- NLL
- Prediction Region Area
- Prediction Region Coverage

8. The SeaDucks implementation of MVN NGBoost improves upon the model presented by O'Malley et al. (2023) on the variance of:
- RMSE
- NLL
- Prediction Region Area
- Prediction Region Coverage


## Observations about the MVN NGBoost Model

9. Increasing the number of training points increases the model's reliance on (lat, lon, time) and decreases the reliance on physical features.

10. There is seasonal variation in the performance of MVN NGBoost.

11. The model residuals are normally distributed
- RMSE
- MAE
- MAAO
- MAPE

12. There are clusters of regions that perform poorly in the following metrics:
- RMSE
- MAE
- MAAO
- MAPE

## Mean Absolute Angle Offset

13. MAAO gives us more information about improvements in direction prediction than RMSE.