# Investigating the Impact of Cluster Sampling on the Size of the Training, Testing, and Validation Sets

When training the MVN NGBoost model on the drifter data, we must take extra care to ensure that information about the testing and validation sets are not inadvertedly introduced into the training data. The ~400,000 drifter observations that form our data set come from ~2000 unique drifters. We expect that observations taken by the same drifter are likely to be highly correlated so we ensure that all of the observations made by any single drifter are in precisely one of the training, testing or validation sets. Ensuring observations from each drifter are not split between sets will ensure that the training data does not contain any extra information via correlation.

O'Malley et al. (2023) deal with this issue using cluster sampling. That is, spliting the data into clusters defined by their corresponding drifter ID, then randomly sampling the clusters into the training, testing, and validation sets containing 81%, 10% and 9% of the drifter IDs respectively. However, there is significant variation between the number of observations found in each of the drifter ID clusters meaning that the propertion of the overall data found in each of the sets may be significantly different than the nominal 81-10-9 split. At its most extreme, this discrepancy could result in testing and training sets that are of comporable sizes. 

In this notebook, we will investigate whether the sizes of the training, testing, and validation sets that result from 81-10-9 cluster sampling differ signficantly from the nominal 81-10-9 values.

In [1]:
# import modules and load data
import pandas as pd
import numpy as np
path_to_data = '../data/filtered_nao_drifters_with_sst_gradient.h5'
data = pd.read_hdf(path_to_data)
# add day of the year as an index (to be added to the data later)
data['day_of_year'] = data['time'].apply(lambda t : t.timetuple().tm_yday)

In [2]:
# split the drifter IDs into training, testing and validation 

from sklearn.model_selection import train_test_split

def train_test_validation_split(X, Y,*,
                                test_frac = 0.10, validation_frac = 0.09, 
                                random_state = None, shuffle = True, stratify = None):
    
    X_aux, X_test, Y_aux, Y_test = train_test_split(X, Y, 
                                                        test_size=test_frac, random_state = random_state, shuffle = shuffle, stratify = stratify)
    if validation_frac == 0:
        return X_aux, X_test, Y_aux, Y_test
    else:
        X_train, X_val, Y_train, Y_val = train_test_split(X_aux, Y_aux,
                                                        test_size=validation_frac/(1 - test_frac), random_state = random_state, shuffle = shuffle, stratify = stratify)
        return X_train, X_test, X_val, Y_train, Y_test, Y_val

In [6]:
# randomly split drifter IDs into 81-10-9 and calculate the proportion of data in each of the sets

number_of_samples = len(data.index)
N = 100000 # number of repeats for hypothesis tests

count_by_id = data.groupby('id').size()
X, Y = np.array(count_by_id.index), np.array(count_by_id)
train_test_val_proportions = []

for ii in range(N):
    _,_,_,Y_train,Y_test,Y_val = train_test_validation_split(X, Y,
                                                             test_frac = 0.10, validation_frac = 0.09)
    train_test_val_proportions.append([sum(Y_train),sum(Y_test),sum(Y_val)])

train_test_val_proportions = np.array(train_test_val_proportions)/number_of_samples

## Testing

With the proportions of data in training, testing, and cross validations set calculated above for `N=100000` repetitions, we will test the following hypotheses:


### Two One-Sided Student's $t$-Tests (TOST)

Since we are working within an application, if the training, testing, validation split differs very slightly from 81-10-9, the impact of this will be negligible in practice so we will allow for the mean proportion to differ from 0.81 up to $\delta = 0.0005$. Since the sample means of the training, testing, validation data proportion approximately follow normal distributions, respectively (CLT) and each combination of training, testing, and validation sets is independent, we can use the two-sided Student's $t$-test.

Test 1

$H_0^{(1)}$: The mean proportion of the data assigned to training, $\mu - 0.81 \leq \delta$.

$H_1^{(1)}$: The mean proportion of the data assigned to training, $\mu - 0.81 > \delta$.

Test 2

$H_0^{(2)}$: The mean proportion of the data assigned to training, $\mu - 0.81 \geq \delta$.

$H_1^{(2)}$: The mean proportion of the data assigned to training, $\mu - 0.81 < \delta$.

Significance Level: 5%

and similarly for the proportion of the data assigned to the testing and validation sets.


In [10]:
from scipy.stats import ttest_1samp

def TOST(train_test_val_proportions,train_test_val_flag,delta,popmean):
    train_proportions = train_test_val_proportions[:,train_test_val_flag]
    _, p1 = ttest_1samp(train_proportions-popmean, -delta,
                        axis=None, nan_policy='propagate', alternative='less')

    _, p2 = ttest_1samp(train_proportions-popmean, delta,
                        axis=None, nan_policy='propagate', alternative='greater')

    set_type = ['training', 'testing', 'validation']
    alpha = 0.05
    print("\n ------------- Test 1 -------------")
    if p1 < alpha:
        print(f"\nReject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu - {popmean} <= -{delta}")
    else:
        print(f"\nFail to reject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu - {popmean} > -{delta}")
    print(f"\n p-value for Test 1: {p1:.3f}")

    print("\n ------------- Test 2 -------------")
    if p1 < alpha:
        print(f"\nReject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu - {popmean} >= {delta}")
    else:
        print(f"\nFail to reject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu - {popmean} < {delta}")
    print(f"\n p-value for Test 2: {p2:.3f}")    



In [34]:
delta = 0.0005

In [35]:
# TOST for training data

popmean = 0.81
train_test_val_flag = 0 # training set

TOST(train_test_val_proportions,train_test_val_flag,delta,popmean)


 ------------- Test 1 -------------

Fail to reject H0: The mean proportion of the data assigned to training, mu - 0.81 > -0.0005

 p-value for Test 1: 1.000

 ------------- Test 2 -------------

Fail to reject H0: The mean proportion of the data assigned to training, mu - 0.81 < 0.0005

 p-value for Test 2: 1.000


In [36]:
# TOST for testing data

popmean = 0.10
train_test_val_flag = 1 # testing set

TOST(train_test_val_proportions,train_test_val_flag,delta,popmean)


 ------------- Test 1 -------------

Fail to reject H0: The mean proportion of the data assigned to testing, mu - 0.1 > -0.0005

 p-value for Test 1: 1.000

 ------------- Test 2 -------------

Fail to reject H0: The mean proportion of the data assigned to testing, mu - 0.1 < 0.0005

 p-value for Test 2: 1.000


In [37]:
# TOST for validation data

popmean = 0.09
train_test_val_flag = 2 # validation set

TOST(train_test_val_proportions,train_test_val_flag,delta,popmean)


 ------------- Test 1 -------------

Fail to reject H0: The mean proportion of the data assigned to validation, mu - 0.09 > -0.0005

 p-value for Test 1: 1.000

 ------------- Test 2 -------------

Fail to reject H0: The mean proportion of the data assigned to validation, mu - 0.09 < 0.0005

 p-value for Test 2: 1.000


In [150]:
# prediction intervals (re do this later)
from scipy.stats import t

def prediction_interval(data, future_sample_size=1, confidence=0.95):
    """
    Compute the prediction interval for a future observation or sample mean.
    
    Parameters:
    - data: array-like, sample data.
    - future_sample_size: int, size of the future sample (default: 1 for individual observation).
    - confidence: float, confidence level (default: 0.95).
    
    Returns:
    - interval: tuple, (lower bound, upper bound) of the prediction interval.
    """
    n = len(data)  # current sample size
    mean = np.mean(data)
    std = np.std(data, ddof=1)  # sample standard deviation
    alpha = 1 - confidence
    
    # Critical t-value
    t_crit = t.ppf(1 - alpha / 2, df=n - 1)
    
    # Standard error terms
    se_mean = std / np.sqrt(n)  # standard error of the mean
    se_prediction = np.sqrt(std**2 + (std**2 / future_sample_size))  # for individual or sample mean
    
    # Prediction interval
    margin_of_error = t_crit * se_prediction
    lower = round((mean - margin_of_error)*100,3)
    upper = round((mean + margin_of_error)*100,3)
    
    return lower, upper


In [153]:
print(f"\nPrediction interval (train proportions): {prediction_interval(train_test_val_proportions[:,0], future_sample_size=1, confidence=0.95)}")
print(f"\nPrediction interval (test proportions): {prediction_interval(train_test_val_proportions[:,1], future_sample_size=1, confidence=0.95)}")
print(f"\nPrediction interval (validation proportions): {prediction_interval(train_test_val_proportions[:,2], future_sample_size=1, confidence=0.95)}")


Prediction interval (train proportions): (78.955, 82.978)

Prediction interval (test proportions): (8.483, 11.562)

Prediction interval (validation proportions): (7.54, 10.482)


## Discussion

To do: Properly discuss
1. Whether the mean sizes are practically equal.
2. An interval in which the size of each set will sit 95% of the time


Cluster sampling the drifter data by ID according to a training, testing, validation data split of 81-10-9, on average, leads to a training, testing, and validation datasets that form within $(80.5 \leq \_ < 81.5)\%$, $(9.5 \leq \_ < 10.5)\%$, $(8.5 < \_ < 9.5)\%$ of the total drifter dataset, respectively at the 5% significance level.

The 95% prediction intervals (An interval in which the size of each set will sit 95% of the time) are:


|Data subset| Nominal Proportion| 95% Prediction Interval|
|---|---|---|
|Training |81\%| $(79.0, 83.0)\%$|
|Testing| $10\%$ | $(8.5, 11.6)\%$|
|Validation| $9\%$| $(7.5, 10.5)\%$|

There is not very much uncertainty in the mean (TOST). Individual variability is where the potential problem lies (Prediction Intervals).

It's up to my discretion as to whether this is a problem