# Investigating the Impact of Cluster Sampling on the Size of the Training, Testing, and Validation Sets

When training the MVN NGBoost model on the drifter data, we must take eidstra care to ensure that information about the testing and validation sets is not inadvertedly introduced into the training data. The ~400,000 drifter observations that form our data set come from ~2000 unique drifters. We expect that observations taken by the same drifter are likely to be highly correlated so we ensure that all of the observations made by any single drifter are in precisely one of the training, testing or validation sets. Ensuring observations from each drifter are not split between sets will ensure that the training data does not contain any information about the testing or validation sets via correlation.

O'Malley et al. (2023) deal with this issue using cluster sampling. That is, spliting the data into clusters defined by their corresponding drifter ID, then randomly sampling the clusters into the training, testing, and validation sets containing 81%, 10% and 9% of the drifter IDs respectively. However, there is significant variation between the number of observations found in each of the drifter ID clusters meaning that the proportion of the overall data found in each of the sets may be significantly different than the nominal 81-10-9 split. At its most extreme, this discrepancy could result in testing and training sets that are of comporable sizes. 

In this notebook, we will investigate whether the sizes of the training, testing, and validation sets that result from 81-10-9 cluster sampling differ signficantly from the nominal 81-10-9 values.

In [39]:
# import modules and load data
import pandas as pd
import numpy as np
from scipy.stats import t,ttest_1samp
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname('seaducks'), '..')))
from seaducks.model_selection import train_test_validation_split
from seaducks.config import config

path_to_data = '../data/filtered_nao_drifters_with_sst_gradient.h5'
data = pd.read_hdf(path_to_data)
# add day of the year as an index (to be added to the data later)
data['day_of_year'] = data['time'].apply(lambda t : t.timetuple().tm_yday)

## Testing

To identify if the actual sizes of the training, testing, and validation sets differ significantly from the nominal 81-10-9 values, we will cluster sample the data by drifter ID `N=100000` times and calculate the resulting proportions of the overall data made up by each of the three sets. 

In [11]:
number_of_samples = len(data.index)                                     # number of drifter observations
N = 100000                                                              # number of repeats for hypothesis tests

count_by_id = data.groupby('id').size()                                 # grouping observations by id
ids, size_by_id = np.array(count_by_id.index), np.array(count_by_id)   

We will test the impact of cluster sampling using the implementation carried out by O'Malley et al. (2023) and the implementation used in this package. By cycling through random seeds, we have also obtained a seed that results in training, testing, and validation set proportions that are close to the nominal values which we also include and is the default `random_state` in this package.

In [15]:
# % ------------------------------------ O'Malley et al. (2023) implementation ------------------------------------------ %

'''MV_Prediction/experiments/dispatcher.py lines 31-39'''
def random_id_subset(ids, pc=0.1):
    unique_id = np.unique(ids)
    N_unique = len(unique_id)
    np.random.shuffle(unique_id)
    in_test = int(N_unique * pc)
    test_ids = unique_id[:in_test]
    test_mask = np.in1d(ids, test_ids)
    train_mask = np.invert(test_mask)
    return train_mask, test_mask

''' from MV_Prediction/experiments/dispatcher.py'''
N_runs = N                                             # L97 (adaptated)
shuffle_seed = 500                                              # L80
np.random.seed(shuffle_seed)                                    # L98
splits = [random_id_subset(ids) for _ in range(N_runs)]         # L99 (adapted)
total_data_size = count_by_id.sum()

OM_train_test_val_proportions = []

for (train_mask, test_mask) in splits:                                # L101 (adapted)
    test_ids = ids[test_mask]                                         # L103 (adapted)
    train_ids = ids[train_mask] # auxillary set                       # L106 (adapted)

    train_mask, valid_mask = random_id_subset(train_ids, pc=0.1)      # L48 (adapted)
    new_train_ids = train_ids[train_mask]
    valid_ids = train_ids[valid_mask]
    
    # get the proportion of each set
    train_size = (count_by_id[new_train_ids]).sum()
    test_size = (count_by_id[test_ids]).sum()
    validation_size = (count_by_id[valid_ids]).sum()

    OM_train_test_val_proportions.append([(count_by_id[new_train_ids]).sum(), (count_by_id[test_ids]).sum(),(count_by_id[valid_ids]).sum()])


OM_train_test_val_proportions = np.array(OM_train_test_val_proportions)/total_data_size

In [16]:
# % ----------------------------- SeaDucks implementation ----------------------------------- %

train_test_val_proportions = []

for ii in range(N):
    _,_,_,Y_train,Y_test,Y_val = train_test_validation_split(ids, size_by_id,
                                                             test_frac = 0.10, validation_frac = 0.09)
    train_test_val_proportions.append([sum(Y_train),sum(Y_test),sum(Y_val)])

train_test_val_proportions = np.array(train_test_val_proportions)/number_of_samples

In [80]:
# % ----------------------------- SeaDucks implementation with preselected seed ----------------------------------- %


_,_,_,Y_train,Y_test,Y_val = train_test_validation_split(ids, size_by_id,
                                                             test_frac = 0.10, validation_frac = 0.09,random_state=np.random.seed(63012))
train_test_val_proportions_seeded = ([sum(Y_train),sum(Y_test),sum(Y_val)])

train_test_val_proportions_seeded = np.array(train_test_val_proportions_seeded)/number_of_samples

### Two One-Sided Student's $t$-Tests (TOST)

Since we are working within an application, if the training, testing, validation split differs very slightly from 81-10-9, the impact of this will be negligible in practice so we will allow for the mean proportion to differ from 0.81 up to $\delta = 5\times 10^{-5}$. This tolerance will result in each of the three sets differing from nominal values by no more than 20 observation.

Since the sample means of the training, testing, validation data proportion approximately follow normal distributions, respectively (CLT) and each combination of training, testing, and validation sets is independent, we can use the two-sided Student's $t$-test.

In [14]:
delta = 0.00005

print(f"Largest permitted difference from nominal values: {int(np.floor(delta*number_of_samples))} observations")

Largest permitted difference from nominal values: 20 observations


In [76]:
def TOST(train_test_val_proportions,train_test_val_flag,delta,popmean):
    set_proportions = train_test_val_proportions[:,train_test_val_flag]

    print(f"sample mean:{np.mean(set_proportions)}")
    _, p1 = ttest_1samp(set_proportions-popmean, -delta,
                        axis=None, nan_policy='propagate', alternative='greater')
    
    _, p2 = ttest_1samp(set_proportions-popmean, delta,
                        axis=None, nan_policy='propagate', alternative='less')

    set_type = ['training', 'testing', 'validation']
    alpha = 0.05
    print("\n ------------- Test 1 -------------")
    if p1 < alpha:
        print(f"\nReject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu_hat, is significantly small compared to the nominal value, mu:")
        print(f"mu_hat - {popmean} <= -{delta}")
    else:
        print(f"\nFail to reject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu_hat, is not signficantly small compared to the nominal value, mu:")
        print(f"mu_hat  - {popmean} > -{delta}")
    print(f"\n p-value for Test 1: {p1:.3f}")

    print("\n ------------- Test 2 -------------")
    if p2 < alpha:
        print(f"\nReject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu_hat, is significantly large compared to the nominal value, mu:")
        print(f"mu_hat - {popmean} >= {delta}")
    else:
        print(f"\nFail to reject H0: The mean proportion of the data assigned to {set_type[train_test_val_flag]}, mu_hat, is not significantly large compared to the nominal value, mu:")
        print(f"mu_hat - {popmean} < {delta}")
        
    print(f"\n p-value for Test 2: {p2:.3f}")    



Each of TOST is comprised of two tests with the following forms:

**Test 1: The mean proportion is smaller than the nominal value**

$H_0^{(1)}$: The mean proportion of the data assigned to `{set type}` set>, $\hat{\mu} - \mu \leq -\delta$. 

$H_1^{(1)}$: The mean proportion of the data assigned to `{set type}` set>, $\hat{\mu} - \mu > -\delta$. 

**Test 2: The mean proportion is larger than the nominal value**

$H_0^{(2)}$: The mean proportion of the data assigned to `{set type}` set>, $\hat{\mu} - \mu \geq \delta$.

$H_1^{(2)}$: The mean proportion of the data assigned to `{set type}` set>, $\hat{\mu} - \mu < \delta$.

Significance Level: 5%

where $\hat{\mu}$ is the sample mean, $\mu$ is the nominal proportion, $\delta=5\times10^{-5}$ is the tolerance, and the set types are training, testing, and validation.

In [82]:
# TOST for training data
popmean = 0.81
train_test_val_flag = 0 # training set
print("\n Training Data Proportion vs Nominal Value TOST [for the O'Malley et al. (2023) implementation]")

TOST(OM_train_test_val_proportions,train_test_val_flag,delta,popmean)

print("\n=========================================================================================================================================")

print("\n Training Data Proportion vs Nominal Value TOST [for the SeaDucks implementation]")
TOST(train_test_val_proportions,train_test_val_flag,delta,popmean)

print("\n=========================================================================================================================================")

print("\n Training Data Proportion vs Nominal Value [for the SeaDucks seeded implementation]")
print(f"\nTraining data proportion: {train_test_val_proportions_seeded[train_test_val_flag]}")
print(f"Training proportion - nominal value: {train_test_val_proportions_seeded[train_test_val_flag]-popmean}")


 Training Data Proportion vs Nominal Value TOST [for the O'Malley et al. (2023) implementation]
sample mean:0.8105521506621232

 ------------- Test 1 -------------

Reject H0: The mean proportion of the data assigned to training, mu_hat, is significantly small compared to the nominal value, mu:
mu_hat - 0.81 <= -5e-05

 p-value for Test 1: 0.000

 ------------- Test 2 -------------

Fail to reject H0: The mean proportion of the data assigned to training, mu_hat, is not significantly large compared to the nominal value, mu:
mu_hat - 0.81 < 5e-05

 p-value for Test 2: 1.000


 Training Data Proportion vs Nominal Value TOST [for the SeaDucks implementation]
sample mean:0.8097174964563796

 ------------- Test 1 -------------

Fail to reject H0: The mean proportion of the data assigned to training, mu_hat, is not signficantly small compared to the nominal value, mu:
mu_hat  - 0.81 > -5e-05

 p-value for Test 1: 1.000

 ------------- Test 2 -------------

Reject H0: The mean proportion of t

In [None]:
# TOST for testing data
popmean = 0.10
train_test_val_flag = 1 # testing set
print("\n Testing Data Proportion vs Nominal Value TOST [for the O'Malley et al. (2023) implementation]")

TOST(OM_train_test_val_proportions,train_test_val_flag,delta,popmean)

print("\n=========================================================================================================================================")

print("\n Testing Data Proportion vs Nominal Value TOST [for the SeaDucks implementation]")
TOST(train_test_val_proportions,train_test_val_flag,delta,popmean)

print("\n=========================================================================================================================================")

print("\n Testing Data Proportion vs Nominal Value [for the SeaDucks seeded implementation]")
print(f"\nTesting data proportion: {train_test_val_proportions_seeded[train_test_val_flag]}")
print(f"Testing proportion - nominal value: {train_test_val_proportions_seeded[train_test_val_flag]-popmean}")


 Testing Data Proportion vs Nominal Value TOST [for the O'Malley et al. (2023) implementation]
sample mean:0.09978165426399488

 ------------- Test 1 -------------

Fail to reject H0: The mean proportion of the data assigned to testing, mu_hat, is not signficantly small compared to the nominal value, mu:
mu_hat  - 0.1 > -5e-05

 p-value for Test 1: 1.000

 ------------- Test 2 -------------

Reject H0: The mean proportion of the data assigned to testing, mu_hat, is significantly large compared to the nominal value, mu:
mu_hat - 0.1 >= 5e-05

 p-value for Test 2: 0.000


 Testing Data Proportion vs Nominal Value TOST [for the SeaDucks implementation]
sample mean:0.10019171016096776

 ------------- Test 1 -------------

Reject H0: The mean proportion of the data assigned to testing, mu_hat, is significantly small compared to the nominal value, mu:
mu_hat - 0.1 <= -5e-05

 p-value for Test 1: 0.000

 ------------- Test 2 -------------

Fail to reject H0: The mean proportion of the data a

In [86]:
# TOST for validation data
popmean = 0.09
train_test_val_flag = 2 # validation set
print("\n Validation Data Proportion vs Nominal Value TOST [for the O'Malley et al. (2023) implementation]")

TOST(OM_train_test_val_proportions,train_test_val_flag,delta,popmean)

print("\n=========================================================================================================================================")

print("\n Validation Data Proportion vs Nominal Value TOST [for the SeaDucks implementation]")
TOST(train_test_val_proportions,train_test_val_flag,delta,popmean)

print("\n=========================================================================================================================================")

print("\n Validation Data Proportion vs Nominal Value [for the SeaDucks seeded implementation]")
print(f"\nValidation data proportion: {train_test_val_proportions_seeded[train_test_val_flag]}")
print(f"Validation proportion - nominal value: {train_test_val_proportions_seeded[train_test_val_flag]-popmean}")


 Validation Data Proportion vs Nominal Value TOST [for the O'Malley et al. (2023) implementation]
sample mean:0.08966619507388206

 ------------- Test 1 -------------

Fail to reject H0: The mean proportion of the data assigned to validation, mu_hat, is not signficantly small compared to the nominal value, mu:
mu_hat  - 0.09 > -5e-05

 p-value for Test 1: 1.000

 ------------- Test 2 -------------

Reject H0: The mean proportion of the data assigned to validation, mu_hat, is significantly large compared to the nominal value, mu:
mu_hat - 0.09 >= 5e-05

 p-value for Test 2: 0.000


 Validation Data Proportion vs Nominal Value TOST [for the SeaDucks implementation]
sample mean:0.09009079338265277

 ------------- Test 1 -------------

Reject H0: The mean proportion of the data assigned to validation, mu_hat, is significantly small compared to the nominal value, mu:
mu_hat - 0.09 <= -5e-05

 p-value for Test 1: 0.000

 ------------- Test 2 -------------

Fail to reject H0: The mean propor

In [40]:
random_state = config['random_state']

In [8]:
def prediction_interval(data, future_sample_size=1, alpha=0.05):
    n = len(data) 
    mean = np.mean(data)
    std = np.std(data, ddof=1) 
    
    t_crit = t.ppf(1 - alpha / 2, df=n - 1)
    se_prediction = np.sqrt(std**2 + (std**2 / future_sample_size)) 
    
    # Prediction interval
    margin_of_error = t_crit * se_prediction
    lower = round((mean - margin_of_error)*100,3)
    upper = round((mean + margin_of_error)*100,3)
    
    return lower, upper

In [98]:
set_type = ['train', 'test', 'validation']

for ii,name in enumerate(set_type):
    print(f"\nPrediction interval ({name} proportions): {prediction_interval(OM_train_test_val_proportions[:,ii], future_sample_size=1, alpha=0.05)}")



Prediction interval (train proportions): (79.042, 83.069)

Prediction interval (test proportions): (8.439, 11.517)

Prediction interval (validation proportions): (7.5, 10.434)


In [40]:
set_type = ['train', 'test', 'validation']

for ii,name in enumerate(set_type):
    print(f"\nPrediction interval ({name} proportions): {prediction_interval(train_test_val_proportions[:,ii], future_sample_size=1, alpha=0.05)}")



Prediction interval (train proportions): (78.961, 82.979)

Prediction interval (test proportions): (8.48, 11.56)

Prediction interval (validation proportions): (7.537, 10.482)


## Discussion

1. Whether the proportion of the overall data contained within the training, testing, and validation sets are practically equal to the nominal nominal 81-10-9 proportions.

The TOST analysis above shows that cluster sampling the drifter data by ID according to a training, testing, validation data split of 81-10-9, on average, leads to a training, testing, and validation datasets that form within 

* $(80.05 \leq \_ < 81.05)\%$, 
* $(9.05 \leq \_ < 10.05)\%$, 
* $(8.05 < \_ < 9.05)\%$ 

of the total drifter dataset, respectively at the 5% significance level.

2. The nominal proportion of the overall dataset for all three data set types are found in their respective 95% prediction intervals. However, individual variability between the number of observations in each drifter ID cluster results in wide prediction intervals, thus for each train-test-validation split, there may be notable variability from the nominal values.

The 95% prediction intervals (the intervals in which the proportion of the overall data each set will contain 95% of the time) are:

|Data subset| Nominal Proportion| 95% Prediction Interval|
|---|---|---|
|Training |81\%| $(79.0, 83.0)\%$|
|Testing| $10\%$ | $(8.5, 11.6)\%$|
|Validation| $9\%$| $(7.5, 10.5)\%$|

There is not very much uncertainty in the mean (TOST). Individual variability is where the potential problem lies (Prediction Intervals).

It's up to my discretion as to whether this is a problem