# Investigating the Impact of Cluster Sampling on the Size of the Training, Testing, and Validation Sets

When training the MVN NGBoost model on the drifter data, we must take extra care to ensure that information about the testing and validation sets are not inadvertedly introduced into the training data. The ~400,000 drifter observations that form our data set come from ~2000 unique drifters. We expect that observations taken by the same drifter are likely to be highly correlated so we ensure that all of the observations made by any single drifter are in precisely one of the training, testing or validation sets. Ensuring observations from each drifter are not split between sets will ensure that the training data does not contain any extra information via correlation.

O'Malley et al. (2023) deal with this issue using cluster sampling. That is, spliting the data into clusters defined by their corresponding drifter ID, then randomly sampling the clusters into the training, testing, and validation sets containing 81%, 10% and 9% of the drifter IDs respectively. However, there is significant variation between the number of observations found in each of the drifter ID clusters meaning that the propertion of the overall data found in each of the sets may be significantly different than the nominal 81-10-9 split. At its most extreme, this discrepancy could result in testing and training sets that are of comporable sizes. 

In this notebook, we will investigate whether the sizes of the training, testing, and validation sets that result from 81-10-9 cluster sampling differ signficantly from the nominal 81-10-9 values.

In [None]:
# import modules and load data
import pandas as pd
import numpy as np
path_to_data = '../data/filtered_nao_drifters_with_sst_gradient.h5'
data = pd.read_hdf(path_to_data)
# add day of the year as an index (to be added to the data later)
data['day_of_year'] = data['time'].apply(lambda t : t.timetuple().tm_yday)

In [2]:
# split the drifter IDs into training, testing and validation 

from sklearn.model_selection import train_test_split

def train_test_validation_split(X, Y,*,
                                test_frac = 0.10, validation_frac = 0.09, 
                                random_state = None, shuffle = True, stratify = None):
    
    X_aux, X_test, Y_aux, Y_test = train_test_split(X, Y, 
                                                        test_size=test_frac, random_state = random_state, shuffle = shuffle, stratify = stratify)
    if validation_frac == 0:
        return X_aux, X_test, Y_aux, Y_test
    else:
        X_train, X_val, Y_train, Y_val = train_test_split(X_aux, Y_aux,
                                                        test_size=validation_frac/(1 - test_frac), random_state = random_state, shuffle = shuffle, stratify = stratify)
        return X_train, X_test, X_val, Y_train, Y_test, Y_val

In [48]:
# randomly split drifter IDs into 81-10-9 and calculate the proportion of data in each of the sets

number_of_samples = len(data.index)
N = 1000 # number of repeats for hypothesis tests

count_by_id = data.groupby('id').size()
X, Y = np.array(count_by_id.index), np.array(count_by_id)
train_test_val_proportions = []

for ii in range(N):
    _,_,_,Y_train,Y_test,Y_val = train_test_validation_split(X, Y,
                                                             test_frac = 0.10, validation_frac = 0.09)
    train_test_val_proportions.append([sum(Y_train),sum(Y_test),sum(Y_val)])

train_test_val_proportions = np.array(train_test_val_proportions)/number_of_samples

## Testing

With the proportions of data in training, testing, and cross validations set calculated above for `N=10000` repetitions, we will test the following hypotheses:

### $\chi^2$ test for normality

To make sure that we can use the (two-sided) student's $t$-test for the mean training data proportion and the ratio between the sizes of the validation and testing sets, we first test for normality.

$H_0:$ The proportion of the dataset that is assigned to training follows a normal distribution.

$H_1:$ The proportion of the dataset that is assigned to training does not follow a normal distribution.

Significance Level: 5%

In [49]:
from scipy.stats import normaltest

train_proportions = train_test_val_proportions[:,0]

_, p = normaltest(train_proportions, nan_policy='propagate')

alpha = 0.05
if p < alpha:
    print("\nReject H0: The proportion of the data that is assigned to training does not follow a normal distribution.")
else:
    print("\nFail to reject H0: The proportion of the data that is assigned to training follows a normal distribution.")

print(f"\n p value: {p:.3f}")



Fail to reject H0: The proportion of the data that is assigned to training follows a normal distribution.

 p value: 0.544


$H_0$: The ratio between the sizes of the validation and testing sets follows a normal distribution

$H_1$: The ratio between the sizes of the validation and testing sets do not follow a normal distribution

Significance Level: 5%

In [50]:
validation_test_ratio = train_test_val_proportions[:,2]/train_test_val_proportions[:,1]

_, p = normaltest(validation_test_ratio, nan_policy='propagate')

alpha = 0.05
if p < alpha:
    print("\nReject H0: The ratio between the sizes of the validation and testing sets does not follow a normal distribution")
else:
    print("\nFail to reject H0: The ratio between the sizes of the validation and testing sets follows a normal distribution")

print(f"\n p value: {p:.3f}")


Reject H0: The ratio between the sizes of the validation and testing sets does not follow a normal distribution

 p value: 0.011


### Two-Sided Student's $t$-Tests

Since the training data proportion and the ratio between the sizes of the validation and testing sets follow normal distributions and each combination of training, testing, and validation sets is independent, we can use the two-sided Student's $t$-test.

$H_0$: The mean proportion of the data assigned to training, $\mu \neq 0.81$

$H_1$: The mean proportion of the data assigned to training, $\mu = 0.81$

Significance Level: 5%


In [51]:
from scipy.stats import ttest_1samp
popmean = 0.81

_, p = ttest_1samp(train_proportions, popmean,
                      axis=None, nan_policy='propagate', alternative='two-sided')

alpha = 0.05
if p < alpha:
    print("\nReject H0: The mean proportion of the data assigned to training, mu != 0.81")
else:
    print("\nFail to reject H0: The mean proportion of the data assigned to training, mu = 0.81")

print(f"\n p value: {p:.3f}")


Reject H0: The mean proportion of the data assigned to training, mu != 0.81

 p value: 0.029


$H_0$: The mean ratio between the sizes of the validation and testing sets, $\mu \neq 0.9$

$H_1$: The mean ratio between the sizes of the validation and testing sets, $\mu = 0.9$

In [42]:
popmean = 0.9

_, p = ttest_1samp(validation_test_ratio, popmean,
                      axis=None, nan_policy='propagate', alternative='two-sided')

alpha = 0.05
if p < alpha:
    print("\nReject H0: The mean ratio between the sizes of the validation and testing sets, mu != 0.9")
else:
    print("\nFail to reject H0: The mean ratio between the sizes of the validation and testing sets, mu = 0.9")

print(f"\n p value: {p:.3f}")


Fail to reject H0: The mean ratio between the sizes of the validation and testing sets, mu = 0.9

 p value: 0.718
