# Train-Test Tutorial

Train test frameworks are an essential part of Machine Learning (ML) workflows. Splitting the data into training and testing sets is a critical step in building ML models. In this tutorial, we will cover three popular train test frameworks: Holdout split, K-fold split, and Leave-One-Out split using the Scikit-learn library.

## Holdout Split

Holdout split is a straightforward method for splitting data into training and testing sets. It randomly divides the data into two sets, a training set and a testing set. We use the training set to train the model and the testing set to evaluate its performance.

Here is an example of how to use the train_test_split function from Scikit-learn to split the data into training and testing sets:

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split

# generate random data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In the example above, we first generated a random dataset of shape (10, 2) for features X and shape (10,) for the target variable y. We then used the train_test_split function from Scikit-learn to split the data into training and testing sets. We set the test_size parameter to 0.3, which means that 30% of the data is used for testing, and the remaining 70% is used for training. We also set the random_state parameter to 42 to ensure reproducibility of results.

After splitting the data, we can print the shapes of the training and testing sets using the following code:

In [2]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)


X_train shape: (7, 2)
X_test shape: (3, 2)


## Holdout Split without Shuffle

In some cases, we may not want to shuffle the data before splitting it into training and testing sets. This is useful, for example, when working with time-series data or data with a specific structure. In this case, we can set the shuffle parameter to False in the train_test_split function. Here's an example:

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split

# generate random data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# split data into training and testing sets without shuffle
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=False)

print("X_train:", X_train)

X_train: [[ 0.32846033 -2.27171679]
 [-0.66579653  0.60807083]
 [ 0.0960338   0.77509136]
 [-1.71511511  0.31740218]
 [-0.48028649  0.33100787]
 [-0.35175208  0.38656938]
 [ 0.69733549  0.14115955]]


## K-fold Cross-Validation

K-fold cross-validation is a more advanced method of splitting data into training and testing sets. In this method, the data is divided into k equal parts (folds), and the model is trained k times. Each time, one of the k folds is used as the testing set, and the remaining k-1 folds are used as the training set. The performance of the model is then averaged over the k runs.

Here is an example of how to use the KFold function from Scikit-learn to perform K-fold cross-validation:

In [5]:
import numpy as np
from sklearn.model_selection import KFold

# generate random data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# instantiate KFold with k=5
kf = KFold(n_splits=5)

# iterate over train_index and test_index in kf.split(X) and print them
for train_index, test_index in kf.split(X):
    print("Train indices:", train_index)
    print("Test indices:", test_index)


Train indices: [2 3 4 5 6 7 8 9]
Test indices: [0 1]
Train indices: [0 1 4 5 6 7 8 9]
Test indices: [2 3]
Train indices: [0 1 2 3 6 7 8 9]
Test indices: [4 5]
Train indices: [0 1 2 3 4 5 8 9]
Test indices: [6 7]
Train indices: [0 1 2 3 4 5 6 7]
Test indices: [8 9]


In the example above, we first generated a random dataset of shape (10, 2) for features X and shape (10,) for the target variable y. We then instantiated the KFold function from Scikit-learn with n_splits=5, which means that we will use 5 folds for cross-validation.

We then used a for loop to iterate over the training and testing indices for each fold. The kf.split(X) function returns the indices of the training and testing sets for each fold.

## K-fold Cross-Validation with Shuffle

By default, the KFold function performs cross-validation without shuffling the data. However, we can shuffle the data by setting the shuffle parameter to True. Here is an example:

In [6]:
import numpy as np
from sklearn.model_selection import KFold

# generate random data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# instantiate KFold with k=5 and shuffle=True
kf = KFold(n_splits=5, shuffle=True)

# iterate over train_index and test_index in kf.split(X) and print them
for train_index, test_index in kf.split(X):
    print("Train indices:", train_index)
    print("Test indices:", test_index)


Train indices: [2 3 4 5 6 7 8 9]
Test indices: [0 1]
Train indices: [0 1 3 4 5 6 8 9]
Test indices: [2 7]
Train indices: [0 1 2 4 5 6 7 9]
Test indices: [3 8]
Train indices: [0 1 2 3 5 6 7 8]
Test indices: [4 9]
Train indices: [0 1 2 3 4 7 8 9]
Test indices: [5 6]


In the example above, we set the shuffle parameter to True when instantiating the KFold function. This will shuffle the data before performing cross-validation.



## Leave-One-Out Cross-Validation

The Leave-One-Out (LOO) method is a special case of K-fold cross-validation, where k is equal to the number of data points. In other words, each data point is used as the testing set once, and the remaining data points are used as the training set.

Here is an example of how to use the LeaveOneOut function from Scikit-learn to perform LOO cross-validation:

In [7]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

# generate random data
X = np.random.normal(0, 1, 20).reshape(10, 2)
y = np.random.normal(0, 1, 10)

# instantiate LeaveOneOut
loo = LeaveOneOut()

# iterate over train_index and test_index in loo.split(X) and print them
for train_index, test_index in loo.split(X):
    print("Train indices:", train_index)
    print("Test indices:", test_index)


Train indices: [1 2 3 4 5 6 7 8 9]
Test indices: [0]
Train indices: [0 2 3 4 5 6 7 8 9]
Test indices: [1]
Train indices: [0 1 3 4 5 6 7 8 9]
Test indices: [2]
Train indices: [0 1 2 4 5 6 7 8 9]
Test indices: [3]
Train indices: [0 1 2 3 5 6 7 8 9]
Test indices: [4]
Train indices: [0 1 2 3 4 6 7 8 9]
Test indices: [5]
Train indices: [0 1 2 3 4 5 7 8 9]
Test indices: [6]
Train indices: [0 1 2 3 4 5 6 8 9]
Test indices: [7]
Train indices: [0 1 2 3 4 5 6 7 9]
Test indices: [8]
Train indices: [0 1 2 3 4 5 6 7 8]
Test indices: [9]


In the example above, we first generated a random dataset of shape (10, 2) for features X and shape (10,) for the target variable y. We then instantiated the LeaveOneOut function from Scikit-learn.

We used a for loop to iterate over the training and testing indices for each fold. The loo.split(X) function returns the indices of the training and testing sets for each fold.

## Conclusion

In this tutorial, we have covered three popular train test frameworks: Holdout split, K-fold split, and Leave-One-Out split using the Scikit-learn library. These methods are essential for evaluating the performance of ML models and for avoiding overfitting.