## Train Test Frameworks

The following exercise is to practice the syntax of the various functions from sklearn that split data into train and test sets. The goal of this exercise is to get familiar with these different splitting methods before engaging with the more complex activities at the end of the day. 

In [1]:
# import numpy
import numpy as np
X = np.random.normal(0,1,20).reshape(10,2)
y = np.random.normal(0,1,10)

* print X

* print y

In [3]:
print(X)
print(y)

[[-0.08375306 -1.5030772 ]
 [-1.45264769  1.53549941]
 [-1.18435993 -1.11891047]
 [ 1.53013712 -2.20594926]
 [-1.25385228  1.98771211]
 [ 0.48488861  0.34209684]
 [ 1.35556662 -0.27954208]
 [-0.71678884  0.04057391]
 [ 0.17731041 -0.11746045]
 [ 0.0557947  -0.22812893]]
[-0.42958867  0.05343784  0.05019458  0.11582392  0.09192977  0.32115742
  0.33483784  1.463977   -0.22085896 -0.13568607]


_____________________________
### Holdout split

* import the **train_test_split** function from sklearn

In [4]:
import sklearn.model_selection as model_selection
from sklearn.model_selection import KFold

* split the data to train set and test set, use a 70:30 ratio or a 80:20 ratio.

* print X_train

In [7]:
XTrain, XTest, YTrain, Ytest = model_selection.train_test_split(X, y, train_size=0.7, test_size=.3)
print(XTrain)

[[-1.25385228  1.98771211]
 [-0.08375306 -1.5030772 ]
 [ 0.0557947  -0.22812893]
 [ 1.35556662 -0.27954208]
 [ 1.53013712 -2.20594926]
 [-0.71678884  0.04057391]
 [ 0.17731041 -0.11746045]]


* split the data again but now with the parameter shuffle = False

* print X_train

In [9]:
XTrain2, XTest2, YTrain2, Ytest2 = model_selection.train_test_split(X, y, train_size=0.7, test_size=.3,shuffle=False)
XTrain2

array([[-0.08375306, -1.5030772 ],
       [-1.45264769,  1.53549941],
       [-1.18435993, -1.11891047],
       [ 1.53013712, -2.20594926],
       [-1.25385228,  1.98771211],
       [ 0.48488861,  0.34209684],
       [ 1.35556662, -0.27954208]])

* print the shape of X_train and X_test

In [10]:
print(XTrain.shape)
print(XTest.shape)

(7, 2)
(3, 2)


_________________________________
### K-fold split 

* import the **KFold** function from sklearn

* instantiate KFold with k=5

* iterate over train_index and test_index in kf.split(X) and print them

In [11]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)

for train_ind, test_ind in kf.split(X):
    X_train, X_test = X[train_ind], X[test_ind]
    y_train, y_test = y[train_ind], y[test_ind]
    print('train index: ',train_ind)
    print('text index: ',test_ind)

train index:  [2 3 4 5 6 7 8 9]
text index:  [0 1]
train index:  [0 1 4 5 6 7 8 9]
text index:  [2 3]
train index:  [0 1 2 3 6 7 8 9]
text index:  [4 5]
train index:  [0 1 2 3 4 5 8 9]
text index:  [6 7]
train index:  [0 1 2 3 4 5 6 7]
text index:  [8 9]


* instantiate KFold with k=5 and shuffle=True

* iterate over train_index and test_index in kf.split(X) and print them

In [13]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5,shuffle=True)
for train_ind, test_ind in kf.split(X):
    X_train, X_test = X[train_ind], X[test_ind]
    y_train, y_test = y[train_ind], y[test_ind]
    print('train index: ',train_ind)
    print('text index: ',test_ind)

train index:  [0 1 2 3 4 6 8 9]
text index:  [5 7]
train index:  [1 3 4 5 6 7 8 9]
text index:  [0 2]
train index:  [0 1 2 3 5 7 8 9]
text index:  [4 6]
train index:  [0 1 2 4 5 6 7 9]
text index:  [3 8]
train index:  [0 2 3 4 5 6 7 8]
text index:  [1 9]


_______________________________________
### Leave-One-Out split
This is a similar technique to the Leave-p-out in the previous readings, with p=1. Each observation is used as test set separately.
- This is a popular method for tiny datasets.
- It takes a lot of time with bigger datasets and can lead to overfitting on a final model.

* import the **LeaveOneOut** function from sklearn

* instantiate LeaveOneOut

* iterate over train_index and test_index in loo.split(X) and print them

In [14]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
loo.get_n_splits(X)

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [1 2 3 4 5 6 7 8 9] TEST: [0]
TRAIN: [0 2 3 4 5 6 7 8 9] TEST: [1]
TRAIN: [0 1 3 4 5 6 7 8 9] TEST: [2]
TRAIN: [0 1 2 4 5 6 7 8 9] TEST: [3]
TRAIN: [0 1 2 3 5 6 7 8 9] TEST: [4]
TRAIN: [0 1 2 3 4 6 7 8 9] TEST: [5]
TRAIN: [0 1 2 3 4 5 7 8 9] TEST: [6]
TRAIN: [0 1 2 3 4 5 6 8 9] TEST: [7]
TRAIN: [0 1 2 3 4 5 6 7 9] TEST: [8]
TRAIN: [0 1 2 3 4 5 6 7 8] TEST: [9]


* print the number of splits

In [15]:
loo.get_n_splits(X)

10