# Cross Validation

### Best Practices
There are a couple of best practices that we can follow to make sure we choose the best approach:

 - Use the KFold cross-validation in case we deal with a small amount of data. In this case, it is very costly to leave some observations as a hold-out set.
 - In case we have enough data, the best practice is to use the combination of a hold-out set and cross-validation:
    1. Split the data into a train and a test (hold-out) sets.
    2. Apply cross-validation on the training set to find the optimal model.
    3. Test the model on the test set to see if the performance is good with a "new"(unseen) data.


### Data splitting in scikit-learn

In [1]:
X = list(range(10))
y = [x**2 for x in X]
print(X)
print(y)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [4]:
import sklearn.model_selection as model_selection

X_train,X_test,y_train,y_test = model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25, random_state=101)

In [5]:
print("X_train: ", X_train)
print("y_train: ", y_train)
print("X_test: ", X_test)
print("y_test: ", y_test)


X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test:  [8, 2, 0]
y_test:  [64, 4, 0]


In [6]:
from sklearn.model_selection import KFold
import numpy as np

In [None]:
kf = KFold(n_splits=5)
X = np.array(X)
y = np.array(y)
for train_index, test_index in kf.split(X)