# Understanding the data splitting functions in scikit-learn

Let’s get started and create a Python list of numbers from `0` to `9` using `range()` :


In [11]:
X =  list(range(10))
print(X)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


Then, we create another list which contains the square values of numbers in X using list comprehension:

In [12]:
y = [x*x for x in X]
print(y)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


Next, we will import model_selection from scikit-learn.  
Let's then use the function train_test_split( ) to split our data into two sets:

In [10]:
import sklearn.model_selection as model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.75,test_size=0.25, random_state=101)

print("X_train: ", X_train)
print("y_train: ", y_train)
print("X_test: ", X_test)
print("y_test: ", y_test)

X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test:  [8, 2, 0]
y_test:  [64, 4, 0]


Other way to split your data is using `cross_validation.train_test_split( )` for smiliar tasks. 

In fact, it’s just an old way of doing the same thing. We are just naming the module differently.

In [13]:
import sklearn.model_selection as cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.75, random_state=101)

print("X_train: ", X_train)
print("y_train: ", y_train)
print("X_test: ", X_test)
print("y_test: ", y_test)

X_train:  [4, 9, 3, 5, 7, 6, 1]
y_train:  [16, 81, 9, 25, 49, 36, 1]
X_test:  [8, 2, 0]
y_test:  [64, 4, 0]


In **scikit-learn**, you can use the `KFold( )` function to split your dataset into n consecutive folds.

In [14]:
from sklearn.model_selection import KFold
import numpy as np

kf = KFold(n_splits=5)
X = np.array(X)
y = np.array(y)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("X_test: ", X_test)

X_test:  [0 1]
X_test:  [2 3]
X_test:  [4 5]
X_test:  [6 7]
X_test:  [8 9]
