# Splitting your data into training and testing set

This notebook reviews the basic procedure of how to split your data into training and testing sets. We will do it using the **train_test_split** function from sklearn.

First we need to import 2 functions:

In [1]:
#Import the train_test_split function
from sklearn.model_selection import train_test_split

#Import the function that creates the data we will split
from mlb_misc_functions import create_dummy_modeling_table

Now we will generate the table (a pandas table) with the features and the target. This is just a dummy table that resembles what you might have in a real modeling table. The table has the following columns:
* row_id: This is just an unique identifier for the row
* feature_1: The first feature
* feature_2: The second feature
* feature_3: The third feature
* feature_4: The fourth feature
* target: The target, with values 0 or 1.

Below we create and show the first 5 rows of the table (1000 rows in total).

In [10]:
#Create the modeling table
data = create_dummy_modeling_table(size=1000)
data.head(5)

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,target
0,0,5,2,3,10.182413,0
1,1,5,3,1,11.201273,1
2,2,7,4,1,9.785386,1
3,3,3,4,2,9.502807,1
4,4,7,3,3,8.21751,0


Now we can split the data into training and testing set. 80% for training and 20% for testing.

In [6]:
#split the modeling table into training and testing
train, test = train_test_split(data, test_size=0.20, random_state=11)

It is important to notice that the function ```train_test_split``` takes in the following arguments:
1.	```data```: The data
2.	```test_size=0.20```: The desired size of the testing set. In this case 20% of the original data
3.	```random_state=11```: An integer controlling the shuffling before splitting the data.
 
Now, let’s see the results:

In [11]:
test.head()

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,target
25,25,9,4,0,10.634838,0
464,464,5,3,1,8.511642,0
372,372,2,4,0,9.733826,0
730,730,9,1,0,9.76233,1
757,757,4,1,1,10.586899,1


In [12]:
train.head()

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,target
832,832,1,4,4,9.174728,0
797,797,7,4,1,11.127272,1
49,49,8,1,1,12.404877,0
867,867,3,3,1,9.920206,1
514,514,1,4,4,9.789905,1


In [13]:
print(train.shape)
print(test.shape)

(800, 6)
(200, 6)


# Final words

As shown above we have split our original data set into training and testing set. We used a 80 – 20 split that resulted in the training test having 800 rows, while the testing set has 200.

More information about the ```train_test_split``` function can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
