## Foundations: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

### Read in Data

_Welcome back, this lesson will be very simple - we're just going to split up our full dataset so we have 60% of it in the training set, 20% in the validation set, and 20% in the test set._

_We will import the packages we'll need and reading in our data - again, we're using this `train test split` method we're importing from `sklearn` - that will make our job here **very** easy. And I also want to call out that we're reading in this `titanic cleaned` dataset that we created in the last lesson._

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

titanic = pd.read_csv('../titanic_cleaned.csv')
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0


### Split into train, validation, and test set

![Split Data](img/split_data.png)

_We start by splitting our data into our features (the fields used to make a prediction) and our labels or target variable (in our case that's whether somebody survived or not)._

_Then we will call `train test split` method and first we need to pass in our features, then we'll pass in our labels, we tell it what percent of the dataset we want allocated to the test set, and lastly `random state` (initialization seed for randomizer). Just to reiterate, `train test split` can't handle splitting data into three datasets, it can only do two. So we'll handle this in two steps. Allocate 60% to training and 40% to what it's calling "test". Then we will take that 40% and split it in half in a second step and that will give us our 60% training, 20% validation, 20% test set._

_Don't forget that the ordering is important. It takes the features and splits it into train and test so the first two outputs are `X train` and `X test`, then it takes the labels and splits that into train and test - `y train` and `y test`.

So now you might be wondering why I'm indicating test size of 40%? Well, `train test split` doesn't have the functionality to split into three datasets. So we'll handle this in two steps. Allocate 60% to training and 40% to which it's calling "test". Then we will take that 40% and split it in half and that will give us our 60% training, 20% validation, 20% test set._

_Then we can just copy that down test set and split it into validation and test. So we will copy and paste down. Change `features` to `X test`, change `labels` to `y test`, and change `test size` to 50%. So we're taking the 40% of the full dataset that we assigned to the test set and we're splitting it in half to create our validation set and test set. Now, lets update the output names. We'll say `X test`, `X val`, `y test`, and `y val`._

In [2]:
features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

_Now, lets quickly take a look at the length of each of these datasets to make sure that 60% went to train and 20% to each test and validation. So print out the length of `labels` (full dataset), length of `y train`, length of `y val`, and length of `y test`._

_And we can confirm that it's split out the way we expected. We didn't have the right number for test and validation to be exactly equal so there is one more in the validation set but that's not a big deal._

In [3]:
print(len(labels), len(y_train), len(y_test), len(y_val))

891 534 178 179


### Write out all data

_Lastly, lets write these all out so to make sure we're using the exact same training, validation, and test set as we're exploring various algorithms in the next few sections._

In [4]:
X_train.to_csv('../train_features.csv', index=False)
X_val.to_csv('../val_features.csv', index=False)
X_test.to_csv('../test_features.csv', index=False)

y_train.to_csv('../train_labels.csv', index=False)
y_val.to_csv('../val_labels.csv', index=False)
y_test.to_csv('../test_labels.csv', index=False)