## Foundations: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

### Read in Data

_Welcome back, this lesson will be very simple - we're just going to split up our full dataset so we have 60% of it in the training set, 20% in the validation set, and 20% in the test set. Doing this kind of split will help us evaluate the models and perform model selection using unbiased results._

_We will import the packages we'll need and read in our data - we're going to use this `train test split` method imported from `sklearn` - that will make our job here **very** easy. And I also want to call out that we're reading in this `titanic cleaned` dataset that we created in the last lesson._

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

titanic = pd.read_csv('../titanic_cleaned.csv')
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_cnt,Cabin_ind
0,0,3,0,22.0,7.25,1,0
1,1,1,1,38.0,71.2833,1,1
2,1,3,1,26.0,7.925,0,0
3,1,1,1,35.0,53.1,1,1
4,0,3,0,35.0,8.05,0,0


### Split into train, validation, and test set

![Split Data](img/split_data.png)

_We start by splitting our data into our features (the fields used to make a prediction) and our labels or target variable (in our case that's whether somebody survived or not)._

_Then we will call `train test split` method and first we need to pass in our features, then we'll pass in our labels. Now, this is a good point to call out that ideally we want to split features and labels into three separate data sets (training, test, and validation). Unfortunately, `train test split` can only handle splitting one dataset into two. So we're going to do our split in two passes through `train test split`._


_So for our first pass we'll tell it to allocate 40% of the data to the test set. That will leave the 60% we want for the training set. Then just set `random state` (initialization seed for randomizer). It's important to note that the ordering of the output is important. It will first take the features and split it in two - so that will be `X train` and `X test` and then take the labels and split it in two and that will be `y train` and `y test`._

_Now we have 60% of our data in train and 40% in set. So lets copy this and call it again and we'll pass in `X test` and `y test` and split it in half - that will give us our 20% for validation and 20% for test set. We'll update the names that we're writing out to._

Just to reiterate, `train test split` can't handle splitting data into three datasets, it can only do two. So we'll handle this in two steps. Allocate 60% to training and 40% to what it's calling "test". Then we will take that 40% and split it in half in a second step and that will give us our 60% training, 20% validation, 20% test set._

In [2]:
features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

_Now, lets quickly take a look at the length of each of these datasets to make sure that 60% went to train and 20% to each test and validation. So print out the length of `labels` (full dataset), length of `y train`, length of `y val`, and length of `y test`._

_And we can confirm that it's split out the way we expected. We didn't have the right number for test and validation to be exactly equal so there is one more in the validation set but that's not a big deal._

In [3]:
print(len(labels), len(y_train), len(y_test), len(y_val))

891 534 178 179


### Write out all data

_Lastly, lets write these all out so to make sure we're using the exact same training, validation, and test set as we're exploring various algorithms in the next few sections._

_Pandas has a really great built in method called `to csv` to write datafarmes out to CSV files. We include `index=False` here because the index in this dataset isn't really meaningful._

In [4]:
X_train.to_csv('../train_features.csv', index=False)
X_val.to_csv('../val_features.csv', index=False)
X_test.to_csv('../test_features.csv', index=False)

y_train.to_csv('../train_labels.csv', index=False)
y_val.to_csv('../val_labels.csv', index=False)
y_test.to_csv('../test_labels.csv', index=False)