## Prepare Data: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition (we are only using the training set).

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

### Read in Data

_In this lesson we're going to split up our full dataset so we have 60% of our examples in the training set, 20% in the validation set, and 20% in the test set. Doing this kind of split will help us evaluate the models and perform model selection using unbiased results._

_Lets import the packages we'll need and read in our data - we're going to use this `train test split` method imported from `sklearn` - that will make our job here **very** easy. And I also want to call out that we're reading in this `titanic cleaned` dataset that we created in the last lesson._

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

titanic = pd.read_csv('../titanic_cleaned.csv')
titanic.head()

FileNotFoundError: [Errno 2] No such file or directory: '../titanic_cleaned.csv'

### Split into train, validation, and test set

![Split Data](img/split_data.png)

_We start by splitting our data into our features (by dropping the `Survived` field leaving only the fields used to make a prediction) and our labels or target variable (in our case that's whether somebody survived or not)._

_Then we will call `train test split` method and first we need to pass in our features, then we'll pass in our labels. Now, this is a good point to call out that ideally we want to split features and labels into three separate data sets (training, test, and validation). Unfortunately, `train test split` can only handle splitting one dataset into two. So we're going to do our split in two passes through `train test split`._


_So for our first pass we'll tell it to allocate 40% of the data to the test set. That will leave the 60% we want for the training set.  Then we will run `train test split` again after this where we take that 40% and then split that in half and that will leave us with 60% in the training set, 20% in the validation set, and 20% in the test set._

_So focusing on the first pass through, the last argument we have to set is `random state` (that's just the initialization seed for randomizer). It's important to note that the ordering of the output is important. It will first take the features and split it in two - so that will be `X train` and `X test` and then take the labels and split it in two and that will be `y train` and `y test`._

_Now we have 60% of our data in train and 40% in test. So lets copy this and call it again and we'll pass in `X test` and `y test` which contains 40% of the date and split it in half - that will give us our 20% for validation and 20% for test set. We'll update the names that we're writing out to._

In [2]:
features = titanic.drop('Survived', axis=1)
labels = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

_Now, lets quickly take a look to make sure that 60% went to train and 20% to each test and validation._

_So what we will do is create a loop to iterate through through `y train`, `y val`, and `y test`. In each iteration, print out the length of the dataset divided by the length of labels which represents the full dataset. And then we will just round that to two digits._

_And this confirms that we do have 60% in the training set, 20% in the validation set, and 20% in the test set._

In [3]:
for dataset in [y_train, y_val, y_test]:
    print(round(len(dataset) / len(labels), 2))

0.6
0.2
0.2


### Write out all data

_Lastly, lets write these all out so to make sure we're using the exact same training, validation, and test set as we're exploring various algorithms in the next few sections._

_So we will use this `to csv` method to write out our dataframes to CSV files. We include `index=False` argument here so that pandas doesn't write out the index as a new column in the CSV file. Now we can pick up this data in the next chapter to test out our first algorithm._

In [4]:
X_train.to_csv('../train_features.csv', index=False)
X_val.to_csv('../val_features.csv', index=False)
X_test.to_csv('../test_features.csv', index=False)

y_train.to_csv('../train_labels.csv', index=False)
y_val.to_csv('../val_labels.csv', index=False)
y_test.to_csv('../test_labels.csv', index=False)