## Foundations: Split data into train, validation, and test set

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section.

### 1. Import libraries  

In [9]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
%matplotlib inline

print("Setup completed")

Setup completed


### 2. Read clean/preprocessed data

In [10]:
titanic = pd.read_csv('../dataset/titanic_clean.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin_ind,Family_cnt
0,1,0,3,0,22.0,7.25,0,1
1,2,1,1,1,38.0,71.2833,1,1
2,3,1,3,1,26.0,7.925,0,0
3,4,1,1,1,35.0,53.1,1,1
4,5,0,3,0,35.0,8.05,0,0


### Split into train (60), validation(20), and test set(20)

In [11]:
labels = titanic['Survived']
features = titanic.drop('Survived', axis=1)

In [12]:
labels.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [13]:
features.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Fare,Cabin_ind,Family_cnt
0,1,3,0,22.0,7.25,0,1
1,2,1,1,38.0,71.2833,1,1
2,3,3,1,26.0,7.925,0,0
3,4,1,1,35.0,53.1,1,1
4,5,3,0,35.0,8.05,0,0


In [14]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)

In [15]:
for dataset in [y_train, y_val, y_test]:
    print(round(len(dataset) / len(labels), 2))
    

0.6
0.2
0.2


### Write out all data

In [16]:
X_train.to_csv('../dataset/train_features.csv', index=False)
X_val.to_csv('../dataset/val_features.csv', index=False)
X_test.to_csv('../dataset/test_features.csv', index=False)

y_train.to_csv('../dataset/train_labels.csv', index=False)
y_val.to_csv('../dataset/val_labels.csv', index=False)
y_test.to_csv('../dataset/test_labels.csv', index=False)