# Splitting the data

It's important to split the data for your research into different categories.

1. Training.

This is the data that the NN is trained with. With supervised learning, it consists of features and the target value(s).

2. Validation

This set is used to measure performance of the model in order to make adjustments. 

3. Testing

This set doesn't have an effect on the model, which is why it is used to evaluate the model on previously unseen and untrained data.

In [1]:
%run E203.ipynb

(19735, 28)
<bound method NDFrame.head of        Appliances  lights         T1       RH_1         T2       RH_2  \
0              60      30  19.890000  47.596667  19.200000  44.790000   
1              60      30  19.890000  46.693333  19.200000  44.722500   
2              50      30  19.890000  46.300000  19.200000  44.626667   
3              50      40  19.890000  46.066667  19.200000  44.590000   
4              60      40  19.890000  46.333333  19.200000  44.530000   
...           ...     ...        ...        ...        ...        ...   
19730         100       0  25.566667  46.560000  25.890000  42.025714   
19731          90       0  25.500000  46.500000  25.754000  42.080000   
19732         270      10  25.500000  46.596667  25.628571  42.768571   
19733         420      10  25.500000  46.990000  25.414000  43.036000   
19734         430      10  25.500000  46.600000  25.264286  42.971429   

              T3       RH_3         T4       RH_4  ...         T9     RH_9  \
0  

In [3]:
x.shape # output should be (19735,27)

(19735, 27)

In [4]:
# get the upper and lower bound for training and validation sets. this will be used to split the dataset into a 60:20:20 arrangement
train_end = int(len(x) * 0.6)
dev_end = int(len(x) * 0.8)

Now we must shuffle the dataset to help remove bias

In [5]:
x_shuffle = x.sample(frac=1, random_state=0)
y_shuffle = y.sample(frac=1, random_state=0)

Now use indexing to split the shuffled dataset into the three sets for both features and target

In [7]:
x_train = x_shuffle.iloc[:train_end,:]
y_train = y_shuffle.iloc[:train_end]
x_dev   = x_shuffle.iloc[train_end:dev_end,:]
y_dev   = y_shuffle.iloc[train_end:dev_end]
x_test  = x_shuffle.iloc[dev_end:,:]
y_test  = y_shuffle.iloc[dev_end:]

In [8]:
print(x_train.shape, y_train.shape)
print(x_dev.shape, y_dev.shape)
print(x_test.shape, y_test.shape)

(11841, 27) (11841,)
(3947, 27) (3947,)
(3947, 27) (3947,)


In [10]:
from sklearn.model_selection import train_test_split # put here just for continuity of material and imports

Now we must split the shuffled dataset

In [13]:
# perform an initial split where the first two params are the dataset to be split, test_size is the % of instances to be contained, 
# and random_state is set to 0 for reproducability
x_new, x_test_2, y_new, y_test_2 = train_test_split(x_shuffle, y_shuffle, test_size=0.2, random_state=0)
dev_per = x_test_2.shape[0]/x_new.shape[0]
x_train_2, x_dev_2, y_train_2, y_dev_2 = train_test_split(x_new, y_new, test_size=dev_per, random_state=0)

Print the shape of all three sets

In [14]:
print(x_train_2.shape, y_train_2.shape)
print(x_dev_2.shape, y_dev_2.shape)
print(x_test_2.shape, y_test_2.shape)

(11841, 27) (11841,)
(3947, 27) (3947,)
(3947, 27) (3947,)


As you can see, the resulting sets have the same shape, using the indexing approach or `sklearn.train_test_split` is a matter of preference, but i personally think using indexes makes more sense and is simpler code.