# Split Dataset

This notebook shows how to split dataset into train, validation and test sub set.

## Read data
Use numpy and pandas to read data file list.

In [1]:
import numpy as np
import pandas as pd
np.random.seed(1)

Tell pandas where the csv file is:

In [2]:
csv_file_url = "data/data.csv"

Check the data.

In [3]:
full_data = pd.read_csv(csv_file_url)
total_file_number = len(full_data)
print("There are total {} examples in this dataset.".format(total_file_number))
full_data.head()

There are total 221565 examples in this dataset.


Unnamed: 0,file_basename
0,300vw-112-000322
1,300vw-402-000872
2,300vw-022-001242
3,300vw-143-000044
4,300vw-138-000175


## Split files
There will be three groups: train, validation and test.

Tell notebook number of samples for each group in the following cell.

In [4]:
num_train = 200000
num_validation = 11565
num_test = 10000

Make sure there are enough example for your choice.

In [5]:
assert num_train + num_validation + num_test <= total_file_number, "Not enough examples for your choice."
print("Looks good! {} for train, {} for validation and {} for test.".format(num_train, num_validation, num_test))

Looks good! 200000 for train, 11565 for validation and 10000 for test.


Random spliting files.

In [6]:
index_train = np.random.choice(total_file_number, size=num_train, replace=False)
index_validation_test = np.setdiff1d(list(range(total_file_number)), index_train)
index_validation = np.random.choice(index_validation_test, size=num_validation, replace=False)
index_test = np.setdiff1d(index_validation_test, index_validation)

Merge them into sub datasets.

In [7]:
train = full_data.iloc[index_train]
validation = full_data.iloc[index_validation]
test = full_data.iloc[index_test]

## Write to files

In [8]:
train.to_csv('data/data_train.csv', index=None)
validation.to_csv("data/data_validation.csv", index=None)
test.to_csv('data/data_test.csv', index=None)

In [9]:
print("All done!")

All done!
