# Train and Test Split

Splits the data into a training data file and testing data file based on parameters set.

## Parameters

1. `data_file`: The data file in CSV with format `x1,x2,x3...y` where `y` is `1` if row is nominal point and `-1` if row is anomaly point.

2. `percent_anomaly`: The ratio of anomalies for validation.

3. `train_file`: The output training file in CSV with format `x1,x2,x3...xn`

4. `validation_file`: The output validation file in CSV with format `x1,x2,x3...y` where `y` is `1` for nominal points and `-1` for outlier points.

In [None]:
import pandas as pd

In [None]:
data_file = "/home/ralampay/workspace/pyno/data/creditcardfraud.csv"
percent_anomaly = 0.05
percent_training = 0.7
train_file = "~/Desktop/creditcardfraud_unsupervised_train.csv"
validation_file = "~/Desktop/creditcardfraud_unsupervised_validation.csv"
chunk_size = 1000

In [None]:
data = pd.DataFrame()

for i, chunk in enumerate(pd.read_csv(data_file, header=None, chunksize=chunk_size)):
    data = data.append(chunk)

input_dim = len(data.columns) - 1

training_data = data.sample(frac=percent_training).iloc[:,:input_dim]
validation_data = data.drop(training_data.index)

In [None]:
print("Saving training data to {}".format(train_file))
training_data.to_csv(train_file, header=False, index=False)

print("Saving validation data to {}".format(validation_file))
validation_data.to_csv(validation_file, header=False, index=False)