# Training Validation Dataset Splitter

This code was created to split the Georgia Coastal Survey 2022 dataset into training and validation sets for Digital Elevation Model correction. 

This will work to split any .csv file into training and validation data, but the example carried throughout was to conduct a stratified split by vegetation class abundance.

First, import the necessary modules.

In [None]:
import pandas as pd
import io
from sklearn.model_selection import train_test_split
import numpy as np

Allow google colab to import files from your computer. After this command is run, it will allow you to choose the file from your computer directory that you wish to use. In this case, I am using a csv from an ESRI Shapefile that I created from our RTK data.

In [None]:
from google.colab import files
uploaded = files.upload()

Visualize the csv

In [None]:
df = pd.read_csv(io.StringIO(uploaded['Data_V2_All_2022.csv'].decode('utf-8')))
df

Define the desired x and y variables. In this case, I want the plot number and the vegetation columns from the table.

In [None]:
x = df.Plot
y = df.Dominant

In [None]:
np.array(x,y)

The 'train_test_split' tool from sklearn can be used to split our dataset into training and validation sets. Additionally, the stratify variable will allow us to stratify our data according to abundance of our veg catagory. The random state variable sets the seed to insure we have the same split with re-runs of this code, and the test size variable sets the data split by the validation size. The training set data size will be (1 - test_size). For example, I would set "test_size" to 0.3 for a 70/30 Training/Validation split.

If you dont need to stratify your dataset, the stratify call can simply be deleted.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.3, random_state=71, stratify=y)

In [None]:
print('x_train : ')
print(x_train.head())
print(x_train.shape)
print('')
print('x_test : ')
print(x_test.head())
print(x_test.shape)
print('')
print('y_train : ')
print(y_train.head())
print(y_train.shape)
print('')
print('y_test : ')
print(y_test.head())
print(y_test.shape)

At this point, I want to export a .csv of plot names (my x variable) for my training and validation datasets. The following code will export and download these as a .csv file.

In [None]:
np.array(x_train)
np.array(x_test)
np.array(y_train)
np.array(y_test)

In [None]:
dataxTrain = np.array(x_train)
dataxTest = np.array(x_test)
dataytrain = np.array(y_train)
dataytest = np.array(y_test)

In [None]:
ExportxTraining = pd.DataFrame(x_train)
ExportxTraining
ExportxTest = pd.DataFrame(x_test)
ExportxTest

Use the following block to set the filename for your .csv file. 

In [None]:
ExportxTraining.to_csv('TrainingTest1_3oct22.csv')
ExportxTest.to_csv('xTest_Test1_5oct22.csv')

In [None]:
from google.colab import files
files.download('TrainingTest1_3oct22.csv')
files.download('xTest_Test1_5oct22.csv')

Created by T.Pudil for use in the Hladik lab at Georgia Southern University, Department of Geography