<a href="https://colab.research.google.com/github/whrc/ARTS/blob/main/Automated_Training_Validation_Testing_Data_Split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated Training/Validation/Testing Data Split
Heidi Rodenhizer, Yili Yang

Jan 2024

Install the ARTS package:

In [7]:
%%capture
pip install git+https://github.com/whrc/ARTS.git

In [2]:
from ARTS import autosplit
import geopandas as gpd

Are you using colab to run this script? Provide 'True' or 'False':

In [None]:
colab = False

if colab:
    from google.colab import drive
    
    drive.mount("/content/drive")


Provide the location of the directory in which you are working:

In [None]:
if colab:
    base_dir = Path("/content/drive/MyDrive/ARTS")
else:
    base_dir = Path('..')

print('Your base directory is ' + str(base_dir.resolve()))

In [12]:
data_to_split = gpd.read_file(base_dir / 'ARTS_main_dataset' / 'ARTS_main_dataset.geojson').to_crs(3413)

data_to_split = data_to_split[['ID', 'Long', 'Lat', 'geometry']]
data_to_split

Unnamed: 0,ID,Long,Lat,geometry
0,00000000000000000000,69.211514,70.161857,"POLYGON ((1979043.254 889866.463, 1979008.648 ..."
1,00000000000000000001,69.527595,70.105949,"POLYGON ((1979732.261 903360.545, 1979736.642 ..."
2,00000000000000000002,68.034317,70.784162,"POLYGON ((1933102.273 821888.676, 1933083.649 ..."
3,00000000000000000008,67.915439,70.620506,"POLYGON ((1951565.636 824899.432, 1951554.920 ..."
4,00000000000000000009,67.915024,70.619287,"POLYGON ((1951695.422 825009.432, 1951681.044 ..."
...,...,...,...,...
133,0000000000000000001c,78.583303,70.397119,"POLYGON ((1785848.244 1185716.315, 1785842.769..."
134,00000000000000000004,67.837259,70.713714,"POLYGON ((1943127.997 818289.183, 1943108.492 ..."
135,00000000000000000016,78.846724,70.463938,"POLYGON ((1774201.764 1189783.587, 1774181.223..."
136,00000000000000000030,76.243105,70.424696,"POLYGON ((1830147.897 1110238.477, 1830143.036..."


This algorithm splits the RTS polygons into training, validation, and testing subsets while ensuring  that there is no data leakage during machine learning model training by ensuring that polygons which could end up in the same image tile are never split across different subsets. The buffer size is intended to be provided as the side length of tile size being used in model training, and the algorithm calculates the diagonal distance across the tile. The RTS polygons are buffered by this distance and intersected to find groups of RTS polygons that are placed into subsets together, thus ensuring that if there is any chance that any part of two polygons could be found within the same image tile, they will placed into a subset together.

In [13]:
split_results = autosplit.split_with_buffer(data_to_split,         # dataset to be splitted
                        ['train', 'val', 'test'],   # subset names
                        [0.8, 0.1, 0.1],      # train, val, test ratio
                        256*2           # buffer size
                        )

split_results

Unnamed: 0,ID,Long,Lat,subset,geometry
0,00000000000000000000,69.211514,70.161857,train,"POLYGON ((1979043.254 889866.463, 1979008.648 ..."
1,00000000000000000001,69.527595,70.105949,train,"POLYGON ((1979732.261 903360.545, 1979736.642 ..."
2,00000000000000000002,68.034317,70.784162,val,"POLYGON ((1933102.273 821888.676, 1933083.649 ..."
3,00000000000000000008,67.915439,70.620506,train,"POLYGON ((1951565.636 824899.432, 1951554.920 ..."
4,00000000000000000009,67.915024,70.619287,train,"POLYGON ((1951695.422 825009.432, 1951681.044 ..."
...,...,...,...,...,...
108,0000000000000000001e,78.969078,70.405653,train,"POLYGON ((1777013.679 1197190.711, 1777003.242..."
109,00000000000000000021,76.651855,70.482606,val,"POLYGON ((1816679.578 1119875.498, 1816661.943..."
118,0000000000000000003a,76.206167,70.283004,train,"POLYGON ((1844376.619 1117232.150, 1844367.396..."
119,0000000000000000003b,76.206369,70.282509,train,"POLYGON ((1844407.140 1117251.812, 1844401.347..."


## Visualize the split

In [None]:
%%capture
pip install folium matplotlib mapclassify

In [14]:
split_results.explore(column = 'subset',
            cmap = 'Set1',
            tiles = "Esri WorldImagery", # get possible tiles with import xyzservices.providers as xyz; xyz
            style_kwds = dict(weight=10)
            )