# Split data into training, validation, and testing datasets

There are a couple of wrinkles we need to consider:
1. I am trying for a 60/20/20% training/validation/test split as best as the other constraints will allow.
1. Because I am using order information to the models I need to split the data for each plant order.
1. The are classes and orders are not uniformly imbalanced.
1. Multiple images per observation: I don't want to leak information to the models so we are splitting observations and **NOT** images.
1. Considering data leaks again, records that have both flowers and fruits have to be handled so that both wind up in the same split.
1. For underrepresented plant orders I make sure to add records to the test set first then the training set and finally the validation set.

In [1]:
import sqlite3
from pathlib import Path
from types import SimpleNamespace

import pandas as pd

In [2]:
DATA_DIR = Path("..") / "data"

In [3]:
args = SimpleNamespace(
    db=DATA_DIR / "inat" / "inat.sqlite",
)

## Read observation data

In [4]:
with sqlite3.connect(args.db) as cxn:
    df = pd.read_sql("select * from obs order by taxon_id", cxn)

df.head()

Unnamed: 0,obs_id,split,order,taxon_id,taxon,ancestry,annotations,phenology
0,146020079,,Saxifragales,47129,Ribes californicum,48460/47126/211194/47125/47124/71289/47131/47130,1315,Flowering
1,145993241,,Saxifragales,47129,Ribes californicum,48460/47126/211194/47125/47124/71289/47131/47130,21,No Evidence of Flowering
2,145877878,,Saxifragales,47129,Ribes californicum,48460/47126/211194/47125/47124/71289/47131/47130,13,Flowering
3,143674797,,Saxifragales,47129,Ribes californicum,48460/47126/211194/47125/47124/71289/47131/47130,13,Flowering
4,145357142,,Rosales,47146,Adenostoma fasciculatum,48460/47126/211194/47125/47124/47132/47148/922...,21,No Evidence of Flowering


In [5]:
df.shape

(64839, 8)