# 3 split the data into training (70%) and validation (30%)
**OBJECTIVE:** randomly split the annotated data into
- *training* (70% of the tiles): set of data used for learning (by the model), that is, to fit the parameters to the machine learning model.
- *validation* set (30%): Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters.
Also play a role in other forms of model preparation, such as feature selection, threshold cut-off selection.

Normally, one would create a third test dataset for a fully independent evaluation of model's performance on unseen data. In this course the test data are already taken out of the data and will be provided later in the course.



**INPUT:**
- `path_to_tiles`="/content/drive/MyDrive/NOVA_course_deep_learning/data/tiles/10m_krakstad_202304_sun"
-`split_train`= 0.7

**OUTPUT:**
- train and validation data organized in the following folders:

```
├── train
│   ├── images
│   └── labels
├── val
│   ├── images
│   └── labels
```


In [1]:

path_to_tiles="/content/drive/MyDrive/NOVA_Deep/Processing/tiles_training"

# define split for training and validation
split_train= 0.7 #
split_val=1-split_train

### 3.1 Load libraries

In [2]:
import os
import shutil
import random

# mount google drive
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 3.2 Create train and validation directories and subdivide each into "images" and "labels" sub-directories

In [6]:
txt_files

[]

In [12]:
train_dir = os.path.join(path_to_tiles, "train")
#os.makedirs(train_dir, exist_ok=True) # creates new directory for training data
val_dir = os.path.join(path_to_tiles, "val")
#os.makedirs(val_dir, exist_ok=True) # creates new directory for validation data
val_img_dir = os.path.join(path_to_tiles, "val","images")
#os.makedirs(val_img_dir, exist_ok=True) # creates new directory for training data
train_img_dir = os.path.join(path_to_tiles, "train","images")
#os.makedirs(train_img_dir, exist_ok=True) # creates new directory for training data
val_ann_dir = os.path.join(path_to_tiles, "val","labels")
#os.makedirs(val_ann_dir, exist_ok=True) # creates new directory for training data
train_ann_dir = os.path.join(path_to_tiles, "train","labels")
#os.makedirs(train_ann_dir, exist_ok=True) # creates new directory for training data


### 3.3 Randomly sample tiles

In [13]:
# Get a list of all the .txt files in the data directory
txt_files = [f for f in os.listdir(path_to_tiles) if f.endswith(".txt")]
img_files = [f for f in os.listdir(path_to_tiles) if f.endswith(".tif")]

In [8]:
path_to_tiles

'/content/drive/MyDrive/NOVA_Deep/Processing/tiles_training'

In [14]:
# remove .txt files that have no image (not sure why ?)
txt_files_with_tif = []
for txt_file in txt_files:
    # get the base name of the text file
    txt_base_name = os.path.basename(txt_file)
    # replace the file extension with .tif to get the corresponding tif file name
    img_file = os.path.join(os.path.dirname(txt_file), os.path.splitext(txt_base_name)[0] + '.tif')
    img_file=path_to_tiles+"/"+img_file
    #print("txt: "+txt_file)
    #print("tif: "+img_file)
    # check if the tif file exists
    if os.path.exists(img_file):
      #print("path to image " + img_file + " does not exist!")
      txt_files_with_tif.append(txt_file)



In [15]:
txt_files=txt_files_with_tif

# Shuffle the list of text files
random.shuffle(txt_files)
#train=random.sample(txt_files, )

# Calculate the number of files for the train and validation sets
train_size = int(0.7 * len(txt_files))
val_size = len(txt_files) - train_size

In [16]:
len(txt_files)

0

### Move the text annotation files and respective images to the train and validation directories

In [None]:
# iterate through each annotated .txt file
for i, txt_file in enumerate(txt_files):
    if i < train_size:
        dest_dir = train_dir
    else:
        dest_dir = val_dir
    #print("path to "+path_to_tiles+"/"+txt_file+" exists: "+ str(os.path.exists(txt_file)))
    if os.path.exists(path_to_tiles+"/"+txt_file):
      src_file = os.path.join(path_to_tiles, txt_file)
      src_img = os.path.join(path_to_tiles, os.path.splitext(txt_file)[0]+".tif")
      if os.path.exists(src_img):
        dest_file = os.path.join(dest_dir,"labels", txt_file)
        dest_img = os.path.join(dest_dir,"images", os.path.splitext(txt_file)[0]+".tif")
        #print("copying files")
        shutil.move(src_file, dest_file)
        shutil.move(src_img, dest_img)

It is often also good practice to add approximately 10% of background images, i.e. that do not contain any bounding box. This will help the model to avoid to produce odd detections in areas otherwise unseen to the model.

A simple (and efficient) way to do so it to scroll throught the image tiles (in google drive), select them manually and copying 70% of them in the training and 30% in the validation folders.

# The end. And now let's get to the fun part 🥳 to the [model training](https://colab.research.google.com/drive/1dZ4uJHNhjbMCdk0pSyhskkA7It1jKlnF)
