# Extracting Training Data

For any supervised classification the training, validation and testing data are obviously key, these are datasets will be used for the following:

 * **Training**: Dataset used to train the algorithm

 * **Validation**: Dataset used to optimise the hyper-parameters (e.g., grid search)

 * **Testing**: Dataset which isn’t used at any point during the training or hyper-parameter search. This provides an accuracy for the classifier but does not a replacement a formal validation of the final map.


## Extracting three datasets

In this case we will perform the extraction three times (and therefore the code below will be repetative! - we could use a loop here but writing it out also keeps it simipler for you to follow) for each of the input images (i.e., original reflectance, linear normalised and standard deviation normalisation). 

## Running Notebook

The notebook has been run and saved with the outputs so you can see what the outputs should be and so the notebook and be browsed online without having to run the notebook for it to make sense. 

If you are running the notebook for yourself it is recommended that you clear the existing outputs which can be done by running one of the following options depending on what system you are using:

**Jupyter-lab**:

> \> _Edit_ \> _'Clear All Outputs'_

**Jupyter-notebook**:

> \> _Cell_ \> _'All Outputs'_ \> _Clear_


# 1. Import Modules

In [1]:
import os

import rsgislib
import rsgislib.classification
import rsgislib.imageutils
import rsgislib.tools.utils
import rsgislib.vectorutils
import rsgislib.vectorutils.createrasters
import rsgislib.zonalstats

# 2. Define the input images

In [2]:
# The input image files
refl_img_file = "../data/sen2_20180629_t30uvd_orb037_osgb_stdsref_20m.tif"
norm_lin_img = "norm_images/sen2_20180629_t30uvd_orb037_osgb_stdsref_norm_linear.tif"
norm_sd_img = "norm_images/sen2_20180629_t30uvd_orb037_osgb_stdsref_norm_stddev.tif"

# 3. Training Data Vector 

In [3]:
vec_train_file = "../data/cls_data/aber_sen2_cls_training.gpkg"

## 3.1 Get the Vector Layer Names

In [4]:
# Get the list of layers within the vector file.
lyr_names = rsgislib.vectorutils.get_vec_lyrs_lst(vec_train_file)

# Print out the layer names by looping through the list of layers returned.
# Note. the enumerate function returns the array index and the list value:
for i, lyr_name in enumerate(lyr_names):
    print(f"{i+1}:\t{lyr_name}")

1:	Artificial_Surfaces
2:	Bare_Rock_Sand
3:	Conifer_Forest
4:	Deciduous_Forest
5:	Grass_Long
6:	Grass_Short
7:	NonPhotosynthetic_Vegetation
8:	Scrub
9:	Water_Training
10:	Bracken


# 4. Output Directories

In [5]:
# The output directory where the training will be outputted
out_dir = "training_data"

if not os.path.exists(out_dir):
    os.mkdir(out_dir)


# The tmp directory where intermediate outputs will be written to
tmp_dir = "tmp"

if not os.path.exists(tmp_dir):
    os.mkdir(tmp_dir)

# 5. Extract Training for Reflectance Image

## 5.1 Define Input Image bands

In [6]:
img_band_info = list()
img_band_info.append(
    rsgislib.imageutils.ImageBandInfo(
        file_name=refl_img_file, name="sen2", bands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    )
)

## 5.2 Define Vector Samples

In [7]:
class_vec_sample_info = list()

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=1,
        class_name="artificial_surfaces",
        vec_file=vec_train_file,
        vec_lyr="Artificial_Surfaces",
        file_h5=os.path.join(tmp_dir, "artificial_surfaces_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=2,
        class_name="bare_rock_sand",
        vec_file=vec_train_file,
        vec_lyr="Bare_Rock_Sand",
        file_h5=os.path.join(tmp_dir, "bare_rock_sand_refl_smpls.h5"),
    )
)


# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=3,
        class_name="conifer_forest",
        vec_file=vec_train_file,
        vec_lyr="Conifer_Forest",
        file_h5=os.path.join(tmp_dir, "conifer_forest_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=4,
        class_name="deciduous_forest",
        vec_file=vec_train_file,
        vec_lyr="Deciduous_Forest",
        file_h5=os.path.join(tmp_dir, "deciduous_forest_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=5,
        class_name="grass_long",
        vec_file=vec_train_file,
        vec_lyr="Grass_Long",
        file_h5=os.path.join(tmp_dir, "grass_long_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=6,
        class_name="grass_short",
        vec_file=vec_train_file,
        vec_lyr="Grass_Short",
        file_h5=os.path.join(tmp_dir, "grass_short_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=7,
        class_name="nonphoto_veg",
        vec_file=vec_train_file,
        vec_lyr="NonPhotosynthetic_Vegetation",
        file_h5=os.path.join(tmp_dir, "nonphoto_veg_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=8,
        class_name="scrub",
        vec_file=vec_train_file,
        vec_lyr="Scrub",
        file_h5=os.path.join(tmp_dir, "scrub_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=9,
        class_name="water",
        vec_file=vec_train_file,
        vec_lyr="Water_Training",
        file_h5=os.path.join(tmp_dir, "water_refl_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=10,
        class_name="bracken",
        vec_file=vec_train_file,
        vec_lyr="Bracken",
        file_h5=os.path.join(tmp_dir, "bracken_refl_smpls.h5"),
    )
)

## 5.3 Extract Sample Data

In [8]:
cls_smpls_info = rsgislib.classification.get_class_training_data(
    img_band_info, class_vec_sample_info, tmp_dir, ref_img=refl_img_file
)

Creating output image using input image
Running Rasterise now...

Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image
Running Rasterise now...

Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image
Running Ra

## 5.4 How many samples were extracts

In [9]:
for cls_name in cls_smpls_info:
    smpls_h5_file = cls_smpls_info[cls_name].file_h5
    n_smpls = rsgislib.classification.get_num_samples(smpls_h5_file)
    print(f"{cls_name}: {n_smpls}")

artificial_surfaces: 454
bare_rock_sand: 5392
conifer_forest: 3335
deciduous_forest: 4021
grass_long: 1264
grass_short: 622
nonphoto_veg: 1989
scrub: 5961
water: 34232
bracken: 1399


## 5.5 Balance and Extract Training, Validation and Testing datasets

Observing the number of samples which are available for the classes there are a number of things which could be done. First, the samples should be balance (i.e., the same number per-class) and this would require using the class with the minimum of samples as the reference for defining the number of testing, training and validation samples. Alternatively, the sample data can be oversampled or there are algorithms which attempt to generate artifical training samples (see the functions within the `rsgislib.classification.classimblearn` module which make use of the [imbalanced-learn](https://imbalanced-learn.org) library.

For this tutorial, things will be kept simple and the class (artificial_surfaces) with the lowest number of samples will be used to define the number of samples for each class:

 * Training: 350
 * Validation: 50
 * Testing: 50
 
The samples are randomly selected from the population of input samples.

Again a helper function (`rsgislib.classication.create_train_valid_test_sets`) has been provided which will make it simplier to perform this analysis. For this a list of `rsgislib.classification.ClassInfoObj` objects needs to be defined which specifies the file names for the training, validation and testing HDF5 files.

In this case, using the `get_class_info_dict` function the existing dictionary (`cls_smpls_info`) will be looped through and file names automatically defined by adding either `_train`, `_valid` or `_test` to the existing file name for the HDF5 file.

In [10]:
cls_smpls_fnl_info = rsgislib.classification.get_class_info_dict(
    cls_smpls_info, out_dir
)

# Run the create_train_valid_test_sets helper function to
# create the train, valid and test datasets
rsgislib.classification.create_train_valid_test_sets(
    cls_smpls_info, cls_smpls_fnl_info, 50, 50, 350
)

0=1: (Train:training_data/artificial_surfaces_refl_smpls_train.h5, Test:training_data/artificial_surfaces_refl_smpls_test.h5, Valid:training_data/artificial_surfaces_refl_smpls_valid.h5), (16, 2, 134)
1=2: (Train:training_data/bare_rock_sand_refl_smpls_train.h5, Test:training_data/bare_rock_sand_refl_smpls_test.h5, Valid:training_data/bare_rock_sand_refl_smpls_valid.h5), (77, 15, 100)
2=3: (Train:training_data/conifer_forest_refl_smpls_train.h5, Test:training_data/conifer_forest_refl_smpls_test.h5, Valid:training_data/conifer_forest_refl_smpls_valid.h5), (25, 29, 158)
3=4: (Train:training_data/deciduous_forest_refl_smpls_train.h5, Test:training_data/deciduous_forest_refl_smpls_test.h5, Valid:training_data/deciduous_forest_refl_smpls_valid.h5), (189, 88, 206)
4=5: (Train:training_data/grass_long_refl_smpls_train.h5, Test:training_data/grass_long_refl_smpls_test.h5, Valid:training_data/grass_long_refl_smpls_valid.h5), (118, 215, 53)
5=6: (Train:training_data/grass_short_refl_smpls_train.

# 6. Extract Training for Linear Normalised Image

## 6.1 Define Input Image bands

In [11]:
img_band_info = list()
img_band_info.append(
    rsgislib.imageutils.ImageBandInfo(
        file_name=norm_lin_img, name="sen2", bands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    )
)

## 6.2 Define Vector Samples

In [12]:
class_vec_sample_info = list()

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=1,
        class_name="artificial_surfaces",
        vec_file=vec_train_file,
        vec_lyr="Artificial_Surfaces",
        file_h5=os.path.join(tmp_dir, "artificial_surfaces_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=2,
        class_name="bare_rock_sand",
        vec_file=vec_train_file,
        vec_lyr="Bare_Rock_Sand",
        file_h5=os.path.join(tmp_dir, "bare_rock_sand_linnorm_smpls.h5"),
    )
)


# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=3,
        class_name="conifer_forest",
        vec_file=vec_train_file,
        vec_lyr="Conifer_Forest",
        file_h5=os.path.join(tmp_dir, "conifer_forest_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=4,
        class_name="deciduous_forest",
        vec_file=vec_train_file,
        vec_lyr="Deciduous_Forest",
        file_h5=os.path.join(tmp_dir, "deciduous_forest_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=5,
        class_name="grass_long",
        vec_file=vec_train_file,
        vec_lyr="Grass_Long",
        file_h5=os.path.join(tmp_dir, "grass_long_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=6,
        class_name="grass_short",
        vec_file=vec_train_file,
        vec_lyr="Grass_Short",
        file_h5=os.path.join(tmp_dir, "grass_short_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=7,
        class_name="nonphoto_veg",
        vec_file=vec_train_file,
        vec_lyr="NonPhotosynthetic_Vegetation",
        file_h5=os.path.join(tmp_dir, "nonphoto_veg_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=8,
        class_name="scrub",
        vec_file=vec_train_file,
        vec_lyr="Scrub",
        file_h5=os.path.join(tmp_dir, "scrub_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=9,
        class_name="water",
        vec_file=vec_train_file,
        vec_lyr="Water_Training",
        file_h5=os.path.join(tmp_dir, "water_linnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=10,
        class_name="bracken",
        vec_file=vec_train_file,
        vec_lyr="Bracken",
        file_h5=os.path.join(tmp_dir, "bracken_linnorm_smpls.h5"),
    )
)

## 6.3 Extract Sample Data

In [13]:
cls_smpls_info = rsgislib.classification.get_class_training_data(
    img_band_info, class_vec_sample_info, tmp_dir, ref_img=norm_lin_img
)

Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Creating output image using input imageGet Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.




Running Rasterise now...
Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Creating output image using input imageGet Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.



Running Rasterise now...

Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image

Running R

## 6.4 How many samples were extracts

In [14]:
for cls_name in cls_smpls_info:
    smpls_h5_file = cls_smpls_info[cls_name].file_h5
    n_smpls = rsgislib.classification.get_num_samples(smpls_h5_file)
    print(f"{cls_name}: {n_smpls}")

artificial_surfaces: 454
bare_rock_sand: 5392
conifer_forest: 3335
deciduous_forest: 4021
grass_long: 1264
grass_short: 622
nonphoto_veg: 1989
scrub: 5961
water: 34232
bracken: 1399


## 6.5 Balance and Extract Training, Validation and Testing datasets

Observing the number of samples which are available for the classes there are a number of things which could be done. First, the samples should be balance (i.e., the same number per-class) and this would require using the class with the minimum of samples as the reference for defining the number of testing, training and validation samples. Alternatively, the sample data can be oversampled or there are algorithms which attempt to generate artifical training samples (see the functions within the `rsgislib.classification.classimblearn` module which make use of the [imbalanced-learn](https://imbalanced-learn.org) library.

For this tutorial, things will be kept simple and the class (artificial_surfaces) with the lowest number of samples will be used to define the number of samples for each class:

 * Training: 350
 * Validation: 50
 * Testing: 50
 
The samples are randomly selected from the population of input samples.

Again a helper function (`rsgislib.classication.create_train_valid_test_sets`) has been provided which will make it simplier to perform this analysis. For this a list of `rsgislib.classification.ClassInfoObj` objects needs to be defined which specifies the file names for the training, validation and testing HDF5 files.

In this case, using the `get_class_info_dict` function the existing dictionary (`cls_smpls_info`) will be looped through and file names automatically defined by adding either `_train`, `_valid` or `_test` to the existing file name for the HDF5 file.

In [15]:
cls_smpls_fnl_info = rsgislib.classification.get_class_info_dict(
    cls_smpls_info, out_dir
)

# Run the create_train_valid_test_sets helper function to
# create the train, valid and test datasets
rsgislib.classification.create_train_valid_test_sets(
    cls_smpls_info, cls_smpls_fnl_info, 50, 50, 350
)

0=1: (Train:training_data/artificial_surfaces_linnorm_smpls_train.h5, Test:training_data/artificial_surfaces_linnorm_smpls_test.h5, Valid:training_data/artificial_surfaces_linnorm_smpls_valid.h5), (198, 116, 255)
1=2: (Train:training_data/bare_rock_sand_linnorm_smpls_train.h5, Test:training_data/bare_rock_sand_linnorm_smpls_test.h5, Valid:training_data/bare_rock_sand_linnorm_smpls_valid.h5), (76, 75, 148)
2=3: (Train:training_data/conifer_forest_linnorm_smpls_train.h5, Test:training_data/conifer_forest_linnorm_smpls_test.h5, Valid:training_data/conifer_forest_linnorm_smpls_valid.h5), (109, 163, 53)
3=4: (Train:training_data/deciduous_forest_linnorm_smpls_train.h5, Test:training_data/deciduous_forest_linnorm_smpls_test.h5, Valid:training_data/deciduous_forest_linnorm_smpls_valid.h5), (172, 89, 141)
4=5: (Train:training_data/grass_long_linnorm_smpls_train.h5, Test:training_data/grass_long_linnorm_smpls_test.h5, Valid:training_data/grass_long_linnorm_smpls_valid.h5), (55, 82, 98)
5=6: (Tr

# 7. Extract Training for Standard Deviation Normalised Image

## 7.1 Define Input Image bands

In [16]:
img_band_info = list()
img_band_info.append(
    rsgislib.imageutils.ImageBandInfo(
        file_name=norm_sd_img, name="sen2", bands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    )
)

## 7.2 Define Vector Samples

In [17]:
class_vec_sample_info = list()

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=1,
        class_name="artificial_surfaces",
        vec_file=vec_train_file,
        vec_lyr="Artificial_Surfaces",
        file_h5=os.path.join(tmp_dir, "artificial_surfaces_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=2,
        class_name="bare_rock_sand",
        vec_file=vec_train_file,
        vec_lyr="Bare_Rock_Sand",
        file_h5=os.path.join(tmp_dir, "bare_rock_sand_sdnorm_smpls.h5"),
    )
)


# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=3,
        class_name="conifer_forest",
        vec_file=vec_train_file,
        vec_lyr="Conifer_Forest",
        file_h5=os.path.join(tmp_dir, "conifer_forest_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=4,
        class_name="deciduous_forest",
        vec_file=vec_train_file,
        vec_lyr="Deciduous_Forest",
        file_h5=os.path.join(tmp_dir, "deciduous_forest_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=5,
        class_name="grass_long",
        vec_file=vec_train_file,
        vec_lyr="Grass_Long",
        file_h5=os.path.join(tmp_dir, "grass_long_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=6,
        class_name="grass_short",
        vec_file=vec_train_file,
        vec_lyr="Grass_Short",
        file_h5=os.path.join(tmp_dir, "grass_short_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=7,
        class_name="nonphoto_veg",
        vec_file=vec_train_file,
        vec_lyr="NonPhotosynthetic_Vegetation",
        file_h5=os.path.join(tmp_dir, "nonphoto_veg_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=8,
        class_name="scrub",
        vec_file=vec_train_file,
        vec_lyr="Scrub",
        file_h5=os.path.join(tmp_dir, "scrub_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=9,
        class_name="water",
        vec_file=vec_train_file,
        vec_lyr="Water_Training",
        file_h5=os.path.join(tmp_dir, "water_sdnorm_smpls.h5"),
    )
)

# Define the file name of the samples HDF5 file, which will be created
class_vec_sample_info.append(
    rsgislib.classification.ClassVecSamplesInfoObj(
        id=10,
        class_name="bracken",
        vec_file=vec_train_file,
        vec_lyr="Bracken",
        file_h5=os.path.join(tmp_dir, "bracken_sdnorm_smpls.h5"),
    )
)

## 7.3 Extract Sample Data

In [18]:
cls_smpls_info = rsgislib.classification.get_class_training_data(
    img_band_info, class_vec_sample_info, tmp_dir, ref_img=norm_lin_img
)

Creating output image using input image

Running Rasterise now...
Get Image Min and Max.

Creating output image using input imageGet Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.




Running Rasterise now...
Get Image Min and Max.

Creating output image using input imageGet Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.




Running Rasterise now...
Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image
Running Rasterise now...

Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.


Creating output image using input image
Running Rasterise now...

Get Image Min and Max.

Get Image Histogram.

Adding Histogram and Colour Table to image file
Calculating Image Pyramids.

Creating output image using input image


Running R

## 7.4 How many samples were extracts

In [19]:
for cls_name in cls_smpls_info:
    smpls_h5_file = cls_smpls_info[cls_name].file_h5
    n_smpls = rsgislib.classification.get_num_samples(smpls_h5_file)
    print(f"{cls_name}: {n_smpls}")

artificial_surfaces: 454
bare_rock_sand: 5392
conifer_forest: 3335
deciduous_forest: 4021
grass_long: 1264
grass_short: 622
nonphoto_veg: 1989
scrub: 5961
water: 34232
bracken: 1399


## 7.5 Balance and Extract Training, Validation and Testing datasets

Observing the number of samples which are available for the classes there are a number of things which could be done. First, the samples should be balance (i.e., the same number per-class) and this would require using the class with the minimum of samples as the reference for defining the number of testing, training and validation samples. Alternatively, the sample data can be oversampled or there are algorithms which attempt to generate artifical training samples (see the functions within the `rsgislib.classification.classimblearn` module which make use of the [imbalanced-learn](https://imbalanced-learn.org) library.

For this tutorial, things will be kept simple and the class (artificial_surfaces) with the lowest number of samples will be used to define the number of samples for each class:

 * Training: 350
 * Validation: 50
 * Testing: 50
 
The samples are randomly selected from the population of input samples.

Again a helper function (`rsgislib.classication.create_train_valid_test_sets`) has been provided which will make it simplier to perform this analysis. For this a list of `rsgislib.classification.ClassInfoObj` objects needs to be defined which specifies the file names for the training, validation and testing HDF5 files.

In this case, using the `get_class_info_dict` function the existing dictionary (`cls_smpls_info`) will be looped through and file names automatically defined by adding either `_train`, `_valid` or `_test` to the existing file name for the HDF5 file.

In [20]:
cls_smpls_fnl_info = rsgislib.classification.get_class_info_dict(
    cls_smpls_info, out_dir
)

# Run the create_train_valid_test_sets helper function to
# create the train, valid and test datasets
rsgislib.classification.create_train_valid_test_sets(
    cls_smpls_info, cls_smpls_fnl_info, 50, 50, 350
)

0=1: (Train:training_data/artificial_surfaces_sdnorm_smpls_train.h5, Test:training_data/artificial_surfaces_sdnorm_smpls_test.h5, Valid:training_data/artificial_surfaces_sdnorm_smpls_valid.h5), (166, 228, 246)
1=2: (Train:training_data/bare_rock_sand_sdnorm_smpls_train.h5, Test:training_data/bare_rock_sand_sdnorm_smpls_test.h5, Valid:training_data/bare_rock_sand_sdnorm_smpls_valid.h5), (54, 125, 207)
2=3: (Train:training_data/conifer_forest_sdnorm_smpls_train.h5, Test:training_data/conifer_forest_sdnorm_smpls_test.h5, Valid:training_data/conifer_forest_sdnorm_smpls_valid.h5), (37, 154, 124)
3=4: (Train:training_data/deciduous_forest_sdnorm_smpls_train.h5, Test:training_data/deciduous_forest_sdnorm_smpls_test.h5, Valid:training_data/deciduous_forest_sdnorm_smpls_valid.h5), (175, 56, 73)
4=5: (Train:training_data/grass_long_sdnorm_smpls_train.h5, Test:training_data/grass_long_sdnorm_smpls_test.h5, Valid:training_data/grass_long_sdnorm_smpls_valid.h5), (26, 113, 170)
5=6: (Train:training_