# Data Normalisation

Some classifiers (e.g., Support Vector Machines) require the image pixel values to be normalised so they are within the same range. If you were merging data from different modalities, this would be even more important (e.g., dB values from SAR are negative). Within the Sentinel-2 data the range of values for each band can be quite different, for example in the visible bands range is commonly quite low, while in the near infrared (NIR) the range is high. 

There are different approaches to normalising the data but for this tutorial we will try two:

- minimum -- maximum normalisation
- standard deviation normalisation

Following the application of the normalisation to the input imagery, the image pixel values will be extracted from the images. This will result in three training sets, the two normalised and original datasets.

## Min-Max Normalisation

Applied on a per-band basis this normalisation calculates the minimum and maximum pixel values and then uses those to scale the rest of the data to the same range:

$$
out_X = \frac{X - min}{max-min} \times out_{range}
$$

Where $X$ is the current pixel, $min$ is the minimum for the whole image band, $max$ is the maximum for the whole image band, $out_{range}$ is the maximum image pixel value within the output image and $out_X$ is the output image pixel value written to the output image.

## Standard Deviation Normalisation

Also applied on a per-band basis, this normalisation calculates the $mean$ and standard deviation ($stdev$) for each image band. The user provides the number of standard deviations ($n_{userstdevs}$) the data should be normalised over (e.g., 2 standard deviations). This provides the upper ($up_{stdev}$) and lower ($low_{stdev}$) bounds for the normalisations. 

$$
low_{stdev} = mean - (std \times n_{userstdevs}) \\
\text{if } min > low_{stdev} \text{ then } low_{stdev} = min \\
up_{stdev} = mean + (std \times n_{userstdevs}) \\
\text{if } max < up_{stdev} \text{ then } up_{stdev} = max \\
out_X = \frac{X - low_{stdev}}{up_{std}-low_{stdev}} \times out_{range}
$$

Where $X$ is the current pixel, $min$ is the minimum for the whole image band, $max$ is the maximum for the whole image band, $out_{range}$ is the maximum image pixel value within the output image and $out_X$ is the output image pixel value written to the output image.


## Running Notebook

The notebook has been run and saved with the outputs so you can see what the outputs should be and so the notebook and be browsed online without having to run the notebook for it to make sense. 

If you are running the notebook for yourself it is recommended that you clear the existing outputs which can be done by running one of the following options depending on what system you are using:

**Jupyter-lab**:

> \> _Edit_ \> _'Clear All Outputs'_

**Jupyter-notebook**:

> \> _Cell_ \> _'All Outputs'_ \> _Clear_


# 1. Import Modules

In [1]:
import os

import rsgislib
import rsgislib.imageutils
from rsgislib.imageutils import STRETCH_LINEARMINMAX, STRETCH_LINEARSTDDEV

# 2. Define the input image

In [2]:
# The input image
input_img = "../data/sen2_20180629_t30uvd_orb037_osgb_stdsref_20m.tif"

# 3. Set up GTIFF options for output files 

In [3]:
# Define environmental variable so outputted GeoTIFFs are tiled and compressed.
rsgislib.imageutils.set_env_vars_lzw_gtiff_outs()

# 4. Create the outputs directory

In [4]:
# The output directory.
out_dir = "norm_images"

# if the output directory does not exist
# then create it.
if not os.path.exists(out_dir):
    os.mkdir(out_dir)

# 5. Apply Linear Min--Max Normalisation

In [None]:
# The output image file for the linear normalisation
output_lin_img = os.path.join(
    out_dir, "sen2_20180629_t30uvd_orb037_osgb_stdsref_norm_linear.tif"
)

# Run the linear normalisation where all the input image
# band will independently be normalised so the minimum
# value is 1 and the maximum value is 1000.
rsgislib.imageutils.normalise_img_pxl_vals(
    input_img=input_img,
    output_img=output_lin_img,
    gdalformat="GTIFF",
    datatype=rsgislib.TYPE_16UINT,
    in_no_data_val=0,
    out_no_data_val=0,
    out_min=1,
    out_max=1000,
    stretch_type=STRETCH_LINEARMINMAX,
)

# Calculate image statistics and pyramids for the output image
rsgislib.imageutils.pop_img_stats(
    output_lin_img, use_no_data=True, no_data_val=0, calc_pyramids=True
)

# 6. Apply Linear Standard Deviation Normalisation


In [None]:
# The output image file for the standard deviation normalisation
output_sd_img = os.path.join(
    out_dir, "sen2_20180629_t30uvd_orb037_osgb_stdsref_norm_stddev.tif"
)

# Run the standard deviation normalisation
rsgislib.imageutils.normalise_img_pxl_vals(
    input_img,
    output_sd_img,
    "GTIFF",
    rsgislib.TYPE_16UINT,
    in_no_data_val=0,
    out_no_data_val=0,
    out_min=0,
    out_max=1000,
    stretch_type=STRETCH_LINEARSTDDEV,
    stretch_param=2,
)

# Calculate image statistics and pyramids for the output image
rsgislib.imageutils.pop_img_stats(
    output_sd_img, use_no_data=True, no_data_val=0, calc_pyramids=True
)