<font face='Calibri' size='2'> <i>eSBAE - Notebook Series - Part 5a, version 0.1, April 2023. Andreas Vollrath, UN-Food and Agricultural Organization, Rome</i>
</font>

![title](images/header.png)

# V-a - Unsupervised Subsampling
### Run a KMeans unsupervised clustering algorithm to get a first set of *"statistically balanced"* training data
-------

This notebook takes you through the process of clustering your data points in an unsupervised fashion. From this you can select a subsample, which can serve as a basis for annotation. Data is exported as a CEO compatible file. 

The reason why we use KMeans over a simple random selection is that we want to overproportoinally capture rare classes. For example, one of the clusters does represent change events, but the cluster itself consists of only 1% of all samples. In a random selection of 100 points, only 1 point would be selected. If we choose 10 clusters  adn sample 10 points within each cluster, we assure that 10% of the samples are actually change. This is beneficial for subsequqnt steps of classifying, as we need a sufficient amount of samples for our rare classes of change. 

The number of clusters is an iterative approach. The Statistic Plots will help to see, if any cluster actually might capture forest change pre-dominantly. The number of points per cluster are determined by th eoverall amount of samples one is able to later interpret. 

### 1 - Import libraries (*only execute this cell*)

This cell will provide us with the functionality we need for running the subsequent cells of the notebook.

In [None]:
from sampling_handler import KMeansSubSampling

### 2 - Basic Input Variables

Here a so called class instance is initialized. The class instance needs some parameters to be set and is written into the *esbae* variable. See the commented lines for further explanation.

In [None]:
esbae = KMeansSubSampling(

    # your project name (NEEDS to be consistent with previous notebooks of your project)
    project_name = 'my_first_esbae_project',
    
    # select the number of clusters (reasonable numbers range from 5 to 30)
    clusters=10,
    
    # select the points per cluster (multiplied by the number fo cluster, will give you the overall number of samples ou will get)
    points_per_cluster=10,
    
    # a random state for reproducability (can be any integer number)
    random_state=25
)

### 3 - Run the clustering algorithm

The cluster algorithm shall be run with standardized inputs (e.g. all input variables are stadnardized to the same range). However, in certain cases, non standardized inputs may give better results. 

In [None]:
esbae.cluster(standardize_inputs=True)

### 4 - Plot the clusters on a map

In [None]:
import contextily as cx
esbae.plot_clusters(markersize=2, basemap=cx.providers.Esri.WorldImagery) #check other basemaps here: https://contextily.readthedocs.io/en/latest/intro_guide.html

### 5 - Plot the clusters against the input variables to get an idea of what they represent

In [None]:
esbae.plot_stats(class_column='KMeans', cols_to_plot=['cusum_confidence', 'cusum_magnitude', 'esa_lc20'])

### 6 - Subsampling

This step subsamples each cluster by the amount of points per cluster defined during the initialization step in cell 2. There are 2 ways of selecting the subset, one is *randomly* and the other uses a *space filling curve* based on the Hilbert distance. 

The output can be saved as CEO. A csv file with point coordinates is created. In adidtion, a zipped shapefile is created as well. In case in Notebook 3, the bounds_reduce option has been used, the polygons of the reducer are taken as geometry.

In [None]:
esbae.sampling_type = 'space_filling_curve'  # or 'random'
esbae.select_samples(save_as_ceo=True)

### 7 - Plot subsample selection on a map

In [None]:
esbae.plot_samples(markersize=5, basemap=cx.providers.Esri.WorldImagery)