# Create and run a Machine Learning model as a custom-script

This notebook showcases how to create a Machine Learning (ML) custom-script for water detection. The workflow uses [eo-learn](https://eo-learn.readthedocs.io/en/latest/) to process the data and [LightGBM](https://lightgbm.readthedocs.io/en/latest/) to train a ML model for water classification given Seninel-2 band and index values. The resulting custom-script can be used in [the Sentinel Hub EOBrowser](https://www-test.sentinel-hub.com/explore/eobrowser/), in the [multi-temporal instance of Sentinel Playground](https://apps.sentinel-hub.com/sentinel-playground-temporal/?source=S2&lat=40.4&lng=-3.730000000000018&zoom=12&preset=1-NATURAL-COLOR&layers=B04,B03,B02&maxcc=20&gain=1.0&temporal=true&gamma=1.0&time=2015-01-01%7C2019-10-02&atmFilter=&showDates=false) and as evalscript in the [Sentinel Hub process API](https://docs.sentinel-hub.com/api/latest/api/process/).

The workflow is as follows:

 * download training and testing data
 * prepare samples for ML algorithm
 * train ML model
 * export trained model as custom-script

In [1]:
# Jupyter notebook related
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Built-in modules
import urllib.request
import shutil
import zipfile

# # Basics of Python data handling and visualization
# import numpy as np
# np.random.seed(42)
import geopandas as gpd
# import matplotlib as mpl
# import matplotlib.pyplot as plt
# import matplotlib.gridspec as gridspec
# from matplotlib.colors import ListedColormap, BoundaryNorm
# from mpl_toolkits.axes_grid1 import make_axes_locatable
# from shapely.geometry import Polygon
# from tqdm.auto import tqdm

# # Machine learning 
# import lightgbm as lgb
# import joblib
# from sklearn import metrics
# from sklearn import preprocessing

# # Imports from eo-learn and sentinelhub-py
# from eolearn.core import EOTask, EOPatch, LinearWorkflow, FeatureType, OverwritePermission, \
#     LoadTask, SaveTask, EOExecutor, ExtractBandsTask, MergeFeatureTask
# from eolearn.io import SentinelHubInputTask, ExportToTiff
# from eolearn.mask import AddMultiCloudMaskTask, AddValidDataMaskTask
# from eolearn.geometry import VectorToRaster, PointSamplingTask, ErosionTask
# from eolearn.features import LinearInterpolation, SimpleFilterTask, NormalizedDifferenceIndexTask
# from sentinelhub import UtmZoneSplitter, BBox, CRS, DataSource

## Download the data

* used to improve the blue-bot observatory
* collected with the Classification App
* available on bucket
* info on what they contain

Set up url and data paths

In [2]:
DATA_URL = 'http://queryplanet.sentinel-hub.com/water-labels'

DATA_INFO = f'{DATA_URL}/data-info.geojson'
EOP_URL = f'{DATA_URL}/eopatches.zip'

EOP_ZIP = './eopatches.zip'
EOP_DIR = '.'

In [3]:
gdf = gpd.read_file(DATA_INFO)

In [4]:
gdf.head()

Unnamed: 0,has_DEM,has_S1_ASC,has_S1_DES,has_S2,task_id,timestamp,window_height,window_width,geometry
0,1,0,0,1,6b73fd74a2eb11e994fbf0db728b8d14,2016-09-26,64,64,"POLYGON ((65.84065 52.68916, 65.84065 52.69491..."
1,1,0,1,1,0a27f4aea2eb11e9bfdaa9140581204c,2017-08-12,64,64,"POLYGON ((64.47018 50.95125, 64.47018 50.95701..."
2,1,1,1,1,9b60193ea24111e98d7d929084d604de,2019-03-23,64,64,"POLYGON ((31.15408 38.61106, 31.15408 38.61600..."
3,1,1,1,1,9afaabaea24111e983d185262381d01a,2018-05-31,64,64,"POLYGON ((-1.60285 38.21084, -1.60285 38.21558..."
4,1,0,1,1,9dfb5254a24111e9a087737fe71fcff1,2018-05-27,64,64,"POLYGON ((48.22342 48.99523, 48.22342 49.00099..."


In [5]:
len(gdf), len(gdf[gdf.has_S2==1])

(7671, 7671)

### Download and unzip file with eopatches

In [6]:
print(f'Downloading {EOP_URL} to {EOP_ZIP}..')
with urllib.request.urlopen(EOP_URL) as response, open(EOP_ZIP, 'wb') as zip_file:
    shutil.copyfileobj(response, zip_file)
    
print(f'Unzipping {EOP_ZIP} to {EOP_DIR}..')
with zipfile.ZipFile(EOP_ZIP, 'r') as zip_file:
    zip_file.extractall(EOP_DIR)

In [13]:
ls -lth {EOP_DIR}/eopatches | wc -l 

7672


## Set-up and run feature processing workflow with `eo-learn`

* load eopatches
* keep only B02, B03 and B04
* add NDWI, NDMI and NDBSI ?
* sample pixels
* save sampled arrays only

## Create train/cval/test sets

In [None]:
index_sets = []
# loop over unique time-stamps
for timestamp in gpdf['timestamp'].unique():
    # get geometries with same time-stamp
    centroids = gpdf[gpdf['timestamp']==timestamp].geometry.centroid
    indices = centroids.keys()
    
    # compute centroids
    centroids = np.array([np.array(centroid) for centroid in centroids])
    ctr_mean = np.mean(centroids, axis=0)
    ctr_std = np.std(centroids, axis=0)
    
    # if centroids are close together, put them in same set
    if all(pt<1 for pt in ctr_std):
        index_sets.append(indices)
        continue
    
    # sort geometries and put geometries together if centroids are within 1degree
    sorted_indices = np.argsort(centroids, axis=0)
    coord_diff = np.diff(centroids[sorted_indices[:, 0]], axis=0)
    diff_norm = np.linalg.norm(coord_diff, axis=-1)
    index_breaks, = np.where(diff_norm>1)
    for nib, ib in enumerate(index_breaks):
        ileft = int(0) if nib == 0 else int(index_breaks[nib-1])
        iright = int(ib+1)
        index_sets.append(indices[sorted_indices[:, 0]][ileft:iright])
    index_sets.append(indices[sorted_indices[:, 0]][iright:])
    
np.random.seed(42)
train_ratio = .8
train_ids = set(np.where(np.random.rand(len(index_sets))<=train_ratio)[0])
test_ids = set(np.arange(len(index_sets))) - train_ids

## Train and evaluate model 

## Convert model to evalscript

## Test evalscript