<a href="https://colab.research.google.com/github/twaldburger/flood475-presenter/blob/master/geo475_flood_modelling_in_gee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import ee
import geemap
import geemap.colormaps as cm
import time

## set some parameters, please update if required
PROJECT_ID = 'ee-timwaldburger-flood475' # your GEE project id
SAMPLE_SIZE = 100 # number of training locations per flood event and class

## connect to GEE
try:
    ee.Initialize()
except ee.EEException:
    ee.Authenticate()
    ee.Initialize(project=PROJECT_ID)

---
# Data exploration and visualization

In [2]:
## define the datasets we want to use for our model
globalFlood = ee.ImageCollection("GLOBAL_FLOOD_DB/MODIS_EVENTS/V1")
dem = ee.ImageCollection('COPERNICUS/DEM/GLO30') \
        .select('DEM')
landcover = ee.ImageCollection('ESA/WorldCover/v200')
hydro = ee.Image('MERIT/Hydro/v1_0_1')
prec = ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY') \
         .select('total_precipitation')
runoffPotential = ee.Image('projects/sat-io/open-datasets/HiHydroSoilv2_0/Hydrologic_Soil_Group_250m') \
                    .remap([1, 2, 3, 4, 14, 24, 34], [1, 2, 3, 4, 1, 3, 4])

> **Task:** Get an overview on the datasets we use. What data do they provide? Who produced them? What is their spatial and temporal resolution? A good starting point is the [Earth Engine Data Catalog](https://developers.google.com/earth-engine/datasets).  
**@presenter:** Ask for people to explain the datasets. Ask them to mention special characteristics (e.g. very low resolution) and explain where they see problems that should be kept in mind.

In [16]:
floods.reduceRegion(geometry=globalFlood.geometry(), reducer=ee.Reducer.max()).getInfo()

{'flooded': 33}

In [13]:
## subtract the permanent water bodies from the flooded areas
def subtractPermanentWater(img):
  flood = img.select('flooded')
  perm = img.select('jrc_perm_water')
  return flood.multiply(perm.eq(0))
floods = globalFlood.map(subtractPermanentWater).sum()

## color flooded areas in blue
flood_vis = {
    'min': 0,
    'max': 10,
    'palette': cm.palettes.Blues
}

## create and display an interactive map
Map = geemap.Map()
Map.add_layer(floods.selfMask(), flood_vis, 'Historic floods')
Map.add_colorbar(flood_vis, label="Number of floods", layer_name="Historic floods", font_size=9)
Map

Map(center=[0, 0], controls=(WidgetControl(options=['position', 'transparent_bg'], widget=SearchDataGUI(childr…

> **Task:** Add the input datasets to the map as individual layers. You can find the geemap documentation [here](https://geemap.org/).  
**@presenter:** Ask if someone wants to share their map and explain what functions they used to create it.

In [None]:
#TODO: add layers to map

---
# Training dataset



In [None]:
def pointQuery(fc:ee.FeatureCollection, img:ee.Image, prop:str) -> ee.FeatureCollection:
  """
  Returns pixel values at locations using a feature collection and an image.

  Parameters
  ----------
  fc: : ee.FeatureCollection
    Collection of points at which to query the image.
  img : ee.Image
    The image to query.
  prop : str
    Name of new property to hold the query results.

  Returns
  -------
  ee.FeatureCollection
    Input FeatureCollection with lookup values added as new property.
  """
  fc = img.reduceRegions(collection=fc, reducer=ee.Reducer.first())
  return fc.map(lambda feat: feat.set(prop, feat.get('first')))


def removeProperty(fc:ee.FeatureCollection, prop:str) -> ee.FeatureCollection:
  """
  Removes a property by name from a feature collection.

  Parameters
  ----------
  fc : ee.FeatureCollection
    Collection from which to remove the property.
  prop : str
    Property to remove.

  Returns
  -------
  ee.FeatureCollection
    Input collection without the removed property.
  """
  selectProperties = fc.propertyNames().filter(ee.Filter.neq('item', prop))
  return fc.select(selectProperties)


def createSample(img:ee.Image) -> ee.FeatureCollection:
  """
  Samples training dataset for a single flood image.

  Parameters
  ----------
  img : ee.Image
    Input image.

  Returns
  -------
  ee.FeatureCollection
    Sampled and enriched training data.
  """

  ## subtract permanent water bodies from flooded areas
  permanent = img.select('jrc_perm_water')
  water = img.select('flooded')
  flooded = water.subtract(permanent).gt(0)

  ## get the total and maximum precipitation over 14 days prior to the event end date
  end = img.getNumber('system:time_end')
  start = end.subtract(1209600000) # timestamp in milliseconds: 60 * 60 * 24 * 14 * 1000
  precSum = prec.filter(ee.Filter.date(start, end)).sum()
  precMax = prec.filter(ee.Filter.date(start, end)).max()

  ## sample equal number of flooded and non-flooded points
  sample = flooded.stratifiedSample(numPoints=SAMPLE_SIZE, classBand='flooded', geometries=True)

  ## add image id in case we want to join the event metadata later
  sample = sample.map(lambda x: x.set('eventId', img.get('system:index')))

  ## enrich sample by running point lookups on multiple datasets
  sample = pointQuery(sample, dem.mosaic(), 'demElevationAbs')
  sample = pointQuery(sample, ee.Terrain.aspect(dem.mosaic()), 'demAspect')
  sample = pointQuery(sample, ee.Terrain.slope(dem.mosaic()), 'demSlope')
  sample = pointQuery(sample, landcover.first(), 'landcover')
  sample = pointQuery(sample, hydro.select('upa'), 'upa')
  sample = pointQuery(sample, runoffPotential, 'runoffPot')
  sample = pointQuery(sample, precSum, 'precSum')
  sample = pointQuery(sample, precMax, 'precMax')

  ## remove first-property
  sample = sample.map(lambda feat: removeProperty(feat, 'first'))

  ## normalize elevation
  elevationRange = dem.mosaic().reduceRegion(geometry=img.geometry(), reducer=ee.Reducer.minMax())
  min = ee.Number(elevationRange.get('DEM_min'))
  max = ee.Number(elevationRange.get('DEM_max'))
  def normalizeElevation(feat:ee.Feature) -> ee.Feature:
    return feat.set('demElevationNorm', (ee.Number(feat.get('demElevationAbs')).subtract(min)).divide(max.subtract(min)))
  sample = sample.filter(ee.Filter.notNull(ee.List(['demElevationAbs']))).map(normalizeElevation)

  return sample

> **Task:** Try to understand the code in the cell above. Why are we using *ee.Image.stratifiedSample* in line 70 instead of using the much faster *ee.Image.Sample* method? Why are we using the complicated GEE methods in lines 65-76 and 90-94 instead of simply using plain Python? What does *ee.FeatureCollection.map* do? Why do we not just write a simple for-loop instead? Why do we normalize the elevation values?  
**@presenter:** Ask where people looked up the GEE functions. Show the documentation section in the [GEE Code editor](https://code.earthengine.google.com/) if not mentioned.  
**@presenter:**
- *stratifiedSample* samples the same number of locations within each class. Since our flood footprints mostly contain non-flooded pixels, using *stratifiedSample* is a convenient way to avoid oversampling non-flooded pixels. However, there might also be a risk of *stratifiedSample* oversampling certain flooded areas if the flood footprint is very small.
- We want to execute all the code in GEE itself so we can benefit from its optimization and parallelisation. If we would use plain Python, we would need to fetch the info from GEE server into our Colab Runtime which would make the whole process extremely inefficient.
- *map* iterates over a feature collection and applies an algorithm to each feature. This process is run in parallel on the GEE server. Using a Python for-loop instead would loose the parallel execution and also create all the problems mentioned in the last question.
- We use a global flood dataset, but look at the individual events independently. We do not want to create a bias towards the general elevation of the region where an event took place and are therefore normalizing the elevation within each image (and therefore for each event).

In [None]:
## create enriched sample and store the result as an asset
properties = ['demAspect', 'demElevationAbs', 'demElevationNorm', 'demSlope', 'eventId', 'flooded', 'landcover', 'precMax', 'precSum', 'runoffPot', 'upa', '.geo']
sample = globalFlood.map(createSample).flatten()
sample = sample.filter(ee.Filter.notNull(properties)).distinct(properties)
task = ee.batch.Export.table.toAsset(sample, description='flood475-sampling', assetId=f"projects/{PROJECT_ID}/assets/flood475_sample_2")

In [None]:
## run and monitor the task
task.start()
while task.active():
  ts = task.status()
  if ts['start_timestamp_ms']>0:
    s = round((ts['update_timestamp_ms']-ts['start_timestamp_ms'])/1000)
  else:
    s = round((ts['update_timestamp_ms']-ts['creation_timestamp_ms'])/1000)
  print(f"task '{ts['description']}' is {ts['state']} for {s} seconds")
  time.sleep(60)
task.status()

> **Task:** There are multiple places (outside of Google Colab) where we can also monitor our tasks. Can you find them?  
**@presenter:** Show the 3 places:
- [GEE Code editor](https://code.earthengine.google.com/)
- [Task Manager](https://code.earthengine.google.com/tasks)
- [Tasks Page in the Cloud Console](https://console.cloud.google.com/earth-engine/tasks?project=ee-timwaldburger-flood475)

> **Task:** Add the sampled data points to your map from above.

In [None]:
#TODO: add data points to map

# Creating a training dataset
In this first notebook, we create a set of training locations which we can then use to train a flood prediction model.

Please go through the explanations and code step-by-step and run each code cell.
> **Task:** Tasks and questions are marked like this. Please try to answer them before proceeding with the next cell.

---
## Preparations
In this first section, we handle all imports and set some variables. We also initialize the connection to GEE.

### Import dependencies
All dependencies required for this notebook are pre-installed in Google Colab. We can therefore just import them.

In [None]:
import ee
import geemap
import geemap.colormaps as cm
import google
import pandas as pd
from datetime import datetime, timedelta
from pathlib import Path
from tqdm import tqdm
from time import sleep

### Define global variables
The cell below defines some global variables. You are only required to set the project ID but can change the other variables if you want.
- `PROJECT_ID` This is you Gee project ID. If you do not remember it, you can go to the [GEE code editor](https://code.earthengine.google.com/) and list your project by clicking on your user symbol in the top-right corner.
- `SAMPLE_SIZE` This is the number of training locations we want to create. To keep computation times small, I suggest that you choose a sample size below 5000.
- `TRAINING_DATA` This is the file name under which to save the training locations. Use different file names if you generate different training data. It is not required to add a file type.
- `SEED` This is the seed value used for random sampling. Setting this here makes our sampling reproducible.

In [None]:
PROJECT_ID = ''    # @param {type: 'string'}
SAMPLE_SIZE = 3000 # @param {type: 'integer'}
TRAINING_DATA = 'roi_sample' # @param {type: 'string'}
SEED = 123         # @param {type: 'integer'}

### Mount Google Drive
We will use Google Drive to store our preliminary results from GEE because we can mount it to Google Colab and therefore easily write and read data without the need of manually down- and uploading datasets.

**Important!** The cell below mounts your Google Drive to Google Colab and creates a new folder (named _geo475_ee_). This folder will be removed again at the end of the exercise (you can also keep it if you want, of course). **To make sure that we are not deleting any of your personal data, do not change the `data_dir`-variable in the cell below unless you know what you are doing.**

In [None]:
data_dir = Path('/content/gdrive/MyDrive/geo475_ee')

## mount Google Drive to Colab
if not data_dir.parent.exists():
  google.colab.drive.mount('/content/gdrive')

## create output directory for the project
if not data_dir.exists():
  data_dir.mkdir()

### Initialize Google Earth Engine
In the cell below, we connect to GEE using the same apporoach shown in [01_Connecting_to_GEE.ipynb](https://github.com/twaldburger/flood475/blob/master/01_Connecting_to_GEE.ipynb).

In [None]:
google.colab.auth.authenticate_user()
credentials, project_id = google.auth.default()
ee.Initialize(credentials, project=PROJECT_ID)
print(ee.String('Nice! That worked! :-)').getInfo())

---
## Data
In this section, we have a quick look at the datasets we will use to generate our training dataset. We use only data from the GEE Catalog for which you can find the links below:

- [Global Flood Database v1 (2000-2018)](https://developers.google.com/earth-engine/datasets/catalog/GLOBAL_FLOOD_DB_MODIS_EVENTS_V1#description)
- [CHIRPS Daily: Climate Hazards Group InfraRed Precipitation With Station Data (Version 2.0 Final)](https://developers.google.com/earth-engine/datasets/catalog/UCSB-CHG_CHIRPS_DAILY#description)
- [Copernicus DEM GLO-30: Global 30m Digital Elevation Model](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_DEM_GLO30#description)
- [MERIT Hydro: Global Hydrography Datasets](https://developers.google.com/earth-engine/datasets/catalog/MERIT_Hydro_v1_0_1#bands)
- [ESA WorldCover 10m v200](https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v200?hl=en#description)

> **Task:** Familiarize yourself by following the links to the GEE Catalog. Try to answer the following questions:
1. Which datasets are raster datasets? Which are vector?
2. How do the data differ in (spatial) resolution?
3. How many flood events does the [Global Flood Database v1 (2000-2018)](https://developers.google.com/earth-engine/datasets/catalog/GLOBAL_FLOOD_DB_MODIS_EVENTS_V1#description) contain?
4. How many bands does the Digital Elevation Model (DEM) have?
5. How is the precipitation dataset different from the others?
6. How do these data limit us in training a flood prediction model? Try to look at the temporat and spatial extent covered by the different datasets. Hint: You can visualize a single day of CHIRPS precipitation data by running the next two code cells.

With the code below, we import the relevant datasets from GEE. We are not reading the actual data into our memory (it would never fit) but we reference the datasets as a variables so we can start using it.




### Import from GEE

In [None]:
flood_ds = ee.ImageCollection('GLOBAL_FLOOD_DB/MODIS_EVENTS/V1')
elevation_ds = ee.ImageCollection('COPERNICUS/DEM/GLO30')
landcover_ds = ee.ImageCollection("ESA/WorldCover/v200")
precipitation_ds = ee.ImageCollection('UCSB-CHG/CHIRPS/DAILY')
flowaccumulation = ee.Image("MERIT/Hydro/v1_0_1").select('upa')

### Explore interactively
This code generates an interactive map of the precipitation data we are using. Note that although the dataset contains daily precipitation of over 30 years, we are only visualizing a single day.
> **Task:** Try to visualize other datasets as well. Hint: Many datasets define the code for a simple visualsation in the catalog.

In [None]:
## add a temporal filter to the CHIRPS dataset and select the precipitation layer
precipitation = precipitation_ds.filterDate('2018-05-01').select('precipitation')

## define a color palette for nicer visualisation
precipitation_vis = {
    'min': 1,
    'max': 17,
    'palette': ['001137', '0aab1e', 'e7eb05', 'ff4a2d', 'e90000'],
}

## create and display an interactive map
Map = geemap.Map()
Map.add_layer(precipitation, precipitation_vis, 'Precipitation')
Map

---
## Defining the training locations
In this section, we sample the locations we want to use for training our model.

### Defining the region of interest
First, we define our region of interest.
> **Task:** Run the cell below to visualize all historic flood events in our dataset. Decide which area you want to use to train your model and draw a region of interest (ROI) using the _Draw a rectangle_-tool from the tool bar on the left. Keep in mind the spatial constraints of our other datasets when choosing the ROI.

In [None]:
## subtract the permanent water bodies from the flooded areas
def mapper(img):
  flood = img.select('flooded')
  perm = img.select('jrc_perm_water')
  return flood.multiply(perm.eq(0))
all_floods = flood_ds.map(mapper).sum()

## color flooded areas in blue
flood_vis = {
    'min': 0,
    'max': 7,
    'palette': cm.palettes.Blues
    }

## create and display an interactive map
Map = geemap.Map()
Map.add_layer(all_floods.selfMask(), flood_vis, 'Historic floods')
Map.add_colorbar(flood_vis, label="Number of floods", layer_name="Historic floods", font_size=9)
Map

### Sampling random locations
We use GEE to randomly sample locations within our ROI. The ROI is taken from the Map-object we defined in the cell above.
> **Task:** What needs to be considered if we want to sample evenly distributed locations? Hint: it also mentioned in the documentation of the function we use: [ee.FeatureCollection.randomPoints](https://developers.google.com/earth-engine/apidocs/ee-featurecollection-randompoints).

Run the cell below to randomly sample training locations. Note that we use the two variables `SAMPLE_SIZE` and `SEED` from the beginning of the notebook as input for the sampling function.

In [None]:
## get the region of interest drawn on the map
roi = ee.FeatureCollection(Map.draw_features)

## randomly sample points within the ROI
roi_sample = ee.FeatureCollection.randomPoints(
    region=roi.geometry(),
    points=SAMPLE_SIZE,
    seed=SEED
)

---
## Data enrichment
In this section, we will enrich our training locations uning several different datasets. The approach is always the same, namely performing a simple point lookup on each dataset using the locations we sampled above.
As the point lookup will be repeated several times, we define a custom function which we then just can apply to each dataset.

It is important to know that calling this function does not yet trigger any computations on GEE. This will only happen once we create and start a task for it (which we will do further down below).

In [None]:
def singleband_point_query(fc, img, scale, prop):
  """
  Run a point lookup on a single image.

  Parameters
  ----------
  fc : ee.FeatureCollection
    FeatureCollection of points for which to preform the lookup.
  img : ee.Image
    Image on which to perform the lookup.
  scale : float
    Spatial resolution of the image.
  prop : str
    Name of the property under which lookup values shall be added to the feature collection.

  Returns
  -------
  ee.FeatureCollection
    Input FeatureCollection with lookup values added as new property.
  """

  ## extract the pixel values at each location
  fc = img.reduceRegions(
      collection=fc,
      reducer=ee.Reducer.first(),
      scale=scale
  )

  ## add the pixel values as a new feature property
  def mapper(feature):
    return feature.set(prop, feature.get('first'))
  fc = fc.map(mapper)

  return fc

### Elevation, slope and aspect
We start by enriching our training locations with information from the DEM. We not only use the elevation above sea level but also slope and aspect which we derive from the DEM using GEE's powerful built-in functions. Note how we are making use of the function we defined in the cell above.

Dataset: [Copernicus DEM GLO-30: Global 30m Digital Elevation Model](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_DEM_GLO30#description)

In [None]:
## elevation
dem = elevation_ds.select('DEM')
roi_sample = singleband_point_query(
    fc=roi_sample,
    img=dem.mosaic(),
    scale=dem.first().projection().nominalScale().getInfo(),
    prop='elevation'
)

## slope
slope = ee.Terrain.slope(dem.mosaic())
roi_sample = singleband_point_query(
    fc=roi_sample,
    img=slope,
    scale=slope.projection().nominalScale().getInfo(),
    prop='slope'
)

## aspect
aspect = ee.Terrain.aspect(dem.mosaic())
roi_sample = singleband_point_query(
    fc=roi_sample,
    img=aspect,
    scale=aspect.projection().nominalScale().getInfo(),
    prop='aspect'
)

### Land cover
In this cell, we lookup the different landcover classes.

Dataset: [ESA WorldCover 10m v200](https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v200?hl=en#description)

> **Task:** How many different classes does the dataset contain?

In [None]:
roi_sample = singleband_point_query(
    fc=roi_sample,
    img=landcover_ds.first(),
    scale=landcover_ds.first().projection().nominalScale().getInfo(),
    prop='landcover'
)

### Flow accumulation
Next, we are determining the upstream drainage area (flow accumulation area) for each of our sample locations. Again, we are just using our custom function for the lookup.

Dataset: [MERIT Hydro: Global Hydrography Datasets](https://developers.google.com/earth-engine/datasets/catalog/MERIT_Hydro_v1_0_1#bands)

> **Task:** Which unit should we expect for the upstream drainage area?

In [None]:
roi_sample = singleband_point_query(
    fc=roi_sample,
    img=flowaccumulation,
    scale=flowaccumulation.projection().nominalScale().getInfo(),
    prop='upstream_drainage_area'
)

### Precipitation
The precipitation data is different from the other datasets as it does not represent a certain point in time but actually contains over 15'000 individual layers representing over 30 years of daily precipitation. To reduce the number of input variables for our model, we will aggregate the data and only take the maximum precipitation for each pixel and within each year between 2000 and 2018.

Dataset: [CHIRPS Daily: Climate Hazards Group InfraRed Precipitation With Station Data (Version 2.0 Final)](https://developers.google.com/earth-engine/datasets/catalog/UCSB-CHG_CHIRPS_DAILY#description)

> **Task:** Why do we not use the full dataset? Why do we limit ourselves to only the years 2000 - 2018?

In [None]:
for year in range(2000, 2019):
  daily_max = precipitation_ds.select('precipitation').filter(ee.Filter.date(f"{year}-01-01", f"{year+1}-01-01")).max()
  roi_sample = singleband_point_query(
      fc=roi_sample,
      img=daily_max,
      scale=daily_max.projection().nominalScale().getInfo(),
      prop=f"daily_max_precipitation_-{2018-year}"
  )

### Historic flood
The last information which is missing is the information if a location has been flooded in the past or not. This will be our target variable for the model. Again, we can just use are custom function to run the point lookup on the historic floods dataset.

Dataset: [Global Flood Database v1 (2000-2018)](https://developers.google.com/earth-engine/datasets/catalog/GLOBAL_FLOOD_DB_MODIS_EVENTS_V1#description)

In [None]:
roi_sample = singleband_point_query(
    fc=roi_sample,
    img=all_floods,
    scale=flood_ds.select('flooded').first().projection().nominalScale().getInfo(),
    prop='floods'
)

### Compute and export results
As mentioned above, calling our custom point query-function has not triggered any computations but we merely defined the exact steps which need to be executed. We will now create a task to run all these steps on Google Earth Engine. The results will be stored on your Google Drive as a csv-table. This cell might run for a few minutes so we can use the time to remember (and appreciate) what is actually happening:
1. run point lookup on the DEM,
2. compute slope from the DEM and run point lookup,
3. compute aspect from the DEM and run point lookup,
4. run point lookup on the land cover dataset,
5. run point lookup upstream drainage area dataset,
6. aggregate the yearly maximum daily precipitation for two decades and run point lookup,
7. run point lookup on the historic floods dataset,
8. write the result to a .csv-file.

The computation time is depending on the size of your region of interest but keeping in mind that some of the datasets we use have spatial resolutions of 30 (or even 10) meters, the speed of Google Earth Engine is pretty impressive.

In [None]:
%%time

## remove output file if it already exists
try:
   (data_dir/f"{TRAINING_DATA}.csv").unlink()
except FileNotFoundError:
  pass

## create an export task
task = ee.batch.Export.table.toDrive(
    collection=roi_sample,
    description='data export to Google Drive',
    folder=data_dir.name,
    fileNamePrefix=TRAINING_DATA,
    fileFormat='CSV'
    )

## run and monitor export task
task.start()
while task.active():
  sleep(10)
task.status()

With the code below, we import the .csv file generated in GEE from Google Drive. We can now - finally - see the results of our data enrichment.

If the code below returns an error, you need to give it a few minutes since it might take some time for GEE to write the file to your Google Drive.

In [None]:
df = pd.read_csv(data_dir/f"{TRAINING_DATA}.csv")
df

### Normalize elevation
In this last post-processing step, we normalize the elevation values to a range between 0 and 1. For this, we get the minimum and maximum elevation within our ROI and apply the [formula for linear scaling](https://developers.google.com/machine-learning/data-prep/transform/normalization#scaling-to-a-range).

In [None]:
## determine min/max elevation within roi
dem_range = dem.mosaic().reduceRegion(
    geometry=roi.geometry(),
    reducer=ee.Reducer.minMax(),
    scale=dem.first().projection().nominalScale().getInfo(),
    bestEffort=True
).getInfo()


## normalize elevation
df['elevation'] = (df['elevation'] - dem_range.get('DEM_min')) / (dem_range.get('DEM_max') - dem_range.get('DEM_min'))

## drop locations where any attribute is NaN
df = df[~df.isnull().any(axis=1)]

## write results to csv
df.to_csv(data_dir/f"{TRAINING_DATA}_norm.csv", index=False)
df

---
## Optional: Repeating the same steps but for all 913 historic events

### Defining the training locations
We sample the same number of locations as before but now, we not only sample within a single ROI but within the geometry of each and every event in our historic event dataset.

In [None]:
## sample the same number of sample points for each flood event
event_sample_size = SAMPLE_SIZE // flood_ds.size().getInfo()

## randomly sample points within each event geometry
def mapper(feat):
  return ee.FeatureCollection.randomPoints(
      region=feat.geometry(),
      points=event_sample_size,
      seed=SEED
  )
flood_sample = flood_ds.map(mapper).flatten()

### Data enrichment
Here, we can repeat exactly the same steps as before. The only difference is that we use a different set of training locations for the point lookups. Actually, it would be elegant to define not only the point lookup but the whole data enrichment pipeline as a single function but I deliberately left that out to keep the code more readable.

In [None]:
## elevation
flood_sample = singleband_point_query(
    fc=flood_sample,
    img=dem.mosaic(),
    scale=dem.first().projection().nominalScale().getInfo(),
    prop='elevation'
)

## slope
flood_sample= singleband_point_query(
    fc=flood_sample,
    img=slope,
    scale=slope.projection().nominalScale().getInfo(),
    prop='slope'
)

## aspect
flood_sample = singleband_point_query(
    fc=flood_sample,
    img=aspect,
    scale=aspect.projection().nominalScale().getInfo(),
    prop='aspect'
)

## land cover
flood_sample = singleband_point_query(
    fc=flood_sample,
    img=landcover_ds.first(),
    scale=landcover_ds.first().projection().nominalScale().getInfo(),
    prop='landcover'
)

## flow accumulation
flood_sample = singleband_point_query(
    fc=flood_sample,
    img=flowaccumulation,
    scale=flowaccumulation.projection().nominalScale().getInfo(),
    prop='upstream_drainage_area'
)

## precipitation
for year in range(2000, 2019):
  daily_max = precipitation_ds.select('precipitation').filter(ee.Filter.date(f"{year}-01-01", f"{year+1}-01-01")).max()
  flood_sample = singleband_point_query(
      fc=flood_sample,
      img=daily_max,
      scale=daily_max.projection().nominalScale().getInfo(),
      prop=f"daily_max_precipitation_-{2018-year}"
  )

## historic flood
flood_sample = singleband_point_query(
    fc=flood_sample,
    img=all_floods,
    scale=flood_ds.select('flooded').first().projection().nominalScale().getInfo(),
    prop='floods'
)

Same as above, we have to explicitly create a task to actually run the computations on GEE. This will take a bit longer as the last one but while you wait, you can have a look at some apps built with GEE:
- https://www.earthengine.app/
- https://philippgaertner.github.io/2020/12/ee-apps-table-searchable/

In [None]:
%%time

## remove output file if it already exists
try:
   (data_dir/'flood_sample.csv').unlink()
except FileNotFoundError:
  pass

## create an export task
task = ee.batch.Export.table.toDrive(
    collection=flood_sample,
    description="data export to Google Drive",
    folder=data_dir.name,
    fileNamePrefix='flood_sample',
    fileFormat='CSV'
    )

## run and monitor export task
task.start()
while task.active():
  sleep(10)
task.status()

The normalization of elevation is a bit trickier than before because we do not want to normalize the full dataset but for each event separately.

First, we want to exract all event geometries as individual polygons. This seems to be a rather hard task in GEE so we use [geopandas](https://geopandas.org/en/stable/) - a core library when working with vector data in Python.

In [None]:
## get all image geometries as multipolygon
multipolygon_fc = ee.FeatureCollection(flood_ds.select('flooded').geometry())

## explode multipolygon to a list of polygons using geopandas
gdf = geemap.ee_to_geopandas(multipolygon_fc)
gdf.set_crs(4326, inplace=True)

## create gee featurecollection of singlepolygons
singlepolygons_fc = geemap.geopandas_to_ee(gdf.explode())

Second, we use the dataset we created in the cell above to get the minimum and maximum elevation within each event geometry.

In [None]:
## define some input variables
dem = elevation_ds.select('DEM')
dem_mosaic = dem.mosaic()
scale = dem.first().projection().nominalScale().getInfo()

## define the mapper
def mapper(img):

  ## get the minimum and maximum elevation within the event geometry
  dem_range = dem_mosaic.reduceRegion(
      geometry=img.geometry(),
      reducer=ee.Reducer.minMax(),
      scale=scale,
      bestEffort=True
  )

  ## create a feature and assign the relevant properties
  feat = ee.Feature(img.geometry())
  feat = feat.set('id', img.get('id'))
  feat = feat.set('system_index', img.get('system:index'))
  feat = feat.set('elevation_min', dem_range.get('DEM_min'))
  feat =feat.set('elevation_max', dem_range.get('DEM_max'))

  return feat

## map the mapper over all images in the historic flood dataset
dem_ranges = ee.FeatureCollection(flood_ds.select('flooded').map(mapper))

And again, we need to actually run the task.

In [None]:
%%time

## remove output file if it already exists
try:
   (data_dir/'dem_ranges.csv').unlink()
except FileNotFoundError:
  pass

## create an export task
task = ee.batch.Export.table.toDrive(
    collection=dem_ranges,
    description="data export to Google Drive",
    folder=data_dir.name,
    fileNamePrefix='dem_ranges',
    fileFormat='CSV'
    )

## run and monitor export task
task.start()
while task.active():
  sleep(10)
task.status()

Finally, we are importing the enriched training data and the dataset of min/max elevation so we can bring it together in order to create our final training dataset.

> **Task:** Why do we need to normalize the elevation values?

In [None]:
## import data
df = pd.read_csv(data_dir/'flood_sample.csv')
ele = pd.read_csv(data_dir/'dem_ranges.csv')

## add minimum and maximum elevation to training dataset
df['system:index'] = df['system:index'].apply(lambda x: x.rsplit('_', 1)[0])
df = df.merge(ele[['system:index', 'elevation_min', 'elevation_max']], how='left', on='system:index')

## normalize elevation
df['elevation'] = (df['elevation'] - df['elevation'].min()) / (df['elevation'].max() - df['elevation'].min())

## drop locations where any attribute is NaN
df = df[~df.isnull().any(axis=1)]

## write dataframe to csv
df.to_csv(data_dir/'flood_sample_norm.csv', index=False)
df

### Exploring the training dataset
> **Task:** Run the remaining two cells and have a look at their output. Try to answer the following questions:
1.  Is our training dataset biased? Try to think about the spatial and temporal limitations we mentioned when looking at the datasets.
2. How would you rate the distribution of our training data with respect to the *flooded*-attribute? Is it optimal for our purpose?
3. Where would you see room for improvement for the sampling approach? How would you implement your idea if you had two write code? No need to get too detailed but I am interested in your approach.

In [None]:
## print flood counts
for k,v in df['floods'].value_counts().to_dict().items():
  print(f"locations with {int(k)} floods: {v}")

In [None]:
## add sampled locations to the map of historic floods
Map.add_layer(flood_sample, None, 'Training dataset')
Map