<a href="https://colab.research.google.com/github/twaldburger/flood475/blob/master/geo475_flood_prediction_in_gee_master.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Geo475 - Flood prediction in Google Earth Engine

Go through the notebook and run the cells. Try to understand the code and the overall approach.  
Feel free to update code where you see room for improvement and do not hesitate to ask questions.

There are several tasks in the notebook, which are intended to trigger exploration and discussions. Please try to solve them but do not spend too much time on them. The goal is to understand the process and the code - solving all the tasks is secondary. 

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Tasks are marked like this.
</div> 

The notebook without answers to the tasks can be found [here](https://github.com/twaldburger/flood475/blob/master/geo475_flood_prediction_in_gee.ipynb).

---
## Setup

We use the *[Earth Engine](https://developers.google.com/earth-engine/guides/python_install)* and the *[geemap](https://geemap.org/)* Python libraries to interact with GEE. *geemap* was originally created because the offical *ee* library had relatively little documentation and limited functionality while *geemap* enables users to easily analyze and visualize Earth Engine datasets interactively within a Jupyter-based environment.

In this first code cell, we are importing our dependencies, setting some global variables and authenticating with GEE. All dependencies are already pre-installed in Colab.

In [None]:
import ee
import geemap
import geemap.colormaps as cm
import time

## set some parameters, please update with your project id
PROJECT_ID = '' # your GEE project id
SAMPLE_SIZE = 100 # number of training locations per flood event and class
SEED = 3414 # for reproducible results

## connect to GEE
try:
    ee.Initialize()
except ee.EEException:
    ee.Authenticate()
    ee.Initialize(project=PROJECT_ID)

---
## Data exploration and visualization

GEE is built around two fundamental classes to represent raster and vector data: *[ee.Image](https://developers.google.com/earth-engine/apidocs/ee-image)* represents a single raster image while *[ee.Feature](https://developers.google.com/earth-engine/apidocs/ee-feature)* represents a geometry. Multiple images or geometries are represented as *[ee.ImageCollection](https://developers.google.com/earth-engine/apidocs/ee-imagecollection)* or as *[ee.FeatureCollection](https://developers.google.com/earth-engine/apidocs/ee-featurecollection)*.

The main catalogs for GEE data are the official [Earth Engine Data Catalog](https://developers.google.com/earth-engine/datasets) and the community-maintained [awesome-gee-community-catalog](https://gee-community-catalog.org/). The Earth Engine Data Catalog stores over 90 petabytes of data and over 1'000 datasets while the awesome-gee-community-catalog stores about 500 terabytes of data and around 4'000 datasets. One of the big benefits of this data is, that they are curated, meaning that they have already been pre-processed and ingested and are ready for analysis. All we need to do is to select and use them.

In the cell below, we select a few input datasets that we will be using to sample our training data for the model training.

In [None]:
## define the datasets from which to derive the input features
globalFlood = ee.ImageCollection("GLOBAL_FLOOD_DB/MODIS_EVENTS/V1")
dem = ee.ImageCollection('COPERNICUS/DEM/GLO30') \
        .select('DEM')
landcover = ee.ImageCollection('ESA/WorldCover/v200')
hydro = ee.Image('MERIT/Hydro/v1_0_1')
prec = ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY') \
         .select('total_precipitation')
runoffPotential = ee.Image('projects/sat-io/open-datasets/HiHydroSoilv2_0/Hydrologic_Soil_Group_250m') \
                    .remap([1, 2, 3, 4, 14, 24, 34], [1, 2, 3, 4, 1, 3, 4]) \
                    .select('remapped')

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Check out the <a href="https://developers.google.com/earth-engine/datasets" target="_blank">Earth Engine Data Catalog</a> and the <a href="https://gee-community-catalog.org/" target="_blank">awesome-gee-community-catalog</a>.
  <ol style="margin-top: 10px; margin-bottom: 0;">
    <li>How are they structured?</li>
    <li>What information do they provide?</li>
    <li>Can you find the datasets we intend to use?</li>
    <li>What are the spatial and temporal resolution of "our" datasets?</li>
  </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
    <strong style="color: #1fbd5bff;">Solution:</strong>
    <ol>
        <li>Google organizes the Earth Engine Data Catalog as a <a href="https://stacspec.org/en">SpatioTemporal Asset
                Catalog (STAC)</a> compliant repository. Datasets are grouped by Provider (e.g., NASA, ESA, USGS) and Theme
            (e.g., Landsat, Sentinel, Climate, Terrain). Datasets are classified into three types: <em>ee.Image</em>,
            <em>ee.ImageCollection</em>, and <em>ee.FeatureCollection</em>.<br>The awesome-gee-community-catalog acts more
            like a curated library of links pointing to various Google Cloud projects where researchers have stored their
            data. It is organized by Categories and Sub-categories (e.g., Hydrology -&gt; Surface Water).</li>
        <li>Both catalogs provide a short description of the dataset, as well as some band/attribute information, license
            and citation info, and some code snippets. The Earth Engine Data Catalog is more strict and structured in the
            metadata provided.</li>
        <li>See table below.</li>
        <li>See table below.</li>
    </ol>
    <br>
    <table>
        <thead>
            <tr>
                <th style="text-align:left">Dataset Name</th>
                <th style="text-align:left">Content</th>
                <th style="text-align:left">Provider</th>
                <th style="text-align:left">Spatial resolution</th>
                <th style="text-align:left">Temporal coverage</th>
                <th style="text-align:left">Spatial coverage</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td style="text-align:left"><a
                        href="https://developers.google.com/earth-engine/datasets/catalog/GLOBAL_FLOOD_DB_MODIS_EVENTS_V1">Global
                        Flood Database v1</a></td>
                <td style="text-align:left">913 selected flood events derived from Terra and Aqua MODIS sensors.</td>
                <td style="text-align:left">Dartmouth Flood Observatory</td>
                <td style="text-align:left">30m</td>
                <td style="text-align:left">2000 – 2018</td>
                <td style="text-align:left">global</td>
            </tr>
            <tr>
                <td style="text-align:left"><a
                        href="https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_DEM_GLO30">Copernicus
                        DEM GLO-30</a></td>
                <td style="text-align:left">Digital Surface Model (DSM) representing the surface of the Earth including
                    buildings, infrastructure and vegetation.</td>
                <td style="text-align:left">Copernicus/ESA</td>
                <td style="text-align:left">30m</td>
                <td style="text-align:left">2011 – 2015</td>
                <td style="text-align:left">global</td>
            </tr>
            <tr>
                <td style="text-align:left"><a
                        href="https://developers.google.com/earth-engine/datasets/catalog/ESA_WorldCover_v200">ESA
                        WorldCover 10m v200</a></td>
                <td style="text-align:left">Land cover map with 11 classes based on Sentinel-1 and Sentinel-2 data.</td>
                <td style="text-align:left">ESA</td>
                <td style="text-align:left">10m</td>
                <td style="text-align:left">2021</td>
                <td style="text-align:left">global</td>
            </tr>
            <tr>
                <td style="text-align:left"><a
                        href="https://developers.google.com/earth-engine/datasets/catalog/MERIT_Hydro_v1_0_1">MERIT
                        Hydro</a></td>
                <td style="text-align:left">Flood direction and hydrography dataset.</td>
                <td style="text-align:left">University of Tokyo</td>
                <td style="text-align:left">~90m</td>
                <td style="text-align:left">1987 – 2017</td>
                <td style="text-align:left">global</td>
            </tr>
            <tr>
                <td style="text-align:left"><a
                        href="https://developers.google.com/earth-engine/datasets/catalog/ECMWF_ERA5_LAND_HOURLY">ERA5-Land
                        Hourly</a></td>
                <td style="text-align:left">Hourly climate reanalysis variables.</td>
                <td style="text-align:left">Copernicus/ECMWF</td>
                <td style="text-align:left">~11km</td>
                <td style="text-align:left">1950 – present</td>
                <td style="text-align:left">global</td>
            </tr>
            <tr>
                <td style="text-align:left"><a href="https://gee-community-catalog.org/projects/isric/">Soil Grids 250m
                        v2.0</a></td>
                <td style="text-align:left">Soil property and class predictions.</td>
                <td style="text-align:left">ISRIC</td>
                <td style="text-align:left">250m</td>
                <td style="text-align:left">Static (v2.0)</td>
                <td style="text-align:left">global</td>
            </tr>
        </tbody>
    </table>

</div>

After checking on our input data in the catalogs, we want to visualize them on a map. *[geemap.Map](https://geemap.org/geemap/#geemap.geemap.Map)* provides a class for interactive mapping of GEE data in a Jupyter Notebook.

In the cell below, initialize a map with some basemaps and add the our historic flood layer. To correctly visualize the flood footprints, we need to ensure that we only include temporary inundation (the flood event) and not the permanent water bodies. To achieve this, we define a function *subtractPermanentWater*, which we will then run over every single image in our *ee.ImageCollection* of global flood events.

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Try to understand the logic in the <em>subtractPermanentWater</em> function in the cell below.
  <ol style="margin-top: 10px; margin-bottom: 0;">
    <li>How exactly does it remove the permanent water bodies from the floods? </li>
    <li>Why are we only using classes from the <em>ee</em> library and not some other Python library such as <em>numpy</em> or <em>xarray</em>?</li>
    <li>What do we achieve by calling <code>.sum()</code> in <code>flood = globalFlood.map(subtractPermanentWater).sum()</code> in row 6?</li>
  </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
    <strong style="color: #1fbd5bff;">Solution:</strong>
    <ol>
        <li>
            <p>The core logic of that function sits in <code>flood.multiply(perm.eq(0))</code></p>
            <ul>
                <li>
                    <p><code>perm.eq(0)</code> (mask creation):</p>
                    <ul>
                        <li>The <em>jrc_perm_water</em> band (permanent water) has a value of 1 for permanent water and 0
                            for no permanent water.</li>
                        <li>The <code>.eq(0)</code> method performs a pixel-by-pixel comparison:<ul>
                                <li>Where permanent water exists (perm == 1), the result is 0 (False).</li>
                                <li>Where permanent water does not exist (perm == 0), the result is 1 (True).</li>
                            </ul>
                        </li>
                        <li>Result: an intermediate binary mask where 1 marks the land/non-permanent water areas, and 0
                            marks the permanent water areas.</li>
                    </ul>
                </li>
                <li>
                    <p><code>flood.multiply(...)</code> (mask application):</p>
                    <ul>
                        <li>This performs element-wise multiplication between the original flood band (which is 1 for
                            flooded areas) and the binary mask created in step 1.</li>
                        <li>If a pixel is a flood pixel:<ul>
                                <li>Case A (transient flood): if the flood occurred on land (mask is 1), then 1*1=1 (stays
                                    flooded).</li>
                                <li>Case B (permanent water): if the flood occurred on permanent water (mask is 0), then
                                    1*0=0 (becomes non-flooded/masked out).</li>
                            </ul>
                        </li>
                        <li>Result: a new raster where the original flood areas are preserved only if they are outside the
                            permanent water body mask.</li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>By using GEE functionality, we execute all computations on the GEE servers, benefiting from GEE&#39;s resources,
            optimisation and parallelisation. Also, GEE is a lazy system, which means that operations are defined and
            queued, but no computations are executed on the data until a specific action, like displaying, downloading, or
            printing a result, forces the evaluation. In short: GEE only computes what is needed. Because of this approach
            and its optimisation, GEE is much more efficient than our Colab environment.</li>
        <li>The <code>sum()</code> function is a reducer that collapses the entire <em>ee.ImageCollection</em> into a single
            <em>ee.Image</em>. Here, it calculates the total number of times a pixel was flooded after permanent water
            bodies were masked out.</li>
    </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Run the cell below and check out the map.
  <ol>
    <li>Try adding a few more of our input datasets to the map as additional layers so you can explore them visually. We are interested in elevation, landcover, upstream drainage area, precipitation and runoff potential. You can find the geemap documentation <a href="https://geemap.org/">here</a>. If you are blocked, you can check my solution in <a href="https://github.com/twaldburger/flood475/blob/master/geo475_flood_prediction_in_gee_master.ipynb">this notebook</a>.</li>
    <li>How are the floods distributed? Are there geographies without any or with only a few floods?</li>
  </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
<strong style="color: #1fbd5bff;">Solution:</strong>
  <ol>
    <li>See cell below for code.</li>
    <li>The dataset covers flood across all contintents. Most flood events are near coastlines. Data in Africa is sparse.</li>
  </ol>
</div>

In [None]:
## subtract the permanent water bodies from the flooded areas
def subtractPermanentWater(img):
  flood = img.select('flooded')
  perm = img.select('jrc_perm_water')
  return flood.multiply(perm.eq(0))
flood = globalFlood.map(subtractPermanentWater).sum()

## initialize map and add some basemaps
# see https://stackoverflow.com/a/33023651 for Google basemap list
Map = geemap.Map(center=[27, -81], zoom=7, basemap='CartoDB.DarkMatter')
Map.add_basemap('CartoDB.Positron', show=False)
Map.add_tile_layer("https://mt1.google.com/vt/lyrs=m&x={x}&y={y}&z={z}", name="Google.Roadmap", attribution="Google", shown=False)
Map.add_tile_layer("https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}", name="Google.Satellite", attribution="Google", shown=False)

## add historic floods
flood_vis = {'min':0, 'max':10, 'palette':cm.palettes.Blues}
Map.add_layer(flood.selfMask(), flood_vis, 'Historic floods')
Map.add_colorbar(flood_vis, label="Number of floods", layer_name="Historic floods")

## add elevation
elevation_vis = {'min':0, 'max':3000, 'palette':cm.palettes.dem}
Map.addLayer(dem.mosaic(), elevation_vis, 'Elevation', shown=False)
Map.add_colorbar(elevation_vis, label="Elevation [m]", layer_name="Elevation")

## add landcover
landcover_vis = {'bands':['Map']}
Map.addLayer(landcover.first(), landcover_vis, 'Landcover', shown=False)
Map.add_legend(title="Landcover", builtin_legend="ESA_WorldCover", layer_name='Landcover')

## add upstream drainage area
upa_vis = {'min':0, 'max':10, 'palette':cm.palettes.Purples}
Map.addLayer(hydro.select('upa'), upa_vis, 'Upstream drainage area', shown=False)
Map.add_colorbar(upa_vis, label="Upstream drainage area [km^2]", layer_name="Upstream drainage area")

## add precipitation
prec_vis = {'min':0, 'max':5, 'palette':cm.palettes.turbo}
Map.addLayer(prec.filter(ee.Filter.date('2024-10-01', '2024-10-31')).sum(), prec_vis, 'Precipitation', shown=False)
Map.add_colorbar(prec_vis, label="Precipitation [m]", layer_name="Precipitation")

## add runoff potential
ee_class_table = """
Value	Color	Description
1	edf8e9	HSG-A: low runoff potential (>90% sand and <10% clay)
2	bae4b3	HSG-B: moderately low runoff potential (50-90% sand and 10-20% clay)
3	74c476	HSG-C: moderately high runoff potential (<50% sand and 20-40% clay)
4	238b45	HSG-D: high runoff potential (<50% sand and >40% clay)
"""
runoff_legend = geemap.legend_from_ee(ee_class_table)
runoff_vis = {'min':0, 'max':4, 'palette':list(runoff_legend.values())}
Map.addLayer(runoffPotential, runoff_vis, 'Runoff potential', shown=False)
Map.add_legend("Runoff potential", legend_dict=runoff_legend, layer_name="Runoff potential")

## display map
Map

---
## Training dataset

We are now somewhat familiar with out input data that we want to use for model training. However, the data are different raster datasets with different band information and various resolutions.

In the next cell, we define the logic to create a point dataset where every point is enriched with the corresponding pixel values from our input data. For this, we define two helper functions *pointQuery* and *removeProperty*, which we then use in our main function *createSample*.


<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Run the cell below and try to understand the code.
    <ol>
        <li>What is the general logic in <em>createSample</em>?</li>
        <li>Why are we using <em>ee.Image.stratifiedSample</em> in line 70 instead of using the much faster
            <em>ee.Image.Sample</em> method?</li>
        <li>Why are we using the complicated GEE methods in lines 65-76 and 90-94 instead of simply using plain Python?</li>
        <li>What does <em>ee.FeatureCollection.map</em> do? Why do we not just write a simple for-loop instead?</li>
        <li>Why do we normalize the elevation values?</li>
    </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
<strong style="color: #1fbd5bff;">Solution:</strong>
<ol>
    <li>See comments in the function code.</li>
    <li><em>stratifiedSample</em> samples the same number of locations within each class. Since our flood footprints
        ostly contain non-flooded pixels, using <em>stratifiedSample</em> is a convenient way to avoid oversampling
        non-flooded pixels. However, there might also be a risk of <em>stratifiedSample</em> oversampling certain
        flooded areas if the flood footprint is very small.</li>
    <li>We want to execute all the code in GEE itself so we can benefit from its optimization and parallelisation. If we
        would use plain Python, we would need to fetch the info from GEE server into our Colab Runtime which would make
        the whole process extremely inefficient.</li>
    <li><em>map</em> iterates over a feature collection and applies an algorithm to each feature. This process is run in
        parallel on the GEE server. Using a Python for-loop instead would loose the parallel execution and also create
        all the problems mentioned in the last question.</li>
    <li>We use a global flood dataset, but look at the individual events independently. We do not want to create a bias
        towards the general elevation of the region where an event took place and are therefore normalizing the
        elevation within each image (and therefore for each event).</li>
</ol>
</div>

In [None]:
def pointQuery(fc:ee.FeatureCollection, img:ee.Image, prop:str) -> ee.FeatureCollection:
  """
  Returns pixel values at locations using a feature collection and an image.

  Parameters
  ----------
  fc: : ee.FeatureCollection
    Collection of points at which to query the image.
  img : ee.Image
    The image to query.
  prop : str
    Name of new property to hold the query results.

  Returns
  -------
  ee.FeatureCollection
    Input FeatureCollection with lookup values added as new property.
  """
  fc = img.reduceRegions(collection=fc, reducer=ee.Reducer.first())
  return fc.map(lambda feat: feat.set(prop, feat.get('first')))


def removeProperty(fc:ee.FeatureCollection, prop:str) -> ee.FeatureCollection:
  """
  Removes a property by name from a feature collection.

  Parameters
  ----------
  fc : ee.FeatureCollection
    Collection from which to remove the property.
  prop : str
    Property to remove.

  Returns
  -------
  ee.FeatureCollection
    Input collection without the removed property.
  """
  selectProperties = fc.propertyNames().filter(ee.Filter.neq('item', prop))
  return fc.select(selectProperties)


def createSample(img:ee.Image) -> ee.FeatureCollection:
  """
  Samples training dataset for a single flood image.

  Parameters
  ----------
  img : ee.Image
    Input image.

  Returns
  -------
  ee.FeatureCollection
    Sampled and enriched training data.
  """

  ## subtract permanent water bodies from flooded areas
  permanent = img.select('jrc_perm_water')
  water = img.select('flooded')
  flooded = water.subtract(permanent).gt(0)

  ## get the total and maximum precipitation over 14 days prior to the event end date
  end = img.getNumber('system:time_end')
  start = end.subtract(1209600000) # timestamp in milliseconds: 60 * 60 * 24 * 14 * 1000
  precSum = prec.filter(ee.Filter.date(start, end)).sum()
  precMax = prec.filter(ee.Filter.date(start, end)).max()

  ## sample equal number of flooded and non-flooded points
  sample = flooded.stratifiedSample(numPoints=SAMPLE_SIZE, classBand='flooded', geometries=True)

  ## add image id in case we want to join the event metadata later
  sample = sample.map(lambda x: x.set('eventId', img.get('system:index')))

  ## enrich sample by running point lookups on multiple datasets
  sample = pointQuery(sample, dem.mosaic(), 'demElevationAbs')
  sample = pointQuery(sample, ee.Terrain.aspect(dem.mosaic()), 'demAspect')
  sample = pointQuery(sample, ee.Terrain.slope(dem.mosaic()), 'demSlope')
  sample = pointQuery(sample, landcover.first(), 'landcover')
  sample = pointQuery(sample, hydro.select('upa'), 'upa')
  sample = pointQuery(sample, runoffPotential, 'runoffPot')
  sample = pointQuery(sample, precSum, 'precSum')
  sample = pointQuery(sample, precMax, 'precMax')

  ## remove first-property
  sample = sample.map(lambda feat: removeProperty(feat, 'first'))

  ## normalize elevation
  elevationRange = dem.mosaic().reduceRegion(geometry=img.geometry(), reducer=ee.Reducer.minMax())
  min = ee.Number(elevationRange.get('DEM_min'))
  max = ee.Number(elevationRange.get('DEM_max'))
  def normalizeElevation(feat:ee.Feature) -> ee.Feature:
    return feat.set('demElevationNorm', (ee.Number(feat.get('demElevationAbs')).subtract(min)).divide(max.subtract(min)))
  sample = sample.filter(ee.Filter.notNull(ee.List(['demElevationAbs']))).map(normalizeElevation)

  return sample

The last cell defined the functions to process flood event images and extract training samples with enriched features. We now want to execute those functions as a batch job and export the result to a GEE asset.

The next cell defines the export task.

In [None]:
## create task to enrich sample and store the result as an asset
properties = ['demAspect', 'demElevationAbs', 'demElevationNorm', 'demSlope', 'eventId', 'flooded', 'landcover', 'precMax', 'precSum', 'runoffPot', 'upa', '.geo']
sample = globalFlood.map(createSample).flatten()
sample = sample.filter(ee.Filter.notNull(properties)).distinct(properties)
task = ee.batch.Export.table.toAsset(sample, description='flood475-sampling', assetId=f"projects/{PROJECT_ID}/assets/flood475_sample")

And this cell actually runs it. It is run on GEE and there are several ways to monitor it. One way is to constantly request the task status from our notebook (the commented part in the cell below). However, this also blocks our notebook so we rather choose one of the other ways to monitor progress:
- [GEE Code editor](https://code.earthengine.google.com/)
- [Task Manager](https://code.earthengine.google.com/tasks)
- [Tasks Page in the Cloud Console](https://console.cloud.google.com/earth-engine/tasks?project=ee-timwaldburger-flood475)

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Run the cell below and check if your task is running via one of the different methods.
</div>

In [None]:
## run and monitor the task
task.start()
# while task.active():
#   ts = task.status()
#   if ts['start_timestamp_ms']>0:
#     s = round((ts['update_timestamp_ms']-ts['start_timestamp_ms'])/1000)
#   else:
#     s = round((ts['update_timestamp_ms']-ts['creation_timestamp_ms'])/1000)
#   print(f"task '{ts['description']}' is {ts['state']} for {s} seconds")
#   time.sleep(60)
# task.status()

Let us now import our exported asset and visualize our training data. Depending on your sample size, training may take up to 30 minutes or more. You can therefore load the training data I created previously by running the cell below:

In [None]:
## import the pre-created training dataset
sample = ee.FeatureCollection(f"projects/ee-timwaldburger-flood475/assets/flood475_sample")

## import your training dataset
# sample = ee.FeatureCollection(f"projects/{PROJECT_ID}/assets/flood475_sample")

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Add the training locations to the map we created above. If you are blocked, you can check my solution in <a href="https://github.com/twaldburger/flood475/blob/master/geo475_flood_prediction_in_gee_master.ipynb">this notebook</a>.
</div>

In [None]:
visParams = {
    'color': '#FFFFFF',
    'colorOpacity': 1,
    'pointSize': 5,
    'pointShape': 'circle',
    'width': 2,
    'lineType': 'solid',
    'fillColorOpacity': 1,
}
Map.add_styled_vector(sample, column='flooded', palette=['#ffffbf', '#d7191c'], layer_name='Training data', **visParams)
Map

---
## Model training

We have now prepared our training data and are ready to train our model. To keep dependencies as simple as possible, we chose a [random forest](https://www.ibm.com/think/topics/random-forest), which is available within GEE from *[ee.Classifier.smileRandomForest](https://developers.google.com/earth-engine/apidocs/ee-classifier-smilerandomforest)*. We train the model using only 70 % of our test data and reserve the other 30 % for validation.

In [None]:
## partition into 70% training and 30% validation samples
sample = sample.randomColumn('random', seed=SEED)
training = sample.filter(ee.Filter.lt('random', 0.7))
validation = sample.filter(ee.Filter.gte('random', 0.7))

## train a random forest
properties = ['demAspect', 'demElevationNorm', 'demSlope', 'landcover', 'precMax', 'precSum', 'runoffPot', 'upa']
randomForest = ee.Classifier.smileRandomForest(10).setOutputMode('CLASSIFICATION')
classifier = randomForest.train(
    features=training,
    classProperty='flooded',
    inputProperties=properties
)

The cell below calcualtes and prints some accuracy metrics on the training and the validation dataset:

In [None]:
# accuracy on training set
trainConfusionMatrix = classifier.confusionMatrix()
trainFscores = trainConfusionMatrix.fscore().getInfo()
print(f"train accuracy: {trainConfusionMatrix.accuracy().getInfo():.3f}")
print(f"train f-score non-flooded: {trainFscores[0]:.3f}")
print(f"train f-score flooded: {trainFscores[1]:.3f}")

# accuracy on validation set
testConfusionMatrix = validation.classify(classifier).errorMatrix('flooded', 'classification')
testFscores = testConfusionMatrix.fscore().getInfo()
print(f"validation accuracy: {testConfusionMatrix.accuracy().getInfo():.3f}")
print(f"validation f-score non-flooded: {testFscores[0]:.3f}")
print(f"validation f-score flooded: {testFscores[1]:.3f}")

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Check the metrics returned by the cell above.
    <ol>
        <li>What do accuracy and F1-Score describe?</li>
        <li>Do you think the model performs well given those results?</li>
    </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
<strong style="color: #1fbd5bff;">Solution:</strong>
  <ol>
      <li>Accuracy: How often is the model right? An accuracy of 0.9 is generally considered good in many machine learning
          contexts. It means that your model makes the correct prediction 90% of the time.</li>
      <li>F1-Score: Metric that combines precision and recall into a single value. It provides a balanced measure of a
          model&#39;s performance, considering both the accuracy of positive predictions (precision) and the ability to
          identify all positive instances (recall). An F1-score of 0.8 is generally considered good. It indicates that
          your model is performing well in terms of both precision and recall. <ul>
              <li>Recall: How many positive predictions can the model identify?</li>
              <li>Precision: How often are positive predictions correct?</li>
          </ul>
      </li>
      <li>
          <table>
              <thead>
                  <tr>
                      <th style="text-align:left">Metric</th>
                      <th style="text-align:left">Training Score</th>
                      <th style="text-align:left">Validation Score</th>
                      <th style="text-align:left">Difference</th>
                      <th style="text-align:left">Key Observation</th>
                  </tr>
              </thead>
              <tbody>
                  <tr>
                      <td style="text-align:left"><strong>Accuracy</strong></td>
                      <td style="text-align:left">0.923 (92.3%)</td>
                      <td style="text-align:left">0.820 (82.0%)</td>
                      <td style="text-align:left"><strong>10.3%</strong></td>
                      <td style="text-align:left">Significant gap (Overfitting)</td>
                  </tr>
                  <tr>
                      <td style="text-align:left"><strong>F-score (Non-Flooded)</strong></td>
                      <td style="text-align:left">0.934</td>
                      <td style="text-align:left">0.850</td>
                      <td style="text-align:left">8.4%</td>
                      <td style="text-align:left">Good on majority class</td>
                  </tr>
                  <tr>
                      <td style="text-align:left"><strong>F-score (Flooded)</strong></td>
                      <td style="text-align:left">0.907</td>
                      <td style="text-align:left">0.775</td>
                      <td style="text-align:left"><strong>13.2%</strong></td>
                      <td style="text-align:left">Poor on minority class</td>
                  </tr>
              </tbody>
          </table>
          <ul>
              <li>Overfitting (main issue)<ul>
                      <li>The large drop in accuracy from 92.3% on the training data to 82.0% on the validation data is the key
                          takeaway.<ul>
                              <li>What it means: the random forest is not learning the general patterns of flood vs. non-flood
                                  events but instead, it is memorizing the noise and specific details of the training dataset.
                                  When presented with new, unseen data (the validation set), its performance suffers
                                  significantly.</li>
                              <li>Why it happens: random forests can easily overfit, especially if you have deep trees, a large
                                  number of trees, or a small training dataset that lacks diversity.</li>
                          </ul>
                      </li>
                  </ul>
              </li>
              <li>Class imbalance<ul>
                      <li>The F-scores highlight how the model performs on the two different classes:<ul>
                              <li>F-score for non-flooded (majority class): the score is higher (0.850 validation) and the gap is
                                  smaller (8.4%). This means the model is relatively reliable at identifying non-flooded areas.
                              </li>
                          </ul>
                      </li>
                      <li>F-score for flooded (minority class): the score is significantly lower (0.775 validation) and the gap is
                          the largest (13.2%). <ul>
                              <li>What it means: the model struggles most to correctly identify actual flood events on new data.
                                  This is where we see the biggest generalization failure.</li>
                              <li>A lower F-score on the minority class is a common sign of class imbalance, where the model
                                  focuses on optimizing for the more frequent (non-flooded) outcomes.</li>
                          </ul>
                      </li>
                  </ul>
              </li>
          </ul>
      </li>
  </ol>

</div>

Let us try once more by setting some parameters to address the overfitting issue. We will not worry to much about class imbalance since we have used stratified sampling to ensure same number of test locations per class.

In [None]:
## train another random forest with parameters to control overfitting
properties = ['demAspect', 'demElevationNorm', 'demSlope', 'landcover', 'precMax', 'precSum', 'runoffPot', 'upa']
randomForest = ee.Classifier.smileRandomForest(
    numberOfTrees=50,             # sufficient ensemble size
    maxNodes=32,                  # controls tree complexity (prevents deep memorization)
    minLeafPopulation=5,          # ensures leaf nodes have enough samples (improves robustness)
    bagFraction=0.63,             # recommended fraction for sampling
).setOutputMode('CLASSIFICATION')
classifier = randomForest.train(
    features=training,
    classProperty='flooded',
    inputProperties=properties
)

# accuracy on training set
trainConfusionMatrix = classifier.confusionMatrix()
trainFscores = trainConfusionMatrix.fscore().getInfo()
print(f"train accuracy: {trainConfusionMatrix.accuracy().getInfo():.3f}")
print(f"train f-score non-flooded: {trainFscores[0]:.3f}")
print(f"train f-score flooded: {trainFscores[1]:.3f}")

# accuracy on validation set
testConfusionMatrix = validation.classify(classifier).errorMatrix('flooded', 'classification')
testFscores = testConfusionMatrix.fscore().getInfo()
print(f"validation accuracy: {testConfusionMatrix.accuracy().getInfo():.3f}")
print(f"validation f-score non-flooded: {testFscores[0]:.3f}")
print(f"validation f-score flooded: {testFscores[1]:.3f}")

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Check the metrics returned by the new model.
    <ol>
        <li>Did we improve our model?</li>
    </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
<strong style="color: #1fbd5bff;">Solution:</strong>
  <table>
      <thead>
          <tr>
              <th style="text-align:left">Metric</th>
              <th style="text-align:left">Training Score</th>
              <th style="text-align:left">Validation Score</th>
              <th style="text-align:left">Difference</th>
              <th style="text-align:left">Key Observation</th>
          </tr>
      </thead>
      <tbody>
          <tr>
              <td style="text-align:left"><strong>Accuracy</strong></td>
              <td style="text-align:left">0.660 (66.0%)</td>
              <td style="text-align:left">0.658 (65.8%)</td>
              <td style="text-align:left"><strong>0.2%</strong></td>
              <td style="text-align:left">Excellent Generalization! (Overfitting Solved)</td>
          </tr>
          <tr>
              <td style="text-align:left"><strong>F-score (Non-Flooded)</strong></td>
              <td style="text-align:left">0.752</td>
              <td style="text-align:left">0.751</td>
              <td style="text-align:left">0.1%</td>
              <td style="text-align:left">Generalizes well on majority class</td>
          </tr>
          <tr>
              <td style="text-align:left"><strong>F-score (Flooded)</strong></td>
              <td style="text-align:left">0.459</td>
              <td style="text-align:left">0.455</td>
              <td style="text-align:left">0.4%</td>
              <td style="text-align:left"><strong>Extremely Low</strong> Predictive Power</td>
          </tr>
      </tbody>
  </table>
  <ul>
      <li>Good news: overfitting is solved!<ul>
              <li>The difference between our training and validation accuracy has collapsed from 10.3% down to just 0.2%
                  (0.660 vs. 0.658). Similarly, the F-scores for both classes are almost identical between training and
                  validation.</li>
              <li>This confirms that our model is now learning general patterns and is robust when applied to unseen data.
                  The choice of parameters like lower <em>maxNodes</em> and higher <em>minLeafPopulation</em> successfully
                  prevented the trees from memorizing the noise.</li>
          </ul>
      </li>
      <li>The bad news: low predictive power<ul>
              <li>Overall accuracy (0.658): an accuracy of 65.8% is only marginally better than random guessing (50%) and
                  certainly not sufficient for a reliable flood prediction.</li>
              <li>An F-score of 0.455 means the model is failing to correctly and reliably predict flood events. This low
                  score suggests our model has a very poor balance of precision (avoiding false alarms) and recall
                  (catching all actual floods) for the minority class.</li>
          </ul>
      </li>
      <li>What next?<ul>
              <li>Given that the model is stable but poorly accurate, the problem is most likely not in the model&#39;s
                  complexity, but in the quality and discriminatory power of our input features. Our next step would
                  therefore be to improve our training data by revisiting the input datasets, analyzing ther predictive
                  value, removing unnecessary ones and adding new ones. We will not do that here, but proceed with
                  prediction.</li>
          </ul>
      </li>
  </ul>
</div>

---
## Prediction

We know that our model is flawed, but let's still go ahead and run some predictions.

The cell below defined the *predict* function, which takes a region of interest and a classifier (our random forest) as input.

<div style="padding: 15px; border-left: 6px solid #f38a21ff;">
  <strong style="color: #f38a21ff;">Task:</strong> Try to understand the code in the cell below.
    <ol>
        <li>What does the <em>predict</em> function do exactly?</li>
    </ol>
</div>

<div style="padding: 15px; border-left: 6px solid #1fbd5bff;">
<strong style="color: #1fbd5bff;">Solution:</strong>
  <ol>
    <li>The <em>predict </em>function creates a composite image by combining all the datasets we used for model training. It normalizes the elevation and aggregates precipitation. It then classifies every pixel in the composite using the pre-trained model.</li>
  </ol>
</div>

In [None]:
def predict(roi: ee.Geometry, classifier: ee.Classifier) -> ee.Image:
  """
  Predict flood probability for a given region of interest.

  Parameters
  ----------
  roi : ee.Geometry
    Region of interest.
  classifier : ee.Classifier
    Trained classifier.

  Returns
  -------
  ee.Image
    Flood probabilities.
  """

  ## normalize elevation
  elevationRange = dem.mosaic().reduceRegion(geometry=roi, reducer=ee.Reducer.minMax(), scale=30, bestEffort=True)
  min = ee.Number(elevationRange.get('DEM_min'))
  max = ee.Number(elevationRange.get('DEM_max'))

  ## calculate start and end data for precipitation data aggregation
  end = ee.Date('2024-10-31T00:00:00').millis()
  start = end.subtract(1209600000) # timestamp in milliseconds: 60 * 60 * 24 * 14 * 1000

  ## create composite of all relevant datasets
  composite = ee.Image.cat(
      ee.Terrain.aspect(dem.mosaic()),
      dem.mosaic().unitScale(min, max),
      ee.Terrain.slope(dem.mosaic()),
      landcover.first(),
      prec.filter(ee.Filter.date(start, end)).max(),
      prec.filter(ee.Filter.date(start, end)).sum(),
      runoffPotential,
      hydro.select('upa')
  ).rename(properties)

  ## classify the composite
  classifier = classifier.setOutputMode('MULTIPROBABILITY')
  classified = composite.classify(classifier).clip(roi)
  probabilities = classified.arrayFlatten([['non-flooded', 'flooded']])
  probability = probabilities.select('flooded')

  return probability

Finally, we are ready to make predictions. For this, we call our *predict* function for a given region of interest using our random forest.

<div style="padding: 15px; border-left: 6px solid #e12525ff;">
  <strong style="color: #e12525ff;">Warning:</strong> The cell below uses the map bounds of our map where we explored the data as the extent for which to compute predictions. Make sure to zoom to a relatively small area to avoid long running computations.
</div>

In [None]:
## run prediction for current map bounds
probability = predict(roi=ee.Geometry.BBox(*Map.getBounds()), classifier=classifier)

## update mask to exclude permanent water
permanentWater = globalFlood.select('jrc_perm_water').mosaic()
probability = probability.updateMask(permanentWater.neq(1))

## update the map
probability_vis = {'min':0, 'max':1, 'palette':cm.palettes.cividis}
Map.addLayer(probability.selfMask(), probability_vis, 'Flooded probability')
Map.add_colorbar(probability_vis, label="Flooded probability", layer_name="Flooded probability", font_size=9)
Map

---
## Feedback

Providing a 2 minute feedback would help me to improve the exercise.
Please run the cell below to display a Google Form where you can provide your feedback. I will not collect your mail address so the feedback is anonymous.

In [None]:
%%html
<iframe src="https://docs.google.com/forms/d/e/1FAIpQLScg8j6ORkqgWw4QEHpkeOy2PxYKSdgop3PPvaA1_WT54igFIA/viewform?embedded=true" width="640" height="1304" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>

---
## Bonus: run country-scale prediction
The code below runs the trained model for all of Switzerland at 30 m resolution and exports the result to GEE. It runs in about 20 minutes. No need to run in the lab, but feel free to try it out and play around with different models, geographies and settings.

In [None]:
## get country shape
countries = ee.FeatureCollection('WM/geoLab/geoBoundaries/600/ADM0')
ch = countries.filterMetadata('shapeName', 'equals', 'Switzerland')

## run prediction for Switzerland
flooded_prob = predict(roi=ch.geometry(), classifier=classifier)

## update the map
flooded_prob_vis = {'min':0, 'max':1, 'palette':cm.palettes.cividis}
Map.addLayer(flooded_prob.selfMask(), flooded_prob_vis, 'Flooded probability')
Map.add_colorbar(flooded_prob_vis, label="CH flooded probability", layer_name="CH flooded probability", font_size=9)
Map

In [None]:
## create export task
task = ee.batch.Export.image.toAsset(
  flooded_prob,
  description='flood475-ch-prediction',
  assetId=f"projects/{PROJECT_ID}/assets/flood475_ch_prediction",
  scale=30,
  maxPixels=500_000_000
)

## run and monitor the task
task.start()
while task.active():
  ts = task.status()
  if ts['start_timestamp_ms']>0:
    s = round((ts['update_timestamp_ms']-ts['start_timestamp_ms'])/1000)
  else:
    s = round((ts['update_timestamp_ms']-ts['creation_timestamp_ms'])/1000)
  print(f"task '{ts['description']}' is {ts['state']} for {s} seconds")
  time.sleep(60)
task.status()