<a href="https://colab.research.google.com/github/twaldburger/flood475/blob/master/geo475_flood_prediction_in_gee.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Geo475 - Flood prediction in Google Earth Engine
Go through the notebook and run the cells. Try to understand the code and the overall approach.  
Feel free to update code where you see room for improvement.  

There are several tasks in the notebook. Please try to solve them but don't spend too much time on them. The goal is to understand the process and the code - solving all the tasks is secondary.  

A notebook including some of countless solutions to the tasks can be found [here](https://github.com/twaldburger/flood475/blob/master/geo475_flood_prediction_in_gee_master.ipynb).




---
## Setup

In [None]:
import ee
import geemap
import geemap.colormaps as cm
import time

## set some parameters, please update with your project id
PROJECT_ID = '' # your GEE project id
SAMPLE_SIZE = 100 # number of training locations per flood event and class
SEED = 3414 # for reproducible results

## connect to GEE
try:
    ee.Initialize()
except ee.EEException:
    ee.Authenticate()
    ee.Initialize(project=PROJECT_ID)

---
## Data exploration and visualization

In [None]:
## define the datasets from which to derive the input features
globalFlood = ee.ImageCollection("GLOBAL_FLOOD_DB/MODIS_EVENTS/V1")
dem = ee.ImageCollection('COPERNICUS/DEM/GLO30') \
        .select('DEM')
landcover = ee.ImageCollection('ESA/WorldCover/v200')
hydro = ee.Image('MERIT/Hydro/v1_0_1')
prec = ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY') \
         .select('total_precipitation')
runoffPotential = ee.Image('projects/sat-io/open-datasets/HiHydroSoilv2_0/Hydrologic_Soil_Group_250m') \
                    .remap([1, 2, 3, 4, 14, 24, 34], [1, 2, 3, 4, 1, 3, 4]) \
                    .select('remapped')

> **Task:** Get an overview on the datasets we use. What data do they provide? Who produced them? What is their spatial and temporal resolution? A good starting point is the [Earth Engine Data Catalog](https://developers.google.com/earth-engine/datasets).

In [None]:
## subtract the permanent water bodies from the flooded areas
def subtractPermanentWater(img):
  flood = img.select('flooded')
  perm = img.select('jrc_perm_water')
  return flood.multiply(perm.eq(0))
flood = globalFlood.map(subtractPermanentWater).sum()

## initialize map and add some basemaps
# see https://stackoverflow.com/a/33023651 for Google basemap list
Map = geemap.Map(center=[27, -81], zoom=7, basemap='CartoDB.DarkMatter')
Map.add_basemap('CartoDB.Positron', show=False)
Map.add_tile_layer("https://mt1.google.com/vt/lyrs=m&x={x}&y={y}&z={z}", name="Google.Roadmap", attribution="Google", shown=False)
Map.add_tile_layer("https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}", name="Google.Satellite", attribution="Google", shown=False)

## add historic floods
flood_vis = {'min':0, 'max':10, 'palette':cm.palettes.Blues}
Map.add_layer(flood.selfMask(), flood_vis, 'Historic floods')
Map.add_colorbar(flood_vis, label="Number of floods", layer_name="Historic floods")

## display map
Map

> **Task:** Add the input datasets to the map so you can explore them visually. We are interested in elevation, landcover, upstream drainage area, precipitation and runoff potential. You can find the geemap documentation [here](https://geemap.org/).

---
## Training dataset



In [None]:
def pointQuery(fc:ee.FeatureCollection, img:ee.Image, prop:str) -> ee.FeatureCollection:
  """
  Returns pixel values at locations using a feature collection and an image.

  Parameters
  ----------
  fc: : ee.FeatureCollection
    Collection of points at which to query the image.
  img : ee.Image
    The image to query.
  prop : str
    Name of new property to hold the query results.

  Returns
  -------
  ee.FeatureCollection
    Input FeatureCollection with lookup values added as new property.
  """
  fc = img.reduceRegions(collection=fc, reducer=ee.Reducer.first())
  return fc.map(lambda feat: feat.set(prop, feat.get('first')))


def removeProperty(fc:ee.FeatureCollection, prop:str) -> ee.FeatureCollection:
  """
  Removes a property by name from a feature collection.

  Parameters
  ----------
  fc : ee.FeatureCollection
    Collection from which to remove the property.
  prop : str
    Property to remove.

  Returns
  -------
  ee.FeatureCollection
    Input collection without the removed property.
  """
  selectProperties = fc.propertyNames().filter(ee.Filter.neq('item', prop))
  return fc.select(selectProperties)


def createSample(img:ee.Image) -> ee.FeatureCollection:
  """
  Samples training dataset for a single flood image.

  Parameters
  ----------
  img : ee.Image
    Input image.

  Returns
  -------
  ee.FeatureCollection
    Sampled and enriched training data.
  """

  ## subtract permanent water bodies from flooded areas
  permanent = img.select('jrc_perm_water')
  water = img.select('flooded')
  flooded = water.subtract(permanent).gt(0)

  ## get the total and maximum precipitation over 14 days prior to the event end date
  end = img.getNumber('system:time_end')
  start = end.subtract(1209600000) # timestamp in milliseconds: 60 * 60 * 24 * 14 * 1000
  precSum = prec.filter(ee.Filter.date(start, end)).sum()
  precMax = prec.filter(ee.Filter.date(start, end)).max()

  ## sample equal number of flooded and non-flooded points
  sample = flooded.stratifiedSample(numPoints=SAMPLE_SIZE, classBand='flooded', geometries=True)

  ## add image id in case we want to join the event metadata later
  sample = sample.map(lambda x: x.set('eventId', img.get('system:index')))

  ## enrich sample by running point lookups on multiple datasets
  sample = pointQuery(sample, dem.mosaic(), 'demElevationAbs')
  sample = pointQuery(sample, ee.Terrain.aspect(dem.mosaic()), 'demAspect')
  sample = pointQuery(sample, ee.Terrain.slope(dem.mosaic()), 'demSlope')
  sample = pointQuery(sample, landcover.first(), 'landcover')
  sample = pointQuery(sample, hydro.select('upa'), 'upa')
  sample = pointQuery(sample, runoffPotential, 'runoffPot')
  sample = pointQuery(sample, precSum, 'precSum')
  sample = pointQuery(sample, precMax, 'precMax')

  ## remove first-property
  sample = sample.map(lambda feat: removeProperty(feat, 'first'))

  ## normalize elevation
  elevationRange = dem.mosaic().reduceRegion(geometry=img.geometry(), reducer=ee.Reducer.minMax())
  min = ee.Number(elevationRange.get('DEM_min'))
  max = ee.Number(elevationRange.get('DEM_max'))
  def normalizeElevation(feat:ee.Feature) -> ee.Feature:
    return feat.set('demElevationNorm', (ee.Number(feat.get('demElevationAbs')).subtract(min)).divide(max.subtract(min)))
  sample = sample.filter(ee.Filter.notNull(ee.List(['demElevationAbs']))).map(normalizeElevation)

  return sample

> **Task:** Try to understand the code in the cell above. Why are we using *ee.Image.stratifiedSample* in line 70 instead of using the much faster *ee.Image.Sample* method? Why are we using the complicated GEE methods in lines 65-76 and 90-94 instead of simply using plain Python? What does *ee.FeatureCollection.map* do? Why do we not just write a simple for-loop instead? Why do we normalize the elevation values?

In [None]:
## create task to enrich sample and store the result as an asset
properties = ['demAspect', 'demElevationAbs', 'demElevationNorm', 'demSlope', 'eventId', 'flooded', 'landcover', 'precMax', 'precSum', 'runoffPot', 'upa', '.geo']
sample = globalFlood.map(createSample).flatten()
sample = sample.filter(ee.Filter.notNull(properties)).distinct(properties)
task = ee.batch.Export.table.toAsset(sample, description='flood475-sampling', assetId=f"projects/{PROJECT_ID}/assets/flood475_sample")

In [None]:
## run and monitor the task
task.start()
while task.active():
  ts = task.status()
  if ts['start_timestamp_ms']>0:
    s = round((ts['update_timestamp_ms']-ts['start_timestamp_ms'])/1000)
  else:
    s = round((ts['update_timestamp_ms']-ts['creation_timestamp_ms'])/1000)
  print(f"task '{ts['description']}' is {ts['state']} for {s} seconds")
  time.sleep(60)
task.status()

> **Task:** There are multiple places (outside of Google Colab) where we can also monitor our tasks. Can you find them?  

> **Task:** Add the sampled data points to your map from above.

---
## Model training

In [None]:
## import the training dataset
sample = ee.FeatureCollection(f"projects/{PROJECT_ID}/assets/flood475_sample")
# sample = ee.FeatureCollection(f"projects/ee-timwaldburger-flood475/assets/flood475_sample")

## partition into 70% training and 30% validation samples
sample = sample.randomColumn('random', seed=SEED)
training = sample.filter(ee.Filter.lt('random', 0.7))
validation = sample.filter(ee.Filter.gte('random', 0.7))

## train a random forest
properties = ['demAspect', 'demElevationNorm', 'demSlope', 'landcover', 'precMax', 'precSum', 'runoffPot', 'upa']
randomForest = ee.Classifier.smileRandomForest(10).setOutputMode('CLASSIFICATION')
classifier = randomForest.train(
    features=training,
    classProperty='flooded',
    inputProperties=properties
)

In [None]:
# accuracy on training set
trainConfusionMatrix = classifier.confusionMatrix()
trainFscores = trainConfusionMatrix.fscore().getInfo()
print(f"train accuracy: {trainConfusionMatrix.accuracy().getInfo()}")
print(f"train f-score non-flooded: {trainFscores[0]}")
print(f"train f-score flooded: {trainFscores[1]}")

# accuracy on test set
testConfusionMatrix = validation.classify(classifier).errorMatrix('flooded', 'classification')
testFscores = testConfusionMatrix.fscore().getInfo()
print(f"test accuracy: {testConfusionMatrix.accuracy().getInfo()}")
print(f"test f-score non-flooded: {testFscores[0]}")
print(f"test f-score flooded: {testFscores[1]}")

> **Task:** What do accuracy and F1-Score describe? Do you think your model performs well given those results?

---
## Prediction

In [None]:
def predict(roi: ee.Geometry, classifier: ee.Classifier) -> ee.Image:
  """
  Predict flood probability for a given region of interest.

  Parameters
  ----------
  roi : ee.Geometry
    Region of interest.
  classifier : ee.Classifier
    Trained classifier.

  Returns
  -------
  ee.Image
    Flood probabilities.
  """

  ## normalize elevation
  elevationRange = dem.mosaic().reduceRegion(geometry=roi, reducer=ee.Reducer.minMax(), scale=30, bestEffort=True)
  min = ee.Number(elevationRange.get('DEM_min'))
  max = ee.Number(elevationRange.get('DEM_max'))

  ## calculate start and end data for precipitation data aggregation
  end = ee.Date('2024-10-31T00:00:00').millis()
  start = end.subtract(1209600000) # timestamp in milliseconds: 60 * 60 * 24 * 14 * 1000

  ## create composite of all relevant datasets
  composite = ee.Image.cat(
      ee.Terrain.aspect(dem.mosaic()),
      dem.mosaic().unitScale(min, max),
      ee.Terrain.slope(dem.mosaic()),
      landcover.first(),
      prec.filter(ee.Filter.date(start, end)).max(),
      prec.filter(ee.Filter.date(start, end)).sum(),
      runoffPotential,
      hydro.select('upa')
  ).rename(properties)

  ## classify the composite
  classifier = classifier.setOutputMode('MULTIPROBABILITY')
  classified = composite.classify(classifier).clip(roi)
  probabilities = classified.arrayFlatten([['non-flooded', 'flooded']])
  probability = probabilities.select('flooded')

  return probability

> **Task:** Try to understand the code in the cell above. What does the *predict*-function do?

In [None]:
## run prediction for current map bounds
probability = predict(roi=ee.Geometry.BBox(*Map.getBounds()), classifier=classifier)

## update mask to exclude permanent water
permanentWater = globalFlood.select('jrc_perm_water').mosaic()
probability = probability.updateMask(permanentWater.neq(1))

## update the map
probability_vis = {'min':0, 'max':1, 'palette':cm.palettes.cividis}
Map.addLayer(probability.selfMask(), probability_vis, 'Flooded probability')
Map.add_colorbar(probability_vis, label="Flooded probability", layer_name="Flooded probability", font_size=9)
Map

> **Task:** Remove pixels representing permanent water bodies from the prediction.  

> **Task:** Check various areas on the map. Where does the model seem to perform well. Where does it perform bad? Can you explain why it performs bad in certain areas? Do you see any bias towards a specific feature?

> **Task:** Go back to the cell where we trained the model and play around with hyperparameters. Can you improve the model?  

> **Task:** Check in the GEE documentation which other models are available and try to run them. What differences do you see? Which model works best?

---
## Feedback

I try to constantly improve the exercise and would therefore much appreciate if you could take 2 minutes to provide me a short feedback. Thank you!

Please run the cell below to display a Google Form where you can provide your feedback. I will not collect your mail address so the feedback is anonymous.

In [None]:
%%html
<iframe src="https://docs.google.com/forms/d/e/1FAIpQLScg8j6ORkqgWw4QEHpkeOy2PxYKSdgop3PPvaA1_WT54igFIA/viewform?embedded=true" width="640" height="1304" frameborder="0" marginheight="0" marginwidth="0">Loading…</iframe>

---
## Bonus: run country-scale prediction
The code below runs the trained model for all of Switzerland at 30 m resolution and exports the result to GEE. It runs in about 20 minutes. No need to run in the lab, but feel free to try it out and play around with different models, geographies and settings.

In [None]:
## get country shape
countries = ee.FeatureCollection('WM/geoLab/geoBoundaries/600/ADM0')
ch = countries.filterMetadata('shapeName', 'equals', 'Switzerland')

## run prediction for Switzerland
flooded_prob = predict(roi=ch.geometry(), classifier=classifier)

## update the map
flooded_prob_vis = {'min':0, 'max':1, 'palette':cm.palettes.cividis}
Map.addLayer(flooded_prob.selfMask(), flooded_prob_vis, 'Flooded probability')
Map.add_colorbar(flooded_prob_vis, label="CH flooded probability", layer_name="CH flooded probability", font_size=9)
Map

In [None]:
## create export task
task = ee.batch.Export.image.toAsset(
  flooded_prob,
  description='flood475-ch-prediction',
  assetId=f"projects/{PROJECT_ID}/assets/flood475_ch_prediction",
  scale=30,
  maxPixels=500_000_000
)

## run and monitor the task
task.start()
while task.active():
  ts = task.status()
  if ts['start_timestamp_ms']>0:
    s = round((ts['update_timestamp_ms']-ts['start_timestamp_ms'])/1000)
  else:
    s = round((ts['update_timestamp_ms']-ts['creation_timestamp_ms'])/1000)
  print(f"task '{ts['description']}' is {ts['state']} for {s} seconds")
  time.sleep(60)
task.status()