# Forecasting urban floods

*pitched by Samuel Barsanelli Costa*

<details>
    <summary><i>💡 Hint: how to generate a html-slide version of the notebook</i></summary>

In VSCode, right-click on the cell and:
- Click on `Switch Slide Type` to set the proper configuration for the cell
- Click on `Add Cell Tag` to add tags if needed

Then, generate an html-slide version of this using the following command:

```
jupyter nbconvert index.ipynb --to slides --post serve --no-prompt \
--TagRemovePreprocessor.remove_input_tags=remove_input \
--TagRemovePreprocessor.remove_all_outputs_tags=remove_output
```

- `no-prompt` removes the In [xx]: and Out[xx]: to the left of each cells
- `--TagRemovePreprocessor` allows you to not display either inputs or outputs of cells that have the associated Tag

As for hosting the html using GitHub, pages change the ```<output-format>``` like this:

``` jupyter nbconvert --to html index.ipynb ```

Make sure to name the notebook as 'index' and that GitHub Pages [settings](https://pages.github.com/) are properly set.

</details>

## Problem statement

What is an urban flood and why should we care about forecasting it?

📽️ A picture is worth a thousand words...

<iframe width="600" height="340"
src="https://www.youtube.com/embed/qefFjVbrZwE?autoplay=1&mute=1">
</iframe>

This is a global issue related to climate change, as raised by the [UN](https://www.undrr.org/)'s [GAR23](https://www.undrr.org/gar/gar2023-special-report) report:
* the increase in the intensity of rainfall is already increasing flood risk 🌧️
* flood damage shall increase by **170%** with a global **2°C warming** 🔥

From a data science perpective, flood prediction is not new. The [Journal of Hydrology](https://www.sciencedirect.com/journal/journal-of-hydrology) alone has over 5k publications about it over the last 20 years.

Even Google has it's own AI model to predict daily flood. They've published their approach on a [paper](https://www.nature.com/articles/s41586-024-07145-1) in Nature, and made it online at the [Flood Hub](https://sites.research.google/floods/l/0/0/3) in 80+ countries.

![google_flood_hub](google_flood_hub.jpg)

But all resources available about flooding are way too much technical, not suited for the mainstream.

Could anyone quickly disgest this [data plot](https://app.powerbi.com/view?r=eyJrIjoiZTRjZDlmYjgtNzAzMS00ZTFmLTlmZDAtNzEwNjM0MDU0NTJhIiwidCI6ImUwYmI0MDEyLTgxMGItNDY5YS04YjRkLTY2N2ZjZDFiYWY4OCJ9) and make an informed "fight-or-flight" decision?

![guaiba_telemetry](guaiba_telemetry.jpg)

The driving questions of this idea are:
* can a classification model be trained to predict flood from rainfall data?
* how accurate could it be?
* how for in advance could we predict a flood (e.g. 5-day)?
* could we publish it in real-time in a easy-to-digest fashion?
* does it make sense to communicate just as we do with weather forecast?

🎯 That's the goal!

![flood_forecast_app](flood_forecast_app.jpg)

## Viability analysis

At minimum, to address this problem we need:
* river stage data: the measure of how high (or low) the level of the river is, at a certain location, in reference to a ground-stable level, usually measured in cm
* rainfall data: the amount of rainfall at a certain location, measured in a volumetric fashion but usually reported in mm

Automated Telemetry Stations are the...

Sampling rate...

Frame the scope to RS...

Talk about spatially distributed rainfall data (maybe a picture would help)

In [1]:
import pandas as pd
import requests
import io

# Get data from the API
response = requests.get('https://saladesituacao.rs.gov.br/api/station/ana/sheet/87382000')

# Convert the binary response to an excel file and pass it to a dataframe
with io.BytesIO(response.content) as excel_file:
    df = pd.io.excel.read_excel(
        excel_file,
        skiprows = 8,
        names = ['stage_cm', 'discharge_cms', 'rainfall_mm'],
        # converters = {'datetime': pd.to_datetime},
        dtype = float,
        index_col=0,
        parse_dates=True
    )

df

Unnamed: 0,stage_cm,discharge_cms,rainfall_mm
2024-07-15 16:15:00,346.0,165.46,0.0
2024-07-15 16:00:00,346.0,165.46,0.0
2024-07-15 15:45:00,346.0,165.46,0.0
2024-07-15 15:30:00,346.0,165.46,0.0
2024-07-15 15:15:00,346.0,165.46,0.0
...,...,...,...
2018-01-08 14:15:00,357.0,,0.0
2018-01-08 14:00:00,357.0,,0.0
2018-01-08 13:45:00,357.0,,0.0
2018-01-08 13:30:00,358.0,,0.0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 194678 entries, 2024-07-15 16:15:00 to 2018-01-08 13:15:00
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   stage_cm       192285 non-null  float64
 1   discharge_cms  143652 non-null  float64
 2   rainfall_mm    194457 non-null  float64
dtypes: float64(3)
memory usage: 5.9 MB


In [3]:
# Get the first NaN stage value
first_nan = df[df.stage_cm.isna()].index[0]

# Subset a continuous stage time series
subset = df[:first_nan].dropna()

# Interpolate NaN discharg and rainfall
subset.discharge_cms.interpolate(method='linear')
subset.rainfall_mm.interpolate(method='linear')

subset.describe()

Unnamed: 0,stage_cm,discharge_cms,rainfall_mm
count,16959.0,16959.0,16959.0
mean,283.658117,246.696645,0.072787
std,184.663094,356.023986,0.494665
min,65.0,14.31,0.0
25%,116.0,28.87,0.0
50%,240.0,95.77,0.0
75%,433.0,354.92,0.0
max,800.0,1865.0,22.0


In [4]:
# Check for gaps in the time series
max(subset.index) - min(subset.index)

Timedelta('340 days 23:15:00')

In [5]:
len(subset) * pd.to_timedelta('15 min')

Timedelta('176 days 15:45:00')