---
## Preparations

### Install additional dependencies
Not all our required libraries are pre-installed in Colab. We therefore install additional libraries using [pip](https://pip.pypa.io/en/stable/). After installation, we have to restart our runtime before we can import the library. Please make sure that you click the _RESTART SESSION_-button which is displayed at the end of the code cell's output.

In [None]:
! pip install pycaret

# Training a flood prediction model
In this second notebook, we use the training dataset created in [02_Creating_Training_Data.ipynb](https://github.com/twaldburger/flood475/blob/master/02_Creating_Training_Data.ipynb) to train a flood prediction model
> **Task:** Tsk and questions are marked like this. Please try to answer them before proceeding with the next cell.

---
## Preparations

### Install additional dependencies
Not all our required libraries are pre-installed in Colab. We therefore install additional libraries using [pip](https://pip.pypa.io/en/stable/). After installation, we have to restart our runtime before we can import the library. Please make sure that you click the _RESTART SESSION_-button which is displayed at the end of the code cell's output.

In [None]:
! pip install pycaret

### Import dependencies
We can now import all required dependencies.

In [None]:
import google
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from geemap.legends import builtin_legends
from pathlib import Path
from pycaret.classification import *

### Define global variables
The cell below defines some global variables.
- `TRAINING_DATA` This is the name of the dataset we created in [01_Connecting_to_GEE.ipynb](https://github.com/twaldburger/flood475/blob/master/02_Creating_Training_Data.ipynb). You do not have to specify the path but only the file name which is most likely _roi_sample.csv_ or _floods_sample.csv_.
- `MODEL_NAME` The name of your model. This becomes relevant if you want to save multiple different models.
- `SEED` This is the seed value used for splitting our training data into a train and a test dataset. Setting this value makes our results reproducible.

In [None]:
TRAINING_DATA = 'roi_sample_norm.csv' # @param {type: 'string'}
MODEL_NAME = 'flood_prediction_1' # @param {type: 'string'}
SEED = 123 # @param {type: 'integer'}

### Mount Google Drive
We will use Google Drive to store our preliminary results from GEE because we can mount it to Google Colab and therefore easily write and read data without the need of manually down- and uploading datasets.

**Important!** The cell below mounts your Google Drive to Google Colab and creates a new folder (named _geo475_ee_). This folder will be removed again at the end of the exercise (you can also keep it if you want, of course). **To make sure that we are not deleting any of your personal data, do not change the `data_dir`-variable in the cell below unless you know what you are doing.**

In [None]:
data_dir = Path('/content/gdrive/MyDrive/geo475_ee')

## mount Google Drive to Colab
if not data_dir.parent.exists():
  google.colab.drive.mount('/content/gdrive')

## create output directory for the project
if not data_dir.exists():
  data_dir.mkdir()

---
## Data preparation

### Data import
First, we import our training dataset from Google Drive and remove unnecessary columns.

In [None]:
## import from google drive
df = pd.read_csv(data_dir/TRAINING_DATA)

## make target variable binary
df['flooded'] = 0
df.loc[df['floods']>0, 'flooded'] = 1

## remove unnecessary columns
df.drop(['floods', 'system:index', 'first', '.geo'], axis=1, inplace=True)

### Exploratory data analysis
We now have a first look at the columns in the dataframe.
> **Task:** Look at the descriptive statistics and see if you spot anything unexpected. What are your key observations?

In [None]:
## get descriptive statistics of all columns
df.describe(include='all').T

Let's check how flood frequency is distributed and how it relates to landcover by creating a stacked barplot showing the flood frequency counts per landcover class.
> **Task:** Try to answer the following questions:
1. What are your key observations when looking at the barplot?
2. Does the plot make sense? Do you see anything unexpected?
3. Do you see signs of biased or wrong training data?

In [None]:
## create a mapper to assign landcover class names
mapper = {}
for cls in builtin_legends['ESA_WorldCover']:
  k, v = cls.split(' ', 1)
  mapper[float(k)] = v
df['landcover_names'] = df['landcover'].map(mapper)

## stacked barplot showing flood frequency counts per landcover class
df_plot = df.groupby(['landcover_names', 'flooded']).size().reset_index()
df_plot = df_plot.pivot(columns='landcover_names', index='flooded', values=0)
df.drop(['landcover_names'], axis=1, inplace=True)
df_plot.plot(kind='bar', stacked=True, figsize=(10, 5))

> **Task:** Which other training variables are you interested in exploring? Try to create a few more plots to get an idea of your training dataset.

Next, let's have a look at correlations with our target variable.
> **Task:** Do you see any strong correlations? How could we use a high correlation to our benefit?

In [None]:
corr = df.corr(method='pearson')
corr['flooded'].sort_values(ascending=False)

And also at the correlations between all other variables.
> **Task:** Try to answer the following questions:
1. Between which variables do you see (strong) correlation?
2. To what extent does it make sense to look at the correlation? Where do you see potential problems?

In [None]:
# quantify correlations between all variables
plt.figure(figsize=(13, 8))
sns.set(font_scale=0.6)
sns.heatmap(corr, cmap='RdYlGn', annot=True, center=0, fmt=".2g")

### Feature engineering
We saw that _daily_max_precipitation_ is strongly correlated in many columns. We therefore create a new variable by aggregating all _daily_max_precipitation_ using the mean.

In [None]:
cols = [c for c in df.columns if c.startswith('daily_max')]
df['daily_max_precipitation'] = df[cols].mean(axis=1)
df.drop(cols, axis=1, inplace=True)
df

---
## Model training


### Setup
With this function, we initialize the training environment. We are using the absolute minimum for the setup by only defining the required parameters. However, the customization options are almost endless.
> **Task:** Come back to this cell after your initial run and try if you can get better results by adjusting your training environment. You can find the documentation for the _setup_-method [here](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.ClassificationExperiment).

In [None]:
classifier = setup(data=df, target='flooded', session_id=SEED)

### Compare models
We now train and evaluate multiple models. The visualisation of this function is a scoring grid with average cross-validated scores. The output stored in `best` is the highest scoring model. Here, we want to focus on the following metrics:

- _Accuracy:_ How often is the model right?
- _Recall:_ How many positive predictions can the model identify?
- _Precision:_ How often are positive predictions correct?
- _F1-Score:_ Combination of recall and precision. A high F1-score signifies that the model can effectively identify positive cases while minimizing false positives and false negatives.

> **Task:** Conduct a short internet research to learn a little bit about the type of model which scored best for your training data. Then, try to answer the following questions:
1. How high is your best model's accuracy? Do you think this is a good performance?
2. When looking at the other metrics, would you argue that another model actually performs better than the one with the highest accuracy?

In [None]:
best = compare_models(sort='F1')

### Analyze best model
We now focus on the model with the highest F1-score and plot the confusion matrix by running the cell below. Below is a quick refresher on the 4 fields of the confusion matrix but it is also worth to have a look at [this page](https://www.v7labs.com/blog/confusion-matrix-guide) which shows how the metrics from above link to the confusion matrix.

- _True Positive (TP):_ a class is predicted true and is true in reality (locations that are flooded and are predicted as flooded)
- _True Negative (TN):_ a class is predicted false and is false in reality (locations that not flooded and are predicted as not flooded)
- _False Positive (FP):_ a class is predicted true but is false in reality (locations that are not flooded but are predicted as flooded)
- _False Negative (FN):_ a class is predicted false but is true in reality (locations that are flooded but predicted as not flooded)

> **Task:** Try to answer the following questions:
1. What can you read from the confusion matrix of your model?
2. Does your model perform better when correctly predicting flooded locatations or does it perform better when correctly predicting non-flooded locations.
3. Thinking about risk and insurance, which of the 4 fields do think is most important?
4. Thinking about risk and insurance, what does a high number of false negatives mean?
5. Thinking about risk and insurance, what does a high number of false positives mean?



In [None]:
plot_model(best, plot='confusion_matrix')

We will also have a look at the importance of our input variables.
> **Task:** Try to answer the following questions:
1. Which features are more important and which are less important?
2. How could we use this information to improve our pipeline?
3. How could the selection of our training data (and our region of interest) influence feature importance?


In [None]:
plot_model(best, plot='feature')

### Save model
We finalize the model by training it again on the full dataset - including the 30% used for validation. This does not change any parameter of the model but only refits on the entire dataset.

In [None]:
finalize_model(best)

We finally save the model to Google Drive.

In [None]:
save_model(best, model_name=data_dir/MODEL_NAME, model_only=True)

### Import dependencies
We can now import all required dependencies.

In [None]:
import google
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from geemap.legends import builtin_legends
from pathlib import Path
from pycaret.classification import *

### Define global variables
The cell below defines some global variables.
- `TRAINING_DATA` This is the name of the dataset we created in [01_Connecting_to_GEE.ipynb](https://github.com/twaldburger/flood475/blob/master/02_Creating_Training_Data.ipynb). You do not have to specify the path but only the file name which is most likely _roi_sample.csv_ or _floods_sample.csv_.
- `SEED` This is the seed value used for splitting our training data into a train and a test dataset. Setting this value makes our results reproducible.

In [None]:
TRAINING_DATA = 'roi_sample_norm.csv' # @param {type: 'string'}
SEED = 123 # @param {type: 'integer'}

### Mount Google Drive
We will use Google Drive to store our preliminary results from GEE because we can mount it to Google Colab and therefore easily write and read data without the need of manually down- and uploading datasets.

**Important!** The cell below mounts your Google Drive to Google Colab and creates a new folder (named _geo475_ee_). This folder will be removed again at the end of the exercise (you can also keep it if you want, of course). **To make sure that we are not deleting any of your personal data, do not change the `data_dir`-variable in the cell below unless you know what you are doing.**

In [None]:
data_dir = Path('/content/gdrive/MyDrive/geo475_ee')

## mount Google Drive to Colab
if not data_dir.parent.exists():
  google.colab.drive.mount('/content/gdrive')

## create output directory for the project
if not data_dir.exists():
  data_dir.mkdir()

---
## Data preparation

### Data import
First, we import our training dataset from Google Drive and remove unnecessary columns.

In [None]:
## import from google drive
df = pd.read_csv(data_dir/TRAINING_DATA)

## make target variable binary
df['flooded'] = 0
df.loc[df['floods']>0, 'flooded'] = 1

## remove unnecessary columns
df.drop(['floods', 'system:index', 'first', '.geo'], axis=1, inplace=True)

### Exploratory data analysis
We now have a first look at the columns in the dataframe.
> **Task:** Look at the descriptive statistics and see if you spot anything unexpected. What are your key observations?

In [None]:
## get descriptive statistics of all columns
df.describe(include='all').T

Let's check how flood frequency is distributed and how it relates to landcover by creating a stacked barplot showing the flood frequency counts per landcover class.
> **Task:** Try to answer the following questions:
1. What are your key observations when looking at the barplot?
2. Does the plot make sense? Do you see anything unexpected?
3. Do you see signs of biased or wrong training data?

In [None]:
## create a mapper to assign landcover class names
mapper = {}
for cls in builtin_legends['ESA_WorldCover']:
  k, v = cls.split(' ', 1)
  mapper[float(k)] = v
df['landcover_names'] = df['landcover'].map(mapper)

## stacked barplot showing flood frequency counts per landcover class
df_plot = df.groupby(['landcover_names', 'flooded']).size().reset_index()
df_plot = df_plot.pivot(columns='landcover_names', index='flooded', values=0)
df.drop(['landcover_names'], axis=1, inplace=True)
df_plot.plot(kind='bar', stacked=True, figsize=(10, 5))

> **Task:** Which other training variables are you interested in exploring? Try to create a few more plots to get an idea of your training dataset.

Next, let's have a look at correlations with our target variable.
> **Task:** Do you see any strong correlations? How could we use a high correlation to our benefit?

In [None]:
corr = df.corr(method='pearson')
corr['flooded'].sort_values(ascending=False)

And also at the correlations between all other variables.
> **Task:** Try to answer the following questions:
1. Between which variables do you see (strong) correlation?
2. To what extent does it make sense to look at the correlation? Where do you see potential problems?

In [None]:
# quantify correlations between all variables
plt.figure(figsize=(13, 8))
sns.set(font_scale=0.6)
sns.heatmap(corr, cmap='RdYlGn', annot=True, center=0, fmt=".2g")

### Feature engineering
We saw that _daily_max_precipitation_ is strongly correlated in many columns. We therefore create a new variable by aggregating all _daily_max_precipitation_ using the mean.

In [None]:
cols = [c for c in df.columns if c.startswith('daily_max')]
df['daily_max_precipitation'] = df[cols].mean(axis=1)
df.drop(cols, axis=1, inplace=True)
df

---
## Model training


### Setup
With this function, we initialize the training environment. We are using the absolute minimum for the setup by only defining the required parameters. However, the customization options are almost endless.
> **Task:** Come back to this cell after your initial run and try if you can get better results by adjusting your training environment. You can find the documentation for the _setup_-method [here](https://pycaret.readthedocs.io/en/latest/api/classification.html#pycaret.classification.ClassificationExperiment).

In [None]:
classifier = setup(data=df, target='flooded', session_id=SEED)

### Compare models
We now train and evaluate multiple models. The visualisation of this function is a scoring grid with average cross-validated scores. The output stored in `best` is the highest scoring model. Here, we want to focus on the following metrics:

- _Accuracy:_ How often is the model right?
- _Recall:_ How many positive predictions can the model identify?
- _Precision:_ How often are positive predictions correct?
- _F1-Score:_ Combination of recall and precision. A high F1-score signifies that the model can effectively identify positive cases while minimizing false positives and false negatives.

> **Task:** Conduct a short internet research to learn a little bit about the type of model which scored best for your training data. Then, try to answer the following questions:
1. How high is your best model's accuracy? Do you think this is a good performance?
2. When looking at the other metrics, would you argue that another model actually performs better than the one with the highest accuracy?

In [None]:
best = compare_models(sort='F1')

### Analyze best model
We now focus on the model with the highest F1-score and plot the confusion matrix by running the cell below. Below is a quick refresher on the 4 fields of the confusion matrix but it is also worth to have a look at [this page](https://www.v7labs.com/blog/confusion-matrix-guide) which shows how the metrics from above link to the confusion matrix.

- _True Positive (TP):_ a class is predicted true and is true in reality (locations that are flooded and are predicted as flooded)
- _True Negative (TN):_ a class is predicted false and is false in reality (locations that not flooded and are predicted as not flooded)
- _False Positive (FP):_ a class is predicted true but is false in reality (locations that are not flooded but are predicted as flooded)
- _False Negative (FN):_ a class is predicted false but is true in reality (locations that are flooded but predicted as not flooded)

> **Task:** Try to answer the following questions:
1. What can you read from the confusion matrix of your model?
2. Does your model perform better when correctly predicting flooded locatations or does it perform better when correctly predicting non-flooded locations.
3. Thinking about risk and insurance, which of the 4 fields do think is most important?
4. Thinking about risk and insurance, what does a high number of false negatives mean?
5. Thinking about risk and insurance, what does a high number of false positives mean?



In [None]:
plot_model(best, plot='confusion_matrix')

We will also have a look at the importance of our input variables.
> **Task:** Try to answer the following questions:
1. Which features are more important and which are less important?
2. How could we use this information to improve our pipeline?
3. How could the selection of our training data (and our region of interest) influence feature importance?


In [None]:
plot_model(best, plot='feature')