Define evaluation metrics and compute baseline scores #11

raspstephan · 2021-01-01T09:56:48Z

Steps:

We need to define how we want to evaluate our forecasts. That means clearly defining the metrics as well as the region and timeframe to be used.

@HirtM we already talked about the train/valid split and decided that using one week per month for validation sounds like a good choice. So we can just take the first 7 days of each month. We still need to define an area to be evaluated. For this, let's have a look at the radar quality map and chose a region. And we should do the evaluation for a range of lead times, maybe up to 48h?

In terms of metrics, there are a ton of different options but we should restrict it to a few in order to keep things simple. We will evaluate deterministic as well as ensemble forecasts. Additionally, we want to look at statistics that describe the "realism" of our forecasts. Here are some suggestions from my side (probably too many...):

Deterministic

RMSE
FSS at different thresholds (0.1, 1 and 5?) and some area. (Let's have a look at what's commonly used in literature)
F1 score (also requires thresholding but very commonly used)

Probabilistic

CRPS
Rank histograms

Realism

Precipitation amount histogram/spectra
Something about cell size and shape (RDF or simply cell size distribution)

Finally, we should have some baselines. The easiest one is simply a bilinear interpolation of the TIGGE forecast.

HirtM · 2021-01-05T07:54:04Z

That sounds like a good list! I was thinking of quite similar scores. I would vote for simple cell size distributions since RDF or any other cell size methods are not trivial to interpret.

48 h lead time for evaluation sounds good. Either we cut out 2 d after each 7d evaluation period from the training, or we reduce the lead time for the latter two of the 7 days, to strictly have different situations. But maybe we need to cut ~1 day anyway before and after each evaluation period because of correlated weather situations?

Regarding the baseline, we want to have both a CRM and another simple downscaling method?

raspstephan · 2021-01-05T09:51:27Z

I will try to organize the CRM data.

With regards to the overlap between train/valid, my intuition would be to ignore this for starters. But should this ever end up in a paper, we should do it properly. Hopefully we will have enough data to take an entire year for validation.

BTW, here is the xskillscore package which is quite nice, especially for ensemble metrics: https://xskillscore.readthedocs.io/en/stable/index.html

HirtM · 2021-01-08T14:43:17Z

As evaluation area, for a start we can use the whole domain, wherever the radar quality is good enough. Using a threshold of -1, we get the following criterion (rq top, selected area at bottom):

raspstephan · 2021-01-08T17:15:57Z

Great, thanks. I moved the to do list up to the top, so that it shows up in the project.

HirtM · 2021-01-12T20:34:24Z

FSS is added. I have a few questions about the code structure, using classes etc., how to call the whole evaluation process. Let's discuss that on Thursday.

HirtM · 2021-01-14T15:57:18Z

Regarding the F1-score, I only implemented the binary version, again, different thresholds are possible. In principle, it would be possible to do the F1-score using multiple categories, not just two. But I am not sure this is what we want.

HirtM · 2021-01-19T21:08:28Z

First comparisons with 001-Upscale_valid Generator prediction showed improvements to our baselin in RMSE, FSS and F1-score (although the precip fields look a bit blurred.) Proper implementation for the evaluation routine (e.g. latitdue selection,... ) still required.

raspstephan · 2021-02-08T16:17:40Z

Some thoughts on the train/valid/test split now that we have 3 years of data (2018-2020).

At first, I thought, we can just do 2018/19 for training and 2020 for validation. But I think that manual overfitting could become an issue, especially if we use things like early stopping, and therefore it might be better to have a third dataset for testing. So my current solution is to use the first 6 days of each month in 2018/29 for validation during model training, and then only use 2020 for the external validation you have done.

Downside is that we lose 1/5 of our training data but I think this is the more proper approach. This leaves ups with 40k training samples, which is a lot but of course they are quire unevenly distributed.

raspstephan added this to To do in NWP downscaling Jan 1, 2021

raspstephan assigned HirtM Jan 5, 2021

HirtM moved this from To do to In progress in NWP downscaling Jan 8, 2021

This comment has been minimized.

Sign in to view

HirtM moved this from In progress to To do in NWP downscaling Jan 25, 2021

raspstephan moved this from To do to Backlog in NWP downscaling Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define evaluation metrics and compute baseline scores #11

Define evaluation metrics and compute baseline scores #11

raspstephan commented Jan 1, 2021 •

edited by HirtM

HirtM commented Jan 5, 2021

raspstephan commented Jan 5, 2021

HirtM commented Jan 8, 2021

This comment has been minimized.

raspstephan commented Jan 8, 2021

HirtM commented Jan 12, 2021

HirtM commented Jan 14, 2021

HirtM commented Jan 19, 2021

raspstephan commented Feb 8, 2021

Define evaluation metrics and compute baseline scores #11

Define evaluation metrics and compute baseline scores #11

Comments

raspstephan commented Jan 1, 2021 • edited by HirtM

HirtM commented Jan 5, 2021

raspstephan commented Jan 5, 2021

HirtM commented Jan 8, 2021

This comment has been minimized.

raspstephan commented Jan 8, 2021

HirtM commented Jan 12, 2021

HirtM commented Jan 14, 2021

HirtM commented Jan 19, 2021

raspstephan commented Feb 8, 2021

raspstephan commented Jan 1, 2021 •

edited by HirtM