Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define evaluation metrics and compute baseline scores #11

Open
7 of 13 tasks
raspstephan opened this issue Jan 1, 2021 · 9 comments
Open
7 of 13 tasks

Define evaluation metrics and compute baseline scores #11

raspstephan opened this issue Jan 1, 2021 · 9 comments
Assignees

Comments

@raspstephan
Copy link
Owner

raspstephan commented Jan 1, 2021

Steps:

  • Interpolation baseline
  • RMSE score
  • FSS
  • Compute further deterministic scores ( F1)
  • Think about the design of how to call the different functions etc.
  • Include cnn prediction in the evaluation
  • Include HRRR-mask for evaluation in addition to radar quality
  • Literature research on commonly used thresholds/scales, ... for evaluation metrics (FSS, F1, ...)
  • Compute histograms and cell sizes
  • Deal with multiple lead times
  • Consider ensemble data
  • Compute ensemble scores
  • Include other baselines as option

We need to define how we want to evaluate our forecasts. That means clearly defining the metrics as well as the region and timeframe to be used.

@HirtM we already talked about the train/valid split and decided that using one week per month for validation sounds like a good choice. So we can just take the first 7 days of each month. We still need to define an area to be evaluated. For this, let's have a look at the radar quality map and chose a region. And we should do the evaluation for a range of lead times, maybe up to 48h?

In terms of metrics, there are a ton of different options but we should restrict it to a few in order to keep things simple. We will evaluate deterministic as well as ensemble forecasts. Additionally, we want to look at statistics that describe the "realism" of our forecasts. Here are some suggestions from my side (probably too many...):

Deterministic

  • RMSE
  • FSS at different thresholds (0.1, 1 and 5?) and some area. (Let's have a look at what's commonly used in literature)
  • F1 score (also requires thresholding but very commonly used)

Probabilistic

  • CRPS
  • Rank histograms

Realism

  • Precipitation amount histogram/spectra
  • Something about cell size and shape (RDF or simply cell size distribution)

Finally, we should have some baselines. The easiest one is simply a bilinear interpolation of the TIGGE forecast.

@raspstephan raspstephan added this to To do in NWP downscaling Jan 1, 2021
@HirtM
Copy link
Collaborator

HirtM commented Jan 5, 2021

That sounds like a good list! I was thinking of quite similar scores. I would vote for simple cell size distributions since RDF or any other cell size methods are not trivial to interpret.

48 h lead time for evaluation sounds good. Either we cut out 2 d after each 7d evaluation period from the training, or we reduce the lead time for the latter two of the 7 days, to strictly have different situations. But maybe we need to cut ~1 day anyway before and after each evaluation period because of correlated weather situations?

Regarding the baseline, we want to have both a CRM and another simple downscaling method?

@raspstephan
Copy link
Owner Author

I will try to organize the CRM data.

With regards to the overlap between train/valid, my intuition would be to ignore this for starters. But should this ever end up in a paper, we should do it properly. Hopefully we will have enough data to take an entire year for validation.

BTW, here is the xskillscore package which is quite nice, especially for ensemble metrics: https://xskillscore.readthedocs.io/en/stable/index.html

@HirtM HirtM moved this from To do to In progress in NWP downscaling Jan 8, 2021
@HirtM
Copy link
Collaborator

HirtM commented Jan 8, 2021

As evaluation area, for a start we can use the whole domain, wherever the radar quality is good enough. Using a threshold of -1, we get the following criterion (rq top, selected area at bottom):
image

@HirtM

This comment has been minimized.

@raspstephan
Copy link
Owner Author

Great, thanks. I moved the to do list up to the top, so that it shows up in the project.

@HirtM
Copy link
Collaborator

HirtM commented Jan 12, 2021

FSS is added. I have a few questions about the code structure, using classes etc., how to call the whole evaluation process. Let's discuss that on Thursday.

@HirtM
Copy link
Collaborator

HirtM commented Jan 14, 2021

Regarding the F1-score, I only implemented the binary version, again, different thresholds are possible. In principle, it would be possible to do the F1-score using multiple categories, not just two. But I am not sure this is what we want.

@HirtM
Copy link
Collaborator

HirtM commented Jan 19, 2021

First comparisons with 001-Upscale_valid Generator prediction showed improvements to our baselin in RMSE, FSS and F1-score (although the precip fields look a bit blurred.) Proper implementation for the evaluation routine (e.g. latitdue selection,... ) still required.

image

@HirtM HirtM moved this from In progress to To do in NWP downscaling Jan 25, 2021
@raspstephan
Copy link
Owner Author

Some thoughts on the train/valid/test split now that we have 3 years of data (2018-2020).

At first, I thought, we can just do 2018/19 for training and 2020 for validation. But I think that manual overfitting could become an issue, especially if we use things like early stopping, and therefore it might be better to have a third dataset for testing. So my current solution is to use the first 6 days of each month in 2018/29 for validation during model training, and then only use 2020 for the external validation you have done.

Downside is that we lose 1/5 of our training data but I think this is the more proper approach. This leaves ups with 40k training samples, which is a lot but of course they are quire unevenly distributed.

@raspstephan raspstephan moved this from To do to Backlog in NWP downscaling Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants