-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define evaluation metrics and compute baseline scores #11
Comments
That sounds like a good list! I was thinking of quite similar scores. I would vote for simple cell size distributions since RDF or any other cell size methods are not trivial to interpret. 48 h lead time for evaluation sounds good. Either we cut out 2 d after each 7d evaluation period from the training, or we reduce the lead time for the latter two of the 7 days, to strictly have different situations. But maybe we need to cut ~1 day anyway before and after each evaluation period because of correlated weather situations? Regarding the baseline, we want to have both a CRM and another simple downscaling method? |
I will try to organize the CRM data. With regards to the overlap between train/valid, my intuition would be to ignore this for starters. But should this ever end up in a paper, we should do it properly. Hopefully we will have enough data to take an entire year for validation. BTW, here is the xskillscore package which is quite nice, especially for ensemble metrics: https://xskillscore.readthedocs.io/en/stable/index.html |
This comment has been minimized.
This comment has been minimized.
Great, thanks. I moved the to do list up to the top, so that it shows up in the project. |
FSS is added. I have a few questions about the code structure, using classes etc., how to call the whole evaluation process. Let's discuss that on Thursday. |
Regarding the F1-score, I only implemented the binary version, again, different thresholds are possible. In principle, it would be possible to do the F1-score using multiple categories, not just two. But I am not sure this is what we want. |
Some thoughts on the train/valid/test split now that we have 3 years of data (2018-2020). At first, I thought, we can just do 2018/19 for training and 2020 for validation. But I think that manual overfitting could become an issue, especially if we use things like early stopping, and therefore it might be better to have a third dataset for testing. So my current solution is to use the first 6 days of each month in 2018/29 for validation during model training, and then only use 2020 for the external validation you have done. Downside is that we lose 1/5 of our training data but I think this is the more proper approach. This leaves ups with 40k training samples, which is a lot but of course they are quire unevenly distributed. |
Steps:
We need to define how we want to evaluate our forecasts. That means clearly defining the metrics as well as the region and timeframe to be used.
@HirtM we already talked about the train/valid split and decided that using one week per month for validation sounds like a good choice. So we can just take the first 7 days of each month. We still need to define an area to be evaluated. For this, let's have a look at the radar quality map and chose a region. And we should do the evaluation for a range of lead times, maybe up to 48h?
In terms of metrics, there are a ton of different options but we should restrict it to a few in order to keep things simple. We will evaluate deterministic as well as ensemble forecasts. Additionally, we want to look at statistics that describe the "realism" of our forecasts. Here are some suggestions from my side (probably too many...):
Deterministic
Probabilistic
Realism
Finally, we should have some baselines. The easiest one is simply a bilinear interpolation of the TIGGE forecast.
The text was updated successfully, but these errors were encountered: