# Create biweekly renku datasets from `climetlab-s2s-ai-challenge`

Goal:

- Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).
- These renku datasets are then used in notebooks:
    - `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions
    - `RPSS_verification.ipynb` to calculate RPSS of the ML model

Requirements:
- [`climetlab`](https://github.com/ecmwf/climetlab)
- [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
- S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)

Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`
- observations
    - deterministic:
        - `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`
        - `forecast-like-observations_2020_biweekly_deterministic.zarr`
    - edges:
        - `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`
    - probabilistic:
        - `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`
        - `forecast-like-observations_2020_biweekly_terciled.nc`
- forecasts/hindcasts
    - deterministic:
        - `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
        - `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
    - more models could be added
- benchmark:
    - probabilistic:
        - `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`

In [1]:
import matplotlib.pyplot as plt
import xarray as xr
import xskillscore as xs
import pandas as pd

import climetlab_s2s_ai_challenge
import climetlab as cml
print(f'Climetlab version : {cml.__version__}')
print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')

xr.set_options(keep_attrs=True)
xr.set_options(display_style='text')

Climetlab version : 0.8.0
Climetlab-s2s-ai-challenge plugin version : 0.6.7




<xarray.core.options.set_options at 0x2b4cbd09de10>

In [2]:
# caching path for climetlab
cache_path = "/work/mh0727/m300524/S2S_AI/cache" # set your own path
cml.settings.set("cache-directory", cache_path)

cache_path = "../data"

# Download and cache

Download all files for the observations, forecast and hindcast.

In [3]:
# shortcut
from scripts import download
#download()

## hindcast and forecast `input`

In [4]:
# starting dates forecast_time in 2020
dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()

forecast_dataset_labels = ['training-input','test-input'] # ML community
# equiv to
forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community

varlist_forecast = ['tp','t2m'] # can add more

center_list = ['ecmwf'] # 'ncep', 'eccc'

In [27]:
%%time
# takes ~ 10-30 min to download for one model one variable depending on number of model realizations
# and download settings https://climetlab.readthedocs.io/en/latest/guide/settings.html 
for center in center_list:
    for ds in forecast_dataset_labels:
        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format='netcdf').to_xarray()

## observations `output-reference`

In [5]:
obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
# equiv to
obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community

varlist_obs = ['tp', 't2m']

In [None]:
%%time
# takes 10min to download
for ds in obs_dataset_labels:
    print(ds)
    # only netcdf, no format choice
    cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=varlist_obs).to_xarray()

In [None]:
# download obs_time for to create output-reference/observations for other models than ecmwf and eccc,
# i.e. ncep or any S2S or Sub model
obs_time = cml.load_dataset(f"s2s-ai-challenge-observations", parameter=['t2m', 'pr']).to_xarray()

# create bi-weekly aggregates

In [6]:
from scripts import aggregate_biweekly, ensure_attributes

#aggregate_biweekly??

In [None]:
for c, center in enumerate(center_list):  # forecast centers (could also take models)
    for dsl in obs_dataset_labels + forecast_dataset_labels:  # climetlab dataset labels
        for p, parameter in enumerate(varlist_forecast):  # variables
            if c != 0 and 'observation' in dsl:  # only do once for observations 
                continue
            print(f"datasetlabel: {dsl}, center: {center}, parameter: {parameter}")
            if 'input' in dsl:
                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=center, parameter=parameter, format='netcdf').to_xarray()
            elif 'observation' in dsl: # obs only netcdf, no choice
                if parameter not in ['t2m', 'tp']:
                    continue
                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", parameter=parameter, date=dates).to_xarray()

            if p == 0:
                ds_biweekly = ds.map(aggregate_biweekly)
            else:
                ds_biweekly[parameter] = ds.map(aggregate_biweekly)[parameter]

            ds_biweekly = ds_biweekly.map(ensure_attributes, biweekly=True)

        if 'test' in dsl:
            ds_biweekly = ds_biweekly.chunk('auto')
        else:
            ds_biweekly = ds_biweekly.chunk({'forecast_time':'auto','lead_time':-1,'longitude':-1,'latitude':-1})

        if 'hindcast' in dsl:
            time = f'{int(ds_biweekly.forecast_time.dt.year.min())}-{int(ds_biweekly.forecast_time.dt.year.max())}'
            if 'input' in dsl:
                name = f'{center}_{dsl}'
            elif 'observations':
                name = dsl

        elif 'forecast' in dsl:
            time = '2020'
            if 'input' in dsl:
                name = f'{center}_{dsl}'
            elif 'observations':
                name = dsl
        else:
            assert False

        # pattern: {model_if_not_observations}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
        zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
        ds_biweekly.attrs.update({'postprocessed':'by https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
        print(f'save to: {zp}')
        ds_biweekly.astype('float32').to_zarr(zp, consolidated=True, mode='w')

## add to `renku` dataset `s2s-ai-challenge`

In [None]:
# observations as hindcast
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr

In [None]:
# for further use retrieve from git lfs
# !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr

In [8]:
obs_2000_2019 = xr.open_zarr(f"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
print(obs_2000_2019.sizes,'\n',obs_2000_2019.coords,'\n', obs_2000_2019.nbytes/1e6,'MB')

Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240})) 
 Coordinates:
  * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray> 
 492.546744 MB


In [8]:
# observations as forecast
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr

In [9]:
obs_2020 = xr.open_zarr(f"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
print(obs_2020.sizes,'\n',obs_2020.coords,'\n', obs_2020.nbytes/1e6,'MB')

Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240})) 
 Coordinates:
  * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray> 
 24.630096 MB


In [10]:
# ecmwf hindcast-input
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr

In [11]:
hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')

Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 11})) 
 Coordinates:
  * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
  * realization    (realization) int64 0 1 2 3 4 5 6 7 8 9 10
    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray> 
 5417.730832 MB


In [12]:
# ecmwf forecast-input
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr

In [13]:
fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')

Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 51})) 
 Coordinates:
  * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
  * realization    (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50
    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray> 
 1255.926504 MB


# tercile edges

Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year

In [14]:
tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'

In [15]:
%%time
xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr',
             consolidated=True).chunk({'forecast_time':-1,'longitude':'auto'}).groupby('forecast_time.weekofyear').quantile(q=[1./3.,2./3.], dim=['forecast_time']).rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)

  overwrite_input, interpolation)


CPU times: user 21min 25s, sys: 9min 19s, total: 30min 45s
Wall time: 18min 4s


In [16]:
tercile_edges = xr.open_dataset(tercile_file)

tercile_edges

In [17]:
tercile_edges.nbytes*1e-6,'MB'

(49.255184, 'MB')

In [18]:
# run renku commands from projects root directory only
# tercile edges
#!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc

In [19]:
# to use retrieve from git lfs
#!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
#xr.open_dataset("../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc")

# observations in categories

- counting how many deterministic forecasts realizations fall into each category, like counting rps
- categorize forecast-like-observations 2020 into categories

In [20]:
obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)
obs_2020.sizes

Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))

In [21]:
# create a mask for land grid
mask = obs_2020.std(['lead_time','forecast_time']).notnull()

In [22]:
# mask.to_array().plot(col='variable')

In [23]:
# total precipitation in arid regions are masked
# Frederic Vitart suggested by email: "Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm"
# we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
th = 0.01
tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th
#tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)
#plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)
#plt.savefig('dry_mask.png')

In [24]:
# look into tercile edges

In [25]:
#tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)

In [26]:
#tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)

In [27]:
# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)

## categorize observations

### forecast 2020

In [28]:
from scripts import make_probabilistic

In [29]:
obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)



In [30]:
obs_2020_p.nbytes/1e6, 'MB'

(147.75984, 'MB')

In [31]:
obs_2020_p.astype('float32').to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')

  x = np.divide(x1, x2, out)


In [32]:
# forecast-like-observations terciled
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc

In [33]:
# to use retrieve from git lfs
#!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
xr.open_dataset("../data/forecast-like-observations_2020_biweekly_terciled.nc")

### hindcast 2000_2019

In [34]:
obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)

In [35]:
obs_2000_2019_p = make_probabilistic(obs_2000_2019, tercile_edges, mask=mask)



In [36]:
obs_2000_2019_p.nbytes/1e6, 'MB'

(2955.138888, 'MB')

In [37]:
obs_2000_2019_p.astype('float32').chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')

  x = np.divide(x1, x2, out)


<xarray.backends.zarr.ZarrStore at 0x2b34e40d80c0>

In [38]:
# forecast-like-observations terciled
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr

In [39]:
# to use retrieve from git lfs
#!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")

# Benchmark

center: ECMWF

The calibration has been performed by using the tercile boundaries from the model climatology rather than from observations. Script by Frederic Vitart.

In [40]:
bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()

 50%|█████     | 1/2 [00:00<00:00,  6.11it/s]

By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data. 


100%|██████████| 2/2 [00:00<00:00,  6.89it/s]




In [41]:
bench_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
                        'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}

In [42]:
bench_p['lead_time'] = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate

bench_p['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
                         'week34_t2m': 'mean[day 14, 27]',
                         'week56_t2m': 'mean[day 28, 41]',
                         'week34_tp': 'day 28 minus day 14',
                         'week56_tp': 'day 42 minus day 28'}

In [43]:
bench_p = bench_p / 100 # convert percent to [0-1] probability

In [45]:
bench_p = bench_p.map(ensure_attributes, biweekly=True)

  0%|          | 0/1 [00:00<?, ?it/s]

By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data. 


100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
100%|██████████| 1/1 [00:00<00:00,  4.22it/s]


In [46]:
# bench_p.isel(forecast_time=2).t2m.plot(row='lead_time', col='category')

In [47]:
bench_p

In [48]:
bench_p.astype('float32').to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')

In [None]:
#!renku dataset add s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc