# Data Access from EWC via `intake`

Data easily available via `climetlab`: https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge
Data holdings listed: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-input/0.3.0/netcdf/index.html

Therefore, S3 data also accessible with `intake-xarray` and cachable with `fsspec`.

In [1]:
import intake
import fsspec
import xarray as xr
import os, glob
import pandas as pd
xr.set_options(display_style='text')



<xarray.core.options.set_options at 0x7fa0100dcdc0>

In [2]:
# prevent aihttp timeout errors

from aiohttp import ClientSession, ClientTimeout
timeout = ClientTimeout(total=600)
fsspec.config.conf['https'] = dict(client_kwargs={'timeout': timeout})

# intake

https://github.com/intake/intake-xarray can read and cache `grib` and `netcdf` from catalogs.

Caching via `fsspec`: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally

In [3]:
import intake_xarray
cache_path = '/work/s2s-ai-challenge-template/data/cache'
fsspec.config.conf['simplecache'] = {'cache_storage': cache_path, 'same_names':True}

In [4]:
%%writefile EWC_catalog.yml
plugins:
  source:
    - module: intake_xarray

sources:
  training-input:
    description: climetlab name in AI/ML community naming for hindcasts as input to the ML-model in training period
    driver: netcdf
    parameters:
      model:
        description: name of the S2S model
        type: str
        default: ecmwf
        allowed: [ecmwf, eccc, ncep]
      param:
        description: variable name
        type: str
        default: tp
        allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
      date:
        description: initialization weekly thursdays
        type: datetime
        default: 2020.01.02
        min: 2020.01.02
        max: 2020.12.31
      version:
        description: versioning of the data
        type: str
        default: 0.3.0
      format:
        description: data type
        type: str
        default: netcdf
        allowed: [netcdf, grib]
      ending:
        description: data format compatible with format; netcdf -> nc, grib -> grib
        type: str
        default: nc
        allowed: [nc, grib]
    xarray_kwargs:
        engine: h5netcdf
    args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
      urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/{{version}}/{{format}}/{{model}}-hindcast-{{param}}-{{date.strftime("%Y%m%d")}}.{{ending}}

  test-input:
    description: climetlab name in AI/ML community naming for 2020 forecasts as input to ML model in test period 2020
    driver: netcdf
    parameters:
      model:
        description: name of the S2S model
        type: str
        default: ecmwf
        allowed: [ecmwf, eccc, ncep]
      param:
        description: variable name
        type: str
        default: tp
        allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
      date:
        description: initialization weekly thursdays
        type: datetime
        default: 2020.01.02
        min: 2020.01.02
        max: 2020.12.31
      version:
        description: versioning of the data
        type: str
        default: 0.3.0
      format:
        description: data type
        type: str
        default: netcdf
        allowed: [netcdf, grib]
      ending:
        description: data format compatible with format; netcdf -> nc, grib -> grib
        type: str
        default: nc
        allowed: [nc, grib]
    xarray_kwargs:
        engine: h5netcdf
    args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
      urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-input/{{version}}/{{format}}/{{model}}-forecast-{{param}}-{{date.strftime("%Y%m%d")}}.{{ending}}

  training-output-reference:
    description: climetlab name in AI/ML community naming for 2020 forecasts as output reference to compare to ML model output to in training period
    driver: netcdf
    parameters:
      param:
        description: variable name
        type: str
        default: tp
        allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
      date:
        description: initialization weekly thursdays
        type: datetime
        default: 2020.01.02
        min: 2020.01.02
        max: 2020.12.31
    xarray_kwargs:
        engine: h5netcdf
    args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
      urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-output-reference/{{param}}-{{date.strftime("%Y%m%d")}}.nc
            
  test-output-reference:
    description: climetlab name in AI/ML community naming for 2020 forecasts as output reference to compare to ML model output to in test period 2020
    driver: netcdf
    parameters:
      param:
        description: variable name
        type: str
        default: tp
        allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
      date:
        description: initialization weekly thursdays
        type: datetime
        default: 2020.01.02
        min: 2020.01.02
        max: 2020.12.31
    xarray_kwargs:
        engine: h5netcdf
    args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
      urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-output-reference/{{param}}-{{date.strftime("%Y%m%d")}}.nc

Writing EWC_catalog.yml


In [5]:
cat = intake.open_catalog('EWC_catalog.yml')

In [6]:
# dates for 2020 forecasts and their on-the-fly reforecasts
dates=pd.date_range(start='2020-01-02',freq='7D',end='2020-12-31')
dates

DatetimeIndex(['2020-01-02', '2020-01-09', '2020-01-16', '2020-01-23',
               '2020-01-30', '2020-02-06', '2020-02-13', '2020-02-20',
               '2020-02-27', '2020-03-05', '2020-03-12', '2020-03-19',
               '2020-03-26', '2020-04-02', '2020-04-09', '2020-04-16',
               '2020-04-23', '2020-04-30', '2020-05-07', '2020-05-14',
               '2020-05-21', '2020-05-28', '2020-06-04', '2020-06-11',
               '2020-06-18', '2020-06-25', '2020-07-02', '2020-07-09',
               '2020-07-16', '2020-07-23', '2020-07-30', '2020-08-06',
               '2020-08-13', '2020-08-20', '2020-08-27', '2020-09-03',
               '2020-09-10', '2020-09-17', '2020-09-24', '2020-10-01',
               '2020-10-08', '2020-10-15', '2020-10-22', '2020-10-29',
               '2020-11-05', '2020-11-12', '2020-11-19', '2020-11-26',
               '2020-12-03', '2020-12-10', '2020-12-17', '2020-12-24',
               '2020-12-31'],
              dtype='datetime64[ns]', freq='7D'

# `hindcast-input`

on-the-fly hindcasts corresponding to the 2020 forecasts

In [7]:
cat['training-input'](date=dates[10], param='tp', model='eccc').to_dask()

/opt/conda/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header


# `forecast-input`

2020

In [8]:
cat['test-input'](date=dates[10], param='t2m', model='ecmwf').to_dask()

# `hindcast-like-observations`

observations matching hindcasts

In [9]:
cat['training-output-reference'](date=dates[10], param='t2m').to_dask()

# `forecast-like-observations`

observations matching 2020 forecasts

In [10]:
cat['test-output-reference'](date=dates[10], param='t2m').to_dask()