# Preconfigured preprocessors

For production runs, there are Preprocessor classes that are pre-configured for different input datasets, e.g., CMIP, ERA5 or REMO output (double nesting). These are mostly pre-configured to work with data at DKRZ, e.g., they read input data from the CMIP or ERA5 data pool and mostly process them on the fly without the need to store any duplicated global model data.

In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


For the preprocessor classes, it is usally a good idea to start a dask client to define your preocessing resources. For the preprocessors, it's usually a good idea to avoid multithreading if you want to write a lot of netcdf files. The following should usually work at DKRZ:

In [5]:
from dask.distributed import Client

client = Client(
    dashboard_address="localhost:8787", n_workers=16, threads_per_worker=1
)  # mutlithreading does not work well with cdo



Now, you can create a preprocessor instance, e.g., for ERA5, you can choose the `ERA5Preprocessor`:

In [6]:
from pyremo.preproc import ERA5Preprocessor

preprocessor = ERA5Preprocessor(
    expid="000000",
    surflib="/work/ch0636/remo/surflibs/cordex/lib_EUR-11_frac.nc",
    domain="EUR-11",
    vc="vc_49lev",
    scratch="/scratch/g/g300046",
    outpath="/scratch/g/g300046/000000/xa/{date:%Y}/{date:%m}",
)

The preprocessor for ERA5 creates intermediate ERA5 NetCDF files (using CDO) in a CF-like format (“gfile”) and stores them in your scratch location. These files are removed automatically after processing and are only used on the fly. Ensure you have enough scratch disk space if preprocessing multiple years.

The easiest way to run the preprocessor is to use the `run` method. If you want to write netcdf files, you should choose `write=True`. The option `compute=True` will immediately start the processing instead of returnd dask delayed objects.

In [None]:
afiles = preprocessor.run(
    "2000-01-01T00:00:00", "2000-02-01T00:00:00", write=True, compute=True
)

The `run` method returns all afiles created by the preprocessor, e.g.

In [9]:
afiles

('/scratch/g/g300046/000000/xa/2000/01/a000000a2000010100.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010106.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010112.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010118.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010200.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010206.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010212.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010218.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010300.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010306.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010312.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010318.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010400.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010406.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a2000010412.nc',
 '/scratch/g/g300046/000000/xa/2000/01/a000000a20000104