Example on using `preprocess` with `mfdataset` #2313

dcherian · 2018-07-25T21:31:34Z

I wrote this little notebook today while trying to get some satellite data in form that was nice to work with: https://gist.github.com/dcherian/66269bc2b36c2bc427897590d08472d7

I think it would make a useful example for the docs.

A few questions:

Do you think it'd be a good addition to the examples?
Is this the recommended way of adding meaningful co-ordinates, expanding dims etc.? The main bit is this function:

def preprocess(ds):

        dsnew = ds.copy()
        dsnew['latitude'] = xr.DataArray(np.linspace(90, -90, 180),
                                         dims=['phony_dim_0'])
        dsnew['longitude'] = xr.DataArray(np.linspace(-180, 180, 360),
                                          dims=['phony_dim_1'])
        dsnew = (dsnew.rename({'l3m_data': 'sss',
                               'phony_dim_0': 'latitude',
                               'phony_dim_1': 'longitude'})
                 .set_coords(['latitude', 'longitude'])
                 .drop('palette'))

        dsnew['time'] = (pd.to_datetime(dsnew.attrs['time_coverage_start'])
                         + np.timedelta64(3, 'D') + np.timedelta64(12, 'h'))
        dsnew = dsnew.expand_dims('time').set_coords('time')

        return dsnew

Also open to other feedback...

The text was updated successfully, but these errors were encountered:

fujiisoup · 2018-08-07T00:18:02Z

There is a related question on SO.

I think it is a good idea to add an example to our doc.
Agreed for the question 1.
I did not yet examine the question 2, but I think simple examples are generally nice.

raybellwaves · 2021-02-13T03:20:58Z

Edit: Copied and pasted from a duplicate issue I opened. Closing that and moving convo here.

@jhamman's SO answer circa 2018 helped me this week https://stackoverflow.com/a/51714004/6046019

I wonder if it's worth (not sure where) providing an example of how to use preprocesses with open_mfdataset?

Add an Examples entry to the doc string? (http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html /

xarray/xarray/backends/api.py

Line 895 in 5296ed1

"""

)

While not a small example (as the remote files are large) this is how I used it:

import xarray as xr
import s3fs


def preprocess(ds):
    return ds.expand_dims('time')

fs = s3fs.S3FileSystem(anon=True)
f1 = fs.open('s3://fmi-opendata-rcrhirlam-surface-grib/2021/02/03/00/numerical-hirlam74-forecast-MaximumWind-20210203T000000Z.grb2')
f2 = fs.open('s3://fmi-opendata-rcrhirlam-surface-grib/2021/02/03/06/numerical-hirlam74-forecast-MaximumWind-20210203T060000Z.grb2')

ds = xr.open_mfdataset([f1, f2], engine="cfgrib", preprocess=preprocess, parallel=True)

with one file looking like:

xr.open_dataset("LOCAL_numerical-hirlam74-forecast-MaximumWind-20210203T000000Z.grb2", engine="cfgrib")
<xarray.Dataset>
Dimensions:            (latitude: 947, longitude: 5294, step: 55)
Coordinates:
    time               datetime64[ns] ...
  * step               (step) timedelta64[ns] 01:00:00 ... 2 days 07:00:00
    heightAboveGround  int64 ...
  * latitude           (latitude) float64 25.65 25.72 25.78 ... 89.86 89.93 90.0
  * longitude          (longitude) float64 -180.0 -179.9 -179.9 ... 179.9 180.0
    valid_time         (step) datetime64[ns] ...
Data variables:
    fg10               (step, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2021-02-12T18:06:52 GRIB to CDM+CF via cfgrib-0....

A smaller example could be (WIP; note I was hoping ds would concat along t but it doesn't do what I expect)

import numpy as np
import xarray as xr

f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f1")
f1 = f1.assign_coords(t=0)
f1.to_dataset().to_zarr("f1.zarr") # What's the best way to store small files to open again with mf_dataset? csv via xarray objects? can you use open_mfdataset on pkl objects?

f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f2")
f2 = f2.assign_coords(t=1)
f2.to_dataset().to_zarr("f2.zarr")

# Concat along t
def preprocess(ds):
    return ds.expand_dims('t')
ds = xr.open_mfdataset(["f1.zarr", "f2.zarr"], engine="zarr", concat_dim="t", preprocess=preprocess)
>>> ds
<xarray.Dataset>
Dimensions:  (a: 2, t: 1)
Coordinates:
  * t        (t) int64 0
  * a        (a) int64 0 1
Data variables:
    f1       (t, a) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>
    f2       (t, a) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>

chuaxr · 2022-03-09T10:13:09Z

Seconding @dcherian's comment in #4901 on an example for .encoding['source']. Working off @raybellwaves' example, something like this would have been useful to me:

>>> import xarray as xr
>>> import numpy as np
>>> model1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name="f")
>>> model1.to_dataset().to_netcdf("model1.nc")
>>> model2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name="f")
>>> model2.to_dataset().to_netcdf("model2.nc")
>>> ds = xr.open_mfdataset(
...     ["model1.nc", "model2.nc"],
...     preprocess=lambda ds: ds.expand_dims(
...         {"model_name": [ds.encoding["source"].split("/")[-1].split(".")[0]]}
...     ),
... )
>>> ds
<xarray.Dataset>
Dimensions:     (dim_0: 2, model_name: 2)
Coordinates:
  * dim_0       (dim_0) int64 0 1
  * model_name  (model_name) object 'model1' 'model2'
Data variables:
    f           (model_name, dim_0) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>

On that note, the example above seems to work with some slight changes:

>>> import numpy as np
>>> import xarray as xr
>>> 
>>> f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f1")
>>> f1 = f1.assign_coords(t='t0')
>>> f1.to_dataset().to_netcdf("f1.nc")
>>> 
>>> f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f2")
>>> f2 = f2.assign_coords(t='t1')
>>> f2.to_dataset().to_netcdf("f2.nc")
>>> 
>>> # Concat along t
>>> def preprocess(ds):
...     return ds.expand_dims("t")
... 
>>> 
>>> ds = xr.open_mfdataset(["f1.nc", "f2.nc"], concat_dim="t", preprocess=preprocess)
>>> ds
<xarray.Dataset>
Dimensions:  (a: 2, t: 2)
Coordinates:
  * t        (t) object 't0' 't1'
  * a        (a) int64 0 1
Data variables:
    f1       (t, a) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
    f2       (t, a) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>

jibcar · 2022-05-24T01:31:22Z

Hello:

I have to find maximum precipitation of each year (for example: 2007 and 2008, Dataset link are: 2007 and 2008). I have done this using resample method (i.e. .resample(time='Y').max()) after concatenating it along time dimension.

Following along SO, I am wondering if I can use preprocess to find maximum (or minimum or average) for each file first and then concatenate it using time dimension. I tried the following code and was not successful. Can someone help me with this?

import numpy as np
import xarray as xr

from dask.distributed import Client
client = Client()
client

def preprocess_func(ds):
    '''Get maximum (or minimum or average) from each file and concatenate along time'''
    return ds.precip.max('time')

prec_ds=xr.open_mfdataset([prec_2007,prec_2008],
                       chunks={"lat": 25,"lon": 25,"time": -1,},
                       preprocess=preprocess_func,
                       concat_dim='time')```

dcherian · 2022-05-31T16:51:57Z

I bet you need to ds.precip.max("time", keepdims=True) so that the returned value has the time dimension. Please ask usage questions at https://github.com/pydata/xarray/discussions next time

husainridwan · 2023-03-14T12:35:00Z

I'll like to work on this @TomNicholas, where do I start from?

TomNicholas added the contrib-good-first-issue label Jul 13, 2019

dcherian mentioned this issue Feb 12, 2021

document example of preprocess with open_mfdataset #4901

Closed

dcherian added the topic-documentation label Apr 9, 2022

dcherian mentioned this issue Jun 22, 2022

Link to or absorb external tutorial material xarray-contrib/xarray-tutorial#104

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example on using `preprocess` with `mfdataset` #2313

Example on using `preprocess` with `mfdataset` #2313

dcherian commented Jul 25, 2018

fujiisoup commented Aug 7, 2018

raybellwaves commented Feb 13, 2021

chuaxr commented Mar 9, 2022

jibcar commented May 24, 2022

dcherian commented May 31, 2022

husainridwan commented Mar 14, 2023

Example on using preprocess with mfdataset #2313

Example on using preprocess with mfdataset #2313

Comments

dcherian commented Jul 25, 2018

fujiisoup commented Aug 7, 2018

raybellwaves commented Feb 13, 2021

chuaxr commented Mar 9, 2022

jibcar commented May 24, 2022

dcherian commented May 31, 2022

husainridwan commented Mar 14, 2023

Example on using `preprocess` with `mfdataset` #2313

Example on using `preprocess` with `mfdataset` #2313