Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example on using preprocess with mfdataset #2313

Open
dcherian opened this issue Jul 25, 2018 · 6 comments
Open

Example on using preprocess with mfdataset #2313

dcherian opened this issue Jul 25, 2018 · 6 comments

Comments

@dcherian
Copy link
Contributor

I wrote this little notebook today while trying to get some satellite data in form that was nice to work with: https://gist.github.com/dcherian/66269bc2b36c2bc427897590d08472d7

I think it would make a useful example for the docs.

A few questions:

  1. Do you think it'd be a good addition to the examples?
  2. Is this the recommended way of adding meaningful co-ordinates, expanding dims etc.? The main bit is this function:
def preprocess(ds):

        dsnew = ds.copy()
        dsnew['latitude'] = xr.DataArray(np.linspace(90, -90, 180),
                                         dims=['phony_dim_0'])
        dsnew['longitude'] = xr.DataArray(np.linspace(-180, 180, 360),
                                          dims=['phony_dim_1'])
        dsnew = (dsnew.rename({'l3m_data': 'sss',
                               'phony_dim_0': 'latitude',
                               'phony_dim_1': 'longitude'})
                 .set_coords(['latitude', 'longitude'])
                 .drop('palette'))

        dsnew['time'] = (pd.to_datetime(dsnew.attrs['time_coverage_start'])
                         + np.timedelta64(3, 'D') + np.timedelta64(12, 'h'))
        dsnew = dsnew.expand_dims('time').set_coords('time')

        return dsnew

Also open to other feedback...

@fujiisoup
Copy link
Member

There is a related question on SO.

I think it is a good idea to add an example to our doc.
Agreed for the question 1.
I did not yet examine the question 2, but I think simple examples are generally nice.

@raybellwaves
Copy link
Contributor

Edit: Copied and pasted from a duplicate issue I opened. Closing that and moving convo here.

@jhamman's SO answer circa 2018 helped me this week https://stackoverflow.com/a/51714004/6046019

I wonder if it's worth (not sure where) providing an example of how to use preprocesses with open_mfdataset?

Add an Examples entry to the doc string? (http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html /

)

While not a small example (as the remote files are large) this is how I used it:

import xarray as xr
import s3fs


def preprocess(ds):
    return ds.expand_dims('time')

fs = s3fs.S3FileSystem(anon=True)
f1 = fs.open('s3://fmi-opendata-rcrhirlam-surface-grib/2021/02/03/00/numerical-hirlam74-forecast-MaximumWind-20210203T000000Z.grb2')
f2 = fs.open('s3://fmi-opendata-rcrhirlam-surface-grib/2021/02/03/06/numerical-hirlam74-forecast-MaximumWind-20210203T060000Z.grb2')

ds = xr.open_mfdataset([f1, f2], engine="cfgrib", preprocess=preprocess, parallel=True)

with one file looking like:

xr.open_dataset("LOCAL_numerical-hirlam74-forecast-MaximumWind-20210203T000000Z.grb2", engine="cfgrib")
<xarray.Dataset>
Dimensions:            (latitude: 947, longitude: 5294, step: 55)
Coordinates:
    time               datetime64[ns] ...
  * step               (step) timedelta64[ns] 01:00:00 ... 2 days 07:00:00
    heightAboveGround  int64 ...
  * latitude           (latitude) float64 25.65 25.72 25.78 ... 89.86 89.93 90.0
  * longitude          (longitude) float64 -180.0 -179.9 -179.9 ... 179.9 180.0
    valid_time         (step) datetime64[ns] ...
Data variables:
    fg10               (step, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2021-02-12T18:06:52 GRIB to CDM+CF via cfgrib-0....

A smaller example could be (WIP; note I was hoping ds would concat along t but it doesn't do what I expect)

import numpy as np
import xarray as xr

f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f1")
f1 = f1.assign_coords(t=0)
f1.to_dataset().to_zarr("f1.zarr") # What's the best way to store small files to open again with mf_dataset? csv via xarray objects? can you use open_mfdataset on pkl objects?

f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f2")
f2 = f2.assign_coords(t=1)
f2.to_dataset().to_zarr("f2.zarr")

# Concat along t
def preprocess(ds):
    return ds.expand_dims('t')
ds = xr.open_mfdataset(["f1.zarr", "f2.zarr"], engine="zarr", concat_dim="t", preprocess=preprocess)
>>> ds
<xarray.Dataset>
Dimensions:  (a: 2, t: 1)
Coordinates:
  * t        (t) int64 0
  * a        (a) int64 0 1
Data variables:
    f1       (t, a) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>
    f2       (t, a) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>

@chuaxr
Copy link

chuaxr commented Mar 9, 2022

Seconding @dcherian's comment in #4901 on an example for .encoding['source']. Working off @raybellwaves' example, something like this would have been useful to me:

>>> import xarray as xr
>>> import numpy as np
>>> model1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name="f")
>>> model1.to_dataset().to_netcdf("model1.nc")
>>> model2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name="f")
>>> model2.to_dataset().to_netcdf("model2.nc")
>>> ds = xr.open_mfdataset(
...     ["model1.nc", "model2.nc"],
...     preprocess=lambda ds: ds.expand_dims(
...         {"model_name": [ds.encoding["source"].split("/")[-1].split(".")[0]]}
...     ),
... )
>>> ds
<xarray.Dataset>
Dimensions:     (dim_0: 2, model_name: 2)
Coordinates:
  * dim_0       (dim_0) int64 0 1
  * model_name  (model_name) object 'model1' 'model2'
Data variables:
    f           (model_name, dim_0) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>

On that note, the example above seems to work with some slight changes:

>>> import numpy as np
>>> import xarray as xr
>>> 
>>> f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f1")
>>> f1 = f1.assign_coords(t='t0')
>>> f1.to_dataset().to_netcdf("f1.nc")
>>> 
>>> f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f2")
>>> f2 = f2.assign_coords(t='t1')
>>> f2.to_dataset().to_netcdf("f2.nc")
>>> 
>>> # Concat along t
>>> def preprocess(ds):
...     return ds.expand_dims("t")
... 
>>> 
>>> ds = xr.open_mfdataset(["f1.nc", "f2.nc"], concat_dim="t", preprocess=preprocess)
>>> ds
<xarray.Dataset>
Dimensions:  (a: 2, t: 2)
Coordinates:
  * t        (t) object 't0' 't1'
  * a        (a) int64 0 1
Data variables:
    f1       (t, a) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
    f2       (t, a) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>

@jibcar
Copy link

jibcar commented May 24, 2022

Hello:

I have to find maximum precipitation of each year (for example: 2007 and 2008, Dataset link are: 2007 and 2008). I have done this using resample method (i.e. .resample(time='Y').max()) after concatenating it along time dimension.

Following along SO, I am wondering if I can use preprocess to find maximum (or minimum or average) for each file first and then concatenate it using time dimension. I tried the following code and was not successful. Can someone help me with this?

import numpy as np
import xarray as xr

from dask.distributed import Client
client = Client()
client

def preprocess_func(ds):
    '''Get maximum (or minimum or average) from each file and concatenate along time'''
    return ds.precip.max('time')

prec_ds=xr.open_mfdataset([prec_2007,prec_2008],
                       chunks={"lat": 25,"lon": 25,"time": -1,},
                       preprocess=preprocess_func,
                       concat_dim='time')```

@dcherian
Copy link
Contributor Author

I bet you need to ds.precip.max("time", keepdims=True) so that the returned value has the time dimension. Please ask usage questions at https://github.com/pydata/xarray/discussions next time

@husainridwan
Copy link

I'll like to work on this @TomNicholas, where do I start from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants