Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for custom user preprocessing step to data_loader.load_variable? #177

Closed
spencerkclark opened this issue May 2, 2017 · 4 comments

Comments

@spencerkclark
Copy link
Collaborator

We toyed with adding this in #90, but eventually settled on just adding a custom time-offset capability for simplicity.

I know someone trying out aospy that would like to use it for looking at output from the WRF model, which unfortunately does not always comply with CF conventions; this is particularly problematic for the time variable:

<xarray.DataArray 'XTIME' (Time: 360)>
array([ 259440.,  259680.,  259920., ...,  345120.,  345360.,  345600.], dtype=float32)
Dimensions without coordinates: Time
Attributes:
    FieldType:    104
    MemoryOrder:  0
    description:  minutes since simulation start
    units:
    stagger:

It would be nice, rather than have to modify every output file to have a CF-compliant time units variable, if a user could provide a function to apply to the dataset before aospy touches it. I think it would make most sense for this to be a specification one could make as an optional argument in a DataLoader constructor.

For instance:

def fix_wrf_time_units(ds):
    if 'XTIME' in ds:
        ds['XTIME'].attrs['units'] = 'minutes since 0001-06-30 04:00:00'
    return ds

rootdir = '/archive/xrc/wrf_output/WTG_colcool/28C_1pt5_noT'

_file_map = {'4hr': os.path.join(rootdir, 'wrfout_d01_0001-06-30_04:00:00')}
example_run = Run(
    name='example_run',
    description=(
        'WRF WTG simulation'
    ),
    default_start_date=datetime.datetime(1, 6, 30) + datetime.timedelta(days=50),
    default_end_date=datetime.datetime(1, 6, 30) + datetime.timedelta(days=60),
    data_loader=DictDataLoader(_file_map, preprocess={'4hr': fix_wrf_time_units})
)

The preprocess argument could take a dictionary as input, with an intvl_in mapping to a user-defined function, which would be called before grid_attrs_to_aospy_names whenever files were loaded from a particular file set. This could be used to clean up datasets that are close, but not quite, compatible with aospy's assumptions.

@spencerahill what are your thoughts here? Does this sound like a decent option to address this issue?

@spencerahill
Copy link
Owner

WRF model, which unfortunately does not always comply with CF conventions

Good catch. Are there any other places where we're implicitly assuming CF-compliant data?

It would be nice, rather than have to modify every output file to have a CF-compliant time units variable, if a user could provide a function to apply to the dataset before aospy touches it.

I agree.

I think it would make most sense for this to be a specification one could make as an optional argument in a DataLoader constructor.

This definitely seems like the most straightforward solution. However, the fact that this stems from model-level settings (i.e. the output format of WRF simulations), I'm wondering if there's a way to do this more systematically. More specifically, is there a way to implement this at the Model level?

I don't think so right now, so maybe that's beyond the scope of this particular issue. And it doesn't preclude the usefulness of your recommended solution, since that's more general.

The preprocess argument could take a dictionary as input, with an intvl_in mapping to a user-defined function

What motivates making it intvl_in-specific?

@spencerahill
Copy link
Owner

spencerahill commented May 3, 2017

This definitely seems like the most straightforward solution. However, the fact that this stems from model-level settings (i.e. the output format of WRF simulations), I'm wondering if there's a way to do this more systematically. More specifically, is there a way to implement this at the Model level?

Just occurred to me: this is exactly what different DataLoaders are for. I.e. we should add a WRFDataLoader that does this. What do you think of that?

Edit: In fact, if a python wrapper exists to the wrfout_to_cf.ncl utility you linked to, we could use that directly, rather than rolling our own.

(I still think your preprocess proposal is good, to allow for arbitrary preprocessing that e.g. none of the built-in DataLoaders have.)

@spencerkclark
Copy link
Collaborator Author

Are there any other places where we're implicitly assuming CF-compliant data?

I think we only are strict about times, since xarray requires a CF-compliant units attribute to decode times (and we rely on times being decoded within aospy).

Just occurred to me: this is exactly what different DataLoaders are for. I.e. we should add a WRFDataLoader that does this. What do you think of that?

That does sound like a nice solution! That way you wouldn't have to remember to pass the preprocessing function to the DataLoader each time (you'd just have to make sure you used the correct one).

I still think your preprocess proposal is good, to allow for arbitrary preprocessing that e.g. none of the built-in DataLoaders have.

I agree; it would make implementing different DataLoaders that require their own preprocessing steps for cleaning datasets before passing them to aospy more straightforward.

What motivates making it intvl_in-specific?

Right, for this particular use case (I was thinking about just a DictDataLoader and modifying the time units attribute) intvl_in was most appropriate, but it does not generalize to other DataLoaders. We'll have to think more about how general/flexible we want to make things. Even for this relatively simple use-case, I can see the need for the function to behave differently depending on the intvl_in. For instance in the example above, the time units string should be 'minutes since 0001-06-30 00:04:00', but for a different, coarser, intvl_in, the units might be 'days since 0001-06-30 00:04:00'.

@spencerahill
Copy link
Owner

Right, for this particular use case (I was thinking about just a DictDataLoader and modifying the time units attribute) intvl_in was most appropriate, but it does not generalize to other DataLoaders. We'll have to think more about how general/flexible we want to make things. Even for this relatively simple use-case, I can see the need for the function to behave differently depending on the intvl_in. For instance in the example above, the time units string should be 'minutes since 0001-06-30 00:04:00', but for a different, coarser, intvl_in, the units might be 'days since 0001-06-30 00:04:00'.

Thinking out loud, could we make it a method that's unique to each DataLoader? E.g. for DictDataLoader it maps to intvl_in, but for others it does something else? That's kind of punting, because that still requires then deciding on that behavior for each class.

Not sure if that's a good idea. Definitely need to work through the implementation a bit before proceeding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants