Skip to content

Dataset.__getitem__ returns a DatasetArray linked to the same dataset #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 3, 2014

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Mar 2, 2014

Originally, Dataset.__getitem__ would "select" out the given variable to use
as the dataset for new DatasetArray. The rationale was that you don't really
want to keep track of extra dataset variables that are no longer relevant.

The problem is that this means that modifying an item from a dataset would not
modify the original dataset. An example might make this clearer:

>>> ds = xray.Dataset({'x': ('x', np.arange(10))})
>>> ds['x'].attributes['units'] = 'meters'  # this was actually a no-op

This is clearly a pretty blatant violation of the norms for a Python
container, and it certaintly surprised @akleeman. So this PR simplies this
behavior so that ds['x'] gives a DatasetArray linked to the dataset ds,
and does some related clean-up of DatasetArray.from_stack. The new method
DatasetArray.select lets you reproduce the old behavior if desired, by using
ds['x'].select() instead of ds['x']. A bonus is that the new behavior is
actually faster, because it doesn't need to create a new Dataset object.

Originally, `Dataset.__getitem__` would "select" out the given variable to use
as the dataset for new DatasetArray. The rationale was that you don't really
want to keep track of extra dataset variables that are no longer relevant.

The problem is that this means that modifying an item from a dataset would not
modify the original dataset. An example might make this clearer:
```
>>> ds = xray.Dataset({'x': ('x', np.arange(10))})
>>> ds['x'].attributes['units'] = 'meters'  # this was actually a no-op
```

This is clearly a pretty blatant violation of the norms for a Python
container, and it certaintly surprised @akleeman. So this PR simplies this
behavior so that `ds['x']` gives a DatasetArray linked to the dataset `ds`,
and does some related clean-up of `DatasetArray.from_stack`. The new method
`DatasetArray.select` lets you reproduce the old behavior if desired, by using
`ds['x'].select()` instead of `ds['x']`. A bonus is that the new behavior is
actually faster, because it doesn't need to create a new Dataset object.
akleeman added a commit that referenced this pull request Mar 3, 2014
Dataset.__getitem__ returns a DatasetArray linked to the same dataset
@akleeman akleeman merged commit 08a03b3 into master Mar 3, 2014
@shoyer shoyer deleted the Dataset.__getitem__ branch March 3, 2014 05:46
shoyer added a commit that referenced this pull request Mar 15, 2014
PR #33 was definitely a useful change -- item access via [] should return
items still in the context of the dataset they were pulled from.

However, it doesn't make sense to always keep track of all dataset variables.
A particular example is when doing math between variables from different
datasets.

To be more concrete, suppose I have two datasets ("obs" and "sim"), each with
two measurement variables ("tmin" and "tmax"). It should be possible to
calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to
conflicting values of "tmax". Unfortunately, this is exactly what the current
version of xray reports.

This PR fixes this behavior, by automatically including only coordinates
necessary to describe the arrays involved (via `DatasetArray.select`) when
merging datasets resulting from mathematical operations.

A possible downside is that occasionally auxiliary coordinates worth keeping
around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a
variable "tmax"). But on the whole I think this behavior is much more in line
with reasonable expectations.
shoyer added a commit that referenced this pull request Mar 15, 2014
PR #33 was definitely a useful change -- item access via [] should return
items still in the context of the dataset they were pulled from.

However, it doesn't make sense to always keep track of all dataset variables.
A particular example is when doing math between variables from different
datasets.

To be more concrete, suppose I have two datasets ("obs" and "sim"), each with
two measurement variables ("tmin" and "tmax"). It should be possible to
calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to
conflicting values of "tmax". Unfortunately, this is exactly what the current
version of xray reports.

This PR fixes this behavior, by automatically including only coordinates
necessary to describe the arrays involved (via `DatasetArray.select`) when
merging datasets resulting from mathematical operations.

A possible downside is that occasionally auxiliary coordinates worth keeping
around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a
variable "tmax"). But on the whole I think this behavior is much more in line
with reasonable expectations.

This change also removes the DatasetArray methods `refocus` and `unselected`
from the public API. I think this is the right call, since these functions
were highly specific and really only useful for the prior version of the
internal API.
shoyer added a commit that referenced this pull request Mar 17, 2014
PR #33 was definitely a useful change -- item access via [] should return
items still in the context of the dataset they were pulled from.

However, it doesn't make sense to always keep track of all dataset variables.
A particular example is when doing math between variables from different
datasets.

To be more concrete, suppose I have two datasets ("obs" and "sim"), each with
two measurement variables ("tmin" and "tmax"). It should be possible to
calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to
conflicting values of "tmax". Unfortunately, this is exactly what the current
version of xray reports.

This PR fixes this behavior, by automatically including only coordinates
necessary to describe the arrays involved (via `DatasetArray.select`) when
merging datasets resulting from mathematical operations.

A possible downside is that occasionally auxiliary coordinates worth keeping
around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a
variable "tmax"). But on the whole I think this behavior is much more in line
with reasonable expectations.

This change also removes the DatasetArray methods `refocus` and `unselected`
from the public API. I think this is the right call, since these functions
were highly specific and really only useful for the prior version of the
internal API.
shoyer added a commit that referenced this pull request Mar 20, 2014
PR #33 was definitely a useful change -- item access via [] should return
items still in the context of the dataset they were pulled from.

However, it doesn't make sense to always keep track of all dataset variables.
A particular example is when doing math between variables from different
datasets.

To be more concrete, suppose I have two datasets ("obs" and "sim"), each with
two measurement variables ("tmin" and "tmax"). It should be possible to
calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to
conflicting values of "tmax". Unfortunately, this is exactly what the current
version of xray reports.

This PR fixes this behavior, by automatically including only coordinates
necessary to describe the arrays involved (via `DatasetArray.select`) when
merging datasets resulting from mathematical operations.

A possible downside is that occasionally auxiliary coordinates worth keeping
around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a
variable "tmax"). But on the whole I think this behavior is much more in line
with reasonable expectations.

This change also removes the DatasetArray methods `refocus` and `unselected`
from the public API. I think this is the right call, since these functions
were highly specific and really only useful for the prior version of the
internal API.
shoyer added a commit that referenced this pull request Mar 24, 2014
PR #33 was definitely a useful change -- item access via [] should return
items still in the context of the dataset they were pulled from.

However, it doesn't make sense to always keep track of all dataset variables.
A particular example is when doing math between variables from different
datasets.

To be more concrete, suppose I have two datasets ("obs" and "sim"), each with
two measurement variables ("tmin" and "tmax"). It should be possible to
calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to
conflicting values of "tmax". Unfortunately, this is exactly what the current
version of xray reports.

This PR fixes this behavior, by automatically including only coordinates
necessary to describe the arrays involved (via `DatasetArray.select`) when
merging datasets resulting from mathematical operations.

A possible downside is that occasionally auxiliary coordinates worth keeping
around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a
variable "tmax"). But on the whole I think this behavior is much more in line
with reasonable expectations.

This change also removes the DatasetArray methods `refocus` and `unselected`
from the public API. I think this is the right call, since these functions
were highly specific and really only useful for the prior version of the
internal API.
shoyer added a commit that referenced this pull request Mar 24, 2014
PR #33 was definitely a useful change -- item access via [] should return
items still in the context of the dataset they were pulled from.

However, it doesn't make sense to always keep track of all dataset variables.
A particular example is when doing math between variables from different
datasets.

To be more concrete, suppose I have two datasets ("obs" and "sim"), each with
two measurement variables ("tmin" and "tmax"). It should be possible to
calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to
conflicting values of "tmax". Unfortunately, this is exactly what the current
version of xray reports.

This PR fixes this behavior, by automatically including only coordinates
necessary to describe the arrays involved (via `DatasetArray.select`) when
merging datasets resulting from mathematical operations.

A possible downside is that occasionally auxiliary coordinates worth keeping
around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a
variable "tmax"). But on the whole I think this behavior is much more in line
with reasonable expectations.

This change also removes the DatasetArray methods `refocus` and `unselected`
from the public API. I think this is the right call, since these functions
were highly specific and really only useful for the prior version of the
internal API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants