-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Dataset.__getitem__ returns a DatasetArray linked to the same dataset #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Originally, `Dataset.__getitem__` would "select" out the given variable to use as the dataset for new DatasetArray. The rationale was that you don't really want to keep track of extra dataset variables that are no longer relevant. The problem is that this means that modifying an item from a dataset would not modify the original dataset. An example might make this clearer: ``` >>> ds = xray.Dataset({'x': ('x', np.arange(10))}) >>> ds['x'].attributes['units'] = 'meters' # this was actually a no-op ``` This is clearly a pretty blatant violation of the norms for a Python container, and it certaintly surprised @akleeman. So this PR simplies this behavior so that `ds['x']` gives a DatasetArray linked to the dataset `ds`, and does some related clean-up of `DatasetArray.from_stack`. The new method `DatasetArray.select` lets you reproduce the old behavior if desired, by using `ds['x'].select()` instead of `ds['x']`. A bonus is that the new behavior is actually faster, because it doesn't need to create a new Dataset object.
akleeman
added a commit
that referenced
this pull request
Mar 3, 2014
Dataset.__getitem__ returns a DatasetArray linked to the same dataset
shoyer
added a commit
that referenced
this pull request
Mar 15, 2014
PR #33 was definitely a useful change -- item access via [] should return items still in the context of the dataset they were pulled from. However, it doesn't make sense to always keep track of all dataset variables. A particular example is when doing math between variables from different datasets. To be more concrete, suppose I have two datasets ("obs" and "sim"), each with two measurement variables ("tmin" and "tmax"). It should be possible to calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to conflicting values of "tmax". Unfortunately, this is exactly what the current version of xray reports. This PR fixes this behavior, by automatically including only coordinates necessary to describe the arrays involved (via `DatasetArray.select`) when merging datasets resulting from mathematical operations. A possible downside is that occasionally auxiliary coordinates worth keeping around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a variable "tmax"). But on the whole I think this behavior is much more in line with reasonable expectations.
shoyer
added a commit
that referenced
this pull request
Mar 15, 2014
PR #33 was definitely a useful change -- item access via [] should return items still in the context of the dataset they were pulled from. However, it doesn't make sense to always keep track of all dataset variables. A particular example is when doing math between variables from different datasets. To be more concrete, suppose I have two datasets ("obs" and "sim"), each with two measurement variables ("tmin" and "tmax"). It should be possible to calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to conflicting values of "tmax". Unfortunately, this is exactly what the current version of xray reports. This PR fixes this behavior, by automatically including only coordinates necessary to describe the arrays involved (via `DatasetArray.select`) when merging datasets resulting from mathematical operations. A possible downside is that occasionally auxiliary coordinates worth keeping around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a variable "tmax"). But on the whole I think this behavior is much more in line with reasonable expectations. This change also removes the DatasetArray methods `refocus` and `unselected` from the public API. I think this is the right call, since these functions were highly specific and really only useful for the prior version of the internal API.
shoyer
added a commit
that referenced
this pull request
Mar 17, 2014
PR #33 was definitely a useful change -- item access via [] should return items still in the context of the dataset they were pulled from. However, it doesn't make sense to always keep track of all dataset variables. A particular example is when doing math between variables from different datasets. To be more concrete, suppose I have two datasets ("obs" and "sim"), each with two measurement variables ("tmin" and "tmax"). It should be possible to calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to conflicting values of "tmax". Unfortunately, this is exactly what the current version of xray reports. This PR fixes this behavior, by automatically including only coordinates necessary to describe the arrays involved (via `DatasetArray.select`) when merging datasets resulting from mathematical operations. A possible downside is that occasionally auxiliary coordinates worth keeping around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a variable "tmax"). But on the whole I think this behavior is much more in line with reasonable expectations. This change also removes the DatasetArray methods `refocus` and `unselected` from the public API. I think this is the right call, since these functions were highly specific and really only useful for the prior version of the internal API.
shoyer
added a commit
that referenced
this pull request
Mar 20, 2014
PR #33 was definitely a useful change -- item access via [] should return items still in the context of the dataset they were pulled from. However, it doesn't make sense to always keep track of all dataset variables. A particular example is when doing math between variables from different datasets. To be more concrete, suppose I have two datasets ("obs" and "sim"), each with two measurement variables ("tmin" and "tmax"). It should be possible to calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to conflicting values of "tmax". Unfortunately, this is exactly what the current version of xray reports. This PR fixes this behavior, by automatically including only coordinates necessary to describe the arrays involved (via `DatasetArray.select`) when merging datasets resulting from mathematical operations. A possible downside is that occasionally auxiliary coordinates worth keeping around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a variable "tmax"). But on the whole I think this behavior is much more in line with reasonable expectations. This change also removes the DatasetArray methods `refocus` and `unselected` from the public API. I think this is the right call, since these functions were highly specific and really only useful for the prior version of the internal API.
shoyer
added a commit
that referenced
this pull request
Mar 24, 2014
PR #33 was definitely a useful change -- item access via [] should return items still in the context of the dataset they were pulled from. However, it doesn't make sense to always keep track of all dataset variables. A particular example is when doing math between variables from different datasets. To be more concrete, suppose I have two datasets ("obs" and "sim"), each with two measurement variables ("tmin" and "tmax"). It should be possible to calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to conflicting values of "tmax". Unfortunately, this is exactly what the current version of xray reports. This PR fixes this behavior, by automatically including only coordinates necessary to describe the arrays involved (via `DatasetArray.select`) when merging datasets resulting from mathematical operations. A possible downside is that occasionally auxiliary coordinates worth keeping around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a variable "tmax"). But on the whole I think this behavior is much more in line with reasonable expectations. This change also removes the DatasetArray methods `refocus` and `unselected` from the public API. I think this is the right call, since these functions were highly specific and really only useful for the prior version of the internal API.
shoyer
added a commit
that referenced
this pull request
Mar 24, 2014
PR #33 was definitely a useful change -- item access via [] should return items still in the context of the dataset they were pulled from. However, it doesn't make sense to always keep track of all dataset variables. A particular example is when doing math between variables from different datasets. To be more concrete, suppose I have two datasets ("obs" and "sim"), each with two measurement variables ("tmin" and "tmax"). It should be possible to calculate `obs['tmin'] - sim['tmin']` without a merge conflict due to conflicting values of "tmax". Unfortunately, this is exactly what the current version of xray reports. This PR fixes this behavior, by automatically including only coordinates necessary to describe the arrays involved (via `DatasetArray.select`) when merging datasets resulting from mathematical operations. A possible downside is that occasionally auxiliary coordinates worth keeping around will be lost (e.g., `(2 * obs['tmin']).dataset` no longer contains a variable "tmax"). But on the whole I think this behavior is much more in line with reasonable expectations. This change also removes the DatasetArray methods `refocus` and `unselected` from the public API. I think this is the right call, since these functions were highly specific and really only useful for the prior version of the internal API.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Originally,
Dataset.__getitem__
would "select" out the given variable to useas the dataset for new DatasetArray. The rationale was that you don't really
want to keep track of extra dataset variables that are no longer relevant.
The problem is that this means that modifying an item from a dataset would not
modify the original dataset. An example might make this clearer:
This is clearly a pretty blatant violation of the norms for a Python
container, and it certaintly surprised @akleeman. So this PR simplies this
behavior so that
ds['x']
gives a DatasetArray linked to the datasetds
,and does some related clean-up of
DatasetArray.from_stack
. The new methodDatasetArray.select
lets you reproduce the old behavior if desired, by usingds['x'].select()
instead ofds['x']
. A bonus is that the new behavior isactually faster, because it doesn't need to create a new Dataset object.