Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG fix +test .sel method gives error with float32 values #3153

Merged
merged 10 commits into from Aug 10, 2019

Conversation

HasanAhmadQ7
Copy link
Contributor

@HasanAhmadQ7 HasanAhmadQ7 commented Jul 20, 2019

.sel method gives error when it is used to select float32 values

@pep8speaks
Copy link

pep8speaks commented Jul 20, 2019

Hello @HasanAhmadQ7! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-10 22:01:05 UTC

@HasanAhmadQ7 HasanAhmadQ7 force-pushed the fix_sel_float branch 2 times, most recently from 6416a16 to 34d4115 Compare July 20, 2019 20:57
    .sel method gives error when it is used to select float32 values

    Resolves: pydata#3137
@max-sixty
Copy link
Collaborator

Hi @HasanAhmadQ7 thanks for the PR - I saw your comment here: #3137 (comment)

I see the tests are failing, but the build fails before it tests these changes - could you try merging master and pushing? That'll give them another chance to run.

@@ -2322,6 +2323,22 @@ def interp_like(self, other, method='linear', assume_sorted=False,
ds = self.reindex(object_coords)
return ds.interp(numeric_coords, method, assume_sorted, kwargs)

# Helper method for sel()
def _maybe_cast_floats(self, indexers):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone have a prior on the likely performance impact of this? We're creating a copy of the indexers on each .sel call.

If a) the requirement to coerce is rare enough, and b) it's expensive to coerce, we could loop through and check whether there are any incongruencies, and only then coerce.

Is there a viable alternative in having the indexer do these casts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@max-sixty Thank you for your time and note. I modified it to avoid unnecessary looping or copying.

context: In my understanding, the root of the problem is that pandas index is float 64. Despite that the issue only mentions float32, I tried to generalize the solution, by casting to the coords type for any kind of float.

I will be happy to further modify if anyone has additional notes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HasanAhmadQ7 !

I'll let someone who knows the indexing machinery review this next (@pydata/xarray), given how central this code path is and my relative knowledge of this. Thanks for your patience

avoid copying indexers for efficiency
@shoyer
Copy link
Member

shoyer commented Jul 29, 2019

@HasanAhmadQ7 thanks for looking into this!

I think this could be solved a little more cleanly at a lower level of xarray's indexing logic. In particular, all of our indexing calls go through convert_label_indexer, which in turn uses get_loc and get_indexer_nd helper functions to wrap the calls into pandas.Index methods:

def get_loc(index, label, method=None, tolerance=None):
kwargs = _index_method_kwargs(method, tolerance)
return index.get_loc(label, **kwargs)
def get_indexer_nd(index, labels, method=None, tolerance=None):
""" Call pd.Index.get_indexer(labels). """
kwargs = _index_method_kwargs(method, tolerance)
flat_labels = np.ravel(labels)
flat_indexer = index.get_indexer(flat_labels, **kwargs)
indexer = flat_indexer.reshape(labels.shape)
return indexer

These helper function would be a good place to implement this casting logic -- and you might even consider trying to fix it upstream in pandas as well.

array = DataArray(data_values, [('float32_coord', coord_values)])
expected = DataArray(data_values[1:3], [('float32_coord',
coord_values[1:3])])
actual = array.sel(float32_coord=float_values[1:3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to add a test case where you're only indexing by a scalar.

@HasanAhmadQ7
Copy link
Contributor Author

HasanAhmadQ7 commented Jul 29, 2019

@shoyer @max-sixty I really appreciate your time and feedback. I will work on it within a week.

HasanAhmadQ7 and others added 4 commits August 4, 2019 19:50
@HasanAhmadQ7
Copy link
Contributor Author

HasanAhmadQ7 commented Aug 4, 2019

@shoyer A fix in pandas that I can imagine is to modify the pd.Index to retain the dtype similar to the PandasIndexAdapter in xarray/core/indexing. However, it seems to me that their design it to coerce any float to float64, so I did not feel they would consider this to be a bug. I would try to dig into such solution more if you recommend so.

I added a test case in which the coords are float16 (in addiction to the scalar case)to show that casting to the coords type is required.

The lowest level in which I can access the coords type is in the indexing/remap_label_indexers where I could get the coords from the data_obj. In the latest code, casting is done just before calling convert_label_indexer

Sorry if I am missing the point, and thank you in advance for any further feedback.

@shoyer
Copy link
Member

shoyer commented Aug 4, 2019

A fix in pandas that I can imagine is to modify the pd.Index to retain the dtype similar to the PandasIndexAdapter in xarray/core/indexing. However, it seems to me that their design it to coerce any float to float64, so I did not feel they would consider this to be a bug. I would try to dig into such solution more if you recommend so.

Yes, after thinking a little bit more about this I think you're totally right.

Pandas only has Float64Index. When xarray makes an index for float32 data, we use a Float64Index under the hood for look-ups, but the fact that index represents float32 data is only known to xarray, not pandas.

xarray/core/indexing.py Outdated Show resolved Hide resolved
def maybe_cast_to_coords_dtype(label, coords_dtype):
if isinstance(label, float):
label = coords_dtype.type(label)
elif isinstance(label, list) and coords_dtype.kind == 'f':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want to convert arrays like np.array([0, 1/3, 2/3, 1], dtype=np.float32).

Actually, maybe we could actually convert any indexing argument to a float array of the appropriate dtype? e.g.,

def maybe_cast_to_coords_dtype(label, coords_dtype):
    if coords_dtype.kind == 'f':
        label = np.asarray(label, dtype=coords_dtype)
    return label

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely cleaner. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adopted the change in the latest code, but added a check for the case of label is slice

Co-Authored-By: Stephan Hoyer <shoyer@gmail.com>
HasanAhmadQ7 and others added 3 commits August 5, 2019 09:47
@shoyer
Copy link
Member

shoyer commented Aug 10, 2019

I'm going to merge this shortly assuming tests pass...

@shoyer shoyer merged commit 097306b into pydata:master Aug 10, 2019
@shoyer
Copy link
Member

shoyer commented Aug 10, 2019

thanks @HasanAhmadQ7 !

@max-sixty
Copy link
Collaborator

Thanks @HasanAhmadQ7 !

dcherian added a commit to dcherian/xarray that referenced this pull request Aug 14, 2019
* upstream/master:
  chunk sparse arrays (pydata#3202)
  Annotations for .data_vars() and .coords() (pydata#3207)
  Remove duck_array_ops.as_like_arrays() (pydata#3204)
  Match mypy version between CI and pre-commit hook (pydata#3203)
  BUG fix +test .sel method gives error with float32 values (pydata#3153)
  fix precommit file (pydata#3201)
  Enforce mypy compliance in CI (pydata#3197)
benbovy added a commit to benbovy/xarray that referenced this pull request Aug 31, 2021
Make the fix in pydata#3153 specific to pandas indexes (i.e., do not apply it
to other, custom indexes). See pydata#5697 for details.

This should also fix pydata#5700 although no test has been added yet (we need
to refactor set_index first).
shoyer pushed a commit that referenced this pull request Mar 17, 2022
* no need to wrap pandas index in lazy index adapter

* multi-index default level names

* refactor setting Dataset/DataArray default indexes

* update multi-index (text) repr

Notes:

- move the multi-index formatting logic into
  PandasMultiIndexingAdapter._repr_inline_
- inline repr: check for _repr_inline_ implementation first

* remove print

* minor fixes and improvements

* fix dtype of index variables created from Index

* fix multi-index selection regression

See #5691

* check conflicting multi-index level names

* update formatting (text and html)

* check level name conflicts for midx given as coord

* intended behavior or unwanted side effect? see #5732

* get rid of multi-index virtual coordinates

Not totally yet: need to refactor set_index / reset_index

* add level coords in indexes & keep coord order

* fix copying multi-index level variable data

* collect index for multi-index level variables

Avoid re-creating the indexes for dimension variables. Collect then
directly instead.

Note: the change here is working for building new Datasets but I haven't
checked other cases like merging different objects, etc. So I'm not sure
this is the right approach.

* wip refactor label based selection

- Index.query must now return a mapping of {dim_name:
positional_indexer} as indexes may be based on several coordinates with
different dimensions

- Added `group_coords_by_index` utility function (not used yet, not sure
we'll need it)

TODO:

- Update DataArray selection
- Update .loc and other places using remap_label_indexers
- Fix selection of multi-index that returns only scalar coordinates

* fix index query tests

* fix multi-index adapter getitem scalar

Multi-index level variables now return the scalar value that corresponds
to the level instead of the multi-index tuple element (all levels).

Also get rid of PandasMultiIndexingAdapter.__getitem__ cache
optimization, which doesn't work with level scalar values and was
premature optimization anyway.

* wip refactor label based selection

Fixed renamed dimension in the case of multi-index -> single index

Updated DataArray._overwrite_indexes

Dirty fix for alignment (not tested yet)

* wip: deeper refactoring label-based sel

Created QueryResult and MergedQueryResults classes for convenience.

* fix some tests + minor tweaks

* fix indexing PandasMultiIndexingAdapater

When level is not None:

- if result is another adapter: propagate it properly
- if result is a numpy adapter: use level values

* refactor cast label indexer to coord dtype

Make the fix in #3153 specific to pandas indexes (i.e., do not apply it
to other, custom indexes). See #5697 for details.

This should also fix #5700 although no test has been added yet (we need
to refactor set_index first).

* better handling of multi-index level labels

* label-based selection tweaks and fixes

- Use a dataclass for QueryResult

- Pass Variables and DataArrays indexers un-normalized to Index.query().
  Indexes have the responsibility of returning the expected types for
  positional (dimension) indexers by adding back dimensions and
  coordinates if needed.

- Typing fixes and tweaks

* sel: propagate multi-index vars attrs/encoding

Related to #1366

* wip refactor rename

Still needs tests

Also need to address the problem of multi-index level coordinates (data
adapters currently not updated). We'll probably need for
`Index.rename()` to also return new index variables?

* wip refactor rename: return new vars from Index

* refactor rename: update tests

Still need to address default indexes internal checks (these are
disabled for now).

* typing tweaks

* fix html formatting: dims w/ or w/o index

* minor fixes and tweaks

* wip refactor set_index

* wip refactor set_index

- Refactor reset_index
- Improve creating new xarray indexes from pandas indexes with proper
  propagation of variable metadata (dtype, attrs, encoding)

* refactor reorder_levels

* Set pd.MultiIndex name from dim name

Closes #4542

* .sel() with multi-index: return scalar coords

..instead of dropping those coordinates

Closes #1408 once tests are fixed/updated

* fix multi-index level coordinate inline repr

* wip refactor stack

Some changes:

Multi-indexes are created only if the
stacked dimensions each have a single coordinate index. In the other
cases, multi-indexes are not created, which means that it is an
irreversible operation (unstack will not work)

Multi-indexes are created from the stacked coordinate variables and not
anymore from the unstacked dimension indexes (level product). There's a
significant decrease in performance but it is probably acceptable since
it's now possible to avoid the creation of multi-indexes.

It is possible to stack a dimension with a multi-index, but it drops the
index. Otherwise it would make it hard for unstack() to figure out
what's going on. It makes it clear that this is an irreversible
operation.

* stack: better rule for add/skip multi-index

It is more robust and allows creating a multi-index from
non-default (non-dimension) coordinates, as long as there's one and only
one 1-d indexed coordinate for each dimension to stack.

* add utility methods to class Indexes

This removes some ugly & duplicate patterns.

* re-arrange class Indexes internals

* wip refactor stack / unstack

stack: revert to old way of creating multi-index

unstack: support non-default (non-dimension) multi-index (as long as
there is exactly one multi-index per specified dimension)

* add PandasMultiIndex.from_product_variables

Refactored utils.multiindex_from_product_levels utility function
into a new PandasMultiIndex classmethod.

* unstack: propagate index coordinate metadata

* wip: fix/update tests

* fix/refactor reindex

* functools.cached_property is since py38

Use the good old way as we still support py37

* update reset_coords

* fix ipython key completion test

make sure the stacked dataset used has a multi-index

* update test_map_index_queries

* do not coerce bool indexer as float coord dtype

Fixes #5727

* add Index.create_variables() method

This will probably be used instead of including index coordinate
variables in the signature (return type) of many methods of ``Index``.

It has several advantages:

- More DRY, and for custom indexes that do not need to create coordinate
variables with special data adapters, it's easier to just skip
implementing this method and not bother with returning empty dicts in
other methods

- This allows to decouple index vs. coordinates creation. For many cases
this can be done at the same time but for some cases like alignment this
is useful

- It's a more elegant solution when we need to propagate
metadata (attrs, encoding)

* improve Dataset/DataArray indexes proxy class

One possible solution to the recurring problem of accessing coordinate
variables related to a given index in some of Xarray's internals.

I feel that the ``Indexes`` proxy is the right place for storing
references to those indexed coordinate variables. It's safer than
relying on a public attribute/property of ``Index``.

* align Indexes API with Coordinates API

Expose similar properties:
- variables
- dims

* clean-up formatting using updated Indexes API

* wip: refactor alignment

* wip refactor alignment

Almost done rewriting `align`.

* tweaks and fixes

- Some fixes and clean-up in new implementation of align
- Add Index.join method (only for inner/outer join)
- Tweak index generic types
- IndexVars type: use Variable instead of IndexVariable (IndexVariable
  may eventually be dropped)

* wip refactor alignment and reindex

Imported all re-indexing logic into the `Alignator` class, which can be
reused directly for both `align()` and `obj.reindex()`.

TODO:
- import override indexes into `Alignator`
- connect Alignator to public API entry points
- clean-up (remove old implementation functions)

* wip refactor alignment / reindex: fixes and tweaks

* wip refactor alignment (support join='override')

* wip alignment: clean-up + tests

I will fix mypy errors later.

* refactor alignment fix tests & types annotations

All `*align*` and `*reindex*` tests are passing! (locally)
Mypy doesn't complain.

TODO: refactor `interp` and `interp_like` (I only plan to fix it with
the new Aligner class for now)

* refactor interp and interp_like

It now reuses the new alignment/reindex internals, but it still only
supports dimension coordinates with a single index,

* wip review merge

* refactor swap_dims

* refactor isel

* add create_index option to stack

* add Index.stack and Index.unstack methods

Also added a `index_cls` parameter to `Dataset.stack` and
`DataArray.stack`.

Custom indexes may thus implement their own logic for stack/unstack,
with the current limitation that a `pandas.MultiIndex` is still
explicitly required for unstack optimized versions.

* fix PandasIndex.isel with scalar indexers

Fix concat tests and some groupby tests

* fix index adapter (variable) dtype

* more indexes invariants checks

Check consistency between PandasIndex objects and
coordinate variables data adapters (PandasIndexingAdapter).

* fix DataArray.isel()

* refactor Dataset/DataArray copy

Filter indexes and preserve unique index objects for multi-coordinate
indexes.

Also fixed indexes tests (stack/unstack).

* refactor computation (and merge)

* minor fixes and tweaks

* propagate_indexes -> filter_indexes_from_coords

We can't just look at excluded dimensions anymore...

* Update xarray/core/coordinates.py

* refactor expand_dims

* merge: avoid compare same indexes more than once

* wip refactor update coords

Check in merge that prioritized elements won't corrupt collected
indexes.

* refactor update/remove coords

Fixes and tweaks
Refactored assign, update, setitem, delitem, drop_vars
Added or updated tests

* misc. fixes

* fix Dataset,__delitem__

* fix Dataset.reset_coords (default coords)

* refactor Dataset.from_dataframe

* Fix .sel with DataArray and multi-index

* PandasIndex.from_variables: preserve wrapped index

Prevent converting special index types like ``pd.CategoricalIndex`` when
such objects are wrapped in xarray variables.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* two minor fixes after merging main

* filter_indexes_from_coords: preserve index order

* implicit multi-index from coord: fix edge case

* groupby combine: fix missing index in result

When coord is None

* implicit multi-index coord: more robust fix

Also cover the cases where indexes are equal but not identical (caused a
couple of tests failing). Perfs may not be a concern for such edge case?

* backward compat fix: multi-index given as data-var

... in Dataset constructor

* refactor concat

Summary:

- Add the ``Index.concat()`` class method with implementations for
  ``PandasIndex`` and ``PandasMultiIndex`` (adapted from
  ``IndexVariable.concat()``).

- Use ``Index.concat()`` to create new indexes (and their corresponding
  variables) in ``_dataset_concat``. Fallback to variable concatenation
  (with default index) when no implementation is given for
  ``Index.concat`` or when no all the variables have an index.

- Refactored merging the other variables (also merge the indexes) in
  ``_dataset_concat``.

This refactor is incomplete. It should work with Pandas (multi-)indexes
but it is likely that it won't work with meta-indexes involving multiple
dimensions. We probably need to update ``_calc_concat_over`` to take
into account such meta-indexes and their related coordinates.

Other limitation (?) we use here the index of the 1st dataset for the
concatenation (i.e., ``Index.concat``). No specific check is made on
the type and/or coordinates of the other indexes.

* sel drop=True: remove multi-index scalar coords

* PandasIndex.from_variables(multi-index level var)

Make sure it gets the level index (not the whole multi-index).

* reindex: disable invalid dimension check

From the docstrings: mis-matched dimensions are simply ignored

* add index concat tests + fix positions type

* add Indexes.get_all_dims convenient method

* refactor pad

* unstack: return copies of mindex.levels

Fix error when trying to set the name of the extracted level indexes.

* strip_units: prevent index->array conversion

which caused alignment conflicts due to multi-index
variable converted to non-index variable.

* attach_units: do not preserve coord multi-indexes

To prevent conflicts when trying to implicitly create
multi-index coordinates.

* concat: avoid multiple loads of dask arrays

* groupby mindex coord: propagate level names

* Dataset.copy: don't reset indexes if data is given

The `data` parameter accepts new data for data variables only, it won't
affect the indexes.

* reindex/align: fix coord dtype of new indexes

Use `as_compatible_data` to get the right dtype to assign to the new
index coordinates created from reindex indexers.

* fix doctests

* to_stacked_array: do not coerce level dtypes

If the intent was to propagate the original coordinate dtypes, this is
now handled by PandasMultiIndex.stack.

This fixes the case where multi-index ``.levels`` do not return labels
with "nan" values (if any) but where the coordinate does, which resulted
in dtype mismatch between the coordinate variable dtype (e.g., coerced
to int64) and the actual label values (float64 due to nan values),
eventually causing numpy errors trying to coerce nan value to int64 when
indexing the coordinate.

* fix indent

* doc: fix user-guide build errors

* stack: fix new index variables not in coordinates

* unstack full-reindex: fix alignment errors

We need a mapping of all index coordinate names to the full index
so that Aligner can find matching indexes.

(Note: reindex may eventually be refactored to be a bit more clever so
that providing only a `{dim: indexer}` will be enough in this case?)

* PandasIndex coord dtype: avoid convert index->array

* refactor diff

* quick fix level coord dtype int32/64 on win

* dask presist/compute dataarray: propagate indexes

* refactor roll

* rename Index.query -> Index.sel

Also rename:
- QueryResult -> IndexSelResult
- merge_query_results -> merge_sel_results

* remove Index.union and Index.intersection

Those are not used elsewhere.
``Index.join`` is used instead for alignment.

* use future annotations in indexes.py

* Index.rename: return only the new index

Create new coordinate variables using Index.create_variables
instead (get rid of PandasIndex.from_pandas_index).

* Index.stack: return only the new index

Create new coordinate variables using Index.create_variables
instead (get rid of PandasIndex.from_pandas_index)

* PandasMultiIndex class methods: return only the index

* wip get rid of Pandas(Multi)Index.from_pandas_index

* remove Pandas(Multi)Index.from_pandas_index

* Index.from_variables: return the new index only

Use Index.create_variables to get the new coordinate variables.

* rename: propagate_attrs_encoding not needed

Index.create_variables already propagates coordinate variable metadata.

* align exclude dims: pass through indexes

* fix set_index append=True

Level variables may have updated names in this case: single index + one
level.

* refactor default_indexes and re-enable invariant check

Default indexes invariant check is now optional (enabled by default).

* fix formatting errors

* fix type

* Dataset._indexes, DataArray._indexes: always dict

Default indexes are created by the default constructors. Indexes are
always passed explicitly by the fastpath constructors.

* remove _level_coords property

It was only used for plotting and has been replaced by simpler logic.

* clean-up

Use ``_indexes`` instead of ``.xindexes`` when possible, for speed.
``xindexes`` is not cached: it would return inconsistent results if
updating the object in-place.

Fix some types.

* optimize Dataset/DataArray copy

Avoid copying pandas indexes twice (indexes + coordinate variables)

* add default implementation for Index.copy

Better than raising an error?

* add/fix indexes tests

* tweak indexes formatting

This will need more work (in a follow-up PR) to be consistent with the
other data model components (i.e., coordinates and data variables): add
summary (inline) and detailed reprs, maybe group by coordinates, etc.

* fix doctests

* PandasIndex.copy: avoid too many pd.Index copies

* DataArray stack test: revert to original

Now that default (range) indexes are created again (create_index=True).

* test indexes/indexing: rename query -> sel

* alignment: use future annotations

* misc. tweaks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Aligner tweaks

Access aligned objects through the ``.result`` attribute.

Enable type checking in internal methods.

* remove assert_unique_multiindex_level_names

It is not needed anymore as multi-indexes have their own coordinates and
we now check for possible name conflicts when "unpacking" multi-index
levels at Dataset / DataArray creation.

* remove propagate_attrs_encoding

Only used in one place.

* assert -> check + ValueError

* misc. fixes and tweaks

* update what's new

* concat: remove fall-backs to Variable.concat

Raise an error instead when trying to concat a mix of indexed and
un-indexed coordinates or when the index doesn't support concat.

We still support the concatenation of a mix of scalar and dimension
indexed coordinates by creating a PandasIndex on-the-fly for scalar
coordinates.

* fix flaky test

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: keewis <keewis@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.sel method gives error with float32 values
4 participants