Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask_cudf loc operation fails with cudf Series and cupy array fails #11877

Closed
VibhuJawa opened this issue Oct 7, 2022 · 1 comment
Closed
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working dask Dask issue

Comments

@VibhuJawa
Copy link
Member

Describe the bug
dask_cudf loc operation fails with cudf Series and cupy array

Steps/Code to reproduce bug

import dask_cudf
import cudf
import pandas as pd
import numpy as np


df = cudf.DataFrame({'a':np.arange(0,100, dtype=np.int32),
                     'b':np.arange(0,100, dtype=np.int32)
                    })
df = dask_cudf.from_cudf(df,10)
df = df.set_index('b')

df.loc[cudf.Series([0,10,50], dtype=np.int32)]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [45], line 1
----> 1 df.loc[ar]

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev_oct_5/lib/python3.9/site-packages/dask/dataframe/indexing.py:103, in _LocIndexer.__getitem__(self, key)
    101     iindexer = key
    102     cindexer = None
--> 103 return self._loc(iindexer, cindexer)

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev_oct_5/lib/python3.9/site-packages/dask/dataframe/indexing.py:125, in _LocIndexer._loc(self, iindexer, cindexer)
    122         return self._loc_list(iindexer.values, cindexer)
    123     else:
    124         # element should raise KeyError
--> 125         return self._loc_element(iindexer, cindexer)
    126 else:
    127     if isinstance(iindexer, (list, np.ndarray)) or (
    128         is_series_like(iindexer) and not is_bool_dtype(iindexer.dtype)
    129     ):
    130         # applying map_partitions to each partition
    131         # results in duplicated NaN rows

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev_oct_5/lib/python3.9/site-packages/dask/dataframe/indexing.py:193, in _LocIndexer._loc_element(self, iindexer, cindexer)
    191 def _loc_element(self, iindexer, cindexer):
    192     name = "loc-%s" % tokenize(iindexer, self.obj)
--> 193     part = self._get_partitions(iindexer)
    195     if iindexer < self.obj.divisions[0] or iindexer > self.obj.divisions[-1]:
    196         raise KeyError("the label [%s] is not in the index" % str(iindexer))

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev_oct_5/lib/python3.9/site-packages/dask/dataframe/indexing.py:216, in _LocIndexer._get_partitions(self, keys)
    213     return _partitions_of_index_values(self.obj.divisions, keys)
    214 else:
    215     # element
--> 216     return _partition_of_index_value(self.obj.divisions, keys)

File /datasets/vjawa/miniconda3/envs/cugraph_dgl_dev_oct_5/lib/python3.9/site-packages/dask/dataframe/indexing.py:327, in _partition_of_index_value(divisions, val)
    325     raise ValueError(msg)
    326 val = _coerce_loc_index(divisions, val)
--> 327 i = bisect.bisect_right(divisions, val)
    328 return min(len(divisions) - 2, max(0, i - 1))

File cupy/_core/core.pyx:1238, in cupy._core.core._ndarray_base.__nonzero__()

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Expected behavior
I would expect this will work as it does for numpy or padas series . See below.

df.loc[np.asarray([0,10,50], dtype=np.int32)].compute()
df.loc[pd.Series([0,10,50], dtype=np.int32)].compute()
	a
b	
0	0
10	100
50	500

Additional context
There is no workaround apart from moving our stuff to pandas, numpy array or dask_cudf Series.

@VibhuJawa VibhuJawa added bug Something isn't working Needs Triage Need team to review and classify dask Dask issue labels Oct 7, 2022
@github-actions github-actions bot added this to Needs prioritizing in Bug Squashing Oct 7, 2022
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Oct 11, 2022
This PR fixes the below errors that have popped up in MNMG testing. 
- [x] fix_out of index keys on MNMG graphs
- [x] fix loc/get_node_storage error  on MNMG graphs 
(Work around  rapidsai/cudf#11877)
- [x] Clear Cached Properties when they become invalid
- [x] Remove  6 pytest skipping as both these PRs have landed 
- #2751
- #2523
- [x] Change `vertex_col_names` to  `node_col_names`  to match DGL 
- [x] Ensure MNMG tests pass 
- [x] Work around the PG bug and also prevent redundant conversion to lists

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)
  - Erik Welch (https://github.com/eriknw)

URL: #2786
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Oct 21, 2022
@quasiben
Copy link
Member

quasiben commented Nov 9, 2022

resolved by dask/dask#9634

@quasiben quasiben closed this as completed Nov 9, 2022
Bug Squashing automation moved this from Needs prioritizing to Closed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working dask Dask issue
Projects
Archived in project
Development

No branches or pull requests

3 participants