[REVIEW] Add support for `DataFrame.from_dict`\`to_dict` and `Series.to_dict` #12048

galipremsagar · 2022-11-02T20:53:37Z

Description

Resolves: #11934

Adds DataFrame.from_dict and DataFrame.to_dict
Adds Series.to_dict

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-11-02T23:13:20Z

Codecov Report

Base: 87.47% // Head: 88.10% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (e15c7b2) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head e15c7b2 differs from pull request most recent head dc7eaa5. Consider uploading reports for the commit dc7eaa5 to get more accurate results

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #12048      +/-   ##
================================================
+ Coverage         87.47%   88.10%   +0.62%     
================================================
  Files               133      135       +2     
  Lines             21826    22133     +307     
================================================
+ Hits              19093    19500     +407     
+ Misses             2733     2633     -100

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/interval.py	`85.45% <0.00%> (-9.10%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <0.00%> (-8.34%)`	⬇️
python/strings_udf/strings_udf/__init__.py	`76.47% <0.00%> (-7.85%)`	⬇️
python/cudf/cudf/core/_base_index.py	`81.28% <0.00%> (-4.27%)`	⬇️
python/cudf/cudf/io/json.py	`92.06% <0.00%> (-2.68%)`	⬇️
python/cudf/cudf/utils/utils.py	`89.91% <0.00%> (-0.69%)`	⬇️
python/cudf/cudf/core/column/timedelta.py	`90.17% <0.00%> (-0.58%)`	⬇️
python/cudf/cudf/core/column/datetime.py	`89.21% <0.00%> (-0.51%)`	⬇️
python/cudf/cudf/core/column/column.py	`87.96% <0.00%> (-0.46%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.72% <0.00%> (-0.41%)`	⬇️
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

vyasr

This adds a lot more logic than I had anticipated. Out of curiosity, I copied this implementation into my current branch, then added a second method from_dict2 with an identical signature that is just defined as:

return cls.from_pandas(pd.DataFrame.from_dict(data, orient, dtype, columns))

Here are some quick benchmark results:

In [9]: d = {f'i': np.random.rand(1000) for i in range(20)}

In [10]: %timeit cudf.DataFrame.from_dict2(d)
279 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: %timeit cudf.DataFrame.from_dict(d)
315 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [12]: d = {f'i': np.random.rand(10) for i in range(20)}

In [13]: %timeit cudf.DataFrame.from_dict2(d)
277 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [14]: %timeit cudf.DataFrame.from_dict(d)
307 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [15]: d = {f'i': np.random.rand(10000) for i in range(50)}

In [16]: %timeit cudf.DataFrame.from_dict2(d)
310 µs ± 322 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [17]: %timeit cudf.DataFrame.from_dict(d)
316 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Is there any benefit to the current, much more complex approach rather than just using the pandas fallback in this case? The complexity of our constructor is high enough that it seems like bypassing it using from_pandas recoups more than enough performance, and this way we really don't have to think about the semantics much at all.

galipremsagar · 2022-11-04T15:11:58Z

Is there any benefit to the current, much more complex approach rather than just using the pandas fallback in this case? The complexity of our constructor is high enough that it seems like bypassing it using from_pandas recoups more than enough performance, and this way we really don't have to think about the semantics much at all.

So, yes this is for sure quicker for CPU-backed data..I had to vendor the entire logic from pandas and do a special handling to support cudf.Series as values in the dict inputs, like going through pandas API will fail because of an implicit cast:

In [1]: import cudf

In [2]: def v2(data):
   ...:     return cudf.from_pandas(pd.DataFrame.from_dict(data))
   ...: 

In [3]: import pandas as pd

In [4]: data = {"a": cudf.Series([1, 2, 3, 4]), "B": cudf.Series([10, 11, 12, 13])}

In [5]: v2(data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 v2(data)

Cell In [2], line 2, in v2(data)
      1 def v2(data):
----> 2     return cudf.from_pandas(pd.DataFrame.from_dict(data))

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/frame.py:1762, in DataFrame.from_dict(cls, data, orient, dtype, columns)
   1756     raise ValueError(
   1757         f"Expected 'index', 'columns' or 'tight' for orient parameter. "
   1758         f"Got '{orient}' instead"
   1759     )
   1761 if orient != "tight":
-> 1762     return cls(data, index=index, columns=columns, dtype=dtype)
   1763 else:
   1764     realdata = data["data"]

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/frame.py:662, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    656     mgr = self._init_mgr(
    657         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    658     )
    660 elif isinstance(data, dict):
    661     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 662     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    663 elif isinstance(data, ma.MaskedArray):
    664     import numpy.ma.mrecords as mrecords

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:123, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    120         index = ensure_index(index)
    122     # don't force copy because getting jammed in an ndarray anyway
--> 123     arrays = _homogenize(arrays, index, dtype)
    124     # _homogenize ensures
    125     #  - all(len(x) == len(index) for x in arrays)
    126     #  - all(x.ndim == 1 for x in arrays)
   (...)
    129 
    130 else:
    131     index = ensure_index(index)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:617, in _homogenize(data, index, dtype)
    614             val = dict(val)
    615         val = lib.fast_multiget(val, oindex._values, default=np.nan)
--> 617     val = sanitize_array(
    618         val, index, dtype=dtype, copy=False, raise_cast_failure=False
    619     )
    620     com.require_length_match(val, index)
    622 homogenized.append(val)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/construction.py:616, in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
    613 # materialize e.g. generators, convert e.g. tuples, abc.ValueView
    614 if hasattr(data, "__array__"):
    615     # e.g. dask array GH#38645
--> 616     data = np.array(data, copy=copy)
    617 else:
    618     data = list(data)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/frame.py:447, in Frame.__array__(self, dtype)
    446 def __array__(self, dtype=None):
--> 447     raise TypeError(
    448         "Implicit conversion to a host NumPy array via __array__ is not "
    449         "allowed, To explicitly construct a GPU matrix, consider using "
    450         ".to_cupy()\nTo explicitly construct a host matrix, consider "
    451         "using .to_numpy()."
    452     )

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

galipremsagar · 2022-11-04T15:37:27Z

python/dask_cudf/dask_cudf/backends.py

@@ -474,10 +474,9 @@ def from_dict(data, npartitions, orient="columns", **kwargs):

            if orient != "columns":
                raise ValueError(f"orient={orient} is not supported")
-            # TODO: Use cudf.from_dict
-            # (See: https://github.com/rapidsai/cudf/issues/11934)
+


cc: @rjzamora for review

python/dask_cudf/dask_cudf/backends.py

python/cudf/cudf/core/dataframe.py

vyasr

It would be nice if 1) we could leverage the pandas method directly in cases where we know that it would be faster, and 2) if we didn't have to handle so many edge cases before devolving to the constructor, but I don't see how to do (1) without doing introspection that will slow down other code paths and I think (2) seems necessary. If other reviewers have more ideas I would welcome them, but I think I'm good with this code as is.

python/cudf/cudf/core/dataframe.py

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>

wence- · 2022-11-10T12:50:03Z

It would be nice if 1) we could leverage the pandas method directly in cases where we know that it would be faster, and 2) if we didn't have to handle so many edge cases before devolving to the constructor, but I don't see how to do (1) without doing introspection that will slow down other code paths and I think (2) seems necessary. If other reviewers have more ideas I would welcome them, but I think I'm good with this code as is.

How about this rewrite:

    @staticmethod
    def __create_index(indexlist, namelist, library):
        try:
            name, = namelist
            return library.Index(indexlist, name=name)
        except ValueError:
            return library.MultiIndex.from_tuples(
                indexlist, names=namelist
            )
            
    @classmethod
    def from_dict(cls, data, orient="columns", dtype=None, index=None, columns=None):
        if orient == "index":
            if len(data) > 0 and isinstance(next(iter(data.values())), cudf.Series):
                data = cls(data).T
                return cls(data, index=None, columns=columns, dtype=dtype)
            else:
                return cls.from_pandas(
                    pd.DataFrame.from_dict(
                        data=data,
                        orient=orient,
                        dtype=dtype,
                        index=index,
                        columns=columns,
                    )
                )
        elif orient == "columns":
            if columns is not None:
                raise ValueError("Cannot use columns parameter with orient='columns'")
            return cls(data, index=index, columns=None, dtype=dtype)
        elif orient == "tight":
            if columns is not None:
                raise ValueError("Cannot use columns parameter with orient='right'")
            index = cls.__create_index(data["index"], data["index_names"], cudf)
            columns = cls.__create_index(data["columns"], data["column_names"], pd)
            return cls(data["data"], index=index, columns=columns, dtype=dtype)
        else:
            raise ValueError(f"Expected 'index', 'columns', or 'tight' got {orient=}")

Which (personally) I find a bit clearer. It simplifies the dict transpose in the case where the values are not cudf.Series by just deferring to pandas. This may, however, be bad if the values are provided as cupy arrays (I think).

Other changes (might be worthwhile anyway): handle each orient case separately with early returns (which I think makes the logic a bit easier to read); define __create_index as a staticmethod since it's not closing over any variables.

One could imagine doing some introspection of the data in each of the cases, but it's probably not worth it. I would hope that from_dict is not on the hot path of an analysis.

jakirkham · 2022-11-10T21:56:39Z

This may, however, be bad if the values are provided as cupy arrays (I think).

Think this is a use case we should prepare for (and ideally test against)

galipremsagar · 2022-11-11T15:44:05Z

@wence- @jakirkham Addressed your reviews. Could you please re-review/approve when you get a chance?

galipremsagar · 2022-11-11T15:44:54Z

Also cc: @vyasr since the code has changed.

python/cudf/cudf/core/frame.py

python/cudf/cudf/core/dataframe.py

galipremsagar · 2022-11-14T21:14:31Z

@gpucibot merge

galipremsagar added 3 commits November 2, 2022 13:21

add from_dict and to_dict

62472e7

Add Series.to_dict

83f29dd

fix docstring

c2b0c23

galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 2, 2022

galipremsagar requested a review from a team as a code owner November 2, 2022 20:53

galipremsagar self-assigned this Nov 2, 2022

galipremsagar added this to PR-WIP in v22.12 Release via automation Nov 2, 2022

galipremsagar requested review from wence- and charlesbluca November 2, 2022 20:53

galipremsagar moved this from PR-WIP to PR-Needs review in v22.12 Release Nov 2, 2022

vyasr reviewed Nov 3, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/branch-22.12' into 11934

96fa517

user from_dict

cd5b23f

galipremsagar commented Nov 4, 2022

View reviewed changes

galipremsagar requested a review from rjzamora November 4, 2022 15:37

rjzamora reviewed Nov 4, 2022

View reviewed changes

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits November 4, 2022 09:36

Merge remote-tracking branch 'upstream/branch-22.12' into 11934

0834a89

add dask_cudf.from_dict

13fbe55

galipremsagar requested a review from a team as a code owner November 4, 2022 21:15

galipremsagar mentioned this pull request Nov 4, 2022

Generalize from_dict implementation to allow usage from other backends dask/dask#9628

Merged

3 tasks

galipremsagar requested a review from rjzamora November 4, 2022 21:29

galipremsagar added 4 commits November 7, 2022 05:45

import fix

02b88d1

Merge remote-tracking branch 'upstream/branch-22.12' into 11934

fb4d7f6

fix

f86a06b

fix

35c6f3f

galipremsagar commented Nov 8, 2022

View reviewed changes