Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add support for DataFrame.from_dict\to_dict and Series.to_dict #12048

Merged
merged 25 commits into from
Nov 14, 2022

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Nov 2, 2022

Description

Resolves: #11934

  • Adds DataFrame.from_dict and DataFrame.to_dict
  • Adds Series.to_dict

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 2, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner November 2, 2022 20:53
@galipremsagar galipremsagar self-assigned this Nov 2, 2022
@galipremsagar galipremsagar added this to PR-WIP in v22.12 Release via automation Nov 2, 2022
@galipremsagar galipremsagar moved this from PR-WIP to PR-Needs review in v22.12 Release Nov 2, 2022
@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Base: 87.47% // Head: 88.10% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (e15c7b2) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head e15c7b2 differs from pull request most recent head dc7eaa5. Consider uploading reports for the commit dc7eaa5 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #12048      +/-   ##
================================================
+ Coverage         87.47%   88.10%   +0.62%     
================================================
  Files               133      135       +2     
  Lines             21826    22133     +307     
================================================
+ Hits              19093    19500     +407     
+ Misses             2733     2633     -100     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/strings_udf/strings_udf/__init__.py 76.47% <0.00%> (-7.85%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
... and 43 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a lot more logic than I had anticipated. Out of curiosity, I copied this implementation into my current branch, then added a second method from_dict2 with an identical signature that is just defined as:

return cls.from_pandas(pd.DataFrame.from_dict(data, orient, dtype, columns))

Here are some quick benchmark results:

In [9]: d = {f'i': np.random.rand(1000) for i in range(20)}

In [10]: %timeit cudf.DataFrame.from_dict2(d)
279 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: %timeit cudf.DataFrame.from_dict(d)
315 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [12]: d = {f'i': np.random.rand(10) for i in range(20)}

In [13]: %timeit cudf.DataFrame.from_dict2(d)
277 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [14]: %timeit cudf.DataFrame.from_dict(d)
307 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [15]: d = {f'i': np.random.rand(10000) for i in range(50)}

In [16]: %timeit cudf.DataFrame.from_dict2(d)
310 µs ± 322 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [17]: %timeit cudf.DataFrame.from_dict(d)
316 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Is there any benefit to the current, much more complex approach rather than just using the pandas fallback in this case? The complexity of our constructor is high enough that it seems like bypassing it using from_pandas recoups more than enough performance, and this way we really don't have to think about the semantics much at all.

@galipremsagar
Copy link
Contributor Author

Is there any benefit to the current, much more complex approach rather than just using the pandas fallback in this case? The complexity of our constructor is high enough that it seems like bypassing it using from_pandas recoups more than enough performance, and this way we really don't have to think about the semantics much at all.

So, yes this is for sure quicker for CPU-backed data..I had to vendor the entire logic from pandas and do a special handling to support cudf.Series as values in the dict inputs, like going through pandas API will fail because of an implicit cast:

In [1]: import cudf

In [2]: def v2(data):
   ...:     return cudf.from_pandas(pd.DataFrame.from_dict(data))
   ...: 

In [3]: import pandas as pd

In [4]: data = {"a": cudf.Series([1, 2, 3, 4]), "B": cudf.Series([10, 11, 12, 13])}

In [5]: v2(data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 v2(data)

Cell In [2], line 2, in v2(data)
      1 def v2(data):
----> 2     return cudf.from_pandas(pd.DataFrame.from_dict(data))

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/frame.py:1762, in DataFrame.from_dict(cls, data, orient, dtype, columns)
   1756     raise ValueError(
   1757         f"Expected 'index', 'columns' or 'tight' for orient parameter. "
   1758         f"Got '{orient}' instead"
   1759     )
   1761 if orient != "tight":
-> 1762     return cls(data, index=index, columns=columns, dtype=dtype)
   1763 else:
   1764     realdata = data["data"]

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/frame.py:662, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    656     mgr = self._init_mgr(
    657         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    658     )
    660 elif isinstance(data, dict):
    661     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 662     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    663 elif isinstance(data, ma.MaskedArray):
    664     import numpy.ma.mrecords as mrecords

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:123, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    120         index = ensure_index(index)
    122     # don't force copy because getting jammed in an ndarray anyway
--> 123     arrays = _homogenize(arrays, index, dtype)
    124     # _homogenize ensures
    125     #  - all(len(x) == len(index) for x in arrays)
    126     #  - all(x.ndim == 1 for x in arrays)
   (...)
    129 
    130 else:
    131     index = ensure_index(index)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/internals/construction.py:617, in _homogenize(data, index, dtype)
    614             val = dict(val)
    615         val = lib.fast_multiget(val, oindex._values, default=np.nan)
--> 617     val = sanitize_array(
    618         val, index, dtype=dtype, copy=False, raise_cast_failure=False
    619     )
    620     com.require_length_match(val, index)
    622 homogenized.append(val)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/pandas/core/construction.py:616, in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
    613 # materialize e.g. generators, convert e.g. tuples, abc.ValueView
    614 if hasattr(data, "__array__"):
    615     # e.g. dask array GH#38645
--> 616     data = np.array(data, copy=copy)
    617 else:
    618     data = list(data)

File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/frame.py:447, in Frame.__array__(self, dtype)
    446 def __array__(self, dtype=None):
--> 447     raise TypeError(
    448         "Implicit conversion to a host NumPy array via __array__ is not "
    449         "allowed, To explicitly construct a GPU matrix, consider using "
    450         ".to_cupy()\nTo explicitly construct a host matrix, consider "
    451         "using .to_numpy()."
    452     )

TypeError: Implicit conversion to a host NumPy array via __array__ is not allowed, To explicitly construct a GPU matrix, consider using .to_cupy()
To explicitly construct a host matrix, consider using .to_numpy().

@@ -474,10 +474,9 @@ def from_dict(data, npartitions, orient="columns", **kwargs):

if orient != "columns":
raise ValueError(f"orient={orient} is not supported")
# TODO: Use cudf.from_dict
# (See: https://github.com/rapidsai/cudf/issues/11934)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @rjzamora for review

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if 1) we could leverage the pandas method directly in cases where we know that it would be faster, and 2) if we didn't have to handle so many edge cases before devolving to the constructor, but I don't see how to do (1) without doing introspection that will slow down other code paths and I think (2) seems necessary. If other reviewers have more ideas I would welcome them, but I think I'm good with this code as is.

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
@wence-
Copy link
Contributor

wence- commented Nov 10, 2022

It would be nice if 1) we could leverage the pandas method directly in cases where we know that it would be faster, and 2) if we didn't have to handle so many edge cases before devolving to the constructor, but I don't see how to do (1) without doing introspection that will slow down other code paths and I think (2) seems necessary. If other reviewers have more ideas I would welcome them, but I think I'm good with this code as is.

How about this rewrite:

    @staticmethod
    def __create_index(indexlist, namelist, library):
        try:
            name, = namelist
            return library.Index(indexlist, name=name)
        except ValueError:
            return library.MultiIndex.from_tuples(
                indexlist, names=namelist
            )
            
    @classmethod
    def from_dict(cls, data, orient="columns", dtype=None, index=None, columns=None):
        if orient == "index":
            if len(data) > 0 and isinstance(next(iter(data.values())), cudf.Series):
                data = cls(data).T
                return cls(data, index=None, columns=columns, dtype=dtype)
            else:
                return cls.from_pandas(
                    pd.DataFrame.from_dict(
                        data=data,
                        orient=orient,
                        dtype=dtype,
                        index=index,
                        columns=columns,
                    )
                )
        elif orient == "columns":
            if columns is not None:
                raise ValueError("Cannot use columns parameter with orient='columns'")
            return cls(data, index=index, columns=None, dtype=dtype)
        elif orient == "tight":
            if columns is not None:
                raise ValueError("Cannot use columns parameter with orient='right'")
            index = cls.__create_index(data["index"], data["index_names"], cudf)
            columns = cls.__create_index(data["columns"], data["column_names"], pd)
            return cls(data["data"], index=index, columns=columns, dtype=dtype)
        else:
            raise ValueError(f"Expected 'index', 'columns', or 'tight' got {orient=}")

Which (personally) I find a bit clearer. It simplifies the dict transpose in the case where the values are not cudf.Series by just deferring to pandas. This may, however, be bad if the values are provided as cupy arrays (I think).

Other changes (might be worthwhile anyway): handle each orient case separately with early returns (which I think makes the logic a bit easier to read); define __create_index as a staticmethod since it's not closing over any variables.

One could imagine doing some introspection of the data in each of the cases, but it's probably not worth it. I would hope that from_dict is not on the hot path of an analysis.

@jakirkham
Copy link
Member

This may, however, be bad if the values are provided as cupy arrays (I think).

Think this is a use case we should prepare for (and ideally test against)

@galipremsagar
Copy link
Contributor Author

galipremsagar commented Nov 11, 2022

@wence- @jakirkham Addressed your reviews. Could you please re-review/approve when you get a chance?

@galipremsagar
Copy link
Contributor Author

galipremsagar commented Nov 11, 2022

Also cc: @vyasr since the code has changed.

v22.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Nov 14, 2022
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 14, 2022
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] cudf.DataFrame.from_dict
5 participants