Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Consolidate shared method between DataFrame and Series into Frame #9038

Closed
40 tasks done
isVoid opened this issue Aug 13, 2021 · 2 comments · Fixed by #11315
Closed
40 tasks done

[FEA] Consolidate shared method between DataFrame and Series into Frame #9038

isVoid opened this issue Aug 13, 2021 · 2 comments · Fixed by #11315
Assignees
Labels
improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@isVoid
Copy link
Contributor

isVoid commented Aug 13, 2021

DataFrame and Series shares much similarity in structure. Operations that works on DataFrame can work (without much adaptation) in Series context, thus, we should consolidate these methods into Frame to reduce duplication.

Caveats:

  • Frame base class is also inherited by Index class. This consolidation would allow extra methods that can be operated on Index. It should also be pointed out that operation that depends on the frame having an index cannot be consolidated (such as sort_index).
  • After consolidation, the method may have shared docstring for both DataFrame and Series. As pointed out by @vyasr in Refactor Frame scans #9021 (comment). Some Pandas API docs also does this by including examples for both dataframe and series in the same docstring section. It's generally more preferable to have less code duplication to have longer docstrings.

Potential candidates:

  • __array_ufunc__
  • _concat (DataFrame is by necessity substantially more complex due to the need to handle multiple columns, not all of which may exist in all inputs)
  • _n_largest_or_smallest
  • add
  • append
  • argsort
  • astype
  • describe (The _describe_categorical helper function needs access to value_counts functionality. Unless we want to port something similar to the column layer (and this would be difficult because it involves a groupby), we need the objects passed to this function (and therefore all other _describe_* helpers) to be Series, not Column, which means that the describe needs to remain implemented on Series and have DataFrame call the Series method per column).
  • drop
  • explode
  • floordiv
  • from_pandas (The requirements of pulling data out of a pandas Series vs a pandas DataFrame are different enough that we can't consolidate that much common code here)
  • groupby
  • iloc
  • index
  • isin (Decided against consolidation, the behavior of DataFrame is too different and Series/Index are one-liners)
  • iteritems
  • loc
  • memory_usage
  • mod
  • mode (Requires value_counts, same issue as describe)
  • mul
  • nlargest (merged to _n_largest_or_smallest)
  • nsmallest (merged to _n_largest_or_smallest)
  • pow
  • quantile (The Series method is quite simple and already uses _from_data, and most of the logic revolves around handling the return type so there isn't much overhead added when calling it from the DataFrame API. Attempting to combine them ends up resulting in longer code, especially since their docstrings have slightly different parameters).
  • radd
  • reset_index
  • rfloordiv
  • rmod
  • rmul
  • rpow
  • rsub
  • rtruediv
  • sort_values
  • sub
  • to_dict
  • to_pandas (This function relies on corresponding column methods that return Series objects. Since pandas is not based on a columnar store but rather a block manager, there is no analogous approach to initialize different type of pandas "Frame"s from a single type of data)
  • truediv
  • values
@isVoid isVoid added feature request New feature or request Needs Triage Need team to review and classify labels Aug 13, 2021
@isVoid isVoid added this to the CuDF Python Refactoring milestone Aug 13, 2021
@isVoid isVoid added Python Affects Python cuDF API. tech debt labels Aug 13, 2021
@vyasr
Copy link
Contributor

vyasr commented Aug 16, 2021

As of #9021 I moved forward with the pandas approach of including both Series and DataFrame examples in the same docstring (in Frame), so we can standardize on that approach for the methods mentioned in this issue.

@isVoid isVoid added improvement Improvement / enhancement to an existing function and removed feature request New feature or request labels Aug 16, 2021
@beckernick beckernick removed the Needs Triage Need team to review and classify label Aug 23, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 30, 2021
Partly addresses #9038 

This function consolidate several (trivial) functions from `Series` and `DataFrame` into Frame. `__invert__` was consolidated to shared (more efficient) code path using factory methods. 
`deserialize` was not consolidated because we have to provide backward compatibility to older classes. But factory method was used for faster class construction.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Sheilah Kirui (https://github.com/skirui-source)

URL: #9059
rapids-bot bot pushed a commit that referenced this issue Oct 13, 2021
Partially solves #9038 

This PR moves binary ops from `Series` and `DataFrame` into `Frame` to reduce duplications. Because the arguments takes a different order in `series.py` and `dataframe.py` resort to `functools.wraps` to reorder the arguments while keeping the docstrings the same.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9357
rapids-bot bot pushed a commit that referenced this issue Oct 14, 2021
#9378)

This PR adds a new IndexedFrame subclass of Frame. For the moment this is a minimal class with one user-facing method and the indexer properties, but as we make progress on #9038 we are likely to find more methods that should live here. Additionally, once we remove references to Frame and Index types from the Cython layer of cudf we will be able to move all index-related logic from Frame into this class and provide a suitable ducktyped interface for methods that need to behave differently for frames depending on whether or not they have an index.

In the process I also moved SingleColumnFrame into its own module. One class per file is a good organizational pattern, especially for huge classes like these, and it can also help reduce issues with circular imports and improve mypy runtime.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #9378
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Jan 8, 2022
Partial of #9038 

This is a rewrite of `reset_index` to share some common logics between `Series` and `DataFrame`. It extends it's capability to handle `level` argument for multi-level index, `col_level` and `col_fill` for multi-level column name support. And adds `name` argument support for series api.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9750
rapids-bot bot pushed a commit that referenced this issue Feb 4, 2022
This PR contributes to #9038, merging implementations of memory_usage and __iter__. It also removes an unnecessary override of take (saves about 25% of runtime for data <1M rows) and deprecates Series.merge, which doesn't exist in pandas.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #10167
rapids-bot bot pushed a commit that referenced this issue Feb 25, 2022
This PR builds on #10217 and #10287 to bring full ufunc support for Index types, expanding well beyond the small set previously supported in the `cudf.core.ops` namespace. By using most of the machinery introduced for IndexedFrame in the prior two PRs we avoid duplicating much logic so that all ufunc dispatches flow through a relatively standard path of known methods prior to a common cupy dispatch. With this change we are also able to deprecate the various ufunc operations defined in cudf/core/ops.py that exist only for this purpose as well as a number of Frame methods that are not defined for the corresponding pandas types. Users of those APIs are recommended to calling the corresponding numpy/cupy ufuncs instead to leverage the new dispatch.

This PR also fixes a bug where index binary operations that output booleans would previously return instances of GenericIndex, whereas those pandas operations would return numpy arrays. cudf now returns cupy arrays in those cases.

Resolves #9083. Contributes to #9038.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10346
rapids-bot bot pushed a commit that referenced this issue Mar 7, 2022
This PR contributes to #9038 by combining common code for `append`, `astype`, `drop`, and `explode`. It's not a small LOC change, but most of the changed lines are just moving code around. Some minor rewrites of `astype` in this PR do significantly speed up those calls (2x faster for <1M rows). Additionally, I've deprecated `append` since it is now deprecated in pandas. Note that it has not been removed in pandas, so we should not remove it until they do.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10381
rapids-bot bot pushed a commit that referenced this issue Jul 26, 2022
This PR resolves #9038. Not all the remaining functions could be consolidated (for various reasons), but the combination was explored and where necessary the reasons not to combine the functions was documented in that issue. Some additional cleanup that was done during the exploration is also included in this PR, but ultimately `groupby` is the only method that was moved up to `IndexedFrame`.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #11315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants