Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Feature/optimize accessor copy #7660

Merged
merged 16 commits into from Mar 23, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 19, 2021

This PR makes a number of changes to resolve #7646:

  1. The underlying data store in ColumnAccessor has been converted to a dict. The conversion from an OrderedDict subclass to a raw dict is safe because, as of Python 3.7, dictionary ordering is guaranteed by the CPython standard.
  2. The validation previously defined by a pair of mixin classes that were parents of the OrderedColumnDict class has instead been moved to a new method that is called in ColumnAccessor.set_by_label. This ensures that the validation only happens once, and dramatically speeds up most dict operations since they can now rely on the built-in C implementation of __setitem__ rather than Python overrides. Furthermore, this obviates the safety concerns raised in Unsafe usage of OrderedDict inheritance #7646 since we are no longer overriding these methods.
  3. The constructor of ColumnAccessor has been carefully optimized to handle validation with minimal overhead. Previously, this was handled by _data.__setitem__.
  4. The column length in a ColumnAccessor is stored the first time a column is added so that it doesn't have to be recomputed.
  5. A new parameter validate has been added to _set_by_label. This parameter is currently unused, but can be used in future refactorings to further speed up the code by bypassing validation steps when we are internally passing columns that we know to be safe.

The performance implications of these changes for shallow copying DataFrame objects are shown below. Specifically, I'm benchmarking df.copy(deep=False) for different numbers of columns and column sizes. As expected, these changes are largely invariant to the size of the columns and scale with the number of columns. The scaling isn't perfectly linear because of the constant overhead of copying the ColumnAccessor itself, but the speedup plateaus around 5x for >200 columns (which probably isn't too realistic anyway). Exactly how these speedups will propagate to operations like joins remains to be seen, but this should help since some crude benchmarking I did showed that copying can take up to 15-20% of the total time of a join depending on the data size. @shwina has better benchmarks there and I'll work with him on future changes. Here's the benchmarking notebook I used. Note that since Github doesn't support uploading notebooks I've changed the extension to txt, but it will work find to just rename it to a .ipynb file.

image

@vyasr vyasr requested a review from a team as a code owner March 19, 2021 22:51
@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label Mar 19, 2021
@vyasr vyasr added non-breaking Non-breaking change 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function Performance Performance related issue labels Mar 19, 2021
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work @vyasr!

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@kkraus14 kkraus14 added this to PR-WIP in v0.19 Release via automation Mar 23, 2021
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 23, 2021
@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #7660 (0457650) into branch-0.19 (7871e7a) will increase coverage by 0.62%.
The diff coverage is n/a.

❗ Current head 0457650 differs from pull request most recent head 498b70e. Consider uploading reports for the commit 498b70e to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7660      +/-   ##
===============================================
+ Coverage        81.86%   82.49%   +0.62%     
===============================================
  Files              101      101              
  Lines            16884    17416     +532     
===============================================
+ Hits             13822    14367     +545     
+ Misses            3062     3049      -13     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/categorical.py 91.97% <ø> (+0.58%) ⬆️
python/cudf/cudf/core/column/column.py 87.86% <ø> (+0.10%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.63% <ø> (+0.54%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <ø> (-2.12%) ⬇️
python/cudf/cudf/core/column/lists.py 92.50% <ø> (+1.10%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.83% <ø> (-0.20%) ⬇️
python/cudf/cudf/core/column/string.py 86.79% <ø> (+0.30%) ⬆️
python/cudf/cudf/core/column/timedelta.py 88.57% <ø> (+0.33%) ⬆️
python/cudf/cudf/core/column_accessor.py 96.01% <ø> (+0.70%) ⬆️
python/cudf/cudf/core/dataframe.py 90.90% <ø> (+0.43%) ⬆️
... and 65 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c21bd0e...498b70e. Read the comment docs.

@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7f9f8f5 into rapidsai:branch-0.19 Mar 23, 2021
v0.19 Release automation moved this from PR-WIP to Done Mar 23, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 24, 2021
This PR introduces various small optimizations that should generally improve various common Python overhead. See #7454 (comment) for the motivation behind these optimizations and some benchmarks.

Merge after: #7660 

Summary:

* Adds a way to initialize a ColumnAccessor (_init_unsafe) without validating its input. This is useful when converting a `cudf::table` to a `Frame`, where we're guaranteed the columns are well formed
* Improved (faster) `is_numerical_dtype`
* Prioritize check for numeric dtypes in `astype()` and `build_column()`. Numeric types are presumably more common, and we can avoid expensive checks for other dtypes this way.

Authors:
  - Ashwin Srinath (@shwina)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7686
@vyasr vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021
@vyasr vyasr deleted the feature/optimize_accessor_copy branch March 9, 2022 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuDF (Python) Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue
Projects
No open projects
v0.19 Release
  
Done
Development

Successfully merging this pull request may close these issues.

Unsafe usage of OrderedDict inheritance
4 participants