[REVIEW] Feature/optimize accessor copy #7660

vyasr · 2021-03-19T22:51:31Z

This PR makes a number of changes to resolve #7646:

The underlying data store in ColumnAccessor has been converted to a dict. The conversion from an OrderedDict subclass to a raw dict is safe because, as of Python 3.7, dictionary ordering is guaranteed by the CPython standard.
The validation previously defined by a pair of mixin classes that were parents of the OrderedColumnDict class has instead been moved to a new method that is called in ColumnAccessor.set_by_label. This ensures that the validation only happens once, and dramatically speeds up most dict operations since they can now rely on the built-in C implementation of __setitem__ rather than Python overrides. Furthermore, this obviates the safety concerns raised in Unsafe usage of OrderedDict inheritance #7646 since we are no longer overriding these methods.
The constructor of ColumnAccessor has been carefully optimized to handle validation with minimal overhead. Previously, this was handled by _data.__setitem__.
The column length in a ColumnAccessor is stored the first time a column is added so that it doesn't have to be recomputed.
A new parameter validate has been added to _set_by_label. This parameter is currently unused, but can be used in future refactorings to further speed up the code by bypassing validation steps when we are internally passing columns that we know to be safe.

The performance implications of these changes for shallow copying DataFrame objects are shown below. Specifically, I'm benchmarking df.copy(deep=False) for different numbers of columns and column sizes. As expected, these changes are largely invariant to the size of the columns and scale with the number of columns. The scaling isn't perfectly linear because of the constant overhead of copying the ColumnAccessor itself, but the speedup plateaus around 5x for >200 columns (which probably isn't too realistic anyway). Exactly how these speedups will propagate to operations like joins remains to be seen, but this should help since some crude benchmarking I did showed that copying can take up to 15-20% of the total time of a join depending on the data size. @shwina has better benchmarks there and I'll work with him on future changes. Here's the benchmarking notebook I used. Note that since Github doesn't support uploading notebooks I've changed the extension to txt, but it will work find to just rename it to a .ipynb file.

…e the columns in the accessor.

…onstructor.

shwina

LGTM! Nice work @vyasr!

isVoid

Awesome!

…ze_accessor_copy

python/cudf/cudf/core/column_accessor.py

codecov · 2021-03-23T20:55:09Z

Codecov Report

Merging #7660 (0457650) into branch-0.19 (7871e7a) will increase coverage by 0.62%.
The diff coverage is n/a.

❗ Current head 0457650 differs from pull request most recent head 498b70e. Consider uploading reports for the commit 498b70e to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7660      +/-   ##
===============================================
+ Coverage        81.86%   82.49%   +0.62%     
===============================================
  Files              101      101              
  Lines            16884    17416     +532     
===============================================
+ Hits             13822    14367     +545     
+ Misses            3062     3049      -13

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/categorical.py	`91.97% <ø> (+0.58%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.86% <ø> (+0.10%)`	⬆️
python/cudf/cudf/core/column/datetime.py	`89.63% <ø> (+0.54%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`92.75% <ø> (-2.12%)`	⬇️
python/cudf/cudf/core/column/lists.py	`92.50% <ø> (+1.10%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`94.83% <ø> (-0.20%)`	⬇️
python/cudf/cudf/core/column/string.py	`86.79% <ø> (+0.30%)`	⬆️
python/cudf/cudf/core/column/timedelta.py	`88.57% <ø> (+0.33%)`	⬆️
python/cudf/cudf/core/column_accessor.py	`96.01% <ø> (+0.70%)`	⬆️
python/cudf/cudf/core/dataframe.py	`90.90% <ø> (+0.43%)`	⬆️
... and 65 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c21bd0e...498b70e. Read the comment docs.

kkraus14 · 2021-03-23T21:01:42Z

@gpucibot merge

@shwina

This PR introduces various small optimizations that should generally improve various common Python overhead. See #7454 (comment) for the motivation behind these optimizations and some benchmarks. Merge after: #7660 Summary: * Adds a way to initialize a ColumnAccessor (_init_unsafe) without validating its input. This is useful when converting a `cudf::table` to a `Frame`, where we're guaranteed the columns are well formed * Improved (faster) `is_numerical_dtype` * Prioritize check for numeric dtypes in `astype()` and `build_column()`. Numeric types are presumably more common, and we can avoid expensive checks for other dtypes this way. Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) URL: #7686

vyasr added 8 commits March 19, 2021 13:48

Move validation directly into set_by_label and use a raw dict to stor…

935648b

…e the columns in the accessor.

Remove all references to OrderedColumnDict.

806a3ef

Move validation to separate method and use in both set_by_label and c…

40a7b17

…onstructor.

Format with black.

a1c576e

Expose parameter to make validation optional.

788d9d6

Coerce constructor input to dict before calling items.

6a64285

Make construction safe.

e7d0981

Final cleanup and documentation.

c39932c

vyasr requested a review from a team as a code owner March 19, 2021 22:51

vyasr requested review from isVoid and brandon-b-miller March 19, 2021 22:51

github-actions bot added the cuDF (Python) Affects Python cuDF API. label Mar 19, 2021

vyasr added non-breaking Non-breaking change 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function Performance Performance related issue labels Mar 19, 2021

Address style issues.

4ff09fc

shwina approved these changes Mar 19, 2021

View reviewed changes

isVoid approved these changes Mar 19, 2021

View reviewed changes

vyasr added 4 commits March 22, 2021 13:46

Merge remote-tracking branch 'origin/branch-0.19' into feature/optimi…

74f2884

…ze_accessor_copy

Lazily compute and delete column length on demand.

c3b6444

Remove redundant clear cache in setitem.

01b2cf5

Remove mypy annotation for column length.

8899258

shwina mentioned this pull request Mar 23, 2021

Join APIs that return gathermaps #7454

Merged

kkraus14 approved these changes Mar 23, 2021

View reviewed changes

kkraus14 added this to PR-WIP in v0.19 Release via automation Mar 23, 2021

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 23, 2021

Change error message back so that tests pass.

de9ca28

shwina reviewed Mar 23, 2021

View reviewed changes

python/cudf/cudf/core/column_accessor.py Show resolved Hide resolved

shwina reviewed Mar 23, 2021

View reviewed changes

python/cudf/cudf/core/column_accessor.py Show resolved Hide resolved

shwina mentioned this pull request Mar 23, 2021

Misc Python/Cython optimizations #7686

Merged

vyasr added 2 commits March 23, 2021 09:49

Add validation option to insert and standardize error message.

739ec57

Fix style.

498b70e

rapids-bot bot merged commit 7f9f8f5 into rapidsai:branch-0.19 Mar 23, 2021

v0.19 Release automation moved this from PR-WIP to Done Mar 23, 2021

vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021

vyasr deleted the feature/optimize_accessor_copy branch March 9, 2022 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Feature/optimize accessor copy #7660

[REVIEW] Feature/optimize accessor copy #7660

vyasr commented Mar 19, 2021

shwina left a comment

isVoid left a comment

codecov bot commented Mar 23, 2021

kkraus14 commented Mar 23, 2021

[REVIEW] Feature/optimize accessor copy #7660

[REVIEW] Feature/optimize accessor copy #7660

Conversation

vyasr commented Mar 19, 2021

shwina left a comment

Choose a reason for hiding this comment

isVoid left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 23, 2021

Codecov Report

kkraus14 commented Mar 23, 2021