New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Feature/optimize accessor copy #7660
[REVIEW] Feature/optimize accessor copy #7660
Conversation
…e the columns in the accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work @vyasr!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7660 +/- ##
===============================================
+ Coverage 81.86% 82.49% +0.62%
===============================================
Files 101 101
Lines 16884 17416 +532
===============================================
+ Hits 13822 14367 +545
+ Misses 3062 3049 -13
Continue to review full report at Codecov.
|
@gpucibot merge |
This PR introduces various small optimizations that should generally improve various common Python overhead. See #7454 (comment) for the motivation behind these optimizations and some benchmarks. Merge after: #7660 Summary: * Adds a way to initialize a ColumnAccessor (_init_unsafe) without validating its input. This is useful when converting a `cudf::table` to a `Frame`, where we're guaranteed the columns are well formed * Improved (faster) `is_numerical_dtype` * Prioritize check for numeric dtypes in `astype()` and `build_column()`. Numeric types are presumably more common, and we can avoid expensive checks for other dtypes this way. Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) URL: #7686
This PR makes a number of changes to resolve #7646:
ColumnAccessor
has been converted to adict
. The conversion from anOrderedDict
subclass to a rawdict
is safe because, as of Python 3.7, dictionary ordering is guaranteed by the CPython standard.OrderedColumnDict
class has instead been moved to a new method that is called inColumnAccessor.set_by_label
. This ensures that the validation only happens once, and dramatically speeds up most dict operations since they can now rely on the built-in C implementation of__setitem__
rather than Python overrides. Furthermore, this obviates the safety concerns raised in Unsafe usage of OrderedDict inheritance #7646 since we are no longer overriding these methods.ColumnAccessor
has been carefully optimized to handle validation with minimal overhead. Previously, this was handled by_data.__setitem__
.ColumnAccessor
is stored the first time a column is added so that it doesn't have to be recomputed.validate
has been added to_set_by_label
. This parameter is currently unused, but can be used in future refactorings to further speed up the code by bypassing validation steps when we are internally passing columns that we know to be safe.The performance implications of these changes for shallow copying
DataFrame
objects are shown below. Specifically, I'm benchmarkingdf.copy(deep=False)
for different numbers of columns and column sizes. As expected, these changes are largely invariant to the size of the columns and scale with the number of columns. The scaling isn't perfectly linear because of the constant overhead of copying theColumnAccessor
itself, but the speedup plateaus around 5x for >200 columns (which probably isn't too realistic anyway). Exactly how these speedups will propagate to operations like joins remains to be seen, but this should help since some crude benchmarking I did showed that copying can take up to 15-20% of the total time of a join depending on the data size. @shwina has better benchmarks there and I'll work with him on future changes. Here's the benchmarking notebook I used. Note that since Github doesn't support uploading notebooks I've changed the extension to txt, but it will work find to just rename it to a .ipynb file.