[REVIEW] Optimize `DataFrame` creation across code-base #10236

galipremsagar · 2022-02-07T18:21:05Z

After internal profiling, it appears that _init_from_dict_like was being repeatedly called even for ColumnAccessor inputs, which is not necessary and is an expensive method where we do proper index re-alignment. This PR handles by raising an error when a developer tries to create a DataFrame with ColumnAccessor, and also changes multiple of those instances where such calls are present to _from_data call which will now avoid going through _init_from_dict_like method for ColumnAccessor input. For a dataframe of shape 30_00_000 x 3 the speed up is about 2.5x to 4x.

In [1]: import cudf

In [2]: x = cudf.DataFrame(
   ...:     {
   ...:         "a": [1, 2, 3] * 10_00_000,
   ...:         "b": [3, 4, 5] * 10_00_000,
   ...:         "c": ["a", "b", "c"] * 10_00_000,
   ...:     },
   ...:     index=list(range(0, 3000000)),
   ...: )

In [3]: x
Out[3]: 
         a  b  c
0        1  3  a
1        2  4  b
2        3  5  c
3        1  3  a
4        2  4  b
...     .. .. ..
2999995  2  4  b
2999996  3  5  c
2999997  1  3  a
2999998  2  4  b
2999999  3  5  c

[3000000 rows x 3 columns]

In [4]: x.index
Out[4]: 
Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            2999990, 2999991, 2999992, 2999993, 2999994, 2999995, 2999996,
            2999997, 2999998, 2999999],
           dtype='int64', length=3000000)

On branch-22.04:

In [3]: %timeit cudf.DataFrame(x._data, x._index)
35.4 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit cudf.DataFrame(x._data)
40.6 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This PR:

In [4]: %timeit cudf.DataFrame._from_data(x._data, x._index)
8.11 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [3]: %timeit cudf.DataFrame._from_data(x._data)
16.9 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

galipremsagar · 2022-02-07T18:21:23Z

cc: @randerzander @ayushdg

vyasr · 2022-02-07T18:28:46Z

Why do we need to support this constructor? Can we not just use DataFrame._from_data instead? I'm generally against adding more support for constructing objects with internal classes like ColumnAccessor. If the use case is in dask-cudf I think it's fine to use internal cudf methods; we do that anyway. If the use case is at a higher level, I'd want to understand exactly what it's doing. Is there a reason not to simply use x.copy()?

vyasr · 2022-02-07T19:09:59Z

To be clear, there are a ton of places in cudf that are doing inefficient things right now when it comes to materializing more intermediates than we need. We need to do a comprehensive review to remove things like this.

galipremsagar · 2022-02-07T19:49:15Z

After an offline discussion with @vyasr we decided to raise an error if someone passes a ColumnAccessor to DataFrame/Series constructors thus forcing devs to use a much faster Series/DataFrame._from_data method. I have also now changed the internal code that was using the ineffecient code path before.

codecov · 2022-02-07T20:27:03Z

Codecov Report

Merging #10236 (b565aaa) into branch-22.04 (a7d88cd) will increase coverage by 0.04%.
The diff coverage is 0.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10236      +/-   ##
================================================
+ Coverage         10.42%   10.47%   +0.04%     
================================================
  Files               119      122       +3     
  Lines             20603    20506      -97     
================================================
- Hits               2148     2147       -1     
+ Misses            18455    18359      -96

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/main.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/comm/gpuarrow.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/categorical.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/datetime.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/column/methods.py	`0.00% <0.00%> (ø)`
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e458b9...b565aaa. Read the comment docs.

vyasr

Awesome happy with these changes.

bdice

Nice. Perf improvements look awesome, too!

galipremsagar · 2022-02-07T21:59:02Z

@gpucibot merge

handle ColumnAccessor inputs in dataframe constructor

ded9906

galipremsagar added cuDF (Python) Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 7, 2022

galipremsagar added this to PR-WIP in v22.04 Release via automation Feb 7, 2022

galipremsagar requested a review from a team as a code owner February 7, 2022 18:21

galipremsagar self-assigned this Feb 7, 2022

galipremsagar requested review from trxcllnt and shwina February 7, 2022 18:21

galipremsagar moved this from PR-WIP to PR-Needs review in v22.04 Release Feb 7, 2022

change constructor calls

b565aaa

galipremsagar changed the title ~~[REVIEW] Optimize DataFrame constructor for ColumnAccessor inputs~~ [REVIEW] Optimize DataFrame creation across code-base Feb 7, 2022

vyasr approved these changes Feb 7, 2022

View reviewed changes

v22.04 Release automation moved this from PR-Needs review to PR-Reviewer approved Feb 7, 2022

bdice approved these changes Feb 7, 2022

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Feb 7, 2022

rapids-bot bot merged commit 8a88490 into rapidsai:branch-22.04 Feb 7, 2022

v22.04 Release automation moved this from PR-Reviewer approved to Done Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Optimize `DataFrame` creation across code-base #10236

[REVIEW] Optimize `DataFrame` creation across code-base #10236

galipremsagar commented Feb 7, 2022 •

edited

galipremsagar commented Feb 7, 2022

vyasr commented Feb 7, 2022

vyasr commented Feb 7, 2022

galipremsagar commented Feb 7, 2022

codecov bot commented Feb 7, 2022 •

edited

vyasr left a comment

bdice left a comment

galipremsagar commented Feb 7, 2022

[REVIEW] Optimize DataFrame creation across code-base #10236

[REVIEW] Optimize DataFrame creation across code-base #10236

Conversation

galipremsagar commented Feb 7, 2022 • edited

galipremsagar commented Feb 7, 2022

vyasr commented Feb 7, 2022

vyasr commented Feb 7, 2022

galipremsagar commented Feb 7, 2022

codecov bot commented Feb 7, 2022 • edited

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

galipremsagar commented Feb 7, 2022

[REVIEW] Optimize `DataFrame` creation across code-base #10236

[REVIEW] Optimize `DataFrame` creation across code-base #10236

galipremsagar commented Feb 7, 2022 •

edited

codecov bot commented Feb 7, 2022 •

edited