Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Optimize DataFrame creation across code-base #10236

Merged
merged 2 commits into from Feb 7, 2022

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Feb 7, 2022

After internal profiling, it appears that _init_from_dict_like was being repeatedly called even for ColumnAccessor inputs, which is not necessary and is an expensive method where we do proper index re-alignment. This PR handles by raising an error when a developer tries to create a DataFrame with ColumnAccessor, and also changes multiple of those instances where such calls are present to _from_data call which will now avoid going through _init_from_dict_like method for ColumnAccessor input. For a dataframe of shape 30_00_000 x 3 the speed up is about 2.5x to 4x.

In [1]: import cudf

In [2]: x = cudf.DataFrame(
   ...:     {
   ...:         "a": [1, 2, 3] * 10_00_000,
   ...:         "b": [3, 4, 5] * 10_00_000,
   ...:         "c": ["a", "b", "c"] * 10_00_000,
   ...:     },
   ...:     index=list(range(0, 3000000)),
   ...: )

In [3]: x
Out[3]: 
         a  b  c
0        1  3  a
1        2  4  b
2        3  5  c
3        1  3  a
4        2  4  b
...     .. .. ..
2999995  2  4  b
2999996  3  5  c
2999997  1  3  a
2999998  2  4  b
2999999  3  5  c

[3000000 rows x 3 columns]

In [4]: x.index
Out[4]: 
Int64Index([      0,       1,       2,       3,       4,       5,       6,
                  7,       8,       9,
            ...
            2999990, 2999991, 2999992, 2999993, 2999994, 2999995, 2999996,
            2999997, 2999998, 2999999],
           dtype='int64', length=3000000)

On branch-22.04:

In [3]: %timeit cudf.DataFrame(x._data, x._index)
35.4 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [4]: %timeit cudf.DataFrame(x._data)
40.6 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This PR:

In [4]: %timeit cudf.DataFrame._from_data(x._data, x._index)
8.11 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [3]: %timeit cudf.DataFrame._from_data(x._data)
16.9 µs ± 220 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@galipremsagar galipremsagar added cuDF (Python) Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 7, 2022
@galipremsagar galipremsagar added this to PR-WIP in v22.04 Release via automation Feb 7, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner February 7, 2022 18:21
@galipremsagar galipremsagar self-assigned this Feb 7, 2022
@galipremsagar
Copy link
Contributor Author

cc: @randerzander @ayushdg

@galipremsagar galipremsagar moved this from PR-WIP to PR-Needs review in v22.04 Release Feb 7, 2022
@vyasr
Copy link
Contributor

vyasr commented Feb 7, 2022

Why do we need to support this constructor? Can we not just use DataFrame._from_data instead? I'm generally against adding more support for constructing objects with internal classes like ColumnAccessor. If the use case is in dask-cudf I think it's fine to use internal cudf methods; we do that anyway. If the use case is at a higher level, I'd want to understand exactly what it's doing. Is there a reason not to simply use x.copy()?

@vyasr
Copy link
Contributor

vyasr commented Feb 7, 2022

To be clear, there are a ton of places in cudf that are doing inefficient things right now when it comes to materializing more intermediates than we need. We need to do a comprehensive review to remove things like this.

@galipremsagar
Copy link
Contributor Author

After an offline discussion with @vyasr we decided to raise an error if someone passes a ColumnAccessor to DataFrame/Series constructors thus forcing devs to use a much faster Series/DataFrame._from_data method. I have also now changed the internal code that was using the ineffecient code path before.

@galipremsagar galipremsagar changed the title [REVIEW] Optimize DataFrame constructor for ColumnAccessor inputs [REVIEW] Optimize DataFrame creation across code-base Feb 7, 2022
@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #10236 (b565aaa) into branch-22.04 (a7d88cd) will increase coverage by 0.04%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10236      +/-   ##
================================================
+ Coverage         10.42%   10.47%   +0.04%     
================================================
  Files               119      122       +3     
  Lines             20603    20506      -97     
================================================
- Hits               2148     2147       -1     
+ Misses            18455    18359      -96     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <0.00%> (ø)
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e458b9...b565aaa. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome happy with these changes.

v22.04 Release automation moved this from PR-Needs review to PR-Reviewer approved Feb 7, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Perf improvements look awesome, too!

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Feb 7, 2022
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8a88490 into rapidsai:branch-22.04 Feb 7, 2022
v22.04 Release automation moved this from PR-Reviewer approved to Done Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuDF (Python) Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants