Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Refactor memory_usage to improve performance #10537

Merged
merged 9 commits into from
Mar 31, 2022

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Mar 29, 2022

Refactored Frame.memory_usage to return a tuple of lists: (column_names, memory_usages).
Motivation for this change is to remove redundent steps i.e., dict(Frame.memory_usage)->unpack & dict (DataFrame.memory_usage)->unpack dict(in Series.init). Choosing to remove dict being returned by Frame.memory_usage will now result in a 5-10% faster execution of external API. Not a huge speedup but dask memory usage calculations that go through
sizeof_cudf_dataframe are now 200x faster.it quickly adds up when dask_cudf is used.

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})

# THIS PR
In [3]: %timeit df.memory_usage()
198 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# branch-22.06
In [3]: %timeit df.memory_usage()
219 µs ± 726 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


# branch-22.06
In [4]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
377 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# this PR
In [6]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
1.8 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 29, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner March 29, 2022 23:19
@galipremsagar galipremsagar self-assigned this Mar 29, 2022
@codecov
Copy link

codecov bot commented Mar 30, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.06@2d8d913). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.06   #10537   +/-   ##
===============================================
  Coverage                ?   86.33%           
===============================================
  Files                   ?      140           
  Lines                   ?    22300           
  Branches                ?        0           
===============================================
  Hits                    ?    19252           
  Misses                  ?     3048           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d8d913...87589b1. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this optimization is really noticeable, then I would suggest the following additional changes:

  • Remove all inheritance logic from this method, since the amount of shared logic is going to be minimal. Just implement it separately in each of the 4 leaf classes: DataFrame, Series, Index, and MultiIndex.
  • Don't even bother constructing the list of column names for anything except DataFrame. You don't need it.
  • MultiIndex definition: sum([col.memory_usage for col in self._data.columns])
  • GenericIndex definition: self._column.memory_usage
  • Series implementation: self._column.memory_usage + (self._index.memory_usage if index else 0)
  • DataFrame implementation: Basically what currently exists except inlining the relevant logic from the current Frame and IndexedFrame implementations.

@galipremsagar
Copy link
Contributor Author

@vyasr Updated the implementations accordingly but kept the inheritances in place for the sake of docstrings. If you want me to remove that as well and duplicate the docstrings in the actual methods let me know, I'm fine with either of the approaches.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One minor suggestion.

python/cudf/cudf/core/multiindex.py Outdated Show resolved Hide resolved
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@vyasr
Copy link
Contributor

vyasr commented Mar 30, 2022

Thanks for that catch @bdice.

@galipremsagar could you post new benchmarks so that we know what we're getting out of this?

galipremsagar and others added 2 commits March 30, 2022 14:13
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
@galipremsagar
Copy link
Contributor Author

galipremsagar commented Mar 30, 2022

@galipremsagar could you post new benchmarks so that we know what we're getting out of this?

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})

In [3]: df = df.set_index(['a', 'd'])

In [4]: x = df.index

# branch-22.06

In [5]: %timeit x.memory_usage()
12 µs ± 48.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [6]: %timeit df.memory_usage()
232 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)



# Now
In [7]: %timeit x.memory_usage()
10.3 µs ± 70.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [6]: %timeit df.memory_usage()
217 µs ± 496 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@vyasr
Copy link
Contributor

vyasr commented Mar 30, 2022

I'm proposing some alternative solutions offline since I don't think this is quite addressing the underlying issue.

@galipremsagar galipremsagar requested a review from a team as a code owner March 30, 2022 21:06
@galipremsagar
Copy link
Contributor Author

galipremsagar commented Mar 30, 2022

Discussed offline with @vyasr and decided to optimize sizeof_cudf_dataframe:

In [4]: df = cudf.DataFrame({'a':[1, 2, 3], 'b':['a', 'b', 'c'], 'd':[111, 123, 123]})

# branch-22.06
In [4]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
377 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# this PR
In [6]: %timeit dask_cudf.backends.sizeof_cudf_dataframe(df)
1.8 µs ± 14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome updating the registered Dask function should hopefully have a much bigger impact on Dask workloads than the relatively minor optimizations that we can make to memory_usage.

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Mar 30, 2022
@galipremsagar
Copy link
Contributor Author

rerun tests

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple questions below. Likely just missing context

if deep:
warnings.warn(
"The deep parameter is ignored and is only included "
"for pandas compatibility."
)
return {name: col.memory_usage for name, col in self._data.items()}
raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not implemented now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have moved all their implementations to their respective Series/DataFrame/MultiIndex/Index .memory_usage methods directly. Kept these parent class methods for the sake of docstrings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so a user is unlikely to run into this error since they would be using one of these other public facing objects. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's correct.

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1355191 into rapidsai:branch-22.06 Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants