Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD Reduces size of wheels by stripping symbols #25123

Merged
merged 7 commits into from Dec 7, 2022

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Follow up to #25063

What does this implement/fix? Explain your changes.

This PR sets -g0 to strip symbols and reduce the file size of the linux wheels.

@thomasjpfan thomasjpfan added this to the 1.2 milestone Dec 6, 2022
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 6, 2022

@jeremiedbb @ogrisel This resolves the large wheel size problem on Linux. I think it would be nice to have for 1.2

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 6, 2022

From this PR, the total size of the artifacts zip is 142 MB. The latest wheel build in main was 231 MB.

When I unzip the artifacts and look at the wheels, I see that the manylinux wheels are around the same size as Windows and macOS. On main, the Linux wheels were 36.6 MB. With this PR, the Linux wheels are now 9.8 MB.

@thomasjpfan thomasjpfan mentioned this pull request Dec 6, 2022
11 tasks
@jeremiedbb
Copy link
Member

I think it's fine to disable all debug informations for the wheels but I'd like to have an easy way to re-enable them when building in my dev env. Maybe we can have an environnement variable ?

@thomasjpfan
Copy link
Member Author

I updated the PR to support --debug which is used to build with symbols:

python setup.py build_ext -i --debug

Are you okay with this as in a dev environment?

Maybe we can have an environnement variable ?

If the above is not agreeable, then I'm okay with adding a environment variable, such as SKLEARN_ENABLE_DEBUG_SYMBOLS.

@ogrisel
Copy link
Member

ogrisel commented Dec 6, 2022

Let's use the SKLEARN_BUILD_ prefix for env variables that only impact at build time.

@ogrisel
Copy link
Member

ogrisel commented Dec 6, 2022

About the change itself I wonder if it won't impact profilers with native code profiling such as py-spy with the --native flag on linux.

@thomasjpfan
Copy link
Member Author

Coming from py-spy's README, stripping symbols will not give the best output. If a developer is using py-spy, it is best to build scikit-learn with symbols on.

Overall, I think it is still better to strip the symbols for the wheels on PyPI because:

  • It is consistent with the macOS and Windows wheels
  • NumPy and SciPy also strip the symbols
  • Better for users with lower internet bandwidth

I updated this PR, introducing a SKLEARN_BUILD_ENABLE_DEBUG_SYMBOLS to control the striping of the symbols.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the change, but we should make sure the env variables we have around the codebase are discoverable.

Comment on lines +302 to +307
`SKLEARN_BUILD_ENABLE_DEBUG_SYMBOLS`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When this environment variable is set to a non zero value, the debug symbols
will be included in the compiled C extensions. Only debug symbols for POSIX
systems is configured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't really belong to parallelism.rst though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this file is named parallelism.rst, the title is "Parallelism, resource management, and configuration". I'm guessing all the environment variables listed here fall under "configuration", which this PR adds onto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for discoverability, searching for "configuration" places the parallelism.rst page as second. I'm open to figuring out a better location for all the configuration options, but I do not think it is a blocker for this PR.

@adrinjalali
Copy link
Member

adrinjalali commented Dec 7, 2022

This would be a significant improvement on the inference endpoints I have which create a new mamba environment every time, so it makes me happy :D

EDIT: oh no, this doesn't affect conda-forge packages lol. Silly me.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the --debug option. @ogrisel was your remark about the env var just about the name of such a var or that you prefer the env var over this cmd line option ?

Comment on lines +516 to +517
if build_with_debug_symbols:
default_extra_compile_args.append("-g")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not already set by default ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On macOS+clang and Windows+MSCV, the debug symbols are stripped by default, which resulted in the ~9MB wheels on PyPI. With the new environment variable, I can enable debug symbols without changing the source on my local macOS machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, right

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a first iteration to integrate as part of 1.2.

Further improvements can be:

  • restructuring information in parallelism.rst which is not about parallelism anymore
  • using binary stripper, like strip(1) to further reduce the size of wheel-shipped shared objects

Edit: I let @jeremiedbb merge if this LGTH.

@jeremiedbb
Copy link
Member

jeremiedbb commented Dec 7, 2022

I let @jeremiedbb merge if this LGTH.

I just wonder why we ruled out the --debug flag that quickly.

@adrinjalali
Copy link
Member

I just wonder why we ruled out the --debug flag that quickly.

I second that. I have a slight preference for it since it sounds more standard, but no strong opinion.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 7, 2022

I just wonder why we ruled out the --debug flag that quickly.

With --debug, it's harder to build scikit-learn with debug symbols through pyproject.toml. I can imagine a use case of someone using py-spy and need debug symbols with --native. With the environment variable, they can set the variable and run pip install git+https://github.com/scikit-learn/scikit-learn and it "just works".

What makes build_ext --debug harder is that it requires more knowledge about setuptools + setup.py and how the build process works to get symbols installed.

@jeremiedbb
Copy link
Member

Thanks for the clarification. Let's merge then :)

@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Dec 7, 2022
@jeremiedbb jeremiedbb merged commit 9527920 into scikit-learn:main Dec 7, 2022
jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 7, 2022
Vincent-Maladiere pushed a commit to Vincent-Maladiere/scikit-learn that referenced this pull request Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants