Skip to content

[Bugfix] [Tests] Perform explicit garbage collection in between tests #1503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jun 2, 2025

Purpose

  • Fix failing tests where tests unexpectedly run out of cuda memory

Background

In order to clean up model memory, the LLM Compressor tests rely on the python garbage collector to recognize dereferenced model objects and remove them from memory. This, in turn, drops pytorch tensor references, which, through the pytorch caching allocator, are recognized and lead to cuda memory being deallocated.

This whole collection process starts with the python garbage collector. However, the garbage collector is not perfect and will sometimes take longer to recognize some objects as dereferenced than others. Specifically, objects with cyclical references seem to take significantly longer to collect (this is because detecting reference cycles is more computationally expensive than standard reference counting)

These python objects can take so long to collect, to the point where cuda can run out of memory before the python garbage collector collects. Surprisingly, the pytorch caching allocator does not call gc.collect() prior to raising an OOM error, a fact which has been confirmed through my own tests and anecdotally matches @yewentao256's experience with the pytorch cuda caching allocator.

Garbage Collection and LLM Compressor

It seems that model objects which have been called by modify_save_pretrained produce reference cycles, as their overridden functions reference their own models directly. From local testing, I see that models which do not have reference cycles are cleaned up faster than models that do have reference cycles. However, this principle does not seem to generalize beyond one file, as the nightly tests still fail, even when no cycle is present.

Changes

  • In order to guarantee memory that can be collect is collected, add a gc.collect() call after every test finishes in order to make sure memory bugs do not persist across tests.
  • This may cause tests to run slightly slower

Testing

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link

github-actions bot commented Jun 2, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs changed the title [Hotfix] Use weakref to model [Hotfix] Remove cyclical references created by modify_save_pretrained Jun 2, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice use of weakref

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs changed the title [Hotfix] Remove cyclical references created by modify_save_pretrained [Bugfix] [Tests] Perform explicit garbage collection in between tests Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants