Skip to content

Conversation

@brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Nov 21, 2025

This exposes a counter for the number of unique objects seen in the current generation. Previously, something like sum(len(gc.get_objects(g)) for g in range(gen + 1)) was needed to get the a statistic like this (in this PR I've made the choice to collect this statistic in update_refs, and omit immortal and "marked alive" members of the current generation).

It's useful for things like info["collected"] / info["visited"], to get the efficiency of a collection.


📚 Documentation preview 📚: https://cpython-previews--141814.org.readthedocs.build/

@brandtbucher brandtbucher self-assigned this Nov 21, 2025
@brandtbucher brandtbucher added the type-feature A feature request or enhancement label Nov 21, 2025
@brandtbucher brandtbucher added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 21, 2025
Copy link
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

I assumed that visited is not initialized because IIRC the interpreter struct is calloc-ed. I'm on my phone so is not easy to double check but just a thing to check before landing

add_stats(GCState *gcstate, int gen, struct gc_collection_stats *stats)
{
gcstate->generation_stats[gen].duration += stats->duration;
gcstate->generation_stats[gen].visited += stats->visited;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this matters but what would happen if this overflows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we generally worried about overflowing Py_ssize_t?

I can switch this to size_t if we are so at least the overflow isn't UB. A Python int seems sort of heavy for this, but we could go that route if needed too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we generally worried about overflowing Py_ssize_t?

Nah, I think this has the same problem as the other fields just wanted to raise it here in case we both believe is worth fixing. I don't think is important, but it is possible. WDYT? I am fine to ignore just to be clear

"uncollectable": When *phase* is "stop", the number of objects
that could not be collected and were put in :data:`garbage`.

"visited": When *phase* is "stop", the number of unique objects visited
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something but if I understand this correctly this counts objects in the collection set, not all objects visited during traversal (which would include objects referenced by containers). Consider maybe renaming to "objects_in_collection" or "initial_objects" or updating implementation to count all visited objects (including referenced ones during subtract_refs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @nascheme pointed this out too on the issue. I think naming is hard... this can really be any of:

  • Total visits performed (edges in the graph).
  • Total unique objects visited (nodes in the graph, optionally minus immortals, marked-alive objects, untracked objects, etc.).
  • Total objects in the current generation (nodes in the generation subgraph, optionally minus immortals, marked-alive objects, untracked objects, etc.).

I'm really trying not to overthink it... I think the probably most intuitive answer is something along the lines of "How many unique objects did the GC consider before eventually freeing <collected> number of them"? Like I said above, collected / visited should be a meaningful efficiency metric to help see at a glance if you're running at the right times (in addition to memory metrics, of course).

But definitely open to suggestions. I naturally lean towards a simpler API name and more nuance in the docs for it, rather than trying to convey all of the nuance in one long name alone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, it seems like objects that can't count towards collected also shouldn't count towards visited. So I'm pretty sure that would exclude anything reachable from, but not part of, the current generation?

Copy link
Member

@pablogsal pablogsal Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but even keeping the "naming is hard" optics, I would say that anything other than visited is good because at least we can agree that surely this is not "number of visited objects" no? Maybe candidates?

@brandtbucher
Copy link
Member Author

I assumed that visited is not initialized because IIRC the interpreter struct is calloc-ed. I'm on my phone so is not easy to double check but just a thing to check before landing

Yeah, per-collection stats have an initializer here (here in free-threading), so unnamed members are zeroed. The process-wide stats live on the gcstate, which is part of the interpreter state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting merge interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants