Skip to content

perf: benchmark singletons — they're slower, keep them off#8

Merged
AmitMY merged 4 commits intomainfrom
perf/singletons
Apr 8, 2026
Merged

perf: benchmark singletons — they're slower, keep them off#8
AmitMY merged 4 commits intomainfrom
perf/singletons

Conversation

@AmitMY
Copy link
Copy Markdown
Contributor

@AmitMY AmitMY commented Apr 8, 2026

Summary

  • Add benchmarks/bench_singletons.py comparing singleton vs non-singleton performance
  • Add tests verifying identical merge output and regression guard

Stacked on #7.

What improved

  • Documented that singletons are 1.5-2x slower and use more memory due to cache overhead
  • Confirmed non-singleton mode should remain the default

Results (50 samples, 100 merges)

Mode Time Peak Memory Cache entries
No singletons 3.8s 2.4 MB 0
Singletons 7.2s 5.3 MB 7,431

Test plan

  • All tests pass
  • ruff check . passes

🤖 Generated with Claude Code

@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

why is it the case? cache check should be fast, and given we do see there are cache entries, i don't understand why it would be slower.

@AmitMY AmitMY force-pushed the perf/singletons branch 2 times, most recently from 50e9317 to 0c78429 Compare April 8, 2026 15:02
AmitMY and others added 2 commits April 8, 2026 17:05
Singleton benchmarking shows cache overhead makes them ~1.5-2x slower
and use more memory than non-singletons. The dict lookup cost in
__new__ and growing cache dominate any deduplication savings.

Results (50 samples, 100 merges):
- No singletons: 3.8s, 2.4 MB
- Singletons: 7.2s, 5.3 MB, 7431 cache entries

Add tests verifying singletons produce identical merges and aren't
excessively slower (regression guard at 3x).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache key tuple (cls, args, kwargs) built on every Node() call is
more expensive than just allocating a fresh frozen dataclass with slots.
Merge operations create thousands of nodes, so this overhead dominates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY force-pushed the perf/singletons branch from 0c78429 to d585e70 Compare April 8, 2026 15:07
@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

The cache lookup is fast, but the cache key construction on every Node() call is the bottleneck. Each call to __new__ builds:

key = (cls, args, tuple(sorted(kwargs.items())))

This tuple creation + dict lookup costs more than just allocating a fresh frozen dataclass with __slots__ (which is essentially a single object.__setattr__ call).

Profiling shows:

  • Without singletons: __new__ = 0.007s for 26k calls (0.26μs/call)
  • With singletons: __new__ = 0.086s for 26k calls (3.2μs/call) — 12x slower per call

The savings from singletons (fewer __eq__/__hash__ calls due to identity comparison) don't compensate because merge operations create thousands of new nodes per training step, each paying the __new__ tax.

Microbenchmark: 50k Node(value=b'x') calls takes ~1.6x longer with singletons than without. Added a test documenting this.

@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

wow, ok! would singletons help in scale? or even then, not worth it? i am thinking about a test like: create 10k linked lists for the word "The" (t -> h -> e) - then run .merge(t, h)
so to test, create a text "the the the ....." 10k times - and run with and without singleton to see.

The old key (cls,)+args missed kwargs — dataclass constructors pass
fields as kwargs, so ALL instances of a class shared one key, creating
self-referential cycles. New key includes kwargs.values().

Add 10k-word scale test verifying singletons produce identical merges.
Overhead is now ~1.2x (was 1.4x with old sorted kwargs key).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY
Copy link
Copy Markdown
Contributor Author

AmitMY commented Apr 8, 2026

Fixed the key — the bug was that dataclass constructors pass fields as kwargs, not args. So the fast key (cls,) + args was just (cls,) for all instances of a class, causing self-referential cycles.

New key: (cls,) + args + tuple(kwargs.values()) — correct and still fast (~1.2x overhead vs 1.4x before).

Added a 10k-word scale test verifying identical merges with/without singletons.

At this scale (10k "the"), singletons add ~15% overhead but are correct. The overhead comes from the dict lookup per Node() call — still more expensive than raw frozen dataclass allocation, but much better than the old sorted(kwargs.items()) approach.

- Remove _instances cache and __new__ override from GraphVertex
- Remove identity-based __eq__/__hash__ from GraphVertex base class
  (dataclass subclasses already generate proper field-based versions)
- Remove USE_SINGLETONS from GraphSettings
- Delete test_singletons.py, test_singleton_perf.py, bench_singletons.py
- Clean up conftest.py and bne.py

Singletons added ~15-20% overhead due to cache key construction cost,
with no measurable benefit for training performance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AmitMY AmitMY merged commit fb85271 into main Apr 8, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant