perf: benchmark singletons — they're slower, keep them off by AmitMY · Pull Request #8 · sign-language-processing/complex-tokenization

AmitMY · 2026-04-08T10:27:13Z

Summary

Add benchmarks/bench_singletons.py comparing singleton vs non-singleton performance
Add tests verifying identical merge output and regression guard

Stacked on #7.

What improved

Documented that singletons are 1.5-2x slower and use more memory due to cache overhead
Confirmed non-singleton mode should remain the default

Results (50 samples, 100 merges)

Mode	Time	Peak Memory	Cache entries
No singletons	3.8s	2.4 MB	0
Singletons	7.2s	5.3 MB	7,431

Test plan

All tests pass
ruff check . passes

🤖 Generated with Claude Code

AmitMY · 2026-04-08T12:56:46Z

why is it the case? cache check should be fast, and given we do see there are cache entries, i don't understand why it would be slower.

Singleton benchmarking shows cache overhead makes them ~1.5-2x slower and use more memory than non-singletons. The dict lookup cost in __new__ and growing cache dominate any deduplication savings. Results (50 samples, 100 merges): - No singletons: 3.8s, 2.4 MB - Singletons: 7.2s, 5.3 MB, 7431 cache entries Add tests verifying singletons produce identical merges and aren't excessively slower (regression guard at 3x). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The cache key tuple (cls, args, kwargs) built on every Node() call is more expensive than just allocating a fresh frozen dataclass with slots. Merge operations create thousands of nodes, so this overhead dominates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY · 2026-04-08T15:07:26Z

The cache lookup is fast, but the cache key construction on every Node() call is the bottleneck. Each call to __new__ builds:

key = (cls, args, tuple(sorted(kwargs.items())))

This tuple creation + dict lookup costs more than just allocating a fresh frozen dataclass with __slots__ (which is essentially a single object.__setattr__ call).

Profiling shows:

Without singletons: __new__ = 0.007s for 26k calls (0.26μs/call)
With singletons: __new__ = 0.086s for 26k calls (3.2μs/call) — 12x slower per call

The savings from singletons (fewer __eq__/__hash__ calls due to identity comparison) don't compensate because merge operations create thousands of new nodes per training step, each paying the __new__ tax.

Microbenchmark: 50k Node(value=b'x') calls takes ~1.6x longer with singletons than without. Added a test documenting this.

AmitMY · 2026-04-08T15:13:12Z

wow, ok! would singletons help in scale? or even then, not worth it? i am thinking about a test like: create 10k linked lists for the word "The" (t -> h -> e) - then run .merge(t, h)
so to test, create a text "the the the ....." 10k times - and run with and without singleton to see.

The old key (cls,)+args missed kwargs — dataclass constructors pass fields as kwargs, so ALL instances of a class shared one key, creating self-referential cycles. New key includes kwargs.values(). Add 10k-word scale test verifying singletons produce identical merges. Overhead is now ~1.2x (was 1.4x with old sorted kwargs key). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY · 2026-04-08T15:22:17Z

Fixed the key — the bug was that dataclass constructors pass fields as kwargs, not args. So the fast key (cls,) + args was just (cls,) for all instances of a class, causing self-referential cycles.

New key: (cls,) + args + tuple(kwargs.values()) — correct and still fast (~1.2x overhead vs 1.4x before).

Added a 10k-word scale test verifying identical merges with/without singletons.

At this scale (10k "the"), singletons add ~15% overhead but are correct. The overhead comes from the dict lookup per Node() call — still more expensive than raw frozen dataclass allocation, but much better than the old sorted(kwargs.items()) approach.

- Remove _instances cache and __new__ override from GraphVertex - Remove identity-based __eq__/__hash__ from GraphVertex base class (dataclass subclasses already generate proper field-based versions) - Remove USE_SINGLETONS from GraphSettings - Delete test_singletons.py, test_singleton_perf.py, bench_singletons.py - Clean up conftest.py and bne.py Singletons added ~15-20% overhead due to cache key construction cost, with no measurable benefit for training performance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AmitMY mentioned this pull request Apr 8, 2026

perf: optimize get_merges and merge for BPE (4x speedup) #9

Merged

2 tasks

AmitMY force-pushed the perf/singletons branch 2 times, most recently from 50e9317 to 0c78429 Compare April 8, 2026 15:02

AmitMY and others added 2 commits April 8, 2026 17:05

AmitMY force-pushed the perf/singletons branch from 0c78429 to d585e70 Compare April 8, 2026 15:07

AmitMY merged commit fb85271 into main Apr 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: benchmark singletons — they're slower, keep them off#8

perf: benchmark singletons — they're slower, keep them off#8
AmitMY merged 4 commits intomainfrom
perf/singletons

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitMY commented Apr 8, 2026

Summary

What improved

Results (50 samples, 100 merges)

Test plan

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

AmitMY commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant