Skip to content

Potential performance improvement: cache the result of mergeBytePairs() #26

@mikethea1

Description

@mikethea1

When tokenizing large input strings, mergeBytePairs seems to be the bottleneck even when it isn't [degrading to quadratic[(https://github.com//issues/25).

On my workload, I found that a small change of caching the result of mergeBytePairs resulted in a ~33% speedup since larger files tend to have many repeat pieces.

This isn't a free improvement since it adds some overhead due to checking the cache.

Have you considered this optimization before? Would you entertain a PR to add it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions