Skip to content

Benchmark a simple BPE tokeniser #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 20, 2024
Merged

Conversation

hauntsaninja
Copy link
Contributor

This isn't fully realistic, e.g. there are some optimisations you'd do if you actually cared about performance.
This benchmark currently takes about 3.5s, let me know if I should adjust that.

Copy link
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for the submission.

I had just the one comment in case there is a source to link to.


Author: Shantanu Jain

Based on code from tiktoken.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a repo or source we should link to for credit?

@hauntsaninja hauntsaninja merged commit 784d042 into python:main May 20, 2024
@hauntsaninja hauntsaninja deleted the bm-bpe-token branch May 20, 2024 14:48
@hauntsaninja
Copy link
Contributor Author

@mdboom are you around in the sprints? I'd love to chat about what more good benchmarks could be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants