`tiktoken`

This is a Haskell implementation of tiktoken, but just the tokenization logic. In other words, given an existing encoding (like cl100k_base) you can tokenize a string (into smaller strings or token ranks).

This means that you can't (yet) use this package to create your own new encodings, but you can use it to consume encodings. In particular, this comes in handy for prompt engineering where you want to use as much of the available prompt tokens as possible (which requires accurately counting tokens).

Encoding speed is ≈2.6-3.1 MB/s on an M1 MacBook Pro (using only one core since this package does not yet support parallel tokenization):

All
  Encode 10 MB of Wikipedia
    r50k_base:   OK (23.88s)
      3.356 s ± 151 ms
    p50k_base:   OK (10.39s)
      3.445 s ±  31 ms
    p50k_edit:   OK (11.13s)
      3.693 s ± 240 ms
    cl100k_base: OK (11.16s)
      3.685 s ± 143 ms
    o200k_base:  OK (11.01s)
      3.648 s ± 134 ms

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
benchmark		benchmark
data		data
src		src
tasty		tasty
.envrc		.envrc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
tiktoken.cabal		tiktoken.cabal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`tiktoken`

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

Gabriella439/tiktoken

Folders and files

Latest commit

History

Repository files navigation

tiktoken

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages

`tiktoken`