Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Add HuggingFace's Tokenizer #271

Merged
merged 22 commits into from
May 29, 2023
Merged

Conversation

RedBoxing
Copy link
Contributor

Closes #35 and #212

Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few nitpicks around error handling/naming

binaries/llm-cli/src/cli_args.rs Outdated Show resolved Hide resolved
crates/llm-base/src/loader.rs Outdated Show resolved Hide resolved
crates/llm-base/src/loader.rs Outdated Show resolved Hide resolved
crates/llm-base/src/loader.rs Outdated Show resolved Hide resolved
crates/llm-base/src/loader.rs Outdated Show resolved Hide resolved
crates/llm-base/src/loader.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@danforbes danforbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work 💪🏻 Unbelievable first contribution 🚀

binaries/llm-cli/src/cli_args.rs Outdated Show resolved Hide resolved
binaries/llm-cli/src/cli_args.rs Outdated Show resolved Hide resolved
binaries/llm-cli/src/cli_args.rs Outdated Show resolved Hide resolved
binaries/llm-cli/src/cli_args.rs Outdated Show resolved Hide resolved
crates/llm-base/src/loader.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
crates/llm-base/src/vocabulary.rs Outdated Show resolved Hide resolved
@philpax
Copy link
Collaborator

philpax commented May 24, 2023

I've updated this PR and made a few code quality changes, but the no-space issue means it can't be made the default option. I've asked for help with Tokenizers to determine why that's happening.

@philpax philpax mentioned this pull request May 25, 2023
@philpax
Copy link
Collaborator

philpax commented May 27, 2023

We just need to implement this huggingface/tokenizers#1141 (comment) and we should be able to take this across the line.

In addition to this, we should update to the latest non-HF tokenizer from llama.cpp (seeing as we're supporting both), but that can be done in a separate PR.

@RedBoxing RedBoxing requested a review from philpax May 28, 2023 09:25
@philpax philpax merged commit 0725865 into rustformers:main May 29, 2023
@philpax
Copy link
Collaborator

philpax commented May 29, 2023

Great work, @RedBoxing, and thanks for all of the help testing, everyone! Glad to have this in :)

@hhamud hhamud mentioned this pull request Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use the HuggingFace llama Tokenizer
3 participants