-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for the world tokenizer #86
Conversation
rwkv/chat_with_bot.py
Outdated
@@ -42,6 +41,7 @@ | |||
|
|||
parser = argparse.ArgumentParser(description='Provide terminal-based chat interface for RWKV model') | |||
parser.add_argument('model_path', help='Path to RWKV model in ggml format') | |||
parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b") | |
parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20B") |
Upper-case letter is used in tokenizer file name and previous version of the code, let's keep it
@@ -110,9 +118,11 @@ def split_last_end_of_line(tokens): | |||
|
|||
# ================================================================================================= | |||
T1 = time.time() | |||
prompt_tokens = tokenizer_encode(init_prompt) | |||
prompt_token_count = len(prompt_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this was moved to bottom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables are only used just below the new location
rwkv/measure_pexplexity.py
Outdated
tokens: List[int] = tokenizer.encode(text).ids | ||
else: | ||
print(f"Unknown tokenizer: {args.tokenizer}") | ||
quit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
statement was repeated three times, I propose moving it into a separate utility file and share between scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a separate file or should it be moved into rwkv_tokenizer.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, moving to existing file rwkv_tokenizer.py
sounds okay to me.
rwkv/measure_pexplexity.py
Outdated
@@ -17,20 +16,30 @@ def parse_args(): | |||
parser.add_argument('model_path', help='Path to model checkpoint file', type=str) | |||
parser.add_argument('text_path', help='Path to text file in UTF-8 encoding', type=str) | |||
parser.add_argument('ignore_first_n_tokens', help='How many tokens should be skipped before loss is measured', type=int) | |||
parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1) | |||
parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1) | |
parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1) |
nargs
goes after help
in all other places.
rwkv/rwkv_tokenizer.py
Outdated
# Tokenizer #1 (reference, naive, slow) | ||
######################################################################################################## | ||
|
||
class RWKV_TOKENIZER(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing reference implementation, since it is not used and (looks like) not intended to be used in rwkv.cpp
. If anyone wants to take a look at ref impl, they can go to original repository.
rwkv/rwkv_tokenizer.py
Outdated
@@ -0,0 +1,189 @@ | |||
######################################################################################################## | |||
# The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM | |||
######################################################################################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
######################################################################################################## |
Yes, it is RWKV :) Does not look like this comment block adds any value.
ch = key[idx] | ||
return ret | ||
|
||
class TRIE_TOKENIZER(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose adding a test that this implementation of tokenizer matches reference implementation (no need to include the actual ref impl, values for assertions should be hardcoded). This test would not be run automatically, but it's better to have it than do not.
As an example, you may take a look at convert_pytorch_to_ggml.test.py
Everything should be fixed now except for not having any tests |
There is now a test verifying both encoding and decoding of the main test string from the original implementation |
print('Unit test...') | ||
|
||
# Test string taken from https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py | ||
test_string = ''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I imagined the test to be a simple sanity check on a short string like this.
But I've delayed you for long enough. If you prefer, you can replace the test string with that from the link above, which will make the test way shorter. If not, I'm okay with merging it as is and changing it later myself.
Let me know what do you think about this, and I'll merge. |
The long test string should be fine, the test isn't likely to be run that often and it makes sure all edge cases are handled |
Tokenizer implementation taken from https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer with test code removed.
This pull request adds a tokenizer command line argument to chat_with_bot.py, generate_completions.py and measure_pexplexity.py.
Current options are the original 20B tokenizer (default) and the new world tokenizer.