Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the world tokenizer #86

Merged
merged 3 commits into from
Jun 8, 2023

Conversation

Mathmagician8191
Copy link
Contributor

Tokenizer implementation taken from https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer with test code removed.

This pull request adds a tokenizer command line argument to chat_with_bot.py, generate_completions.py and measure_pexplexity.py.
Current options are the original 20B tokenizer (default) and the new world tokenizer.

@@ -42,6 +41,7 @@

parser = argparse.ArgumentParser(description='Provide terminal-based chat interface for RWKV model')
parser.add_argument('model_path', help='Path to RWKV model in ggml format')
parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b")
parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20B")

Upper-case letter is used in tokenizer file name and previous version of the code, let's keep it

@@ -110,9 +118,11 @@ def split_last_end_of_line(tokens):

# =================================================================================================
T1 = time.time()
prompt_tokens = tokenizer_encode(init_prompt)
prompt_token_count = len(prompt_tokens)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this was moved to bottom?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables are only used just below the new location

tokens: List[int] = tokenizer.encode(text).ids
else:
print(f"Unknown tokenizer: {args.tokenizer}")
quit()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement was repeated three times, I propose moving it into a separate utility file and share between scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate file or should it be moved into rwkv_tokenizer.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, moving to existing file rwkv_tokenizer.py sounds okay to me.

@@ -17,20 +16,30 @@ def parse_args():
parser.add_argument('model_path', help='Path to model checkpoint file', type=str)
parser.add_argument('text_path', help='Path to text file in UTF-8 encoding', type=str)
parser.add_argument('ignore_first_n_tokens', help='How many tokens should be skipped before loss is measured', type=int)
parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1)
parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1)
parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1)

nargs goes after help in all other places.

# Tokenizer #1 (reference, naive, slow)
########################################################################################################

class RWKV_TOKENIZER():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing reference implementation, since it is not used and (looks like) not intended to be used in rwkv.cpp. If anyone wants to take a look at ref impl, they can go to original repository.

@@ -0,0 +1,189 @@
########################################################################################################
# The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM
########################################################################################################
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
########################################################################################################

Yes, it is RWKV :) Does not look like this comment block adds any value.

ch = key[idx]
return ret

class TRIE_TOKENIZER():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose adding a test that this implementation of tokenizer matches reference implementation (no need to include the actual ref impl, values for assertions should be hardcoded). This test would not be run automatically, but it's better to have it than do not.

As an example, you may take a look at convert_pytorch_to_ggml.test.py

@Mathmagician8191
Copy link
Contributor Author

Everything should be fixed now except for not having any tests

@Mathmagician8191
Copy link
Contributor Author

There is now a test verifying both encoding and decoding of the main test string from the original implementation

print('Unit test...')

# Test string taken from https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py
test_string = '''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I imagined the test to be a simple sanity check on a short string like this.

But I've delayed you for long enough. If you prefer, you can replace the test string with that from the link above, which will make the test way shorter. If not, I'm okay with merging it as is and changing it later myself.

@saharNooby
Copy link
Collaborator

Let me know what do you think about this, and I'll merge.

@Mathmagician8191
Copy link
Contributor Author

The long test string should be fine, the test isn't likely to be run that often and it makes sure all edge cases are handled

@saharNooby saharNooby merged commit 82c4ac7 into RWKV:master Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants