Add support for the world tokenizer #86

Mathmagician8191 · 2023-06-02T13:00:22Z

Tokenizer implementation taken from https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer with test code removed.

This pull request adds a tokenizer command line argument to chat_with_bot.py, generate_completions.py and measure_pexplexity.py.
Current options are the original 20B tokenizer (default) and the new world tokenizer.

saharNooby · 2023-06-03T13:28:27Z

rwkv/chat_with_bot.py

@@ -42,6 +41,7 @@

 parser = argparse.ArgumentParser(description='Provide terminal-based chat interface for RWKV model')
 parser.add_argument('model_path', help='Path to RWKV model in ggml format')
+parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b")


Suggested change

parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b")

parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20B")

Upper-case letter is used in tokenizer file name and previous version of the code, let's keep it

saharNooby · 2023-06-03T13:29:21Z

rwkv/chat_with_bot.py

@@ -110,9 +118,11 @@ def split_last_end_of_line(tokens):

 # =================================================================================================
 T1 = time.time()
+prompt_tokens = tokenizer_encode(init_prompt)
+prompt_token_count = len(prompt_tokens)


Why this was moved to bottom?

The variables are only used just below the new location

saharNooby · 2023-06-03T13:30:55Z

rwkv/measure_pexplexity.py

+    tokens: List[int] = tokenizer.encode(text).ids
+else:
+    print(f"Unknown tokenizer: {args.tokenizer}")
+    quit()


This if statement was repeated three times, I propose moving it into a separate utility file and share between scripts.

Should this be a separate file or should it be moved into rwkv_tokenizer.py?

Hm, moving to existing file rwkv_tokenizer.py sounds okay to me.

saharNooby · 2023-06-03T13:31:38Z

rwkv/measure_pexplexity.py

@@ -17,20 +16,30 @@ def parse_args():
    parser.add_argument('model_path', help='Path to model checkpoint file', type=str)
    parser.add_argument('text_path', help='Path to text file in UTF-8 encoding', type=str)
    parser.add_argument('ignore_first_n_tokens', help='How many tokens should be skipped before loss is measured', type=int)
-    parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1)
+    parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1)


Suggested change

parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1)

parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1)

nargs goes after help in all other places.

saharNooby · 2023-06-03T13:34:02Z

rwkv/rwkv_tokenizer.py

+# Tokenizer #1 (reference, naive, slow)
+########################################################################################################
+
+class RWKV_TOKENIZER():


I suggest removing reference implementation, since it is not used and (looks like) not intended to be used in rwkv.cpp. If anyone wants to take a look at ref impl, they can go to original repository.

saharNooby · 2023-06-03T13:34:37Z

rwkv/rwkv_tokenizer.py

@@ -0,0 +1,189 @@
+########################################################################################################
+# The RWKV Language Model - https://github.com/BlinkDL/RWKV-LM
+########################################################################################################


Suggested change

########################################################################################################

Yes, it is RWKV :) Does not look like this comment block adds any value.

saharNooby · 2023-06-03T13:37:04Z

rwkv/rwkv_tokenizer.py

+            ch = key[idx]
+        return ret
+
+class TRIE_TOKENIZER():


I propose adding a test that this implementation of tokenizer matches reference implementation (no need to include the actual ref impl, values for assertions should be hardcoded). This test would not be run automatically, but it's better to have it than do not.

As an example, you may take a look at convert_pytorch_to_ggml.test.py

Mathmagician8191 · 2023-06-04T12:38:33Z

Everything should be fixed now except for not having any tests

Mathmagician8191 · 2023-06-06T11:41:13Z

There is now a test verifying both encoding and decoding of the main test string from the original implementation

saharNooby · 2023-06-08T11:32:18Z

rwkv/rwkv_tokenizer_test.py

+print('Unit test...')
+
+# Test string taken from https://github.com/BlinkDL/ChatRWKV/blob/main/tokenizer/rwkv_tokenizer.py
+test_string = '''


Oof, I imagined the test to be a simple sanity check on a short string like this.

But I've delayed you for long enough. If you prefer, you can replace the test string with that from the link above, which will make the test way shorter. If not, I'm okay with merging it as is and changing it later myself.

saharNooby · 2023-06-08T11:33:03Z

Let me know what do you think about this, and I'll merge.

Mathmagician8191 · 2023-06-08T11:36:24Z

The long test string should be fine, the test isn't likely to be run that often and it makes sure all edge cases are handled

Add support for the world tokenizer

324d56d

saharNooby reviewed Jun 3, 2023

View reviewed changes

Move tokenizer logic to rwkv_tokenizer.py

7025561

Added test for the tokenizer

6f2a4a8

Mathmagician8191 requested a review from saharNooby June 8, 2023 11:25

saharNooby reviewed Jun 8, 2023

View reviewed changes

saharNooby approved these changes Jun 8, 2023

View reviewed changes

saharNooby merged commit 82c4ac7 into RWKV:master Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for the world tokenizer #86

Add support for the world tokenizer #86

Mathmagician8191 commented Jun 2, 2023

saharNooby Jun 3, 2023

saharNooby Jun 3, 2023

Mathmagician8191 Jun 3, 2023

saharNooby Jun 3, 2023

Mathmagician8191 Jun 3, 2023

saharNooby Jun 4, 2023

saharNooby Jun 3, 2023

saharNooby Jun 3, 2023

saharNooby Jun 3, 2023

saharNooby Jun 3, 2023

Mathmagician8191 commented Jun 4, 2023

Mathmagician8191 commented Jun 6, 2023

saharNooby Jun 8, 2023

saharNooby commented Jun 8, 2023

Mathmagician8191 commented Jun 8, 2023

	parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20b")
	parser.add_argument('tokenizer', help='Which tokenizer to use', nargs='?', type=str, default="20B")

	parser.add_argument('token_limit', nargs='?', help='How many tokens to process; set to -1 to process all text', type=int, default=-1)
	parser.add_argument('token_limit', help='How many tokens to process; set to -1 to process all text', nargs='?', type=int, default=-1)

Add support for the world tokenizer #86

Add support for the world tokenizer #86

Conversation

Mathmagician8191 commented Jun 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mathmagician8191 commented Jun 4, 2023

Mathmagician8191 commented Jun 6, 2023

Choose a reason for hiding this comment

saharNooby commented Jun 8, 2023

Mathmagician8191 commented Jun 8, 2023