- Q: regarding chr(0)
- A: null, shown as '\x00' in Python, the same with '\0' in C. When printed, nothing is printed. If we use it to concatenate, it will show as if it does not exist in Python when printed.


In [1]:
ch = chr(0)
ch

'\x00'

In [2]:
print(ch)
print('test1' + ch + 'test2')

 
test1 test2


- Q: Why UTF-8 instead of 16 or 32 for our tokenizer?
- UTF-8 representations are shorter especially for text with mostly ASCII characters.

In [3]:
test_string = "你好呀！Let's do a test!"
utf8_encoded = test_string.encode('utf-8')
utf16_encoded = test_string.encode('utf-16')
print(utf8_encoded)
print(utf16_encoded)

b"\xe4\xbd\xa0\xe5\xa5\xbd\xe5\x91\x80\xef\xbc\x81Let's do a test!"
b"\xff\xfe`O}Y@T\x01\xffL\x00e\x00t\x00'\x00s\x00 \x00d\x00o\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00!\x00"


In [7]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
print(decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")))
print(decode_utf8_bytes_to_str_wrong("换个语言test".encode("utf-8")))

# Explanation: for non-ASCII characters, each byte is treated as a separate character,
# which leads to incorrect decoding.


hello


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data

In [8]:
invalid_bytes = b'\xff\xff'
print(invalid_bytes.decode('utf-8', errors='ignore'))  # This will ignore invalid bytes

# This is invalid because the first byte is beyond the valid UTF-8 range.
# The second byte is also invalid in UTF-8.




In [2]:
%%time
from cs336_basics.tokenizer import train_bpe
special_tokens = ["<|endoftext|>"]
input_path = "data/TinyStoriesV2-GPT4-train.txt"
vocab_size = 10_000

vocab_ts, merges_ts = train_bpe(
    input_path=input_path,
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    num_processes=8
)

CPU times: user 2min 56s, sys: 3.29 s, total: 3min
Wall time: 3min 38s


In [3]:
from cs336_basics.tokenizer import save_merges, save_vocab
save_merges(merges_ts, "data/merges_ts.txt")
save_vocab(vocab_ts, "data/vocab_ts.txt")

In [4]:
%%prun 
vocab_ts, merges_ts = train_bpe(
    input_path=input_path,
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    num_processes=8
)

 

         966108674 function calls (966108291 primitive calls) in 280.120 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       90   73.858    0.821  134.759    1.497 {method 'control' of 'select.kqueue' objects}
       90   59.083    0.656  505.183    5.613 base_events.py:1909(_run_once)
       90   52.438    0.583  254.973    2.833 selectors.py:558(select)
369218707   34.627    0.000   34.627    0.000 tokenizer.py:150(<lambda>)
9870/9853   33.401    0.003   68.001    0.007 {built-in method builtins.max}
584066668   16.680    0.000   16.682    0.000 __init__.py:609(__missing__)
        1    4.441    4.441    7.451    7.451 tokenizer.py:124(train_bpe)
    86/82    3.888    0.045   40.185    0.490 {built-in method posix.read}
   574255    0.803    0.000    1.266    0.000 tokenizer.py:182(replace_pair_in_tuple)
  8682077    0.306    0.000    0.306    0.000 {built-in method builtins.len}
  3449434    0.219    0.000    0.219   

In [11]:
# Longest token in the vocabulary
longest_token = max(vocab_ts.values(), key=len)
print("Longest token:", longest_token, "Length:", len(longest_token))

Longest token: b' accomplishment' Length: 15


Homework questions: 
- Roughly 10 mins to train;
- Max memory about a bit over the file size, which makes sense as pretokenization is the most memory intensive
- Longest token in the vocab is 15 which looks reasonable

In [None]:
%%time
from cs336_basics.tokenizer import train_bpe
special_tokens = ["<|endoftext|>"]
input_path = "data/owt_train.txt"
vocab_size = 32_000

vocab_owt, merges_owt = train_bpe(
    input_path=input_path,
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    num_processes=2
)

In [None]:
from cs336_basics.tokenizer import save_merges, save_vocab
save_merges(merges_owt, "data/merges_owt.txt")
save_vocab(vocab_owt, "data/vocab_owt.txt")