- Q: regarding chr(0)
- A: null, shown as '\x00' in Python, the same with '\0' in C. When printed, nothing is printed. If we use it to concatenate, it will show as if it does not exist in Python when printed.


In [1]:
ch = chr(0)
ch

'\x00'

In [2]:
print(ch)
print('test1' + ch + 'test2')

 
test1 test2


- Q: Why UTF-8 instead of 16 or 32 for our tokenizer?
- UTF-8 representations are shorter especially for text with mostly ASCII characters.

In [3]:
test_string = "你好呀！Let's do a test!"
utf8_encoded = test_string.encode('utf-8')
utf16_encoded = test_string.encode('utf-16')
print(utf8_encoded)
print(utf16_encoded)

b"\xe4\xbd\xa0\xe5\xa5\xbd\xe5\x91\x80\xef\xbc\x81Let's do a test!"
b"\xff\xfe`O}Y@T\x01\xffL\x00e\x00t\x00'\x00s\x00 \x00d\x00o\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00!\x00"


In [7]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])
print(decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")))
print(decode_utf8_bytes_to_str_wrong("换个语言test".encode("utf-8")))

# Explanation: for non-ASCII characters, each byte is treated as a separate character,
# which leads to incorrect decoding.


hello


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: unexpected end of data

In [8]:
invalid_bytes = b'\xff\xff'
print(invalid_bytes.decode('utf-8', errors='ignore'))  # This will ignore invalid bytes

# This is invalid because the first byte is beyond the valid UTF-8 range.
# The second byte is also invalid in UTF-8.




In [None]:
%%time
from cs336_basics.tokenizer import train_bpe
special_tokens = ["<|endoftext|>"]
input_path = "data/TinyStoriesV2-GPT4-train.txt"
vocab_size = 10_000

vocab_ts, merges_ts = train_bpe(
    input_path=input_path,
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    num_processes=8
)

CPU times: user 2min 50s, sys: 2.56 s, total: 2min 52s
Wall time: 3min 32s


In [None]:
from cs336_basics.tokenizer import save_merges, save_vocab
save_merges(merges_ts, "data/merges_ts.txt")
save_vocab(vocab_ts, "data/vocab_ts.txt")

In [3]:
%%prun 
vocab_ts, merges_ts = train_bpe(
    input_path=input_path,
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    num_processes=8
)

 

         966107709 function calls (966107360 primitive calls) in 273.724 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       88  108.572    1.234  191.708    2.178 {method 'control' of 'select.kqueue' objects}
       88   51.383    0.584  301.878    3.430 selectors.py:558(select)
369218707   34.350    0.000   34.350    0.000 tokenizer.py:150(<lambda>)
9866/9847   32.998    0.003   67.332    0.007 {built-in method builtins.max}
       88   19.563    0.222  412.148    4.684 base_events.py:1909(_run_once)
584066668   16.632    0.000   16.633    0.000 __init__.py:609(__missing__)
        1    4.425    4.425    7.555    7.555 tokenizer.py:124(train_bpe)
    84/82    4.085    0.049    4.085    0.050 {built-in method posix.read}
   574255    0.787    0.000    1.243    0.000 tokenizer.py:182(replace_pair_in_tuple)
  8682067    0.303    0.000    0.303    0.000 {built-in method builtins.len}
  3449428    0.215    0.000    0.215   

In [9]:
vocab_ts

{0: b'<|endoftext|>',
 1: b'\x00',
 2: b'\x01',
 3: b'\x02',
 4: b'\x03',
 5: b'\x04',
 6: b'\x05',
 7: b'\x06',
 8: b'\x07',
 9: b'\x08',
 10: b'\t',
 11: b'\n',
 12: b'\x0b',
 13: b'\x0c',
 14: b'\r',
 15: b'\x0e',
 16: b'\x0f',
 17: b'\x10',
 18: b'\x11',
 19: b'\x12',
 20: b'\x13',
 21: b'\x14',
 22: b'\x15',
 23: b'\x16',
 24: b'\x17',
 25: b'\x18',
 26: b'\x19',
 27: b'\x1a',
 28: b'\x1b',
 29: b'\x1c',
 30: b'\x1d',
 31: b'\x1e',
 32: b'\x1f',
 33: b' ',
 34: b'!',
 35: b'"',
 36: b'#',
 37: b'$',
 38: b'%',
 39: b'&',
 40: b"'",
 41: b'(',
 42: b')',
 43: b'*',
 44: b'+',
 45: b',',
 46: b'-',
 47: b'.',
 48: b'/',
 49: b'0',
 50: b'1',
 51: b'2',
 52: b'3',
 53: b'4',
 54: b'5',
 55: b'6',
 56: b'7',
 57: b'8',
 58: b'9',
 59: b':',
 60: b';',
 61: b'<',
 62: b'=',
 63: b'>',
 64: b'?',
 65: b'@',
 66: b'A',
 67: b'B',
 68: b'C',
 69: b'D',
 70: b'E',
 71: b'F',
 72: b'G',
 73: b'H',
 74: b'I',
 75: b'J',
 76: b'K',
 77: b'L',
 78: b'M',
 79: b'N',
 80: b'O',
 81: b'P',
 82: b

Homework questions: 
- Roughly 10 mins to train;
- Max memory about a bit over the file size, which makes sense as pretokenization is the most memory intensive
- Longest token in the vocab should be around 

In [None]:
%%time
from cs336_basics.tokenizer import train_bpe
special_tokens = ["<|endoftext|>"]
input_path = "data/owt_train.txt"
vocab_size = 32_000

vocab_owt, merges_owt = train_bpe(
    input_path=input_path,
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    num_processes=2
)

In [None]:
from cs336_basics.tokenizer import save_merges, save_vocab
save_merges(merges_owt, "data/merges_owt.txt")
save_vocab(vocab_owt, "data/vocab_owt.txt")