# Assignment 1

## BPE Tokenizer

In [5]:
chr(0)

'\x00'

In [None]:
# print(chr(0).__repr__)
print(chr(0))

 


In [7]:
"this is a test" + chr(0) + "string"

'this is a test\x00string'

In [8]:
print("this is a test" + chr(0) + "string")

this is a test string


In [13]:
test_string = "hello! こんにちは!"
utf8_encoded = test_string.encode("utf-8")
print(utf8_encoded)
print(type(utf8_encoded))
print(list(utf8_encoded))
print(len(test_string))
print(len(utf8_encoded))
print(utf8_encoded.decode("utf-8"))

b'hello! \xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf!'
<class 'bytes'>
[104, 101, 108, 108, 111, 33, 32, 227, 129, 147, 227, 130, 147, 227, 129, 171, 227, 129, 161, 227, 129, 175, 33]
13
23
hello! こんにちは!


In [14]:
test_string = "hello! こんにちは!"
utf16_encoded = test_string.encode("utf-16")
print(utf16_encoded)
print(type(utf16_encoded))
print(list(utf16_encoded))
print(len(test_string))
print(len(utf16_encoded))
print(utf16_encoded.decode("utf-16"))

b'\xff\xfeh\x00e\x00l\x00l\x00o\x00!\x00 \x00S0\x930k0a0o0!\x00'
<class 'bytes'>
[255, 254, 104, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0, 32, 0, 83, 48, 147, 48, 107, 48, 97, 48, 111, 48, 33, 0]
13
28
hello! こんにちは!


In [15]:
test_string = "hello! こんにちは!"
utf32_encoded = test_string.encode("utf-32")
print(utf32_encoded)
print(type(utf32_encoded))
print(list(utf32_encoded))
print(len(test_string))
print(len(utf32_encoded))
print(utf32_encoded.decode("utf-32"))

b'\xff\xfe\x00\x00h\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00 \x00\x00\x00S0\x00\x00\x930\x00\x00k0\x00\x00a0\x00\x00o0\x00\x00!\x00\x00\x00'
<class 'bytes'>
[255, 254, 0, 0, 104, 0, 0, 0, 101, 0, 0, 0, 108, 0, 0, 0, 108, 0, 0, 0, 111, 0, 0, 0, 33, 0, 0, 0, 32, 0, 0, 0, 83, 48, 0, 0, 147, 48, 0, 0, 107, 48, 0, 0, 97, 48, 0, 0, 111, 48, 0, 0, 33, 0, 0, 0]
13
56
hello! こんにちは!


In [20]:
def decode_utf8_bytes_to_str_wrong(bytestring: bytes):
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

print(decode_utf8_bytes_to_str_wrong("hello".encode("utf-8")))

hello


This function is incorrect because it attempts to decode each byte of the input `bytestring` individually.

The core problem is that **UTF-8 is a variable-length encoding**. While standard ASCII characters (like 'h', 'e', 'l', 'l', 'o') are represented by a single byte, **很多其它字符需要多个bytes**. The provided function breaks these multi-byte sequences, trying to decode each constituent byte in isolation. This fails because a single byte from a multi-byte sequence is not a valid UTF-8 character on its own.

-----

### Example of Incorrect Results

An input byte string that represents a character outside the ASCII range will cause the function to fail. Let's use the Euro sign (`€`), which is encoded in UTF-8 by the three-byte sequence `b'\xe2\x82\ac'`.

**Input:**

```python
euro_bytes = "€".encode("utf-8")  # This results in b'\xe2\x82\xac'
```

When `decode_utf8_bytes_to_str_wrong(euro_bytes)` is called:

1.  The loop starts with the first byte, `0xe2`.
2.  It tries to execute `bytes([0xe2]).decode("utf-8")`.
3.  The UTF-8 decoder sees the byte `0xe2` (`11100010` in binary), which signals the start of a three-byte character. Since the other two bytes are not provided, the decoder recognizes this as an incomplete and invalid sequence.
4.  A `UnicodeDecodeError` is raised, and the program crashes.

**Correct Decoding:**
The correct way to decode is to call the `decode` method on the entire byte string at once, allowing the decoder to properly interpret the multi-byte sequences.

```python
>>> b'\xe2\x82\xac'.decode("utf-8")
'€'
```

In [1]:
b = b'\xc0\x80'
b.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 0: invalid start byte

A two-byte sequence that does not decode to any Unicode character is **`b'\xc0\x80'`**.

---

### Explanation

This sequence is invalid because it is an **overlong encoding**. Here's why:

1.  **UTF-8 Rules**: The UTF-8 encoding scheme has specific rules to ensure that each character has only one valid byte representation. A character must be encoded using the shortest possible number of bytes.

2.  **The `C0` Start Byte**: The first byte, `0xc0` (binary `11000000`), signals the start of a two-byte sequence.

3.  **Decoding the Sequence**: If a decoder were to process this sequence naively, it would interpret `c0 80` as an attempt to encode the null character (U+0000).

4.  **The Violation**: The null character already has a valid, shorter, single-byte representation: `b'\x00'`. Because a shorter representation exists, the two-byte version `b'\xc0\x80'` is classified as an illegal "overlong" encoding. Any compliant UTF-8 decoder will reject this sequence as invalid.

In short, any two-byte UTF-8 sequence starting with `0xc0` or `0xc1` is invalid for this reason.

In [3]:
print(list(b'the'))

[116, 104, 101]


In [7]:
import regex as re
# (used by GPT-2; Radford et al., 2019) from github.com/openai/tiktoken/pull/234/files:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
for item in re.finditer(PAT, "some text that i'll pre-tokenize"):
    print(item)

<regex.Match object; span=(0, 4), match='some'>
<regex.Match object; span=(4, 9), match=' text'>
<regex.Match object; span=(9, 14), match=' that'>
<regex.Match object; span=(14, 16), match=' i'>
<regex.Match object; span=(16, 19), match="'ll">
<regex.Match object; span=(19, 23), match=' pre'>
<regex.Match object; span=(23, 24), match='-'>
<regex.Match object; span=(24, 32), match='tokenize'>
