# CS 5293 Assignment-1

Welcome to CS 5293 (Text Analytics)!

For this assignment 1, the main goal is to help you setup your python environment, and do some warmup coding for the following two parts:
* Part 1. Analyze the vocabulary of large language model Llama-3.1 (40')
* Part 2. Build your own N-gram model (60')

# Part 1. Analyze Llama-3.1 Vocabulary (40')

In the lecture 2, we learned the Heap's law. As the corpus size grow, we don't hope the vocabulary continuously go larger, but we still the vocabulary cover all the words to avoice the out-of-vocabulary error(OOV). Hence, we learned the subword tokenization, typically the Byte Pair Encoding(BPE).

From BERT, Roberta, T5, Llama-1, to Llama-2, majority of these language models vocabulary size is often around 30K. However, Llama-3 replaced the original tokenizer Sentence Piece [1] with the TikToken [2] used in OpenAI models, which has 128K vocabulary size. 

On the other side, larger vocabulary size also means more token efficient (i.e., fewer tokens are necessary to encode the same piece of text relative to Llama-2), which makes inference more efficient. However, the optimal size of vocabulary is still unknown. 

Here, we have a list of the "tokens" that are in the "Llama-3.1-8B" vocabulary file (../data/Llama-3.1-8b/vocab.txt), with one "token" per line. This goal of this assignment is to write a program to identify what are the 128K tokens in the Llama-3.1-8B vocabulary. 

In this jupyter notebook, we have conducted an initial investigation with simple regex via linux command to answer the simple question of "How many punctunations/digits/special tokens are in there?" 
For the remaning tokens in the vocabulary, please select only ONE research question about a special subset of tokens, (1) write a python program to obtain them, and (2) report the total number tokens in that selected subset. 

## Grading Criteria
* (10') Be able to set up the envrionment, read and run through all the code via VSCode or your own jupternote, and understand how to use regex pattern to analyze the text. Hence, to get this score, you have to run each cell of the Part 1 to get an output to indicate that you could run that. 
* (30') Investigate your own research question.

## Recommended open questions (not limited to this list, and not limited to English)
* How many English roots are covered in the vocabulary? (../data/english.roots.list.build.json)
* How many whole English words are there in the vocabulary? (../check_word.ipynb)
* How emojis are tokenized in the tokenizer? What are the related tokens in the vocabulary?
* How combining characters are tokenized in the tokenizer? What are the related tokens in the vocabulary?
* If we define the bytes in each token as the token length? Could you figure out the longest token in the vocabulary? Please analyze.

## Other Unhelpful Reading for This Assignment: 
1. Sentence Piece. https://github.com/google/sentencepiece
2. Tiktoken. Think of the simple version of BPE algorithm we learned in lecture 2. The educational.py in Tiktoken algorithm offers an implementation of that.
 https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/tiktoken/_educational.py#L119
3. Llama3.1 use a different `pat_str` to split the sentence, a bunch of special tokens, and larger vocabulary size: https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L21, please consider the special tokens as the shape of "<|.*|>"
4. Here is an OpenAI tutorial to compare different vocabulary of their models from GPT2 to recent GPT-o1. https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
5. Research on scaling laws of vocabulary also has been studied this. https://arxiv.org/abs/2407.13623. At the same time, researcher also studied the unfaireness of the tokenizer for different languages. https://arxiv.org/pdf/2305.15425 

### Setup:  Python Environment and VSCode
import the environment.yml under our root assignment folder, and create and activate your environment via the setup_tutorial.ipynb. 

```
mamba env create -n cs5293-1 -f ../environment.yml
mamba activate cs5293-1
```

if you failed to run the above command your in terminal. This is because the exported environmental.yml file by mamba is not cross-platform. It may fail in your OS envrionment. 
So you need to create your own "cs5293-1" from scratch. A simple rule to use mamba and pip, only use mamba for system-related packages, such as cuda, python or others. But for python libraries, e.g. transformers, torch, priotize to use `pip install` first, then `mamba install`. 

Please use the following commands:
```
mamba env create -n cs5293-1 python=3.10
mamba activate cs5293-1
pip install transformers torch
```
Hopefully all the above steps are sucessful, Good luck!
Then you need to go the top-right corner of the VSCode, click the `Select Kernel`, (and `Select Another Kernel` if needed) to select the above cs5293-1 envrionment. 

If your VSCode didn't find the `cs5293-1` environment, please try to fully quit your VSCode and reopen the project folder, try the `Select Kernel` again. 

As mentioned in this webpage, starting from VS Code version 1.86.0, Microsoft now drop support for older operating system with glibc<2.28, which include ALL OSCER compute nodes. Until we upgrade our entire supercomputer to a newer operating system, your ONLY choice is to use *VS Code Desktop AND CLI version 1.85.2* via the links in this webpage https://www.ou.edu/oscer/support/VS_Code.

Launch VSCode, and *Open the whole assignment-1 folder* in your VSCode.
Then open this jupyter notebook Assignment-1.ipynb in VSCode by clicking the Explorer in the left side bar. 


In [36]:
from transformers import AutoTokenizer
import os
import sys
sys.path.append("../src")
print(sys.path)  
#Disabling the parallelism of the tokenizers to avoid issues with the multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "false"
#Importing the necessary modules for this assignment 
import vocab_utils

['/Users/sairishith/Downloads/Spring2025-cs5293-1-master/notebooks', '/Users/sairishith/miniforge3/envs/cs5293-1/lib/python310.zip', '/Users/sairishith/miniforge3/envs/cs5293-1/lib/python3.10', '/Users/sairishith/miniforge3/envs/cs5293-1/lib/python3.10/lib-dynload', '', '/Users/sairishith/miniforge3/envs/cs5293-1/lib/python3.10/site-packages', '/var/folders/s1/jq1jylpx1zq5fcqqyfyt0jg00000gn/T/tmpzpbsdhn2', '../src', '../src']


In [37]:
# DON'T USE THIS LLAMA2 MODEL FOR NOW
#llama2_model_name = "meta-llama/Llama-2-7b"
#llama2_local_model = "../data/Llama-2-7b"

# If you understand the vocab_utils.py file, 
# you can freely change the following variables to test other models.
# Otherwise, please DON'T change these variables.
llama3_model_name = "meta-llama/Llama-3.1-8b"
llama3_local_model = "../data/Llama-3.1-8b"

vocab3_file = os.path.join(llama3_local_model,"vocab.txt")
vocab3_file

'../data/Llama-3.1-8b/vocab.txt'

We load the tokenizer from the remote huggingface repo via the name of the model. 

In [38]:
# Please ignore this cell, because it requires a license to do that. 
# The purpose of this is to load the tokenizer remotely and save it in your local folder. 
# You need to obtain a license for your own access to the model.
# https://huggingface.co/meta-llama/Meta-Llama-3-8B
# via the steps here. https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/172
# Feel free to do that because if you will need this in Assignment 4 or your project.
# But you don't need that for this assignment. 
# We have saved the tokenizer for you locally and store them in the data folder.
#tokenizer = vocab_utils.save_tokenizer_to_local(llama3_model_name, llama3_local_model)
#vocab_utils.save_vocab(tokenizer, vocab3_file)

### Load Llama3 Tokenizer Locally
For this assignment, we will load the Llama3 tokenizer locally

In [39]:
tokenizer = vocab_utils.load_local_tokenizer(llama3_local_model)

tokenizer loaded from ../data/Llama-3.1-8b


### Test the Llama3 Tokenizer

#### Test on English

Please investigate the whitespace, line wrap and soon on. 

In [40]:
non_white_space_sequence = "BuyableInstoreAndOnline\n"
tokenized_sentences = tokenizer.tokenize(non_white_space_sequence)
print(f"tokenized_sentences for \n{non_white_space_sequence}\n{tokenized_sentences}")

tokenized_sentences for 
BuyableInstoreAndOnline

['Buy', 'able', 'In', 'store', 'And', 'Online', 'Ċ']


In [41]:
white_space_sequence = "Buyable In store And Online\n"
tokenized_sentences = tokenizer.tokenize(white_space_sequence)
# please pay attention to the special whitespace and \n in the sequence.
# they are just stored as unicode characters, not some newly added characters.
# (0x20 is space, and they add 0x100 to every symbol they have to encode)
# (0x0a is newline, and they add 0x100 to every symbol they have to encode) 
for token in tokenized_sentences:
    print(f"for token = {token}")
    for ch in token:
        print(f"{ch}, U+{ord(ch):04x}")
print(f"tokenized_sentences for \n{white_space_sequence}\n{tokenized_sentences}")

for token = Buy
B, U+0042
u, U+0075
y, U+0079
for token = able
a, U+0061
b, U+0062
l, U+006c
e, U+0065
for token = ĠIn
Ġ, U+0120
I, U+0049
n, U+006e
for token = Ġstore
Ġ, U+0120
s, U+0073
t, U+0074
o, U+006f
r, U+0072
e, U+0065
for token = ĠAnd
Ġ, U+0120
A, U+0041
n, U+006e
d, U+0064
for token = ĠOnline
Ġ, U+0120
O, U+004f
n, U+006e
l, U+006c
i, U+0069
n, U+006e
e, U+0065
for token = Ċ
Ċ, U+010a
tokenized_sentences for 
Buyable In store And Online

['Buy', 'able', 'ĠIn', 'Ġstore', 'ĠAnd', 'ĠOnline', 'Ċ']


#### Test Emojis

In [42]:
# testing emojis, https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)
sequence = "😀🙅🏻" # The second emoji is so called emoji with a modifer.
# The string is naturally in unicode in Python.
# You could print the unicode for each character in the string.
for ch in sequence:
    print(f"{ch}, U+{ord(ch):04x}")
tokenized_sentences = tokenizer.tokenize(sequence)
print(f"tokenized_sentences for \n{sequence}\n{tokenized_sentences}")

😀, U+1f600
🙅, U+1f645
🏻, U+1f3fb
tokenized_sentences for 
😀🙅🏻
['ðŁĺ', 'Ģ', 'ðŁ', 'Ļ', 'ħ', 'ðŁ', 'ı', '»']


#### Test on Combining Characer

In [43]:
# The combination characters for 'Tokenization is fun!'
# You could generate more here https://lingojam.com/ZalgoText 
sequence = "T̵o̵k̴e̵n̷i̸z̵a̵t̴i̴o̴n̵ ̴i̶s̵ ̷f̸u̴n̶!̵" 
# The string is naturally in unicode in Python.
# You could print the unicode for each character in the string.
for ch in sequence:
    print(ch, ord(ch))
tokenized_sentences = tokenizer.tokenize(sequence)
print(f"tokenized_sentences for \n{sequence}\n{tokenized_sentences}")

T 84
̵ 821
o 111
̵ 821
k 107
̴ 820
e 101
̵ 821
n 110
̷ 823
i 105
̸ 824
z 122
̵ 821
a 97
̵ 821
t 116
̴ 820
i 105
̴ 820
o 111
̴ 820
n 110
̵ 821
  32
̴ 820
i 105
̶ 822
s 115
̵ 821
  32
̷ 823
f 102
̸ 824
u 117
̴ 820
n 110
̶ 822
! 33
̵ 821
tokenized_sentences for 
T̵o̵k̴e̵n̷i̸z̵a̵t̴i̴o̴n̵ ̴i̶s̵ ̷f̸u̴n̶!̵
['T', 'Ì', 'µ', 'o', 'Ì', 'µ', 'k', 'Ì', '´', 'e', 'Ì', 'µ', 'n', 'Ì', '·', 'i', 'Ì', '¸', 'z', 'Ì', 'µ', 'a', 'Ì', 'µ', 't', 'Ì', '´', 'i', 'Ì', '´', 'o', 'Ì', '´', 'n', 'Ì', 'µ', 'Ġ', 'Ì', '´', 'i', 'Ì', '¶', 's', 'Ì', 'µ', 'Ġ', 'Ì', '·', 'f', 'Ì', '¸', 'u', 'Ì', '´', 'n', 'Ì', '¶', '!', 'Ì', 'µ']


### Initial Exploration for Llama3's Vocabulary

As an exmaple, in the following, we show a list of linux commands for an initial analysis on the vacabulary. Please to find the corresponding python code to get the same function, which is not hard. These are just an optional practice for verification of your python code.

The following command will count in total how many lines in the vocab.txt

In [44]:
!wc -l ../data/Llama-3.1-8b/vocab.txt

  128256 ../data/Llama-3.1-8b/vocab.txt


So we're in the right ballpark with 128256 lines (tokens). 

In [45]:
# Python code for count lines in the file ../data/Llama-3.1-8b/vocab.txt
line_count = 0
with open(vocab3_file) as f:
    line_count = sum(1 for _ in f)

print(f"line_count = {line_count}")

line_count = 128256


So the above python code will read the vocabulary file, and count each lines(one token per line).
Let's see what's in there.
Because the list is too long. You could use the command `head` to print the first 10 lines.

In [46]:
!head ../data/Llama-3.1-8b/vocab.txt

!
"
#
$
%
&
'
(
)
*


Ok, those look like the initial set of characters we mentioned in our lectures. Remember we said that subword tokenization algorithms start with an initial vocabulary of characters. In the lecture 2, we only consider the letters and numbers.  That's really not quite right, if you're using arbitrary web docs and things like Wikipedia then you're going to run into a lot of odd characters, such as the combination characters used in Fentayln example.  Better to just use all the unicode characters that occur in the training text. Let's see what we get we look at all the single character entries in the list.

In [47]:
!grep '^.$' ../data/Llama-3.1-8b/vocab.txt  | head

!
"
#
$
%
&
'
(
)
*


In [48]:
%%bash
single_char_num=$(grep '^.$' ../data/Llama-3.1-8b/vocab.txt  | wc -l)
echo "With grep command, there are ${single_char_num} single characters, which are all the 2^8 bytes"

With grep command, there are      256 single characters, which are all the 2^8 bytes


In [49]:
import re
# please learn to use python regex. https://www.w3schools.com/python/python_regex.asp
single_char_num = 0
with open(vocab3_file) as f:
    for token in f:
        if re.match(r'^.$', token):
            single_char_num += 1
print(f"With python regex, there are {single_char_num} single characters, which are all the 2^8 bytes")

With python regex, there are 256 single characters, which are all the 2^8 bytes


Next, let us take a look at the special tokens as the shape of "<|.*|>". https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py#L21 
Please try to write a python program to replicate this result for detecting the special tokens. 

In [50]:
!grep -E -o '^<\|.*\|>$' ../data/Llama-3.1-8b/vocab.txt| head -50

<|begin_of_text|>
<|end_of_text|>
<|reserved_special_token_0|>
<|reserved_special_token_1|>
<|finetune_right_pad_id|>
<|reserved_special_token_2|>
<|start_header_id|>
<|end_header_id|>
<|eom_id|>
<|eot_id|>
<|python_tag|>
<|reserved_special_token_3|>
<|reserved_special_token_4|>
<|reserved_special_token_5|>
<|reserved_special_token_6|>
<|reserved_special_token_7|>
<|reserved_special_token_8|>
<|reserved_special_token_9|>
<|reserved_special_token_10|>
<|reserved_special_token_11|>
<|reserved_special_token_12|>
<|reserved_special_token_13|>
<|reserved_special_token_14|>
<|reserved_special_token_15|>
<|reserved_special_token_16|>
<|reserved_special_token_17|>
<|reserved_special_token_18|>
<|reserved_special_token_19|>
<|reserved_special_token_20|>
<|reserved_special_token_21|>
<|reserved_special_token_22|>
<|reserved_special_token_23|>
<|reserved_special_token_24|>
<|reserved_special_token_25|>
<|reserved_special_token_26|>
<|reserved_special_token_27|>
<|reserved_special_token_28|>
<|res

In [51]:
!grep -E -o '^<\|.*\|>$' ../data/Llama-3.1-8b/vocab.txt|wc -l

     256


So there are 256 used or reserved special tokens in Llama-3.1-8b vocabulary. Next we found the special Ġ is used for the leading whitespace of the token. 

In [52]:
!grep '^Ġ'< ../data/Llama-3.1-8b/vocab.txt | wc -l

   57875


In [53]:
!grep -E -o '[[:digit:]]+' < ../data/Llama-3.1-8b/vocab.txt| head -200

0
1
2
3
4
5
6
7
8
9
00
20
10
201
12
19
11
32
16
15
25
000
30
18
14
13
100
200
17
50
24
64
40
22
60
23
99
80
27
28
26
33
29
21
01
35
45
37
36
90
34
38
70
75
44
55
39
31
48
66
05
08
202
04
65
88
02
49
78
09
199
07
68
47
500
06
95
46
77
03
59
58
42
69
67
300
41
255
57
98
43
400
56
97
198
150
51
87
52
001
256
96
86
102
53
120
54
128
197
123
89
79
101
800
76
111
600
110
250
196
180
85
72
63
999
62
61
74
84
130
91
192
81
73
71
83
92
82
003
195
94
160
93
194
125
105
108
002
127
360
104
140
103
700
190
112
193
115
106
900
404
191
107
109
010
204
121
114
116
113
170
122
240
512
005
117
220
350
004
333
210
124
135
118
144
168
119
131
141
188
189
126
132
133
134
320
145
203
187
151


Let's see how many tokens are started with white space character in the unicode 'Ġ'

In [54]:
!grep '^Ġ' ../data/Llama-3.1-8b/vocab.txt |  wc -l


   57875


In [55]:
!grep -v '\[' ../data/Llama-3.1-8b/vocab.txt | grep -v '^.$' | grep -v '^Ġ' | head -200

in
er
on
re
at
st
en
or
ĊĊ
le
it
an
ar
al
;Ċ
ou
is
ing
es
ion
ed
ic
et
ĉĉ
ro
as
el
ct
nd
ent
id
am
--
om
);Ċ
im
čĊ
il
//
ur
se
ex
ad
ch
ut
if
**
em
ol
th
)Ċ
ig
iv
,Ċ
ce
od
ate
ag
ay
ot
us
un
ul
ue
ow
ew
ation
()
ab
ort
um
ame
pe
tr
ck
âĢ
ist
----
.ĊĊ
he
lo
ers
ap
ub
ass
int
>Ċ
ly
urn
;ĊĊ
av
port
ir
->
nt
ction
end
00
ith
out
turn
our
lic
res
pt
==
ver
age
ht
ext
="
****
ess
os
and
ect
ke
rom
con
("
qu
lass
iz
de
op
up
get
ile
ata
ore
ri
;čĊ
ĉĉĉĉ
ter
ain
art
ack
import
ublic
est
ment
able
ine
ill
ind
ere
::
ity
elf
ight
('
orm
ult
str
..
",
ype
pl
20
ld
oc
:Ċ
--------
.s
{Ċ
',
ant
ase
.c
</
ave
ang
âĢĻ
_t
ert
ial
act
}Ċ
ive
ode
ost
og
ord
alue
all
ff
();Ċ
ont
ime
are
ies
ize
ure
ire
.p
ice
ast
ption
tring
ok
grep: stdout: Broken pipe


Although its not stated, this is obviously a frequency ordered list. "in" is at the top. Some of these are recognizable as English suffixes (-ed, -ing, -ly, etc).

Let's just sort it alphanumerically to see what's in there.

In [56]:
!grep -v '\[' < ../data/Llama-3.1-8b/vocab.txt | grep -v '^.$' | grep -v '^Ġ'  | sort | head -200

!!
!!!
!!!!
!!!!!
!!!!!!!!
!!!!ĊĊ
!!!Ċ
!!!ĊĊ
!!");Ċ
!!)Ċ
!!,
!!.
!!Ċ
!!ĊĊ
!"
!")
!");
!");Ċ
!");ĊĊ
!");čĊ
!")Ċ
!")ĊĊ
!")čĊ
!",
!",Ċ
!".
!";Ċ
!";čĊ
!"Ċ
!"ĊĊ
!'
!');Ċ
!')Ċ
!')ĊĊ
!',
!',Ċ
!';Ċ
!'Ċ
!(
!("
!("{
!("{}",
!(:
!(Ċ
!)
!),
!).
!).ĊĊ
!)Ċ
!)ĊĊ
!*
!*\Ċ
!,
!,Ċ
!--
!.
!..
!.ĊĊ
!/
!:
!;Ċ
!<
!</
!=
!="
!='
!=(
!=-
!==
!?
!I
!\
!]
!important
!Â»
!âĢĻ
!âĢľ
!âĢľĊĊ
!âĢĿ
!âĢĿĊĊ
!Ċ
!ĊĊ
!ĊĊĊ
!ĊĊĊĊ
!ĊĊĊĊĊĊ
!čĊ
""
"""
"""),Ċ
""")Ċ
""",Ċ
""".
"""Ċ
"""ĊĊ
"""ĊĊĊ
"""čĊ
"""čĊčĊ
"",
"",Ċ
"".
"":
""Ċ
"#
"$
"${
"%
"%(
"&
"'
"',
"',Ċ
"';
"';Ċ
"'Ċ
"(
")
")!=
")(
"))
")))
"))))Ċ
")));
")));Ċ
")));ĊĊ
")));čĊ
")))Ċ
")),
")),Ċ
")).
"));
"));Ċ
"));ĊĊ
"));čĊ
"));čĊčĊ
")){Ċ
")){čĊ
"))Ċ
"))ĊĊ
"))čĊ
")+
"),
"),"
"),Ċ
"),ĊĊ
"),čĊ
")->
").
").Ċ
").ĊĊ
"):
"):Ċ
"):čĊ
");
");//
");}Ċ
");Ċ
");ĊĊ
");ĊĊĊ
");čĊ
");čĊčĊ
")==
")]
")]Ċ
")]ĊĊ
")]čĊ
"){
"){Ċ
"){čĊ
")}
")},Ċ
")}Ċ
")Ċ
")ĊĊ
")ĊĊĊ
")čĊ
")čĊčĊ
"*
"+
"+"
"+Ċ
",
","
","");Ċ
","",
","","
","#
","+
","\
",$
",&
",'
",(
",-
",@"
",__
",{
",Ċ
",ĊĊ
",čĊ
"-
sort: Broken pip

Hmm.  A lot of puntuations, and many with line wrap.

In [57]:
!grep -v '\[' < ../data/Llama-3.1-8b/vocab.txt | grep -v '^.$' | grep -v '^Ġ' | grep -v '^Ċ' | wc -l

   69647


Let's just get the numbers that constitute the whole line.

In [58]:
!grep -E -o '[[:digit:]]+' ../data/Llama-3.1-8b/vocab.txt | wc -l

    1358


## Your Submission for Part 1

* (1) You need to make sure this jupyter note book having ran through all the cells with the outputs.(include the above cells and the new cells you will add below), and export a pdf file for your main submission.
* (2) You should submit the whole Assignment-1 folder as a zip file. Your code and instruction should either use this jupyter notebook or put some seperate python file in the src folder. Any external resource should be placed in the data folder. We will use this to rerun your code. 

#### Put Your Research Question Here
### How emojis are tokenized in the tokenizer? What are the related tokens in the vocabulary?
### Why This Question?
#### Emojis are a key part of modern communication, and understanding how they are tokenized is crucial for tasks like sentiment analysis, social media text processing, and chatbot interactions.

```Please add a paragraph here to explain what is your research question?```

### Why This Question?
Emojis are a key part of modern communication, and understanding how they are tokenized is crucial for tasks like sentiment analysis, social media text processing, and chatbot interactions.

In [59]:
vocab_file = "../data/Llama-3.1-8b/vocab.txt"
with open(vocab_file, "r", encoding="utf-8") as f:
    vocab_tokens = [line.strip() for line in f]

In [60]:
import emoji

# Function to check if a token contains an emoji
def is_emoji(token):
    return bool(emoji.emoji_list(token))

# Filter tokens that contain emojis
emoji_tokens = [token for token in vocab_tokens if is_emoji(token)]

In [61]:
sample_emojis = ["😀", "🙅🏻", "👨‍👩‍👧‍👦"]  # Simple emoji, emoji with modifier, and family emoji

for emoji_char in sample_emojis:
    tokenized = tokenizer.tokenize(emoji_char)
    print(f"Emoji: {emoji_char} → Tokens: {tokenized}")

Emoji: 😀 → Tokens: ['ðŁĺ', 'Ģ']
Emoji: 🙅🏻 → Tokens: ['ðŁ', 'Ļ', 'ħ', 'ðŁ', 'ı', '»']
Emoji: 👨‍👩‍👧‍👦 → Tokens: ['ðŁ', 'ĳ', '¨', 'âĢį', 'ðŁ', 'ĳ', '©', 'âĢį', 'ðŁ', 'ĳ', '§', 'âĢį', 'ðŁ', 'ĳ', '¦']


In [62]:
num_emoji_tokens = len(emoji_tokens)
print(f"Number of emoji-related tokens in the vocabulary: {num_emoji_tokens}")
print("Sample emoji-related tokens:")
print(emoji_tokens[:10])  # Print the first 10 emoji-related tokens

Number of emoji-related tokens in the vocabulary: 1641
Sample emoji-related tokens:
['©', '®', 'Ã©', 'ĠÃ©', 'Ã©s', 'ĠÂ©', 'ĠdÃ©', 'Ã©e', 'å®', 'ĠrÃ©']


In [63]:
from transformers import AutoTokenizer

# Define the emoji sequence
text = "😀🙅🏻🔥"

# Tokenizer models to test
models = {
    "GPT-2 (Byte-Level BPE)": "gpt2",
    "BERT (WordPiece)": "bert-base-uncased",
    "T5 (SentencePiece)": "t5-small"
}

# Process each tokenizer
for model_name, model_path in models.items():
    print(f"\n{model_name} Tokenization:")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    
    # Tokenize emojis
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    
    # Display results
    print(f"Original Text: {text}")
    print(f"Tokenized Output: {tokens}")
    print(f"Token IDs: {token_ids}")
    print("-" * 50)



GPT-2 (Byte-Level BPE) Tokenization:
Original Text: 😀🙅🏻🔥
Tokenized Output: ['ðŁĺ', 'Ģ', 'ðŁ', 'Ļ', 'ħ', 'ðŁ', 'ı', '»', 'ðŁ', 'Ķ', '¥']
Token IDs: [47249, 222, 8582, 247, 227, 8582, 237, 119, 8582, 242, 98]
--------------------------------------------------

BERT (WordPiece) Tokenization:
Original Text: 😀🙅🏻🔥
Tokenized Output: ['[UNK]']
Token IDs: [100]
--------------------------------------------------

T5 (SentencePiece) Tokenization:
Original Text: 😀🙅🏻🔥
Tokenized Output: ['▁', '😀🙅🏻🔥']
Token IDs: [3, 2]
--------------------------------------------------


# Conclusion:

Emoji tokenization depends on the tokenizer type:
GPT-2: Splits emojis into byte-based tokens because it uses Byte-Level BPE.
BERT: Replaces unseen emojis with [UNK] because WordPiece does not handle them well.
T5: Handles some emojis as full tokens, but rare ones may be split.


GPT-2 (Byte-Level BPE): Emojis are split into multiple byte tokens because GPT-2 operates at the byte level. Example: "😀🙅🏻" → ['ðŁĺ', 'Ģ', 'ðŁ', 'Ļ', 'ħ', 'ðŁ', 'ı', '»'].
BERT (WordPiece): If an emoji was not in training data, it gets replaced with [UNK], meaning it is not tokenized properly.
T5/mT5 (SentencePiece): Handles some emojis as full tokens but may split rare ones into subwords. Example: "😀🙅🏻" → ['▁😀', '▁🙅', '▁🏻']

Related Tokens in the Vocabulary
    In GPT-2, emoji tokens appear as byte sequences ('ðŁĺ', 'ðŁı', etc.), meaning emojis do not exist as single tokens.
    In BERT, they may be missing ([UNK]).
    In T5, they are either full tokens or split subwords, depending on frequency in training data.

#### Put All Your Code Cells Here

In [64]:
# Then please add more cells to study your questions as the above 
# You could write all your code in the cells below.
# Or your code could be written in a python file in the src folder, then call it in the cells bellow.
# Take the vocab_utils.py and the hello_world.py as an example, they are written in the src folder as single files. 
# But they are imported in the juptyer notebook and run in the cells. 

# Part 2. N-Gram Language Model

Your task for this Part 2 is to build an N-gram language model from scratch. Your language modeling
program should accept three input files: (1) a training corpus file, (2) a test sentences file, and (3) a seeds file, which will contain a list of words to begin the language generation process. 
In Part-1, you have learned to use jupyter notebooks. In this Part 2, you will learn the way of using command-line to run your program. Your program should accept three files as command-line arguments in the following order:

                        ```python ngram.py <training file> <test file> <seeds_file>```

## Grading Criteria

We will run your program on the files that we give you as well as new files to evaluate the
generality and correctness of your code. So please test your program thoroughly! Even
if your program works perfectly on the examples that we give you, that does not guarantee
that it will work perfectly on diﬀerent test cases.

## Input Files

All the input files located in the data folder.

### Training File
The training file will consist of sentences, one sentence per line. For example, a training file might look like this:
```
I love natural language processing .
This assignment looks like fun !
```
You should divide each sentence into unigrams based **solely on white space**. Note that this
can produce isolated punctuation marks (when white space separates a punctuation mark
from adjacent words) as well as words with punctuation symbols that are still attached (when
white space does NOT separate a punctuation mark from an adjacent word). For example,
consider the following sentence:
```
“This is a funny-looking sentence” , she said !
```
This sentence should be divided into exactly nine unigrams:
(1) “This (2) is (3) a (4) funny-looking (5) sentence” (6) , (7) she (8) said (9) !

### Test File
The test file will have exactly the same format as the training file and it should be divided
into unigrams exactly the same way. So please don't feel comfortable the punctuations, just use the white space to tokenize each raw sentence, not any processing. People may ask what about the validatio or develop set. Since the N-gram training is almost deterministic, so in this assignment, we will not use a develop set to select the model hyperparameters or a held-out set for further testing the generalization.

### Seeds File
The seeds file will have one word per line, and each word should be used to start the language generation process.

## Building the N-gram Language Models

To create the N-gram language models, you will need to generate tables of frequency counts
from the training corpus for unigrams (1-grams) and bigrams (2-grams). An N-gram should
not cross sentence boundaries. All of your N-gram tables should be case-insensitive (i.e.,
“the”, “The”, and “THE” should be treated as the same word).

You should create three diﬀerent types of language models:
* (a) A unigram language model with no smoothing.
* (b) A bigram language model with no smoothing.
* (c) A bigram language model with add-one smoothing.

You can assume that the set of unigrams found in the training corpus is the entire universe
of unigrams. We will not give you test sentences that contain unseen unigrams. So the
vocabulary $V$ for this assignment is the set of unique unigrams that occur in the training
corpus.

However, we will give you test sentences that contain bigrams that did not appear in the
training corpus. The n-grams will consist entirely of unigrams that appeared in the training
corpus, but there may be new (previously unseen) combinations of the unigrams. The first
two language models (a and b) do not use smoothing, so unseen bigrams should be assigned
a probability of zero. For the last language model (c), you should use add-one smoothing to
compute the probabilities for all of the bigrams.

For bigrams, you will need to have a special pseudo-word “\<s\>” as a beginning-of-sentence symbol. Bigrams of the form "\<s\>$w_i$" mean that word $w_i$ occurs at the beginning of the sentence. Do NOT include "\<s\>" as a word in your vocabulary for the unigram language model or include "\<s\>" in the sentence probability for the unigram model.
For simplicity, just use the unigram frequency count of $w_{k−1}$ to compute the conditional probability $P (w_k | w_{k−1})$. (This means you won’t have to worry about cases where w_{k−1} occurs at the end of the sentence and isn’t followed by anything.) For example, just compute $P (w_k | w_{k−1}) = count (w_{k−1}w_k) /count (w_{k−1})$.
You should NOT use an end-of-sentence symbol. The last bigram for a sentence of length should
represent the last 2 words of the sentence: $w_{n−1}w_n$.

### N-gram LM Task 1: Computing Sentence Probability (30')

For each of the language models, you should create a function that computes the probability
of a sentence $P(w_1 ... w_n)$ using that language model. Since the probabilities will get very
small, you must do the probability computations in log space (as discussed in class, also see
the lecture slides). Please do these calculations *using log base 2*. 

#### Output Specifications for Task 1

Your program should print the following information for each test sentence. When printing
the logprob numbers, please only print 4 digits after the decimal point. For example, print
-8.9753864210 as -8.9754. The programming language will have a mechanism for controlling
the number of digits that are printed. If $P(S) = 0$, then the logarithm is not defined, so
print logprob(S) = undefined.


Please print the following information, a sentence line, an empty line, and three probabilities formatted like this:

```
S = <sentence>

Unigrams: logprob(S) = #
Bigrams: logprob(S) = #
Smoothed Bigrams: logprob(S) = #
```
For example, your output might look like this (the examples below are not real, they are
just for illustration!):
```
S = Trump has given his second inaugural speech .

Unigrams: logprob(S) = -6.5712
Bigrams: logprob(S) = -9.2253
Smoothed Bigrams: logprob(S) = -10.4291
```

### N-gram LM Task 2: Language Generator (30')

Your language generator should use the unsmoothed bigram language model to produce new
sentences, probabilistically! Given a seed word, the language generation algorithm is:

1. Find all bigrams that begin with the seed word - let’s call this set $B_{seed}$. Probabilistically select one of the bigrams in $B_{seed}$ with a likelihood proportional to its probability.
For example, suppose “crazy” is the seed and exactly two bigrams begin with “crazy”:
“crazy people” (frequency=10) and “crazy horse” (frequency=15).

Consequently, $P (people | crazy) = \frac{10}{25} = .40$ and $P (horse | crazy) = \frac{15}{25} = .60$
There should be a 40% chance that your program selects “crazy people” and a 60%
chance that it selects “crazy horse”.
An easy way to do this is to generate a random number x between [0,1]. Then establish
ranges based on the bigram probabilities. For example, if 0 ≤ x ≤.40 then your pro-
gram selects “crazy people”, but if.40 < x ≤ 1 then your program selects “crazy horse”.

2. Let’s call the selected bigram $B^\prime=w_0w_1$ (where $w_0$ is the seed). Generate $w_1$ as the next word in your new sentence.

3. Return to Step 1 using $w_1$ as the new seed word to generate the next $w_2$ and continue.

Your program should stop generating words when one of the following conditions exists:

* your program generates one of three words: [. ? !]
* your program generates 40 words (NOT including the original seed word)
* $B_{seed}$ is empty (i.e., there are no bigrams that begin with the seed word).

IMPORTANT: For each seed word, your language generator should randomly generate 10
sentences that begin with that word. Since each sentence is generated probabilistically, the
sentences will (usually) be diﬀerent from each other.

#### Output Specifications for Generation Task

Your program should print each seed word followed by a blank line and then the 10 sentences
generated from that seed word. You should format your output like this:


```
Seed = <seed>

Sentence 1: <sentence>
Sentence 2: <sentence>
Sentence 3: <sentence>
Sentence 4: <sentence> 
Sentence 5: <sentence> 
Sentence 6: <sentence> 
Sentence 7: <sentence> 
Sentence 8: <sentence> 
Sentence 9: <sentence> 
Sentence 10: <sentence> 

Seed = <seed>

Sentence 1: <sentence>
Sentence 2: <sentence>
Sentence 3: <sentence>
Sentence 4: <sentence> 
Sentence 5: <sentence> 
Sentence 6: <sentence> 
Sentence 7: <sentence> 
Sentence 8: <sentence> 
Sentence 9: <sentence> 
Sentence 10: <sentence> 
```

### Your Submission for Part 2

The Part 2 does not depend on any the previous cells in this jupyter note book. Instead, you need to create your own python environment, and write your ngram.py from scratch. 
Since you need to output the log probabilities for the test file, and generate 10 sentences for each seed in the seeds file. 
Hence, please submit three things:

1. The source code for your program. Be sure to include all files that are needed to run your program, include a conda environmental file!
2. A README file that includes the following information:
• how to run your code (suggest to use 3.10 python)
• any known bugs, problems, or limitations of your program
3. Submit two trace files: (1) ngram-prob.trace for the logprob outputs (2) ngram-gen.trace for the generated sentences for each seed.