<a href="https://colab.research.google.com/github/sdey17/LLM-tests/blob/main/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Chapter 2 - Tokens and Token Embeddings</h1>
<i>Exploring tokens and embeddings as an integral part of building LLMs</i>


<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter02/Chapter%202%20-%20Tokens%20and%20Token%20Embeddings.ipynb)

---

This notebook is for Chapter 2 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


# Downloading and Running An LLM

The first step is to load our model onto the GPU for faster inference. Note that we load the model and tokenizer separately and keep them as such so that we can explore them separately.

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [34]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|>"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")

# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=20,
)

# Print the output
print(tokenizer.decode(generation_output[0]))

Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened.<|assistant|> Subject: Sincere Apologies for the Gardening Mishap


Dear


In [23]:
print(input_ids)

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001]], device='cuda:0')


In [24]:
for id in input_ids[0]:
   print(tokenizer.decode(id))

Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
<|assistant|>


In [27]:
generation_output

tensor([[14350,   385,  4876, 27746,  5281,   304, 19235,   363,   278, 25305,
           293, 16423,   292,   286,   728,   481, 29889, 12027,  7420,   920,
           372,  9559, 29889, 32001,  3323,   622, 29901,   317,  3742,   406,
          6225, 11763,   363,   278, 19906,   292,   341,   728,   481,    13,
            13,    13, 29928,   799]], device='cuda:0')

In [28]:
print(tokenizer.decode(3323))
print(tokenizer.decode(622))
print(tokenizer.decode([3323, 622]))
print(tokenizer.decode(29901))

Sub
ject
Subject
:


# Comparing Trained LLM Tokenizers


In [58]:
from transformers import AutoModelForCausalLM, AutoTokenizer

colors_list = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

def show_tokens(sentence, tokenizer_name):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

In [63]:
text = """

English and CAPITALIZATION

🎵鸟
show_tokens False None elif == >= else: two tabs:" " Three tabs: "   "

12.0*50=600

"""

In [64]:
sentence = text
tokenizer_name = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
token_ids = tokenizer(sentence).input_ids
for idx, t in enumerate(token_ids):
    print(f'\x1b[0;30;48;2;{colors_list[idx % len(colors_list)]}m' + tokenizer.decode(t) + '\x1b[0m', end=' ')
    print(token_ids[idx])
print(tokenizer.decode([220]))

[0;30;48;2;102;194;165m
[0m 198
[0;30;48;2;252;141;98m
[0m 198
[0;30;48;2;141;160;203mEnglish[0m 15823
[0;30;48;2;231;138;195m and[0m 290
[0;30;48;2;166;216;84m CAP[0m 20176
[0;30;48;2;255;217;47mITAL[0m 40579
[0;30;48;2;102;194;165mIZ[0m 14887
[0;30;48;2;252;141;98mATION[0m 6234
[0;30;48;2;141;160;203m
[0m 198
[0;30;48;2;231;138;195m
[0m 198
[0;30;48;2;166;216;84m�[0m 8582
[0;30;48;2;255;217;47m�[0m 236
[0;30;48;2;102;194;165m�[0m 113
[0;30;48;2;252;141;98m�[0m 165
[0;30;48;2;141;160;203m�[0m 116
[0;30;48;2;231;138;195m�[0m 253
[0;30;48;2;166;216;84m
[0m 198
[0;30;48;2;255;217;47mshow[0m 12860
[0;30;48;2;102;194;165m_[0m 62
[0;30;48;2;252;141;98mt[0m 83
[0;30;48;2;141;160;203mok[0m 482
[0;30;48;2;231;138;195mens[0m 641
[0;30;48;2;166;216;84m False[0m 10352
[0;30;48;2;255;217;47m None[0m 6045
[0;30;48;2;102;194;165m el[0m 1288
[0;30;48;2;252;141;98mif[0m 361
[0;30;48;2;141;160;203m ==[0m 6624
[0;30;48;2;231;138;195m >=[0m 18189
[

In [42]:
show_tokens(text, "bert-base-uncased")

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98menglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mcapital[0m [0;30;48;2;166;216;84m##ization[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mfalse[0m [0;30;48;2;102;194;165mnone[0m [0;30;48;2;252;141;98meli[0m [0;30;48;2;141;160;203m##f[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84m=[0m [0;30;48;2;255;217;47m>[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98melse[0m [0;30;48;2;141;160;203m:[0m [0;30;48;2;231;138;195mtwo[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;141;98m"[0m [0;30;48;2;141;160;203m"[0m [0;30;48;2;231;138;195mthree[0m [0;30;48;2;166;216;84mtab[0m [0;30;48;2;255;217;47m##s[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;25

In [43]:
show_tokens(text, "bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[0;30;48;2;102;194;165m[CLS][0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203mand[0m [0;30;48;2;231;138;195mCA[0m [0;30;48;2;166;216;84m##PI[0m [0;30;48;2;255;217;47m##TA[0m [0;30;48;2;102;194;165m##L[0m [0;30;48;2;252;141;98m##I[0m [0;30;48;2;141;160;203m##Z[0m [0;30;48;2;231;138;195m##AT[0m [0;30;48;2;166;216;84m##ION[0m [0;30;48;2;255;217;47m[UNK][0m [0;30;48;2;102;194;165m[UNK][0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_[0m [0;30;48;2;231;138;195mtoken[0m [0;30;48;2;166;216;84m##s[0m [0;30;48;2;255;217;47mF[0m [0;30;48;2;102;194;165m##als[0m [0;30;48;2;252;141;98m##e[0m [0;30;48;2;141;160;203mNone[0m [0;30;48;2;231;138;195mel[0m [0;30;48;2;166;216;84m##if[0m [0;30;48;2;255;217;47m=[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m>[0m [0;30;48;2;141;160;203m=[0m [0;30;48;2;231;138;195melse[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47mtwo[0m [0;30;48;2;102;194;165mta[0m [0;30;48;2;252;1

In [66]:
show_tokens(text, "gpt2")

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mEnglish[0m [0;30;48;2;231;138;195m and[0m [0;30;48;2;166;216;84m CAP[0m [0;30;48;2;255;217;47mITAL[0m [0;30;48;2;102;194;165mIZ[0m [0;30;48;2;252;141;98mATION[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195m
[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47mshow[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98mt[0m [0;30;48;2;141;160;203mok[0m [0;30;48;2;231;138;195mens[0m [0;30;48;2;166;216;84m False[0m [0;30;48;2;255;217;47m None[0m [0;30;48;2;102;194;165m el[0m [0;30;48;2;252;141;98mif[0m [0;30;48;2;141;160;203m ==[0m [0;30;48;2;231;138;195m >=[0m [0;30;48;2;166;216;84m else[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m two[0m [0;30;48;2;252;141;98m tabs[0m [0

In [61]:
show_tokens(text, "google/flan-t5-small")

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

[0;30;48;2;102;194;165mEnglish[0m [0;30;48;2;252;141;98mand[0m [0;30;48;2;141;160;203mCA[0m [0;30;48;2;231;138;195mPI[0m [0;30;48;2;166;216;84mTAL[0m [0;30;48;2;255;217;47mIZ[0m [0;30;48;2;102;194;165mATION[0m [0;30;48;2;252;141;98m[0m [0;30;48;2;141;160;203m<unk>[0m [0;30;48;2;231;138;195m[0m [0;30;48;2;166;216;84m<unk>[0m [0;30;48;2;255;217;47mshow[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98mto[0m [0;30;48;2;141;160;203mken[0m [0;30;48;2;231;138;195ms[0m [0;30;48;2;166;216;84mFal[0m [0;30;48;2;255;217;47ms[0m [0;30;48;2;102;194;165me[0m [0;30;48;2;252;141;98mNone[0m [0;30;48;2;141;160;203m[0m [0;30;48;2;231;138;195me[0m [0;30;48;2;166;216;84ml[0m [0;30;48;2;255;217;47mif[0m [0;30;48;2;102;194;165m=[0m [0;30;48;2;252;141;98m=[0m [0;30;48;2;141;160;203m>[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84melse[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165mtwo[0m [0;30;48;2;252;141;98mtab[0m [0;30;48;2;141

In [65]:
# The official is `tiktoken` but this the same tokenizer on the HF platform
show_tokens(text, "Xenova/gpt-4")

tokenizer_config.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.01M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/917k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.23M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/98.0 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m

[0m [0;30;48;2;252;141;98mEnglish[0m [0;30;48;2;141;160;203m and[0m [0;30;48;2;231;138;195m CAPITAL[0m [0;30;48;2;166;216;84mIZATION[0m [0;30;48;2;255;217;47m

[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98mshow[0m [0;30;48;2;141;160;203m_tokens[0m [0;30;48;2;231;138;195m False[0m [0;30;48;2;166;216;84m None[0m [0;30;48;2;255;217;47m elif[0m [0;30;48;2;102;194;165m ==[0m [0;30;48;2;252;141;98m >=[0m [0;30;48;2;141;160;203m else[0m [0;30;48;2;231;138;195m:[0m [0;30;48;2;166;216;84m two[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:"[0m [0;30;48;2;252;141;98m "[0m [0;30;48;2;141;160;203m Three[0m [0;30;48;2;231;138;195m tabs[0m [0;30;48;2;166;216;84m:[0m [0;30;48;2;255;217;47m "[0m [0;30;48;2;102;194;165m  [0m [0;30;48;2

In [67]:
# You need to request access before being able to use this tokenizer
show_tokens(text, "bigcode/starcoder2-15b")

tokenizer_config.json:   0%|          | 0.00/7.88k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mEnglish[0m [0;30;48;2;231;138;195m and[0m [0;30;48;2;166;216;84m CAPITAL[0m [0;30;48;2;255;217;47mIZATION[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mshow[0m [0;30;48;2;231;138;195m_[0m [0;30;48;2;166;216;84mtokens[0m [0;30;48;2;255;217;47m False[0m [0;30;48;2;102;194;165m None[0m [0;30;48;2;252;141;98m elif[0m [0;30;48;2;141;160;203m ==[0m [0;30;48;2;231;138;195m >=[0m [0;30;48;2;166;216;84m else[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m two[0m [0;30;48;2;252;141;98m tabs[0m [0;30;48;2;141;160;203m:"[0m [0;30;48;2;231;138;195m "[0m [0;30;48;2;166;216;84m Three[0m [0;30;48;2;255;217;47m tabs[0m [0;30;48;2;102;194;165m:[0m [0;30;48;2;252;

In [68]:
show_tokens(text, "facebook/galactica-1.3b")

tokenizer_config.json:   0%|          | 0.00/166 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.00 [00:00<?, ?B/s]

[0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203mEnglish[0m [0;30;48;2;231;138;195m and[0m [0;30;48;2;166;216;84m CAP[0m [0;30;48;2;255;217;47mITAL[0m [0;30;48;2;102;194;165mIZATION[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m
[0m [0;30;48;2;255;217;47mshow[0m [0;30;48;2;102;194;165m_[0m [0;30;48;2;252;141;98mtokens[0m [0;30;48;2;141;160;203m False[0m [0;30;48;2;231;138;195m None[0m [0;30;48;2;166;216;84m elif[0m [0;30;48;2;255;217;47m [0m [0;30;48;2;102;194;165m==[0m [0;30;48;2;252;141;98m [0m [0;30;48;2;141;160;203m>[0m [0;30;48;2;231;138;195m=[0m [0;30;48;2;166;216;84m else[0m [0;30;48;2;255;217;47m:[0m [0;30;48;2;102;194;165m two[0m [0;30;48;2;252;141;98m t[0m [0;3

In [69]:
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

[0;30;48;2;102;194;165m[0m [0;30;48;2;252;141;98m
[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mEnglish[0m [0;30;48;2;166;216;84mand[0m [0;30;48;2;255;217;47mC[0m [0;30;48;2;102;194;165mAP[0m [0;30;48;2;252;141;98mIT[0m [0;30;48;2;141;160;203mAL[0m [0;30;48;2;231;138;195mIZ[0m [0;30;48;2;166;216;84mATION[0m [0;30;48;2;255;217;47m
[0m [0;30;48;2;102;194;165m
[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m�[0m [0;30;48;2;231;138;195m�[0m [0;30;48;2;166;216;84m�[0m [0;30;48;2;255;217;47m�[0m [0;30;48;2;102;194;165m�[0m [0;30;48;2;252;141;98m�[0m [0;30;48;2;141;160;203m
[0m [0;30;48;2;231;138;195mshow[0m [0;30;48;2;166;216;84m_[0m [0;30;48;2;255;217;47mto[0m [0;30;48;2;102;194;165mkens[0m [0;30;48;2;252;141;98mFalse[0m [0;30;48;2;141;160;203mNone[0m [0;30;48;2;231;138;195melif[0m [0;30;48;2;166;216;84m==[0m [0;30;48;2;255;217;47m>=[0m [0;30;48;2;102;194;165melse[0m [0;30;48;2;252;141;98m:[0m [0;30;48;2;141;160

# Contextualized Word Embeddings From a Language Model (Like BERT)

In [70]:
from transformers import AutoModel, AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-base")

# Load a language model
model = AutoModel.from_pretrained("microsoft/deberta-v3-xsmall")

# Tokenize the sentence
tokens = tokenizer('Hello world', return_tensors='pt')

# Process the tokens
output = model(**tokens)[0]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/474 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/241M [00:00<?, ?B/s]

In [71]:
output.shape

torch.Size([1, 4, 384])

In [74]:
output

tensor([[[-3.4816,  0.0861, -0.1819,  ..., -0.0612, -0.3911,  0.3017],
         [ 0.1898,  0.3208, -0.2315,  ...,  0.3714,  0.2478,  0.8048],
         [ 0.2071,  0.5036, -0.0485,  ...,  1.2175, -0.2292,  0.8582],
         [-3.4278,  0.0645, -0.1427,  ...,  0.0658, -0.4367,  0.3834]]],
       grad_fn=<NativeLayerNormBackward0>)

In [73]:
model

DebertaV2Model(
  (embeddings): DebertaV2Embeddings(
    (word_embeddings): Embedding(128100, 384, padding_idx=0)
    (LayerNorm): LayerNorm((384,), eps=1e-07, elementwise_affine=True)
    (dropout): StableDropout()
  )
  (encoder): DebertaV2Encoder(
    (layer): ModuleList(
      (0-11): 12 x DebertaV2Layer(
        (attention): DebertaV2Attention(
          (self): DisentangledSelfAttention(
            (query_proj): Linear(in_features=384, out_features=384, bias=True)
            (key_proj): Linear(in_features=384, out_features=384, bias=True)
            (value_proj): Linear(in_features=384, out_features=384, bias=True)
            (pos_dropout): StableDropout()
            (dropout): StableDropout()
          )
          (output): DebertaV2SelfOutput(
            (dense): Linear(in_features=384, out_features=384, bias=True)
            (LayerNorm): LayerNorm((384,), eps=1e-07, elementwise_affine=True)
            (dropout): StableDropout()
          )
        )
        (intermedia

In [75]:
for token in tokens['input_ids'][0]:
    print(tokenizer.decode(token))

[CLS]
Hello
 world
[SEP]


In [76]:
tokens

{'input_ids': tensor([[    1, 31414,   232,     2]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

# Text Embeddings (For Sentences and Whole Documents)

In [78]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence_transformers
Successfully installed sentence_transformers-3.1.1


In [79]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to text embeddings
vector = model.encode("Best movie ever!")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [80]:
vector.shape

(768,)

In [83]:
vector

array([-2.02203933e-02,  4.57696877e-02, -1.26637146e-02, -3.37991631e-03,
       -2.54910300e-03, -3.31096235e-03, -4.58366945e-02,  2.32390799e-02,
       -3.12585123e-02, -3.15588824e-02, -2.19783355e-02,  3.67821753e-03,
        3.39105981e-03, -1.88130587e-02, -2.65821870e-02, -1.45892315e-02,
        3.20066400e-02, -3.66493082e-03,  1.75410733e-02,  5.43209612e-02,
       -3.96661647e-02,  9.93637647e-03, -3.27205285e-02, -1.62909944e-02,
        6.36525359e-03, -1.24459760e-02,  1.04737049e-02,  3.08674313e-02,
        7.47453189e-03, -3.86319868e-02, -6.19924963e-02, -3.25308628e-02,
       -1.40268244e-02,  5.16586639e-02,  1.69651707e-06, -1.74092929e-04,
       -1.27881172e-03, -4.15052772e-02,  5.00347884e-03,  4.58600894e-02,
       -2.85337027e-02,  5.02370670e-02, -1.75183658e-02, -3.98830482e-04,
        1.63311344e-02,  5.89936748e-02, -2.21730098e-02, -4.00975086e-02,
       -6.60439581e-03, -2.97632981e-02, -3.18221636e-02, -1.77945811e-02,
       -1.54363075e-02, -

In [81]:
print(model.decode([768]))

AttributeError: 'SentenceTransformer' object has no attribute 'decode'

# Word Embeddings Beyond LLMs


In [84]:
import gensim.downloader as api

# Download embeddings (66MB, glove, trained on wikipedia, vector size: 50)
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-data
model = api.load("glove-wiki-gigaword-50")



In [90]:
model.most_similar([model['mercedes']], topn=11)

[('mercedes', 1.0000001192092896),
 ('benz', 0.8691064119338989),
 ('bmw', 0.8641209602355957),
 ('renault', 0.8392857909202576),
 ('honda', 0.8238522410392761),
 ('ferrari', 0.7953569889068604),
 ('sedan', 0.7913727164268494),
 ('toyota', 0.7825417518615723),
 ('suv', 0.7775290608406067),
 ('nissan', 0.768851637840271),
 ('ford', 0.7544342875480652)]

# Recommending songs by embeddings

In [91]:
import pandas as pd
from urllib import request

# Get the playlist dataset file
data = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/train.txt')

# Parse the playlist dataset file. Skip the first two lines as
# they only contain metadata
lines = data.read().decode("utf-8").split('\n')[2:]

# Remove playlists with only one song
playlists = [s.rstrip().split() for s in lines if len(s.split()) > 1]

# Load song metadata
songs_file = request.urlopen('https://storage.googleapis.com/maps-premium/dataset/yes_complete/song_hash.txt')
songs_file = songs_file.read().decode("utf-8").split('\n')
songs = [s.rstrip().split('\t') for s in songs_file]
songs_df = pd.DataFrame(data=songs, columns = ['id', 'title', 'artist'])
songs_df = songs_df.set_index('id')

In [92]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

In [93]:
from gensim.models import Word2Vec

# Train our Word2Vec model
model = Word2Vec(
    playlists, vector_size=32, window=20, negative=50, min_count=1, workers=4
)

In [94]:
song_id = 2172

# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))

[('3126', 0.9973883032798767),
 ('2976', 0.9967131018638611),
 ('2849', 0.9966433644294739),
 ('3094', 0.9959104061126709),
 ('3116', 0.9958090782165527),
 ('6641', 0.9953957796096802),
 ('5586', 0.9950486421585083),
 ('10105', 0.9946891069412231),
 ('3136', 0.9946587681770325),
 ('3119', 0.9946584105491638)]

In [95]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [96]:
import numpy as np

def print_recommendations(song_id):
    similar_songs = np.array(
        model.wv.most_similar(positive=str(song_id),topn=5)
    )[:,0]
    return  songs_df.iloc[similar_songs]

# Extract recommendations
print_recommendations(2172)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3126,Heavy Metal,Sammy Hagar
2976,I Don't Know,Ozzy Osbourne
2849,Run To The Hills,Iron Maiden
3094,Breaking The Law,Judas Priest
3116,Communication Breakdown,Led Zeppelin


In [105]:
print_recommendations(200)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
20108,Ain't Thinkin Bout You (w\/ Chris Brown),Bow Wow
8,Lay It Down,Lloyd
5887,Real Love,Mary J. Blige
23778,What Could Have Been,Ginuwine
64,Neighbors Know My Name,Trey Songz


In [99]:
print_recommendations(0)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1425,The Boss (w\/ T-Pain),Rick Ross
94,I Like (w\/ Ludacris),Jeremih
5900,Peaches & Cream,112
6719,Good Life (w\/ T-Pain),Kanye West
27087,Lose Control (w\/ Nelly),Keri Hilson


In [104]:
songs_df.sort_values(by=['artist'])

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
75052,Jump Back,!!!
18747,The Hammer,!!!
15685,AM\/FM,!!!
15717,Wannagain Wannagain,!!!
41379,Dear Can (Clean Edit),!!!
...,...,...
61436,Wooly Wolly Gong,tUnE-YaRdS
50842,Gangsta,tUnE-YaRdS
34933,Bizness,tUnE-YaRdS
63690,Adore,


In [112]:
songs_df.iloc[np.array(
        model.wv.most_similar(negative=str(2172),topn=5)
    )[:,0]]

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
31887,#1 Crush,Garbage
72667,A Gift,Basia
34693,Sexual Healing,Max-A-Million
41415,Text Yuh,Rikki Jai
66237,Girl U Know,Buju Banton
