# Tokenization

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

- [Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.
- [Model Card](https://github.com/openai/gpt-2/blob/master/model_card.md) for GPT-2.

The `transformers` library is pre-installed on many systems, but in case you need to install it, you can run the following cell.

In [1]:
# Uncomment the following line to install the transformers library
#!pip install -q transformers

In [2]:
import torch
from torch import tensor

### Download and load the model

This cell downloads the model and tokenizer, and loads them into memory.

In [3]:
# https://huggingface.co/docs/transformers/en/generation_strategies
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, set_seed
model_name = "openai-community/gpt2"
# Here's a few larger models you could try:
# model_name = "EleutherAI/pythia-1.4b-deduped"
# model_name = "google/gemma-2b"
# model_name = "google/gemma-2b-it"
# Note: you'll need to accept the license agreement on https://huggingface.co/google/gemma-7b to use Gemma models
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained(model_name)
if model.generation_config.pad_token_id is None:
    model.generation_config.pad_token_id = model.generation_config.eos_token_id
streamer = TextStreamer(tokenizer)
# Silence a warning.
tokenizer.decode([tokenizer.eos_token_id]);

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

2024-03-15 15:07:33.288123: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-15 15:07:33.288463: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-15 15:07:33.447996: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
token_to_id_dict = tokenizer.get_vocab()
print(f"The tokenizer has {len(token_to_id_dict)} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 124,439,808 parameters.


In [5]:
# warning: this assumes that there are no gaps in the token ids, which happens to be true for this tokenizer.
id_to_token = [token for token, id in sorted(token_to_id_dict.items(), key=lambda x: x[1])]
print(f"The first 10 tokens are: {id_to_token[:10]}")
print(f"The last 10 tokens are: {id_to_token[-10:]}")

The first 10 tokens are: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
The last 10 tokens are: ['Ġ(/', 'âĢ¦."', 'Compar', 'Ġamplification', 'ominated', 'Ġregress', 'ĠCollider', 'Ġinformants', 'Ġgazed', '<|endoftext|>']


## Demo

In [6]:
set_seed(0)
model.generate(
    **tokenizer("A list of colors: red, blue,", return_tensors="pt"),
    max_new_tokens=10, do_sample=True, temperature=0.3, penalty_alpha=.5, top_k=5, streamer=streamer);

 A list of colors: red, blue, green, yellow, orange, yellow, orange,


## Task

Consider the following phrase:

In [7]:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

### Getting familiar with tokens

1: Use `tokenizer.tokenize` to convert the phrase into a list of tokens. (What do you think the `Ġ` means?)

In [8]:
tokens = tokenizer.tokenize(phrase)
tokens

['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']

2: Use `tokenizer.convert_tokens_to_string` to convert the tokens back into a string.


In [9]:
tokenizer.convert_tokens_to_string(tokens)

' I visited Muskegon'

In [10]:
# for comparison:
''.join(tokens)

'ĠIĠvisitedĠMuskegon'

**What is the difference between the output from `convert_tokens_to_string` and the result of ''.join(tokens)?**
<br>
<br>Result of ''.join(tokens): 'ĠIĠvisitedĠMuskegon'
Result from `convert_tokens_to_string`: ' I visited Muskegon'

The difference between these two results is that the output from `convert_tokens_to_string` has blank spots for its spaces, and the result of ''.join(tokens) inserts a G into each space.

3: Use `tokenizer.encode` to convert the original phrase into token ids. (*Note: this is equivalent to `tokenize` followed by `convert_tokens_to_ids`*.) Call the result `input_ids`.


In [11]:
input_ids = tokenizer.encode(phrase)
input_ids

[314, 8672, 2629, 365, 14520]

4: Turn `input_ids` back into a readable string. Try this two ways: (1) using `tokenizer.decode` and (2) using `convert_ids_to_tokens`. **The result of (1) should be the same as the result of (2).**

In [12]:
# using convert_ids_to_tokens
convert = tokenizer.convert_ids_to_tokens(input_ids)
tokenizer.convert_tokens_to_string(convert)

' I visited Muskegon'

In [13]:
# using tokenizer.decode
tokenizer.decode(input_ids)

' I visited Muskegon'

### Applying what you learned

5: Use `model.generate(input_ids_batch)` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input, and convert the result to a PyTorch `tensor` for the model code to use it.) Call the result `output_ids`. This one is done for you.


In [14]:
input_ids_batch = tensor([input_ids])
output_ids = model.generate(input_ids_batch, max_new_tokens=20, do_sample=True, top_k=50)
output_ids

tensor([[  314,  8672,  2629,   365, 14520,   287,  9656,    13,   383,  3952,
           373,  1363,   284,  1811,  1957, 17245,    11,  1390,  9935,  1709,
          3266,   694,   290,  3941, 25732]])

6: Convert your `output_ids` into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use `output_ids[0]`.)

In [15]:
readable = tokenizer.convert_ids_to_tokens(output_ids[0])
tokenizer.convert_tokens_to_string(readable)

' I visited Muskegon in 1993. The park was home to several local musicians, including Dave Brubeck and Bill Kre'

Note: `generate` uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

- Turn on `do_sample=True`. Run it a few times to see what it gives.
- Set `top_k=5`. Or 50.

**When I turned on `do_sample=True`, I started to get results like:**<br>
- " I visited Muskegon two years ago, and had a wonderful experience with Mr. O'Brien. He had written a"
- ' I visited Muskegon in 1993. The park was home to several local musicians, including Dave Brubeck and Bill Kre'
- " I visited Muskegon on New Year's Eve. This was a strange place. We got up before sunrise and all the"

**When I set top_k=5, I started to get results like:**<br>
- ' I visited Muskegon. I was in my 20s and was in the process of finishing a degree in English. I'
- ' I visited Muskegon, the largest of the two rivers in North America, and was told that the water was "so'
- ' I visited Muskegon. I was there to visit the city of Muskegon, where the city of Muskegon'

**When I set top_k=50, I started to get results like:**<br>
- " I visited Muskegon High School on August 17, 2017.\n\n'Our goal was to learn about the community from"
- ' I visited Muskegon in February to observe the wildlife in action, it was clear the situation was different.\n\nOn'
- ' I visited Muskegon. I remember saying to the people there, \'Well, what are they gonna do?" and said'

7. What is the largest possible token id for this tokenizer? What token does it correspond to?

In [20]:
max_token_id = output_ids.max().item()
print(max_token_id)

25732


The largest possbile token id for this tokenizer is 25,732. 

## Analysis

**Q1: Write a brief explanation of what a tokenizer does.** Note that we worked with two parts of a tokenizer in this exercise (one that deals only with strings, and another that deals with numbers); make sure your explanation addresses both parts.

A tokenizer has the ability to break text down into small pieces/sections called tokens. There are two main purposes that a tokenizer accomplishes.

**Strings**: A tokenizer has the capability to break parts of text down, like a sentence or phrase, into individual tokens, like words or characters. 

**Numbers**: A tokenizer is also able to take each token(text has to be tokenized already) and map it to a unique numerical identifier, like an integer. This allows the model to process the information that we provided it numerically.

**Q2: Suppose a language model has learned to spell, e.g., after the prefix "The word dog is spelled d o g". Will it then already know how to spell any other word, or does it have to re-learn spelling for each word? Why or why not? (For example, try tokenizing the phrase "The word walking is spelled: w a" and then asking the LM to complete it.)**

No, I don't think that a language model will automatically learn to spell other words just because it's understands one. It needs to learn the spelling fo reach word individually because it doesn't understand words in the same way humans do

**Q3: Suppose you made a typo in your input. Explain what the tokenizer we used in this notebook will do with the new input. (Go back and try an input with a typo to see what happens.)**

If I were to make a typo in my input, the tokenizer would have a little bit of trouble figuring out what I was trying to say, especially if it was a big mistake. The model would try it's best to understand my input, but it would get a little confused if the mistake chages the text too much.