# Tokenization

In this notebook we will demonstrate *tokenization*. This is the process of splitting raw text into smaller pieces, called (drum-roll please), *tokens*. Tokens can be individual characeters, words, or sentences.

Examples of character and word tokenization are shown for the following raw text.

```Show me the money```

Character tokenization:

```['S', 'h', 'o', 'w', 'm', 'e', 't', 'h', 'e', 'm', 'o', 'n', 'e', 'y']```.

Word tokenization:

```['Show', 'me', 'the', 'money'] ```

We can achieve character and word tokenization very easily in Python.


In [1]:
# Character and word tokenization

sentence = "Show me the money"
word_tokens = sentence.split()
print(word_tokens)
character_tokens = [char for char in sentence if char != ' ']
print(character_tokens)

['Show', 'me', 'the', 'money']
['S', 'h', 'o', 'w', 'm', 'e', 't', 'h', 'e', 'm', 'o', 'n', 'e', 'y']


There are advantages and disadvantages to different tokenization methods. We showed very simple strategies, but there are others that are more advanced. With tokenization, our goal is to not lose meaning with the tokens. With character based tokenization, especially for English (non-character based languages) we certainly lose meaning. 

We now demonstrate how to tokenize using the package [transformers](https://huggingface.co/docs/transformers/en/index) available from Huggingface.

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", 
                                          cache_dir="/projectnb/scottml/hf_cache", 
                                          clean_up_tokenization_spaces=True)
tokens = tokenizer.tokenize(sentence)
print(tokens)

# Try a more advanced sentence
sentence2 = "Let's try to see if we can get this transformer to tokenize."
tokens2 = tokenizer.tokenize(sentence2)
print(tokens2)

['Show', 'me', 'the', 'money']
['Let', "'", 's', 'try', 'to', 'see', 'if', 'we', 'can', 'get', 'this', 'transform', '##er', 'to', 'token', '##ize', '.']


Associated to each token is a unique token ID. The total number of unique tokens that a model can recognize and process is the *vocabulary size*. The *vocabulary* is the collection of all the unique tokens (i.e., all the unique tokens).

In [3]:
tokens2_ids = tokenizer.convert_tokens_to_ids(tokens2)
print("Length of tokens2", len(tokens2))
print("Length of tokens2_ids", len(tokens2_ids))
print(tokens2_ids)

Length of tokens2 17
Length of tokens2_ids 17
[2421, 112, 188, 2222, 1106, 1267, 1191, 1195, 1169, 1243, 1142, 11303, 1200, 1106, 22559, 3708, 119]


The tokens (and token ids) alone hold no (semantic) information. There are different ways to *encode* this information numerically. A common  process is to create word embeddings. Word embeddings are the subject of a separate notebook. In words, token ids are transformed from an index value to a vectors in a high-dimensional space. 