## Basic coding for AI systems

We work with AI models using text, but internally the models use what are called "tokens" to represent the basic atoms of processing (we'll focus on words). Let's see how we would use a large text base to create a tokenizer.  We'll use the collected works of Jane Austen. The tokenizer code is taken from the examples provided by Lightening AI.  
  
We'll ignore warnings as we run this notebook.

In [1]:
import warnings
warnings.filterwarnings('ignore')

We create a class from which we can call the encode and decode functions, which turn words into tokens and tokens into words. At this step we're not using these functions, but we are preparing the dictionary vocab with which they function. For the dictionary we'll use janes.txt, which comes from the consolidated works of Jane Austen.

In [2]:
import re

class Tokenizer:
  def __init__(self,vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}
  def encode(self,text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids
  def decode(self,ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
    return text



## Building a dictionary
Now that we have the tokenizer functions (methods) defined, we will read the textual version of Jane Austen's works and turn them into a dictionary assigning a token value for each unique word. Once complete, we'll display the size of the dictionary.

In [4]:
with open("janes.txt","r",encoding="utf-8") as f:
  raw_text=f.read()

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',raw_text)
preprocessed = [item for item in preprocessed if item]
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
vocab={token:integer for integer,token in enumerate(all_words)}
print("Vocabulary size: ",vocab_size)


Vocabulary size:  19151


## Tokenizing and recovering text

Now that we have a dictionary we can use it to turn a word into a token, and recover a word from a token.

In [5]:
# ...following on
tokenizer=Tokenizer(vocab)

text="It is a truth universally acknowledged"

ids=tokenizer.encode(text)
print("Token stream: ",ids)

print("Recovery:     ",tokenizer.decode(ids))


Token stream:  [1267, 9884, 2492, 16409, 16744, 2645]
Recovery:      It is a truth universally acknowledged


## Embeddings
Embeddings are vectors that capture the semantic context of text. They are generated by embedding models that learn from vast amounts of text data, tokenizing it and establishing its relationships with other tokens.  To demonstrate this, we'll use the sentence_transformers library and call the MiniLM-L6 model.  We'll embed the previous text phrase.

In [6]:
%pip install -q sentence_transformers
%pip install -q tf_keras

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m110.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddings = model.encode([text])
print("Embedding structure size: ",embeddings.shape)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding structure size:  (1, 384)


We can dump out the full embedding structure to see what it looks like.

In [8]:
print("Full embedding structure:")
print(embeddings)

Full embedding structure:
[[-1.96266267e-02  4.27056402e-02 -4.58967388e-02  3.39551903e-02
  -6.53073099e-03  6.21899553e-02 -7.37903873e-03 -5.07542454e-02
  -1.18517559e-02 -1.23127495e-04 -1.62150580e-02 -2.55365856e-02
   5.89982048e-03  4.37062085e-02 -3.97319272e-02 -6.48405701e-02
   5.55757359e-02 -2.76492294e-02  1.01935875e-04 -7.99029619e-02
  -3.20597328e-02  8.34374204e-02  9.98163819e-02  6.28955476e-03
   1.47306789e-02 -1.77241284e-02  6.25883192e-02 -3.90182100e-02
   2.86961030e-02 -5.97629398e-02 -9.57401767e-02  5.13124354e-02
   7.17045739e-02 -3.83142233e-02 -1.07453605e-02  5.29964082e-02
   2.79421005e-02 -2.08029486e-02  6.44168705e-02 -3.09844054e-02
  -4.46723066e-02 -6.71222135e-02 -1.06379399e-02 -9.90883820e-03
  -7.84582086e-03  9.46811289e-02  7.71141751e-03  8.55962634e-02
  -2.82240286e-02  3.10915103e-03 -4.39242506e-03  2.29647458e-02
   2.53019221e-02 -3.27661373e-02 -1.01104276e-02  2.02602558e-02
  -3.55585106e-02 -7.63279852e-03  1.72803123e-02 

Great - we've successfully embedded a prompt.