## Key topics:

**Tokens**: basic units of text/code for LLM AI models to process/generate language.

**Tokenization**: splitting input/output texts into smaller units for LLM AI models.

**Vocabulary size**: the number of tokens each model uses, which varies among different GPT models.

**Tokenization cost**: affects the memory and computational resources that a model needs, which influences the cost and performance of running Azure OpenAI model.

In [5]:
import os
import openai
from dotenv import load_dotenv

# Set up Azure OpenAI
load_dotenv("credentials.env")

openai.api_type = "azure"
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT") # Api base is the 'Endpoint' which can be found in Azure Portal where Azure OpenAI is created. It looks like https://xxxxxx.openai.azure.com/
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_version = "2023-03-15-preview"

In [None]:
##Using Keyvault for storing AOAI secrets
"""from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient
from azureml.core import Workspace

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

try:
    ml_client = MLClient.from_config(credential=credential, path="workspace.json")
except Exception as ex:
    raise Exception(
        "Failed to create MLClient from config file. Please modify and then run the above cell with your AzureML Workspace details."
    ) from ex
ws = Workspace(
    subscription_id=ml_client.subscription_id,
    resource_group=ml_client.resource_group_name,
    workspace_name=ml_client.workspace_name,
)
keyvault = ws.get_default_keyvault()

aoai_endpoint=keyvault.get_secret(name="aoai-endpoint")
aoai_key=keyvault.get_secret(name="key")
"""

In [2]:
#The GPT family of models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.
import os
import openai
from transformers import GPT2TokenizerFast


tokenizer = GPT2TokenizerFast.from_pretrained("gpt2") #GPT2TokenizerFast is a class for tokenizing text using the GPT-2 model. It is based on byte-level Byte-Pair-Encoding and can encode or decode text quickly. 

prompt = "The road to creating new medicines and vaccines has traditionally been long and winding!"

tokens = tokenizer(prompt)
print('Total number of tokens:', len(tokens['input_ids']))
print('Tokens : ', [tokenizer.decode(t) for t in tokens['input_ids']])
print("Tokens' numerical values:", tokens['input_ids'])

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Total number of tokens: 15
Tokens :  ['The', ' road', ' to', ' creating', ' new', ' medicines', ' and', ' vaccines', ' has', ' traditionally', ' been', ' long', ' and', ' winding', '!']
Tokens' numerical values: [464, 2975, 284, 4441, 649, 23533, 290, 18336, 468, 16083, 587, 890, 290, 28967, 0]


In [32]:
#pip install tiktoken #The open source version of tiktoken can be installed from PyPI

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/anaconda/envs/azureml_py310_sdkv2/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import tiktoken 

cl100k_base = tiktoken.get_encoding("cl100k_base") 

enc = tiktoken.Encoding( 
    name="gpt-35-turbo",  
    pat_str=cl100k_base._pat_str, 
    mergeable_ranks=cl100k_base._mergeable_ranks, 
    special_tokens={ 
        **cl100k_base._special_tokens, 
        "<|im_start|>": 100264, 
        "<|im_end|>": 100265
    } 
) 

tokens = enc.encode( 
    "The road to creating new medicines and vaccines has traditionally been long and winding!"
) 

print('Total number of tokens:', len(tokens))
print('Tokens : ', [enc.decode([t]) for t in tokens])
print("Tokens' numerical values:", tokens)

#https://platform.openai.com/tokenizer

Total number of tokens: 15
Tokens :  ['The', ' road', ' to', ' creating', ' new', ' medicines', ' and', ' vaccines', ' has', ' traditionally', ' been', ' long', ' and', ' winding', '!']
Tokens' numerical values: [791, 5754, 311, 6968, 502, 39653, 323, 40300, 706, 36342, 1027, 1317, 323, 54826, 0]


In [9]:
response = openai.Completion.create(
    engine="gpt-35-turbo",
    prompt=prompt,
    max_tokens=60,
    n=2
)

# Show 2 returned results

In [10]:
print('='*30, 'ANSWER #1', '='*30)
print(response['choices'][0]['text'])
print('='*30, 'ANSWER #2', '='*30)
print(response['choices'][1]['text'])


 Even considering the advances made by modern technologies, COVID-19 has posed new hurdles that are being hastily addressed to contain a pandemic that has besieged people across the world. Data privacy and building trust with test subjects are just two of the many challenges to overcome.

But there is another challenge that must be
 But you can speed up the process of turning medical bench results into products in just one day. A game-themed workshop led by Ksenia Opaleva, a project manager at R-Pharm (a leading diversified pharmaceutical holding in Russia), will introduce you to the principles of clinical trials design and


# Usage

In [11]:
response

<OpenAIObject text_completion id=cmpl-7nqNr8N2MSvdKEU1k0V2KnmIceoBm at 0x7fb6ec22b970> JSON: {
  "id": "cmpl-7nqNr8N2MSvdKEU1k0V2KnmIceoBm",
  "object": "text_completion",
  "created": 1692113791,
  "model": "gpt-35-turbo",
  "choices": [
    {
      "text": " Even considering the advances made by modern technologies, COVID-19 has posed new hurdles that are being hastily addressed to contain a pandemic that has besieged people across the world. Data privacy and building trust with test subjects are just two of the many challenges to overcome.\n\nBut there is another challenge that must be",
      "index": 0,
      "finish_reason": "length",
      "logprobs": null
    },
    {
      "text": " But you can speed up the process of turning medical bench results into products in just one day. A game-themed workshop led by Ksenia Opaleva, a project manager at R-Pharm (a leading diversified pharmaceutical holding in Russia), will introduce you to the principles of clinical trials design and",


In [12]:
response['usage']

<OpenAIObject at 0x7f2385199c10> JSON: {
  "completion_tokens": 120,
  "prompt_tokens": 15,
  "total_tokens": 135
}

Azure OpenAI uses a subword tokenization method called "Byte-Pair Encoding (BPE)" for its GPT-based models. ** BPE is a method that merges the most frequently occurring pairs of characters or bytes into a single token **, until a certain number of tokens or a vocabulary size is reached. BPE can help the model to handle rare or unseen words, and to create more compact and consistent representations of the texts. BPE can also allow the model to generate new words or tokens, by combining existing ones. 

The way that tokenization is different dependent upon the different model Ada, Babbage, Curie, and Davinci is mainly based on the number of tokens or the vocabulary size that each model uses. Ada has the smallest vocabulary size, with 50,000 tokens, and Davinci has the largest vocabulary size, with 60,000 tokens. Babbage and Curie have the same vocabulary size, with 57,000 tokens. 

** Pricing** ex: Davinci is 0.06 dollar per 1,000 tokens while the rate for using Ada is $0.0008 per 1,000 tokens.

https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens