## Key topics:

**Tokens**: Tokens are a numerical representation of how the Azure OpenAI models process text. So they are representing words or just chunks of characters. For English text, 1 token is approximately 4 characters or 0.75 words. 

**Tokenization**: splitting input/output texts into smaller units for LLMs.

**Vocabulary size**: the number of tokens each model uses, which varies among different GPT models.

In [6]:
import tiktoken #python library for encoding text 

cl100k_base = tiktoken.get_encoding("cl100k_base") #pretrained tokenizer

enc = tiktoken.Encoding( 
    name="gpt-35-turbo",  
    pat_str=cl100k_base._pat_str, 
    mergeable_ranks=cl100k_base._mergeable_ranks, 
    special_tokens={ 
        **cl100k_base._special_tokens, 
        "<|im_start|>": 100264, 
        "<|im_end|>": 100265
    } 
) 

tokens = enc.encode( 
    "Mortgages from Nationwide" #tbc
) 

print('Total number of tokens:', len(tokens))
print('Tokens : ', [enc.decode([t]) for t in tokens])
print("Tokens' numerical values:", tokens)

#https://platform.openai.com/tokenizer

Total number of tokens: 5
Tokens :  ['M', 'ort', 'gages', ' from', ' Nationwide']
Tokens' numerical values: [44, 371, 56144, 505, 90754]


In [2]:
# Returns the num of tokens used on a string
def num_tokens_from_string(string: str) -> int:
    encoding_name ='cl100k_base'
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [7]:
num_tokens_from_string("Mortgages from Nationwide")#tbc

5

In [1]:
import os
import openai
from dotenv import load_dotenv

# Set up Azure OpenAI
load_dotenv("credentials.env")

openai.api_type = "azure"

import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
    
deployment_name='gpt-35-turbo' #This will correspond to the custom name you chose for your deployment when you deployed a model. 
 

In [23]:
#pip install --upgrade openai

Collecting openai
  Downloading openai-1.41.1-py3-none-any.whl (362 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.5/362.5 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (318 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions<5,>=4.11 (from openai)
  Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, jiter, openai
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.8.0
    Uninstalling typing_extensions-4.8.0:
      Successfully uninstalled typing_extensions-4.8.0
  Attempting uninstall: openai
    Found existing installation: openai 1.30.1
    Uninstalling openai-1.30.1:
      Successfully uninstalled openai-1.30.1
[31mERROR: pip's d

In [24]:
#import openai
#print(openai.__version__)

1.30.1


In [2]:
 # Send a completion call to generate an answer
print('Sending a test completion job')
start_phrase = "Nationwide is a " #tbc
response = client.completions.create(
    model=deployment_name, 
    prompt=start_phrase, 
    max_tokens=1000)
print(response.choices[0].text)

Sending a test completion job
44-year-old company that is among the top personal auto and homeowners insurance companies in America as measured by premiums written. Its market position has gone from being principally a direct writer in Ohio to one of a leading multi-line insurance companies, with products ranging from pet insurance to term life insurance.

Over time, Nationwide learned that to succeed it needed to establish long-term, profitable relationships with its customers based on a win-win proposition. “What our customer needed and wanted and what we were prepared to offer had to be in complete alignment,” said Steven Schreibman, SVP and chief marketing officer.

The Great Recession taught Nationwide that people needed help in managing their finances while maintaining adequate protection. The company responded with new product offerings such as a suite of offerings called Nationwide My Pet Protection, which includes insurance for pets and 24/7 access to a veterinary helpline.

N



# Usage

In [3]:
response

Completion(id='cmpl-9yKRjH7HGjaa16v17eEYrvRGzJUq3', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text='44-year-old company that is among the top personal auto and homeowners insurance companies in America as measured by premiums written. Its market position has gone from being principally a direct writer in Ohio to one of a leading multi-line insurance companies, with products ranging from pet insurance to term life insurance.\n\nOver time, Nationwide learned that to succeed it needed to establish long-term, profitable relationships with its customers based on a win-win proposition. “What our customer needed and wanted and what we were prepared to offer had to be in complete alignment,” said Steven Schreibman, SVP and chief marketing officer.\n\nThe Great Recession taught Nationwide that people needed help in managing their finances while maintaining adequate protection. The company responded with new product offerings such as a suite of offerings called Na

In [4]:
response.usage

CompletionUsage(completion_tokens=1000, prompt_tokens=5, total_tokens=1005)

Azure OpenAI uses a subword tokenization method called "Byte-Pair Encoding (BPE)" for its GPT-based models. ** BPE is a method that merges the most frequently occurring pairs of characters or bytes into a single token **, until a certain number of tokens or a vocabulary size is reached. BPE can help the model to handle rare or unseen words, and to create more compact and consistent representations of the texts. BPE can also allow the model to generate new words or tokens, by combining existing ones. 

https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens