## Key topics:

**Tokens**: Tokens are a numerical representation of how the Azure OpenAI models process text. So they are representing words or just chunks of characters. For English text, 1 token is approximately 4 characters or 0.75 words. 

**Tokenization**: splitting input/output texts into smaller units for LLMs.

**Vocabulary size**: the number of tokens each model uses, which varies among different GPT models.

In [3]:
import tiktoken #python library for encoding text 

cl100k_base = tiktoken.get_encoding("cl100k_base") #pretrained tokenizer

enc = tiktoken.Encoding( 
    name="gpt-35-turbo",  
    pat_str=cl100k_base._pat_str, 
    mergeable_ranks=cl100k_base._mergeable_ranks, 
    special_tokens={ 
        **cl100k_base._special_tokens, 
        "<|im_start|>": 100264, 
        "<|im_end|>": 100265
    } 
) 

tokens = enc.encode( 
    "Boost employee health and productivity with Bupa by your side. That's better for business."
) 

print('Total number of tokens:', len(tokens))
print('Tokens : ', [enc.decode([t]) for t in tokens])
print("Tokens' numerical values:", tokens)

#https://platform.openai.com/tokenizer

Total number of tokens: 18
Tokens :  ['Boost', ' employee', ' health', ' and', ' productivity', ' with', ' B', 'upa', ' by', ' your', ' side', '.', ' That', "'s", ' better', ' for', ' business', '.']
Tokens' numerical values: [53463, 9548, 2890, 323, 26206, 449, 426, 46931, 555, 701, 3185, 13, 3011, 596, 2731, 369, 2626, 13]


In [4]:
# Returns the num of tokens used on a string
def num_tokens_from_string(string: str) -> int:
    encoding_name ='cl100k_base'
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [5]:
num_tokens_from_string("Boost employee health and productivity with Bupa by your side. That's better for business.")

18

In [6]:
import os
import openai
from dotenv import load_dotenv

# Set up Azure OpenAI
load_dotenv("credentials.env")

openai.api_type = "azure"

import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-15-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
    
deployment_name='gpt-35-turbo' #This will correspond to the custom name you chose for your deployment when you deployed a model. 
 

In [7]:
 # Send a completion call to generate an answer
print('Sending a test completion job')
start_phrase = 'Bupa UK is a '
response = client.completions.create(
    model=deployment_name, 
    prompt=start_phrase, 
    max_tokens=1000)
print(response.choices[0].text)

Sending a test completion job
1.5m member health insurance provider, which also offers dental services, care homes, retirement villages, health assessments, and occupational health services.

By Prathima Nandakumar

[email protected]_reports

Related

First Name:* First Name Required

Last Name:* Last Name Required

Company Name:* Company Name is Required

No of employees:* No of employees is Required

Please fix the errors above




# Usage

In [8]:
response

Completion(id='cmpl-9j21UnUW4Za4tlzHktQ8I6o1887o6', choices=[CompletionChoice(finish_reason='stop', index=0, logprobs=None, text='1.5m member health insurance provider, which also offers dental services, care homes, retirement villages, health assessments, and occupational health services.\n\nBy Prathima Nandakumar\n\n[email protected]_reports\n\nRelated\n\nFirst Name:* First Name Required\n\nLast Name:* Last Name Required\n\nCompany Name:* Company Name is Required\n\nNo of employees:* No of employees is Required\n\nPlease fix the errors above\n\n', content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1720519564, model='gpt-35-turbo', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=84, prompt_tokens=6, total_tokens=90), prompt_filter_results=[{'prompt_index'

In [9]:
response.usage

CompletionUsage(completion_tokens=84, prompt_tokens=6, total_tokens=90)

Azure OpenAI uses a subword tokenization method called "Byte-Pair Encoding (BPE)" for its GPT-based models. ** BPE is a method that merges the most frequently occurring pairs of characters or bytes into a single token **, until a certain number of tokens or a vocabulary size is reached. BPE can help the model to handle rare or unseen words, and to create more compact and consistent representations of the texts. BPE can also allow the model to generate new words or tokens, by combining existing ones. 

https://learn.microsoft.com/en-us/semantic-kernel/prompt-engineering/tokens