<a href="https://colab.research.google.com/github/tpadmapriyaGitHub/AgenticAI/blob/Training/Custom_tokenizer_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#USE CASE: CUSTOM TOKENIZER

This notebook implements a custom tokenizer using the tiktoken library, likely for processing and visualizing how text input is tokenized based on a specific encoding model (cl100k_base). The use case involves taking a userâ€™s input text, converting it into tokens, and then displaying these tokens in a color-coded manner for easy interpretation. This is useful for understanding how text is broken down into tokens, which is critical for Natural Language Processing (NLP) tasks such as language modeling, text generation, and machine learning workflows. Additionally, the code counts the number of tokens and characters in the tokenized output, providing insights into text compression and processing efficiency, which can be valuable in optimizing AI models or other text-based algorithms.

##Tech Stack:

1. Python 3.11.1
2. tiktoken library (for tokenization)
3. IPython (for output control)
4. Jupyter Notebook (as the development environment)
5. ANSI escape codes (for color-coded terminal output)


 ## Library Imports and Setup

In [1]:
#install tiktoken
!pip install tiktoken



## Tokenizer Initialization and User Input Handling

In [2]:
#Define the tokenizer for encoding & subsequent decoding of tokens in the user's input sentence
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")

user_input= input("Enter the text here")

tokens=tokenizer.encode(user_input)

print(tokens)

Enter the text hereTokenization class today
[3404, 2065, 538, 3432]


In [3]:
#Decoding
decode = tokenizer.decode(tokens)
print(decode)

Tokenization class today


In [4]:
#Same process is repeated
user_input =input("")
tokens= tokenizer.encode(user_input)
decode_to_bytes =tokenizer.decode_tokens_bytes(tokens)

print(tokens)
print(decode_to_bytes)

Tokenization class today
[3404, 2065, 538, 3432]
[b'Token', b'ization', b' class', b' today']


## Token and Character Count Calculation

In [5]:
import tiktoken
from IPython.display import clear_output

# Initialize tokenizer and variables
tokenizer = tiktoken.get_encoding("cl100k_base")
count = 0
token_list = []

user_input = input("")
clear_output(wait=True)

# Encode input and decode tokens
encode = tokenizer.encode(user_input)
decode = tokenizer.decode_tokens_bytes(encode)

# Store decoded tokens as strings
for token in decode:
    token_list.append(token.decode())

# Calculate character count and token length
character_count = sum(len(i) for i in token_list)
length = len(encode)

# Print tokens with alternating color codes
for tk in token_list:
    if count == 0:
        print('\x1b[0;47;1m' + tk + '\x1b[0m', end='')
    elif count == 1:
        print('\x1b[0;42;1m' + tk + '\x1b[0m', end='')
    elif count == 2:
        print('\x1b[0;43;1m' + tk + '\x1b[0m', end='')
    elif count == 3:
        print('\x1b[0;44;1m' + tk + '\x1b[0m', end='')
    elif count == 4:
        print('\x1b[0;46;1m' + tk + '\x1b[0m', end='')
    elif count == 5:
        print('\x1b[0;45;1m' + tk + '\x1b[0m', end='')
        count = -1
    count += 1

# Print token details
print("\n\n" + str(token_list) + "\n")
print(str(encode) + "\n")
print("Token Count: " + str(length))
print("Characters: " + str(character_count))

[0;47;1mToken[0m[0;42;1mization[0m[0;43;1m class[0m[0;44;1m today[0m

['Token', 'ization', ' class', ' today']

[3404, 2065, 538, 3432]

Token Count: 4
Characters: 24


In [6]:
import tiktoken
from IPython.display import clear_output

# Initialize tokenizer and color codes
tokenizer = tiktoken.get_encoding("cl100k_base")
color_codes = ['0;47', '0;42', '0;43', '0;44', '0;46', '0;45']

user_input = input("")
clear_output(wait=True)

# Encode and decode input
encoded = tokenizer.encode(user_input)
decoded = tokenizer.decode_tokens_bytes(encoded)
token_list = [token.decode() for token in decoded]

# Calculate character count
character_count = sum(len(i) for i in token_list)

# Print tokens with alternating color codes
for idx, token in enumerate(token_list):
    print(f'\x1b[{color_codes[idx % len(color_codes)]};1m{token}\x1b[0m', end='')

# Print token details
print("\n\n" + str(token_list) + "\n")
print(str(encoded) + "\n")
print("Token Count: " + str(len(encoded)))
print("Characters: " + str(character_count))


[0;47;1mToken[0m[0;42;1mization[0m[0;43;1m class[0m[0;44;1m today[0m

['Token', 'ization', ' class', ' today']

[3404, 2065, 538, 3432]

Token Count: 4
Characters: 24
