SN BPE - A Byte Pair Encoder for ServiceNow

Overview

This project introduces a Byte Pair Encoding (BPE) tokenizer for use in ServiceNow. BPE is a subword tokenization method that allows for efficient encoding and processing of text data by decomposing words into frequently occurring subwords or characters. This approach is beneficial for NLP tasks where the vocabulary might be vast or cannot be predefined. The CL100k Base Tokenizer implemented is the current (March 2024) GPT-4 tokenizer.

Features

Tokenization using Byte Pair Encoding (BPE) method.
Support for custom tokenization patterns and special tokens though configuration in ServiceNow.
Efficient tokenizer load/cache processing within ServiceNow using GlideRecord for data retrieval and storage.
Polyfill for TextEncoder and TextDecoder, facilitating UTF-8 encoding in ServiceNow without external dependencies.

Installation

Ensure you have access to ServiceNow Studio and the necessary permissions to create and manage applications.
Use the guide to the ServiceNow SDK to set up your instance connection to facilitate deployment ServiceNow SDK links
Run npm run upload to deploy the project to your default instance connection.

Usage

Initializing the Tokenizer

The Tokenizer class is used to create tokenizer instances capable of processing text according to the BPE method.

To initialize a tokenizer:

Create a record in x_13131_bpe_tokenizer with a .tiktoken.txt file attachment.

See ./examples for a sample data record that can be directly xml imported.

More examples to get started with can be found here:

The examples folder includes an XML for the GPT4 tokenizer

Note that ServiceNow's JS engine does not support (?i:... case insensitive regex or \p{L} unicode character classes. As a result the Regex in that registry must be substantially modified in order to run in ServiceNow which is the reason for the very long regex in the demo data.

const { Tokenizer} = require('./src/Tokenizer.js');
const tokenizer = new Tokenizer('c14ba3f74738021051711288c26d430c'); //
const output = tokenizer.encode("TEXT_TO_TEST", "all");
gs.info(output);
gs.info(output.length);
gs.info(tokenizer.decode(output));

When the tokenizer is first run some pre-processing happens based on the tiktoken file. To reduce latency this processing is cached and re-used as a file on the tokenizer gliderecord.

TODO:

Expand Tests for Decode logic

Acknowledgments

This tokenizer is largely based on Andrej Karpathy's MiniBPE and associated youtube video: Let's build the GPT Tokenizer

The UTF8 Encoder Polyfill is based on TextEncoderLite

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.metadata		.metadata
.now		.now
examples		examples
src-ts		src-ts
target		target
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
package-lock.json		package-lock.json
package.json		package.json
readme.md		readme.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SN BPE - A Byte Pair Encoder for ServiceNow

Overview

Features

Installation

Usage

Initializing the Tokenizer

TODO:

Acknowledgments

About

Languages

royjustus/sn_bpe

Folders and files

Latest commit

History

Repository files navigation

SN BPE - A Byte Pair Encoder for ServiceNow

Overview

Features

Installation

Usage

Initializing the Tokenizer

TODO:

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Languages