Skip to content

Fast, Rust-backed word-level tokenization for Ruby. Unlike subword tokenizers (BPE, WordPiece) designed for LLMs, TokenKit provides linguistic tokenization for search engines, text mining, and NLP pipelines—preserving domain-specific patterns like gene names, measurements, and technical terms while handling Unicode correctly.

License

Notifications You must be signed in to change notification settings

scientist-labs/tokenkit

Repository files navigation

tokenkit

Fast, Rust-backed word-level tokenization for Ruby with pattern preservation.

TokenKit is a Ruby wrapper around Rust's unicode-segmentation crate, providing lightweight, Unicode-aware tokenization designed for NLP pipelines, search applications, and text processing where you need consistent, high-quality word segmentation.

Quickstart

# Install the gem
gem install tokenkit

# Or add to your Gemfile
gem 'tokenkit'
require 'tokenkit'

# Basic tokenization - handles Unicode, contractions, accents
TokenKit.tokenize("Hello, world! café can't")
# => ["hello", "world", "café", "can't"]

# Preserve domain-specific terms even when lowercasing
TokenKit.configure do |config|
  config.lowercase = true
  config.preserve_patterns = [
    /\d+ug/i,           # Measurements: 100ug
    /[A-Z][A-Z0-9]+/    # Gene names: BRCA1, TP53
  ]
end

TokenKit.tokenize("Patient received 100ug for BRCA1 study")
# => ["patient", "received", "100ug", "for", "BRCA1", "study"]

Features

  • Thirteen tokenization strategies: whitespace, unicode (recommended), custom regex patterns, sentence, grapheme, keyword, edge n-gram, n-gram, path hierarchy, URL/email-aware, character group, letter, and lowercase
  • Pattern preservation: Keep domain-specific terms (gene names, measurements, antibodies) intact even with case normalization
  • Fast: Rust-backed implementation (~100K docs/sec)
  • Thread-safe: Safe for concurrent use
  • Simple API: Configure once, use everywhere
  • Zero dependencies: Pure Ruby API with Rust extension

Tokenization Strategies

Unicode (Recommended)

Uses Unicode word segmentation for proper handling of contractions, accents, and multi-language text.

✅ Supports preserve_patterns

TokenKit.configure do |config|
  config.strategy = :unicode
  config.lowercase = true
end

TokenKit.tokenize("Don't worry about café!")
# => ["don't", "worry", "about", "café"]

Whitespace

Simple whitespace splitting.

✅ Supports preserve_patterns

TokenKit.configure do |config|
  config.strategy = :whitespace
  config.lowercase = true
end

TokenKit.tokenize("hello world")
# => ["hello", "world"]

Pattern (Custom Regex)

Custom tokenization using regex patterns.

✅ Supports preserve_patterns

TokenKit.configure do |config|
  config.strategy = :pattern
  config.regex = /[\w-]+/  # Keep words and hyphens
  config.lowercase = true
end

TokenKit.tokenize("anti-CD3 antibody")
# => ["anti-cd3", "antibody"]

Sentence

Splits text into sentences using Unicode sentence boundaries.

✅ Supports preserve_patterns (preserves patterns within each sentence)

TokenKit.configure do |config|
  config.strategy = :sentence
  config.lowercase = false
end

TokenKit.tokenize("Hello world! How are you? I am fine.")
# => ["Hello world! ", "How are you? ", "I am fine."]

Useful for document-level processing, sentence embeddings, or paragraph analysis.

Grapheme

Splits text into grapheme clusters (user-perceived characters).

⚠️ Does NOT support preserve_patterns (patterns will be lowercased if lowercase: true)

TokenKit.configure do |config|
  config.strategy = :grapheme
  config.grapheme_extended = true  # Use extended grapheme clusters (default)
  config.lowercase = false
end

TokenKit.tokenize("👨‍👩‍👧‍👦café")
# => ["👨‍👩‍👧‍👦", "c", "a", "f", "é"]

Perfect for handling emoji, combining characters, and complex scripts. Set grapheme_extended = false for legacy grapheme boundaries.

Keyword

Treats entire input as a single token (no splitting).

⚠️ Does NOT support preserve_patterns (patterns will be lowercased if lowercase: true)

TokenKit.configure do |config|
  config.strategy = :keyword
  config.lowercase = false
end

TokenKit.tokenize("PROD-2024-ABC-001")
# => ["PROD-2024-ABC-001"]

Ideal for exact matching of SKUs, IDs, product codes, or category names where splitting would lose meaning.

Edge N-gram (Search-as-you-type)

Generates prefixes from the beginning of words for autocomplete functionality.

⚠️ Does NOT support preserve_patterns (patterns will be lowercased if lowercase: true)

TokenKit.configure do |config|
  config.strategy = :edge_ngram
  config.min_gram = 2        # Minimum prefix length
  config.max_gram = 10       # Maximum prefix length
  config.lowercase = true
end

TokenKit.tokenize("laptop")
# => ["la", "lap", "lapt", "lapto", "laptop"]

Essential for autocomplete, type-ahead search, and prefix matching. At index time, generate edge n-grams of your product names or search terms.

N-gram (Fuzzy Matching)

Generates all substring n-grams (sliding window) for fuzzy matching and misspelling tolerance.

⚠️ Does NOT support preserve_patterns (patterns will be lowercased if lowercase: true)

TokenKit.configure do |config|
  config.strategy = :ngram
  config.min_gram = 2        # Minimum n-gram length
  config.max_gram = 3        # Maximum n-gram length
  config.lowercase = true
end

TokenKit.tokenize("quick")
# => ["qu", "ui", "ic", "ck", "qui", "uic", "ick"]

Perfect for fuzzy search, typo tolerance, and partial matching. Unlike edge n-grams which only generate prefixes, n-grams generate all possible substrings.

Path Hierarchy (Hierarchical Navigation)

Creates tokens for each level of a path hierarchy.

⚠️ Partially supports preserve_patterns (has limitations with hierarchical structure)

TokenKit.configure do |config|
  config.strategy = :path_hierarchy
  config.delimiter = "/"     # Use "\\" for Windows paths
  config.lowercase = false
end

TokenKit.tokenize("/usr/local/bin/ruby")
# => ["/usr", "/usr/local", "/usr/local/bin", "/usr/local/bin/ruby"]

# Works for category hierarchies too
TokenKit.tokenize("electronics/computers/laptops")
# => ["electronics", "electronics/computers", "electronics/computers/laptops"]

Perfect for filesystem paths, URL structures, category hierarchies, and breadcrumb navigation.

URL/Email-Aware (Web Content)

Preserves URLs and email addresses as single tokens while tokenizing surrounding text.

✅ Supports preserve_patterns (preserves patterns alongside URLs/emails)

TokenKit.configure do |config|
  config.strategy = :url_email
  config.lowercase = true
end

TokenKit.tokenize("Contact support@example.com or visit https://example.com")
# => ["contact", "support@example.com", "or", "visit", "https://example.com"]

Essential for user-generated content, customer support messages, product descriptions with links, and social media text.

Character Group (Fast Custom Splitting)

Splits text based on a custom set of characters (faster than regex for simple delimiters).

⚠️ Partially supports preserve_patterns (works best with whitespace delimiters; non-whitespace delimiters may have issues)

TokenKit.configure do |config|
  config.strategy = :char_group
  config.split_on_chars = ",;"  # Split on commas and semicolons
  config.lowercase = false
end

TokenKit.tokenize("apple,banana;cherry")
# => ["apple", "banana", "cherry"]

# CSV parsing
TokenKit.tokenize("John Doe,30,Software Engineer")
# => ["John Doe", "30", "Software Engineer"]

Ideal for structured data (CSV, TSV), log parsing, and custom delimiter-based formats. Default split characters are \t\n\r (whitespace).

Letter (Language-Agnostic)

Splits on any non-letter character (simpler than Unicode tokenizer, no special handling for contractions).

✅ Supports preserve_patterns

TokenKit.configure do |config|
  config.strategy = :letter
  config.lowercase = true
end

TokenKit.tokenize("hello-world123test")
# => ["hello", "world", "test"]

# Handles multiple scripts
TokenKit.tokenize("Hello-世界-test")
# => ["hello", "世界", "test"]

Great for noisy text, mixed scripts, and cases where you want aggressive splitting on any non-letter character.

Lowercase (Efficient Case Normalization)

Like the Letter tokenizer but always lowercases in a single pass (more efficient than letter + lowercase filter).

✅ Supports preserve_patterns (preserved patterns maintain original case despite always lowercasing)

TokenKit.configure do |config|
  config.strategy = :lowercase
  # Note: config.lowercase setting is ignored - this tokenizer ALWAYS lowercases
end

TokenKit.tokenize("HELLO-WORLD")
# => ["hello", "world"]

# Case-insensitive search indexing
TokenKit.tokenize("User-Agent: Mozilla/5.0")
# => ["user", "agent", "mozilla"]

⚠️ Important: The :lowercase strategy always lowercases text, regardless of the config.lowercase setting. If you need control over lowercasing, use the :letter strategy instead with config.lowercase = true/false.

Perfect for case-insensitive search indexing, normalizing product codes, and cleaning social media text. Handles Unicode correctly, including characters that lowercase to multiple characters (e.g., Turkish İ).

Pattern Preservation

Preserve domain-specific terms even when lowercasing.

Fully Supported by: Unicode, Pattern, Whitespace, Letter, Lowercase, Sentence, and URL/Email tokenizers.

Partially Supported by: Character Group (works best with whitespace delimiters) and Path Hierarchy (limitations with hierarchical structure) tokenizers.

Not Supported by: Grapheme, Keyword, Edge N-gram, and N-gram tokenizers.

TokenKit.configure do |config|
  config.strategy = :unicode
  config.lowercase = true
  config.preserve_patterns = [
    /\d+(ug|mg|ml|units)/i,  # Measurements: 100ug, 50mg
    /anti-cd\d+/i,            # Antibodies: Anti-CD3, anti-CD28
    /[A-Z][A-Z0-9]+/          # Gene names: BRCA1, TP53, EGFR
  ]
end

text = "Patient received 100ug Anti-CD3 with BRCA1 mutation"
tokens = TokenKit.tokenize(text)
# => ["patient", "received", "100ug", "Anti-CD3", "with", "BRCA1", "mutation"]

Pattern matches maintain their original case despite lowercase=true.

Regex Flags

TokenKit supports Ruby regex flags for both preserve_patterns and the :pattern strategy:

# Case-insensitive matching (i flag)
TokenKit.configure do |config|
  config.preserve_patterns = [/gene-\d+/i]
end

TokenKit.tokenize("Found GENE-123 and gene-456")
# => ["found", "GENE-123", "and", "gene-456"]

# Multiline mode (m flag) - dot matches newlines
TokenKit.configure do |config|
  config.strategy = :pattern
  config.regex = /test./m
end

# Extended mode (x flag) - allows comments and whitespace
pattern = /
  \w+       # word characters
  @         # at sign
  \w+\.\w+  # domain.tld
/x

TokenKit.configure do |config|
  config.preserve_patterns = [pattern]
end

# Combine flags
TokenKit.configure do |config|
  config.preserve_patterns = [/code-\d+/im]  # case-insensitive + multiline
end

Supported flags:

  • i - Case-insensitive matching
  • m - Multiline mode (. matches newlines)
  • x - Extended mode (ignore whitespace, allow comments)

Flags work with both Regexp objects and string patterns passed to :pattern strategy.

Configuration

Global Configuration

TokenKit.configure do |config|
  config.strategy = :unicode              # :whitespace, :unicode, :pattern, :sentence, :grapheme, :keyword, :edge_ngram, :ngram, :path_hierarchy, :url_email, :char_group, :letter, :lowercase
  config.lowercase = true                 # Normalize to lowercase
  config.remove_punctuation = false       # Remove punctuation from tokens
  config.preserve_patterns = []           # Regex patterns to preserve

  # Strategy-specific options
  config.regex = /\w+/                    # Only for :pattern strategy
  config.grapheme_extended = true         # Only for :grapheme strategy (default: true)
  config.min_gram = 2                     # For :edge_ngram and :ngram strategies (default: 2)
  config.max_gram = 10                    # For :edge_ngram and :ngram strategies (default: 10)
  config.delimiter = "/"                  # Only for :path_hierarchy strategy (default: "/")
  config.split_on_chars = " \t\n\r"       # Only for :char_group strategy (default: whitespace)
end

Per-Call Options

Override global config for specific calls:

# Override general options
TokenKit.tokenize("BRCA1 Gene", lowercase: false)
# => ["BRCA1", "Gene"]

# Override strategy-specific options
TokenKit.tokenize("laptop", strategy: :edge_ngram, min_gram: 3, max_gram: 5)
# => ["lap", "lapt", "lapto"]

TokenKit.tokenize("C:\\Windows\\System", strategy: :path_hierarchy, delimiter: "\\")
# => ["C:", "C:\\Windows", "C:\\Windows\\System"]

# Combine multiple overrides
TokenKit.tokenize(
  "TEST",
  strategy: :edge_ngram,
  min_gram: 2,
  max_gram: 3,
  lowercase: false
)
# => ["TE", "TES"]

All strategy-specific options can be overridden per-call:

  • :pattern - regex: /pattern/
  • :grapheme - extended: true/false
  • :edge_ngram - min_gram: n, max_gram: n
  • :ngram - min_gram: n, max_gram: n
  • :path_hierarchy - delimiter: "/"
  • :char_group - split_on_chars: ",;"

Get Current Config

config = TokenKit.config_hash
# Returns a Configuration object with accessor methods

config.strategy           # => :unicode
config.lowercase          # => true
config.remove_punctuation # => false
config.preserve_patterns  # => [...]

# Strategy predicates
config.edge_ngram?        # => false
config.ngram?             # => false
config.pattern?           # => false
config.grapheme?          # => false
config.path_hierarchy?    # => false
config.char_group?        # => false
config.letter?            # => false
config.lowercase?         # => false

# Strategy-specific accessors
config.min_gram           # => 2 (for edge_ngram and ngram)
config.max_gram           # => 10 (for edge_ngram and ngram)
config.delimiter          # => "/" (for path_hierarchy)
config.split_on_chars     # => " \t\n\r" (for char_group)
config.extended           # => true (for grapheme)
config.regex              # => "..." (for pattern)

# Convert to hash if needed
config.to_h
# => {"strategy" => "unicode", "lowercase" => true, ...}

Reset to Defaults

TokenKit.reset

Use Cases

Biotech/Life Sciences

TokenKit.configure do |config|
  config.strategy = :unicode
  config.lowercase = true
  config.preserve_patterns = [
    /\d+(ug|mg|ml|ul|units)/i,  # Measurements
    /anti-[a-z0-9-]+/i,          # Antibodies
    /[A-Z]{2,10}/,               # Gene names (CDK10, BRCA1, TP53)
    /cd\d+/i,                    # Cell markers (CD3, CD4, CD8)
    /ig[gmaed]/i                 # Immunoglobulins (IgG, IgM)
  ]
end

text = "Anti-CD3 IgG antibody 100ug for BRCA1 research"
tokens = TokenKit.tokenize(text)
# => ["Anti-CD3", "IgG", "antibody", "100ug", "for", "BRCA1", "research"]

E-commerce/Catalogs

TokenKit.configure do |config|
  config.strategy = :unicode
  config.lowercase = true
  config.preserve_patterns = [
    /\$\d+(\.\d{2})?/,          # Prices: $99.99
    /\d+(-\d+)+/,               # SKUs: 123-456-789
    /\d+(mm|cm|inch)/i          # Dimensions: 10mm, 5cm
  ]
end

text = "Widget $49.99 SKU: 123-456 size: 10cm"
tokens = TokenKit.tokenize(text)
# => ["widget", "$49.99", "sku", "123-456", "size", "10cm"]

Search Applications

# Exact matching with case normalization
TokenKit.configure do |config|
  config.strategy = :lowercase
  config.lowercase = true
end

# Index time: normalize documents
doc_tokens = TokenKit.tokenize("Product Code: ABC-123")
# => ["product", "code", "abc"]

# Query time: normalize user input
query_tokens = TokenKit.tokenize("product abc")
# => ["product", "abc"]

# Fuzzy matching with n-grams
TokenKit.configure do |config|
  config.strategy = :ngram
  config.min_gram = 2
  config.max_gram = 4
  config.lowercase = true
end

# Index time: generate n-grams
TokenKit.tokenize("search")
# => ["se", "ea", "ar", "rc", "ch", "sea", "ear", "arc", "rch", "sear", "earc", "arch"]

# Query time: typo "serch" still has significant overlap
TokenKit.tokenize("serch")
# => ["se", "er", "rc", "ch", "ser", "erc", "rch", "serc", "erch"]
# Overlap: ["se", "rc", "ch", "rch"] allows matching despite typo

# Autocomplete with edge n-grams
TokenKit.configure do |config|
  config.strategy = :edge_ngram
  config.min_gram = 2
  config.max_gram = 10
end

TokenKit.tokenize("laptop")
# => ["la", "lap", "lapt", "lapto", "laptop"]
# Matches "la", "lap", "lapt" as user types

Performance

TokenKit has been extensively optimized for production use:

  • Unicode tokenization: ~870K tokens/sec (baseline)
  • Pattern preservation: ~410K tokens/sec with 4 patterns (was 3.6K/sec before v0.3.0 optimizations)
  • Memory efficient: Pre-allocated buffers and in-place operations
  • Thread-safe: Cached instances with mutex protection, safe for concurrent use
  • 110x speedup: For pattern-heavy workloads through intelligent caching

Key optimizations:

  • Regex patterns compiled once and cached (not per-tokenization)
  • String allocations minimized through index-based operations
  • Tokenizer instances reused across calls
  • In-place post-processing for lowercase and punctuation removal

See the Performance Guide for detailed benchmarks and optimization techniques.

Integration

TokenKit is designed to work with other gems in the scientist-labs ecosystem:

  • PhraseKit: Use TokenKit for consistent phrase extraction
  • SpellKit: Tokenize before spell correction
  • red-candle: Tokenize before NER/embeddings

Documentation

Generating Documentation Locally

# Install documentation dependencies
bundle install

# Generate YARD documentation
bundle exec yard doc

# Open documentation in browser
open doc/index.html

Development

# Setup
bundle install
bundle exec rake compile

# Run tests
bundle exec rspec

# Run tests with coverage
COVERAGE=true bundle exec rspec

# Run linter
bundle exec standardrb

# Run benchmarks
ruby benchmarks/tokenizer_benchmark.rb

# Build gem
gem build tokenkit.gemspec

Requirements

  • Ruby >= 3.1.0
  • Rust toolchain (for building from source)

License

MIT License. See LICENSE.txt for details.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/tokenkit.

This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.

Credits

Built with:

About

Fast, Rust-backed word-level tokenization for Ruby. Unlike subword tokenizers (BPE, WordPiece) designed for LLMs, TokenKit provides linguistic tokenization for search engines, text mining, and NLP pipelines—preserving domain-specific patterns like gene names, measurements, and technical terms while handling Unicode correctly.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published