Fast, Rust-backed word-level tokenization for Ruby with pattern preservation.
TokenKit is a Ruby wrapper around Rust's unicode-segmentation crate, providing lightweight, Unicode-aware tokenization designed for NLP pipelines, search applications, and text processing where you need consistent, high-quality word segmentation.
# Install the gem
gem install tokenkit
# Or add to your Gemfile
gem 'tokenkit'
require 'tokenkit'
# Basic tokenization - handles Unicode, contractions, accents
TokenKit.tokenize("Hello, world! café can't")
# => ["hello", "world", "café", "can't"]
# Preserve domain-specific terms even when lowercasing
TokenKit.configure do |config|
config.lowercase = true
config.preserve_patterns = [
/\d+ug/i, # Measurements: 100ug
/[A-Z][A-Z0-9]+/ # Gene names: BRCA1, TP53
]
end
TokenKit.tokenize("Patient received 100ug for BRCA1 study")
# => ["patient", "received", "100ug", "for", "BRCA1", "study"]
- Thirteen tokenization strategies: whitespace, unicode (recommended), custom regex patterns, sentence, grapheme, keyword, edge n-gram, n-gram, path hierarchy, URL/email-aware, character group, letter, and lowercase
- Pattern preservation: Keep domain-specific terms (gene names, measurements, antibodies) intact even with case normalization
- Fast: Rust-backed implementation (~100K docs/sec)
- Thread-safe: Safe for concurrent use
- Simple API: Configure once, use everywhere
- Zero dependencies: Pure Ruby API with Rust extension
Uses Unicode word segmentation for proper handling of contractions, accents, and multi-language text.
✅ Supports preserve_patterns
TokenKit.configure do |config|
config.strategy = :unicode
config.lowercase = true
end
TokenKit.tokenize("Don't worry about café!")
# => ["don't", "worry", "about", "café"]
Simple whitespace splitting.
✅ Supports preserve_patterns
TokenKit.configure do |config|
config.strategy = :whitespace
config.lowercase = true
end
TokenKit.tokenize("hello world")
# => ["hello", "world"]
Custom tokenization using regex patterns.
✅ Supports preserve_patterns
TokenKit.configure do |config|
config.strategy = :pattern
config.regex = /[\w-]+/ # Keep words and hyphens
config.lowercase = true
end
TokenKit.tokenize("anti-CD3 antibody")
# => ["anti-cd3", "antibody"]
Splits text into sentences using Unicode sentence boundaries.
✅ Supports preserve_patterns
(preserves patterns within each sentence)
TokenKit.configure do |config|
config.strategy = :sentence
config.lowercase = false
end
TokenKit.tokenize("Hello world! How are you? I am fine.")
# => ["Hello world! ", "How are you? ", "I am fine."]
Useful for document-level processing, sentence embeddings, or paragraph analysis.
Splits text into grapheme clusters (user-perceived characters).
preserve_patterns
(patterns will be lowercased if lowercase: true
)
TokenKit.configure do |config|
config.strategy = :grapheme
config.grapheme_extended = true # Use extended grapheme clusters (default)
config.lowercase = false
end
TokenKit.tokenize("👨👩👧👦café")
# => ["👨👩👧👦", "c", "a", "f", "é"]
Perfect for handling emoji, combining characters, and complex scripts. Set grapheme_extended = false
for legacy grapheme boundaries.
Treats entire input as a single token (no splitting).
preserve_patterns
(patterns will be lowercased if lowercase: true
)
TokenKit.configure do |config|
config.strategy = :keyword
config.lowercase = false
end
TokenKit.tokenize("PROD-2024-ABC-001")
# => ["PROD-2024-ABC-001"]
Ideal for exact matching of SKUs, IDs, product codes, or category names where splitting would lose meaning.
Generates prefixes from the beginning of words for autocomplete functionality.
preserve_patterns
(patterns will be lowercased if lowercase: true
)
TokenKit.configure do |config|
config.strategy = :edge_ngram
config.min_gram = 2 # Minimum prefix length
config.max_gram = 10 # Maximum prefix length
config.lowercase = true
end
TokenKit.tokenize("laptop")
# => ["la", "lap", "lapt", "lapto", "laptop"]
Essential for autocomplete, type-ahead search, and prefix matching. At index time, generate edge n-grams of your product names or search terms.
Generates all substring n-grams (sliding window) for fuzzy matching and misspelling tolerance.
preserve_patterns
(patterns will be lowercased if lowercase: true
)
TokenKit.configure do |config|
config.strategy = :ngram
config.min_gram = 2 # Minimum n-gram length
config.max_gram = 3 # Maximum n-gram length
config.lowercase = true
end
TokenKit.tokenize("quick")
# => ["qu", "ui", "ic", "ck", "qui", "uic", "ick"]
Perfect for fuzzy search, typo tolerance, and partial matching. Unlike edge n-grams which only generate prefixes, n-grams generate all possible substrings.
Creates tokens for each level of a path hierarchy.
preserve_patterns
(has limitations with hierarchical structure)
TokenKit.configure do |config|
config.strategy = :path_hierarchy
config.delimiter = "/" # Use "\\" for Windows paths
config.lowercase = false
end
TokenKit.tokenize("/usr/local/bin/ruby")
# => ["/usr", "/usr/local", "/usr/local/bin", "/usr/local/bin/ruby"]
# Works for category hierarchies too
TokenKit.tokenize("electronics/computers/laptops")
# => ["electronics", "electronics/computers", "electronics/computers/laptops"]
Perfect for filesystem paths, URL structures, category hierarchies, and breadcrumb navigation.
Preserves URLs and email addresses as single tokens while tokenizing surrounding text.
✅ Supports preserve_patterns
(preserves patterns alongside URLs/emails)
TokenKit.configure do |config|
config.strategy = :url_email
config.lowercase = true
end
TokenKit.tokenize("Contact support@example.com or visit https://example.com")
# => ["contact", "support@example.com", "or", "visit", "https://example.com"]
Essential for user-generated content, customer support messages, product descriptions with links, and social media text.
Splits text based on a custom set of characters (faster than regex for simple delimiters).
preserve_patterns
(works best with whitespace delimiters; non-whitespace delimiters may have issues)
TokenKit.configure do |config|
config.strategy = :char_group
config.split_on_chars = ",;" # Split on commas and semicolons
config.lowercase = false
end
TokenKit.tokenize("apple,banana;cherry")
# => ["apple", "banana", "cherry"]
# CSV parsing
TokenKit.tokenize("John Doe,30,Software Engineer")
# => ["John Doe", "30", "Software Engineer"]
Ideal for structured data (CSV, TSV), log parsing, and custom delimiter-based formats. Default split characters are \t\n\r
(whitespace).
Splits on any non-letter character (simpler than Unicode tokenizer, no special handling for contractions).
✅ Supports preserve_patterns
TokenKit.configure do |config|
config.strategy = :letter
config.lowercase = true
end
TokenKit.tokenize("hello-world123test")
# => ["hello", "world", "test"]
# Handles multiple scripts
TokenKit.tokenize("Hello-世界-test")
# => ["hello", "世界", "test"]
Great for noisy text, mixed scripts, and cases where you want aggressive splitting on any non-letter character.
Like the Letter tokenizer but always lowercases in a single pass (more efficient than letter + lowercase filter).
✅ Supports preserve_patterns
(preserved patterns maintain original case despite always lowercasing)
TokenKit.configure do |config|
config.strategy = :lowercase
# Note: config.lowercase setting is ignored - this tokenizer ALWAYS lowercases
end
TokenKit.tokenize("HELLO-WORLD")
# => ["hello", "world"]
# Case-insensitive search indexing
TokenKit.tokenize("User-Agent: Mozilla/5.0")
# => ["user", "agent", "mozilla"]
:lowercase
strategy always lowercases text, regardless of the config.lowercase
setting. If you need control over lowercasing, use the :letter
strategy instead with config.lowercase = true/false
.
Perfect for case-insensitive search indexing, normalizing product codes, and cleaning social media text. Handles Unicode correctly, including characters that lowercase to multiple characters (e.g., Turkish İ).
Preserve domain-specific terms even when lowercasing.
Fully Supported by: Unicode, Pattern, Whitespace, Letter, Lowercase, Sentence, and URL/Email tokenizers.
Partially Supported by: Character Group (works best with whitespace delimiters) and Path Hierarchy (limitations with hierarchical structure) tokenizers.
Not Supported by: Grapheme, Keyword, Edge N-gram, and N-gram tokenizers.
TokenKit.configure do |config|
config.strategy = :unicode
config.lowercase = true
config.preserve_patterns = [
/\d+(ug|mg|ml|units)/i, # Measurements: 100ug, 50mg
/anti-cd\d+/i, # Antibodies: Anti-CD3, anti-CD28
/[A-Z][A-Z0-9]+/ # Gene names: BRCA1, TP53, EGFR
]
end
text = "Patient received 100ug Anti-CD3 with BRCA1 mutation"
tokens = TokenKit.tokenize(text)
# => ["patient", "received", "100ug", "Anti-CD3", "with", "BRCA1", "mutation"]
Pattern matches maintain their original case despite lowercase=true
.
TokenKit supports Ruby regex flags for both preserve_patterns
and the :pattern
strategy:
# Case-insensitive matching (i flag)
TokenKit.configure do |config|
config.preserve_patterns = [/gene-\d+/i]
end
TokenKit.tokenize("Found GENE-123 and gene-456")
# => ["found", "GENE-123", "and", "gene-456"]
# Multiline mode (m flag) - dot matches newlines
TokenKit.configure do |config|
config.strategy = :pattern
config.regex = /test./m
end
# Extended mode (x flag) - allows comments and whitespace
pattern = /
\w+ # word characters
@ # at sign
\w+\.\w+ # domain.tld
/x
TokenKit.configure do |config|
config.preserve_patterns = [pattern]
end
# Combine flags
TokenKit.configure do |config|
config.preserve_patterns = [/code-\d+/im] # case-insensitive + multiline
end
Supported flags:
i
- Case-insensitive matchingm
- Multiline mode (.
matches newlines)x
- Extended mode (ignore whitespace, allow comments)
Flags work with both Regexp objects and string patterns passed to :pattern
strategy.
TokenKit.configure do |config|
config.strategy = :unicode # :whitespace, :unicode, :pattern, :sentence, :grapheme, :keyword, :edge_ngram, :ngram, :path_hierarchy, :url_email, :char_group, :letter, :lowercase
config.lowercase = true # Normalize to lowercase
config.remove_punctuation = false # Remove punctuation from tokens
config.preserve_patterns = [] # Regex patterns to preserve
# Strategy-specific options
config.regex = /\w+/ # Only for :pattern strategy
config.grapheme_extended = true # Only for :grapheme strategy (default: true)
config.min_gram = 2 # For :edge_ngram and :ngram strategies (default: 2)
config.max_gram = 10 # For :edge_ngram and :ngram strategies (default: 10)
config.delimiter = "/" # Only for :path_hierarchy strategy (default: "/")
config.split_on_chars = " \t\n\r" # Only for :char_group strategy (default: whitespace)
end
Override global config for specific calls:
# Override general options
TokenKit.tokenize("BRCA1 Gene", lowercase: false)
# => ["BRCA1", "Gene"]
# Override strategy-specific options
TokenKit.tokenize("laptop", strategy: :edge_ngram, min_gram: 3, max_gram: 5)
# => ["lap", "lapt", "lapto"]
TokenKit.tokenize("C:\\Windows\\System", strategy: :path_hierarchy, delimiter: "\\")
# => ["C:", "C:\\Windows", "C:\\Windows\\System"]
# Combine multiple overrides
TokenKit.tokenize(
"TEST",
strategy: :edge_ngram,
min_gram: 2,
max_gram: 3,
lowercase: false
)
# => ["TE", "TES"]
All strategy-specific options can be overridden per-call:
:pattern
-regex: /pattern/
:grapheme
-extended: true/false
:edge_ngram
-min_gram: n, max_gram: n
:ngram
-min_gram: n, max_gram: n
:path_hierarchy
-delimiter: "/"
:char_group
-split_on_chars: ",;"
config = TokenKit.config_hash
# Returns a Configuration object with accessor methods
config.strategy # => :unicode
config.lowercase # => true
config.remove_punctuation # => false
config.preserve_patterns # => [...]
# Strategy predicates
config.edge_ngram? # => false
config.ngram? # => false
config.pattern? # => false
config.grapheme? # => false
config.path_hierarchy? # => false
config.char_group? # => false
config.letter? # => false
config.lowercase? # => false
# Strategy-specific accessors
config.min_gram # => 2 (for edge_ngram and ngram)
config.max_gram # => 10 (for edge_ngram and ngram)
config.delimiter # => "/" (for path_hierarchy)
config.split_on_chars # => " \t\n\r" (for char_group)
config.extended # => true (for grapheme)
config.regex # => "..." (for pattern)
# Convert to hash if needed
config.to_h
# => {"strategy" => "unicode", "lowercase" => true, ...}
TokenKit.reset
TokenKit.configure do |config|
config.strategy = :unicode
config.lowercase = true
config.preserve_patterns = [
/\d+(ug|mg|ml|ul|units)/i, # Measurements
/anti-[a-z0-9-]+/i, # Antibodies
/[A-Z]{2,10}/, # Gene names (CDK10, BRCA1, TP53)
/cd\d+/i, # Cell markers (CD3, CD4, CD8)
/ig[gmaed]/i # Immunoglobulins (IgG, IgM)
]
end
text = "Anti-CD3 IgG antibody 100ug for BRCA1 research"
tokens = TokenKit.tokenize(text)
# => ["Anti-CD3", "IgG", "antibody", "100ug", "for", "BRCA1", "research"]
TokenKit.configure do |config|
config.strategy = :unicode
config.lowercase = true
config.preserve_patterns = [
/\$\d+(\.\d{2})?/, # Prices: $99.99
/\d+(-\d+)+/, # SKUs: 123-456-789
/\d+(mm|cm|inch)/i # Dimensions: 10mm, 5cm
]
end
text = "Widget $49.99 SKU: 123-456 size: 10cm"
tokens = TokenKit.tokenize(text)
# => ["widget", "$49.99", "sku", "123-456", "size", "10cm"]
# Exact matching with case normalization
TokenKit.configure do |config|
config.strategy = :lowercase
config.lowercase = true
end
# Index time: normalize documents
doc_tokens = TokenKit.tokenize("Product Code: ABC-123")
# => ["product", "code", "abc"]
# Query time: normalize user input
query_tokens = TokenKit.tokenize("product abc")
# => ["product", "abc"]
# Fuzzy matching with n-grams
TokenKit.configure do |config|
config.strategy = :ngram
config.min_gram = 2
config.max_gram = 4
config.lowercase = true
end
# Index time: generate n-grams
TokenKit.tokenize("search")
# => ["se", "ea", "ar", "rc", "ch", "sea", "ear", "arc", "rch", "sear", "earc", "arch"]
# Query time: typo "serch" still has significant overlap
TokenKit.tokenize("serch")
# => ["se", "er", "rc", "ch", "ser", "erc", "rch", "serc", "erch"]
# Overlap: ["se", "rc", "ch", "rch"] allows matching despite typo
# Autocomplete with edge n-grams
TokenKit.configure do |config|
config.strategy = :edge_ngram
config.min_gram = 2
config.max_gram = 10
end
TokenKit.tokenize("laptop")
# => ["la", "lap", "lapt", "lapto", "laptop"]
# Matches "la", "lap", "lapt" as user types
TokenKit has been extensively optimized for production use:
- Unicode tokenization: ~870K tokens/sec (baseline)
- Pattern preservation: ~410K tokens/sec with 4 patterns (was 3.6K/sec before v0.3.0 optimizations)
- Memory efficient: Pre-allocated buffers and in-place operations
- Thread-safe: Cached instances with mutex protection, safe for concurrent use
- 110x speedup: For pattern-heavy workloads through intelligent caching
Key optimizations:
- Regex patterns compiled once and cached (not per-tokenization)
- String allocations minimized through index-based operations
- Tokenizer instances reused across calls
- In-place post-processing for lowercase and punctuation removal
See the Performance Guide for detailed benchmarks and optimization techniques.
TokenKit is designed to work with other gems in the scientist-labs ecosystem:
- PhraseKit: Use TokenKit for consistent phrase extraction
- SpellKit: Tokenize before spell correction
- red-candle: Tokenize before NER/embeddings
- API Documentation - Full API reference
- Architecture Guide - Internal design and structure
- Performance Guide - Benchmarks and optimization details
# Install documentation dependencies
bundle install
# Generate YARD documentation
bundle exec yard doc
# Open documentation in browser
open doc/index.html
# Setup
bundle install
bundle exec rake compile
# Run tests
bundle exec rspec
# Run tests with coverage
COVERAGE=true bundle exec rspec
# Run linter
bundle exec standardrb
# Run benchmarks
ruby benchmarks/tokenizer_benchmark.rb
# Build gem
gem build tokenkit.gemspec
- Ruby >= 3.1.0
- Rust toolchain (for building from source)
MIT License. See LICENSE.txt for details.
Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/tokenkit.
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
Built with:
- Magnus for Ruby-Rust bindings
- unicode-segmentation for Unicode word boundaries
- linkify for robust URL and email detection
- regex for pattern matching