Worka PII

Worka PII is a Rust library for detecting and anonymizing personally identifiable information (PII). It provides deterministic, capability-aware NLP pipelines designed to run on CPU-only environments with explicit auditability and controlled degradation when language features are unavailable.

This crate was extracted from the Worka internal monorepo to become a standalone, reusable component. The APIs and the RFCs are maintained here to support independent development and external adoption.

Features

Deterministic PII detection with stable byte offsets
Regex, validator, dictionary, and NER-backed recognizers
Capability-aware pipeline (tokenization, lemma, POS, NER)
Configurable anonymization operators (redact, mask, replace, hash)
Optional Candle-based NER via candle-ner feature

Examples

cargo run --example redact
cargo run --example extract

Redaction Example

use pii::anonymize::{AnonymizeConfig, Anonymizer};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Contact Jane at jane@example.com or +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), pii::anonymize::Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), pii::anonymize::Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;

let redacted = Anonymizer::anonymize(text, &result.entities, &config).unwrap();
assert!(redacted.text.contains("<EMAIL>"));

Span Extraction Example

This example keeps the input text intact and uses the detected spans directly.

use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Reach me at jane@example.com from 10.0.0.5.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

for detection in &result.entities {
    let span = &text[detection.start..detection.end];
    println!(
        "type={} start={} end={} value={}",
        detection.entity_type.as_str(),
        detection.start,
        detection.end,
        span
    );
}

Custom Operators + Audit Log Example

This example applies per-entity operators and emits a simple audit log that records the original value alongside the replacement.

use pii::anonymize::{AnonymizeConfig, Anonymizer, Operator};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Email jane@example.com or call +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;

let anonymized = Anonymizer::anonymize(text, &result.entities, &config).unwrap();

for item in &anonymized.items {
    let original = &text[item.entity.start..item.entity.end];
    println!(
        "type={} value={} replacement={}",
        item.entity.entity_type.as_str(),
        original,
        item.replacement
    );
}

Supported Entity Types (Built-in)

The following entity types are supported out of the box via built-in recognizers:

Email
Phone
IpAddress (IPv4)
Ipv6
CreditCard
Iban
Ssn
Itin
TaxId
Passport
DriverLicense
BankAccount
RoutingNumber
CryptoAddress
MacAddress
Uuid
Vin
Imei
Url
Domain
Hostname

The following types are supported when a NER engine is enabled:

Person
Location
Organization

Custom Entities and Recognizers

You can add custom entities and recognizers to the pipeline.

use pii::recognizers::regex::RegexRecognizer;
use pii::types::EntityType;

let mut recognizers = default_recognizers();
let employee_id = RegexRecognizer::new(
    "regex_employee_id",
    EntityType::Custom("EmployeeId".to_string()),
    r"\bEMP-\d{4}\b",
    0.7,
    "employee_id",
).unwrap();
recognizers.push(Box::new(employee_id));

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    recognizers,
    Vec::new(),
    PolicyConfig::default(),
);

Custom Pipeline

The pipeline is fully customizable: you can supply your own NLP engine, recognizers, and context enhancers.

Implement NlpEngine if you want custom tokenization, lemma/POS, or NER.
Add domain-specific recognizers and context enhancers for tuned detection.
Swap the default recognizers with your own curated set for strict control.

Language Support and Degradation

The default SimpleNlpEngine is language-agnostic and provides tokenization plus sentence splitting for any language tag. For EN/DE/ES, you can provide richer language profiles and context terms to improve recall.

For unsupported languages:

Regex and validator recognizers still work (language-neutral).
Lemma/POS/NER capabilities will be absent unless your NlpEngine provides them.
Context enhancement falls back to surface terms when lemma is unavailable.

Adding Languages

To add a new language with higher fidelity:

Implement or integrate an NlpEngine that can emit token offsets, lemmas, POS tags, and/or NER.
Provide a LanguageProfile with context terms for that language.
Attach those to the analyzer via your pipeline configuration.

Specification

The full specification is in docs/rfc-1200-pii.md and defines the data model, pipeline behavior, capability reporting, and conformance requirements.

Tests

cargo test

Benchmarks

cargo bench

Candle NER tests are ignored by default and require --features candle-ner plus a model:

PII_CANDLE_MODEL_DIR=/path/to/model \
  cargo test --features candle-ner --test candle_ner -- --ignored

You can also set PII_CANDLE_MODEL_ID to download a model via hf-hub.

License

Licensed under either of:

Apache License, Version 2.0
MIT license

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
benches		benches
docs		docs
examples		examples
src		src
tests		tests
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Worka PII

Features

Examples

Redaction Example

Span Extraction Example

Custom Operators + Audit Log Example

Supported Entity Types (Built-in)

Custom Entities and Recognizers

Custom Pipeline

Language Support and Degradation

Adding Languages

Specification

Tests

Benchmarks

License

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

worka-ai/pii

Folders and files

Latest commit

History

Repository files navigation

Worka PII

Features

Examples

Redaction Example

Span Extraction Example

Custom Operators + Audit Log Example

Supported Entity Types (Built-in)

Custom Entities and Recognizers

Custom Pipeline

Language Support and Degradation

Adding Languages

Specification

Tests

Benchmarks

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages