Skip to content
/ pii Public

A library to identify and help redact Personally Identifiable Information (PII) from text. It gives you deterministic PII detection and anonymization in Rust (CPU‑only, capability‑aware).

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

worka-ai/pii

Worka PII

Crates.io Docs CI License

Worka PII is a Rust library for detecting and anonymizing personally identifiable information (PII). It provides deterministic, capability-aware NLP pipelines designed to run on CPU-only environments with explicit auditability and controlled degradation when language features are unavailable.

This crate was extracted from the Worka internal monorepo to become a standalone, reusable component. The APIs and the RFCs are maintained here to support independent development and external adoption.

Features

  • Deterministic PII detection with stable byte offsets
  • Regex, validator, dictionary, and NER-backed recognizers
  • Capability-aware pipeline (tokenization, lemma, POS, NER)
  • Configurable anonymization operators (redact, mask, replace, hash)
  • Optional Candle-based NER via candle-ner feature

Examples

cargo run --example redact
cargo run --example extract

Redaction Example

use pii::anonymize::{AnonymizeConfig, Anonymizer};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Contact Jane at jane@example.com or +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), pii::anonymize::Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), pii::anonymize::Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;

let redacted = Anonymizer::anonymize(text, &result.entities, &config).unwrap();
assert!(redacted.text.contains("<EMAIL>"));

Span Extraction Example

This example keeps the input text intact and uses the detected spans directly.

use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Reach me at jane@example.com from 10.0.0.5.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

for detection in &result.entities {
    let span = &text[detection.start..detection.end];
    println!(
        "type={} start={} end={} value={}",
        detection.entity_type.as_str(),
        detection.start,
        detection.end,
        span
    );
}

Custom Operators + Audit Log Example

This example applies per-entity operators and emits a simple audit log that records the original value alongside the replacement.

use pii::anonymize::{AnonymizeConfig, Anonymizer, Operator};
use pii::nlp::SimpleNlpEngine;
use pii::presets::default_recognizers;
use pii::{Analyzer, PolicyConfig};
use pii::types::Language;
use std::collections::HashMap;

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    default_recognizers(),
    Vec::new(),
    PolicyConfig::default(),
);

let text = "Email jane@example.com or call +1 415-555-1212.";
let result = analyzer.analyze(text, &Language::from("en")).unwrap();

let mut config = AnonymizeConfig::default();
let mut per_entity = HashMap::new();
per_entity.insert("Email".to_string(), Operator::Replace { with: "<EMAIL>".into() });
per_entity.insert("Phone".to_string(), Operator::Mask { ch: '*', from_end: 4 });
config.per_entity = per_entity;

let anonymized = Anonymizer::anonymize(text, &result.entities, &config).unwrap();

for item in &anonymized.items {
    let original = &text[item.entity.start..item.entity.end];
    println!(
        "type={} value={} replacement={}",
        item.entity.entity_type.as_str(),
        original,
        item.replacement
    );
}

Supported Entity Types (Built-in)

The following entity types are supported out of the box via built-in recognizers:

  • Email
  • Phone
  • IpAddress (IPv4)
  • Ipv6
  • CreditCard
  • Iban
  • Ssn
  • Itin
  • TaxId
  • Passport
  • DriverLicense
  • BankAccount
  • RoutingNumber
  • CryptoAddress
  • MacAddress
  • Uuid
  • Vin
  • Imei
  • Url
  • Domain
  • Hostname

The following types are supported when a NER engine is enabled:

  • Person
  • Location
  • Organization

Custom Entities and Recognizers

You can add custom entities and recognizers to the pipeline.

use pii::recognizers::regex::RegexRecognizer;
use pii::types::EntityType;

let mut recognizers = default_recognizers();
let employee_id = RegexRecognizer::new(
    "regex_employee_id",
    EntityType::Custom("EmployeeId".to_string()),
    r"\bEMP-\d{4}\b",
    0.7,
    "employee_id",
).unwrap();
recognizers.push(Box::new(employee_id));

let analyzer = Analyzer::new(
    Box::new(SimpleNlpEngine::default()),
    recognizers,
    Vec::new(),
    PolicyConfig::default(),
);

Custom Pipeline

The pipeline is fully customizable: you can supply your own NLP engine, recognizers, and context enhancers.

  • Implement NlpEngine if you want custom tokenization, lemma/POS, or NER.
  • Add domain-specific recognizers and context enhancers for tuned detection.
  • Swap the default recognizers with your own curated set for strict control.

Language Support and Degradation

The default SimpleNlpEngine is language-agnostic and provides tokenization plus sentence splitting for any language tag. For EN/DE/ES, you can provide richer language profiles and context terms to improve recall.

For unsupported languages:

  • Regex and validator recognizers still work (language-neutral).
  • Lemma/POS/NER capabilities will be absent unless your NlpEngine provides them.
  • Context enhancement falls back to surface terms when lemma is unavailable.

Adding Languages

To add a new language with higher fidelity:

  1. Implement or integrate an NlpEngine that can emit token offsets, lemmas, POS tags, and/or NER.
  2. Provide a LanguageProfile with context terms for that language.
  3. Attach those to the analyzer via your pipeline configuration.

Specification

The full specification is in docs/rfc-1200-pii.md and defines the data model, pipeline behavior, capability reporting, and conformance requirements.

Tests

cargo test

Benchmarks

cargo bench

Candle NER tests are ignored by default and require --features candle-ner plus a model:

PII_CANDLE_MODEL_DIR=/path/to/model \
  cargo test --features candle-ner --test candle_ner -- --ignored

You can also set PII_CANDLE_MODEL_ID to download a model via hf-hub.

License

Licensed under either of:

  • Apache License, Version 2.0
  • MIT license

About

A library to identify and help redact Personally Identifiable Information (PII) from text. It gives you deterministic PII detection and anonymization in Rust (CPU‑only, capability‑aware).

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages