Skip to content

patons02/pd-anonymiser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pd-anonymiser

A privacy-focused tool to anonymise and optionally re-identify personal data using Microsoft Presidio.

Supports NER with Hugging Face and SpaCy transformer models, pseudonym mapping, and encrypted reversible transformations.


πŸš€ Quick Start

git clone https://github.com/your-org/pd-anonymiser.git
cd pd-anonymiser

# Create and activate Python 3.10 virtual environment
python3.10 -m venv .venv
source .venv/bin/activate

# Install all dependencies (prod + dev)
make install-dev

# Download models (once)
make download-models

Core Library Usage

from pd_anonymiser.anonymiser import anonymise_text
from pd_anonymiser.reidentifier import reidentify_text

# Anonymise input text (returns anonymisedText, sessionId, key)
result = anonymise_text(
    "Alice from Acme Corp emailed Bob at 5pm.",
    allow_reidentification=True
)
print(result.text, result.session_id, result.key)

# Re-identify the text using the session map
original = reidentify_text(
    result.text,
    result.session_id,
    result.key
)
print(original)

πŸ§ͺ Run Examples

πŸ” With Re-identification

python sample/reidentification.py

πŸ”’ Without Re-identification (e.g. irreversible UUIDs)

python sample/no_reidentification.py

πŸ’» MCP Server

The MCP server implements the Model Context Protocol for three-step pipelines:

  1. Anonymisation (resource)
  2. ChatGPT call via OpenAI (tool)
  3. Re-identification (resource)

The repo ships an MCP server and a sample client.

1 Launch the server

python src/pd_anonymiser_mcp/server.py            \
  --transport streamable-http --host 0.0.0.0 --port 9000 --path /mcp

By default it exposes JSON‑RPC over HTTP at http://localhost:9000/mcp.

πŸ“‘ What the server exposes

Type Name URI / behaviour
resource anonymisation mcp://pd-anonymiser/anonymisation?text={text}&allow_reidentification={allow_reidentification} β†’ returns { anonymised_text, session_id, key }
resource reidentification mcp://pd-anonymiser/reidentification?text={text}&session_id={session_id}&key={key} β†’ returns { reidentified_text }
tool execute‑prompt‑with‑anonymisation Takes raw text, internally (1) anonymises it, (2) calls the client’s LLM via ctx.sample(), (3) returns { llm_response_anonymised, session_id, key }
prompt anonymisePrompt Prompt template that forces any assistant to strip personal data in both input & output

Launch the MCP Server

python src/pd_anonymiser_mcp/server.py

By default, it listens on http://0.0.0.0:9000 with JSON-RPC over HTTP.

πŸ‘Ύ Use in VSCode Agent Mode

  1. Install GitHub Copilot Chat extension.
  2. In the Chat panel choose Agent mode β†’ Tools ➜ Add MCP Server β†’ URL http://localhost:9000/mcp.
  3. Enable pd-anonymiser.* tools and experiment interactively: anonymise β†’ chat β†’ re‑identify, all inside VSCode.

Example: curl-based Pipeline

# 1) Anonymise input
curl -s localhost:8000 \
  -H "Content-Type: application/json" \
  -d '{
      "jsonrpc":"2.0",
      "method":"invoke",
      "params":{
        "toolId":"mcp://pd-anonymiser/anonymisation",
        "params":{"text":"Hello, I’m Stuart from London.","allow_reidentification":true}
      },
      "id":1
    }' | jq

# 3) Re-identification (if needed separately)
curl -s localhost:8000 \
  -H "Content-Type: application/json" \
  -d '{
      "jsonrpc":"2.0",
      "method":"invoke",
      "params":{
        "toolId":"mcp://pd-anonymiser/reidentification",
        "params":{"text":"...anonymised reply...","session_id":"...","key":"..."}
      },
      "id":3
    }' | jq

Pricing Estimator Server

Uses tiktoken for token counting. Pricing lives in src/pd_anonymiser_mcp/estimate_openai_cost.py with server src/pd_anonymiser_mcp/cost_estimation_server.py.

The server can be run via:

 uvicorn src.pd_anonymiser_mcp.cost_estimation_server:app --reload

POST /cost-estimator/open-ai

Estimates the USD cost of an OpenAI API call.

Request

{
  "prompt":                "Hello world",
  "model":                 "gpt-4",
  "max_completion_tokens": 100
}

Response (200)

{
  "prompt_token_count":  3,
  "cost":                0.0123
}

πŸ’Ό Key Features

  • Combine multiple recognisers
  • OperatorConfig injection for anonymisation
  • Reusable tag pseudonyms (e.g. Person A, Company B)
  • Optional irreversible UUID redaction
  • Re-identification with Fernet-encrypted session-based mappings
  • Designed for English (UK), but extensible
  • Built-in FastAPI MCP server for text-based integrations

🧱 Project Structure

pd-anonymiser/
β”œβ”€β”€ BLOG-POST-1_MCP_FEATURE.md     # Design notes for MCP integration
β”œβ”€β”€ Makefile                      # Helpers: install-dev, test, download-models
β”œβ”€β”€ sample/
β”‚   β”œβ”€β”€ reidentification.py       # Example with re-identification
β”‚   └── no_reidentification.py    # Example with UUID-only anonymisation
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ pd_anonymiser/            # Core library
β”‚   β”‚   β”œβ”€β”€ anonymiser.py         # NER-based anonymisation logic
β”‚   β”‚   β”œβ”€β”€ reidentifier.py       # Session-based reverse mapping
β”‚   β”‚   β”œβ”€β”€ models.py             # Model registry
β”‚   β”‚   β”œβ”€β”€ utils.py              # Fernet encryption, session storage
β”‚   β”‚   └── recognisers/
β”‚   β”‚       β”œβ”€β”€ huggingface.py
β”‚   β”‚       └── spacy.py
β”‚   └── pd_anonymiser_mcp/               # MCP server implementation
β”‚       β”œβ”€β”€ cost_estimation_server.py    # FastAPI server to check OpenAI API cost
β”‚       β”œβ”€β”€ estimate_openai_cost.py      # cost estimator for OpenAI API's 
β”‚       β”œβ”€β”€ client.py                    # MCP client example
β”‚       └── server.py                    # FastMCP JSON-RPC server
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/                     # Unit tests (anonymiser + cost estimator)
β”‚   └── integration/              # Integration tests (end-to-end)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ requirements-dev.txt
β”œβ”€β”€ setup.py
└── README.md                     # ← this file

πŸ“¦ Development Tasks

make install-dev        # Editable install with dev deps
make test               # Run pytest with coverage
make freeze             # Generate requirements.txt and dev.txt
make download-models    # Pull transformer-based SpaCy model

πŸ” Re-identification Flow

  1. During anonymisation, a Fernet key + session ID are generated
  2. A JSON pseudonym map is encrypted and saved in sessions/
  3. To re-identify, call:
reidentify_text(anonymised_text, session_id, encoded_key)

βœ… Example Output

Original Text:
Theresa May met with Boris Johnson at Downing Street...

Anonymised:
Person A met with Person B at Location A...

Reidentified:
Theresa May met with Boris Johnson at Downing Street...

🧰 Requirements

  • Python 3.10
  • presidio-analyzer, presidio-anonymizer
  • transformers, spacy, cryptography
  • fastapi, uvicorn, openai, tiktoken
  • Various Spacy and Hugging Face models (download via make download-models)
  • Dev: pytest, pytest-cov, pip-tools

πŸ‘€ Maintainer

Built and maintained by @patons02


πŸͺͺ License

MIT License. See LICENSE.md


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published