A privacy-focused tool to anonymise and optionally re-identify personal data using Microsoft Presidio.
Supports NER with Hugging Face and SpaCy transformer models, pseudonym mapping, and encrypted reversible transformations.
git clone https://github.com/your-org/pd-anonymiser.git
cd pd-anonymiser
# Create and activate Python 3.10 virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
# Install all dependencies (prod + dev)
make install-dev
# Download models (once)
make download-models
from pd_anonymiser.anonymiser import anonymise_text
from pd_anonymiser.reidentifier import reidentify_text
# Anonymise input text (returns anonymisedText, sessionId, key)
result = anonymise_text(
"Alice from Acme Corp emailed Bob at 5pm.",
allow_reidentification=True
)
print(result.text, result.session_id, result.key)
# Re-identify the text using the session map
original = reidentify_text(
result.text,
result.session_id,
result.key
)
print(original)
python sample/reidentification.py
python sample/no_reidentification.py
The MCP server implements the Model Context Protocol for three-step pipelines:
- Anonymisation (resource)
- ChatGPT call via OpenAI (tool)
- Re-identification (resource)
The repo ships an MCP server and a sample client.
python src/pd_anonymiser_mcp/server.py \
--transport streamable-http --host 0.0.0.0 --port 9000 --path /mcp
By default it exposes JSONβRPC over HTTP at http://localhost:9000/mcp.
Type | Name | URI / behaviour |
---|---|---|
resource | anonymisation | mcp://pd-anonymiser/anonymisation?text={text}&allow_reidentification={allow_reidentification} β returns { anonymised_text, session_id, key } |
resource | reidentification | mcp://pd-anonymiser/reidentification?text={text}&session_id={session_id}&key={key} β returns { reidentified_text } |
tool | executeβpromptβwithβanonymisation | Takes raw text , internally (1) anonymises it, (2) calls the clientβs LLM via ctx.sample() , (3) returns { llm_response_anonymised, session_id, key } |
prompt | anonymisePrompt | Prompt template that forces any assistant to strip personal data in both input & output |
python src/pd_anonymiser_mcp/server.py
By default, it listens on http://0.0.0.0:9000 with JSON-RPC over HTTP.
- Install GitHub Copilot Chat extension.
- In the Chat panel choose Agent mode β Tools β Add MCP Server β URL
http://localhost:9000/mcp
. - Enable
pd-anonymiser.*
tools and experiment interactively: anonymise β chat β reβidentify, all inside VSCode.
# 1) Anonymise input
curl -s localhost:8000 \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0",
"method":"invoke",
"params":{
"toolId":"mcp://pd-anonymiser/anonymisation",
"params":{"text":"Hello, Iβm Stuart from London.","allow_reidentification":true}
},
"id":1
}' | jq
# 3) Re-identification (if needed separately)
curl -s localhost:8000 \
-H "Content-Type: application/json" \
-d '{
"jsonrpc":"2.0",
"method":"invoke",
"params":{
"toolId":"mcp://pd-anonymiser/reidentification",
"params":{"text":"...anonymised reply...","session_id":"...","key":"..."}
},
"id":3
}' | jq
Uses tiktoken for token counting.
Pricing lives in src/pd_anonymiser_mcp/estimate_openai_cost.py
with server src/pd_anonymiser_mcp/cost_estimation_server.py
.
The server can be run via:
uvicorn src.pd_anonymiser_mcp.cost_estimation_server:app --reload
Estimates the USD cost of an OpenAI API call.
Request
{
"prompt": "Hello world",
"model": "gpt-4",
"max_completion_tokens": 100
}
Response (200)
{
"prompt_token_count": 3,
"cost": 0.0123
}
- Combine multiple recognisers
OperatorConfig
injection for anonymisation- Reusable tag pseudonyms (e.g.
Person A
,Company B
) - Optional irreversible UUID redaction
- Re-identification with Fernet-encrypted session-based mappings
- Designed for English (UK), but extensible
- Built-in FastAPI MCP server for text-based integrations
pd-anonymiser/
βββ BLOG-POST-1_MCP_FEATURE.md # Design notes for MCP integration
βββ Makefile # Helpers: install-dev, test, download-models
βββ sample/
β βββ reidentification.py # Example with re-identification
β βββ no_reidentification.py # Example with UUID-only anonymisation
βββ src/
β βββ pd_anonymiser/ # Core library
β β βββ anonymiser.py # NER-based anonymisation logic
β β βββ reidentifier.py # Session-based reverse mapping
β β βββ models.py # Model registry
β β βββ utils.py # Fernet encryption, session storage
β β βββ recognisers/
β β βββ huggingface.py
β β βββ spacy.py
β βββ pd_anonymiser_mcp/ # MCP server implementation
β βββ cost_estimation_server.py # FastAPI server to check OpenAI API cost
β βββ estimate_openai_cost.py # cost estimator for OpenAI API's
β βββ client.py # MCP client example
β βββ server.py # FastMCP JSON-RPC server
βββ tests/
β βββ unit/ # Unit tests (anonymiser + cost estimator)
β βββ integration/ # Integration tests (end-to-end)
βββ .gitignore
βββ requirements.txt
βββ requirements-dev.txt
βββ setup.py
βββ README.md # β this file
make install-dev # Editable install with dev deps
make test # Run pytest with coverage
make freeze # Generate requirements.txt and dev.txt
make download-models # Pull transformer-based SpaCy model
- During anonymisation, a Fernet key + session ID are generated
- A JSON pseudonym map is encrypted and saved in
sessions/
- To re-identify, call:
reidentify_text(anonymised_text, session_id, encoded_key)
Original Text:
Theresa May met with Boris Johnson at Downing Street...
Anonymised:
Person A met with Person B at Location A...
Reidentified:
Theresa May met with Boris Johnson at Downing Street...
- Python 3.10
presidio-analyzer
,presidio-anonymizer
transformers
,spacy
,cryptography
fastapi
,uvicorn
,openai
,tiktoken
- Various Spacy and Hugging Face models (download via
make download-models
) - Dev:
pytest
,pytest-cov
,pip-tools
Built and maintained by @patons02
MIT License. See LICENSE.md