# Reversible Anonymization with LCEL and Fuzzy Matching

This notebook demonstrates:
1. Using PresidioReversibleAnonymizer with LangChain Expression Language (LCEL)
2. Implementing combined exact+fuzzy matching strategy
3. Full workflow from anonymization -> LLM processing -> deanonymization
4. Add custom recognizers to anonymize compagny names

Taken from:
- https://python.langchain.com/v0.1/docs/guides/productionization/safety/presidio_data_anonymization/
- https://python.langchain.com/api_reference/experimental/data_anonymizer/langchain_experimental.data_anonymizer.deanonymizer_matching_strategies.combined_exact_fuzzy_matching_strategy.html


In [None]:
from devtools import debug  # noqa: F401  # noqa: F811
from dotenv import load_dotenv
from rich import print  # noqa: F401

assert load_dotenv(verbose=True)


%load_ext autoreload
%autoreload 2
%reset -f

In [None]:
from faker import Faker
from langchain_core.prompts import PromptTemplate
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer
from langchain_experimental.data_anonymizer.deanonymizer_matching_strategies import (
    combined_exact_fuzzy_matching_strategy,
)
from langchain_openai import ChatOpenAI
from presidio_analyzer import Pattern, PatternRecognizer
from presidio_anonymizer import OperatorConfig

# Initialize anonymizer with common PII types
anonymizer = PresidioReversibleAnonymizer(
    analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"],
    faker_seed=42,  # For deterministic fake data
)

In [None]:
# Original sensitive text
text = """John Doe recently lost his wallet. 
Inside is $500 cash and his Visa card 4111 1111 1111 1111. 
Contact him at johndoe@example.com or 555-123-4567."""

# Anonymize the text
anonymized_text = anonymizer.anonymize(text)
print("Anonymized:\n", anonymized_text)

In [None]:
# Create LCEL chain with anonymization and deanonymization
template = """Convert this message into a formal notification:
{anonymized_text}
"""

prompt = PromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0)

# Chain: anonymize -> LLM -> deanonymize
chain = (
    {"anonymized_text": anonymizer.anonymize}
    | prompt
    | llm
    | (
        lambda msg: anonymizer.deanonymize(
            msg.content, deanonymizer_matching_strategy=combined_exact_fuzzy_matching_strategy
        )
    )
)

# Invoke the chain
response = chain.invoke(text)
print("\nFinal Response:\n", response)

In [None]:
# Demonstrate fuzzy matching robustness
llm_altered_text = """
We regret to inform that Mr. Lynch (contactable at 734-413-1647 
or jamesmichael@example.com) mislaid a Visa card ending with 40262.
"""

# Deanonymize with different strategies
print("Without fuzzy matching:\n", anonymizer.deanonymize(llm_altered_text))
print(
    "\nWith fuzzy matching:\n",
    anonymizer.deanonymize(llm_altered_text, deanonymizer_matching_strategy=combined_exact_fuzzy_matching_strategy),
)

## Custom Company/Product Recognizer Example

Demonstrate adding custom recognizer for specific company/product names:

In [None]:
# List of sensitive company/product names to detect (case-insensitive)


fake = Faker(locale=["fr-FR", "en-US"])


COMPANY_NAMES = ["Atos", "CapGemini", "IBM", "CNES", "Thales", "Google", "Microsoft"]

# Create regex pattern to match any of the names as whole words
company_pattern = r"(?i)\b(" + "|".join(COMPANY_NAMES) + r")\b"  # (?i) makes it case-insensitive

# Create custom recognizer for company/product names
company_recognizer = PatternRecognizer(
    supported_entity="COMPANY", patterns=[Pattern(name="company_pattern", regex=company_pattern, score=0.9)]
)

# Add custom recognizer and fake replacement operator
anonymizer.add_recognizer(company_recognizer)
anonymizer.add_operators(
    {
        "COMPANY": OperatorConfig(
            "custom",
            {
                "lambda": lambda _: fake.bothify(text="CCC####")  # Generate codes like CCC1221
            },
        )
    }
)

# Test with mixed case company names
text_with_companies = """
Our partners include ATOS, Capgemini and ibm. 
Recent projects with Thales and cnes have been successful.
"""

print("Original:\n", text_with_companies)
print("\nAnonymized:\n", anonymizer.anonymize(text_with_companies))

Display the mapping table

In [None]:
anonymizer.deanonymizer_mapping

## Key Takeaways
- LCEL enables seamless integration of privacy-preserving steps
- Combined matching strategy handles LLM output variations
- Full reversibility maintains data utility while protecting PII
- Mapping persistence allows consistent anonymization across sessions