<a href="https://colab.research.google.com/github/raaz0000002/Training-Samples/blob/main/Reversible_Anonymizing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**We will be using 3rd party endpoint like (openai, gemini. phi-3 etc) to anonymize and reverse it**

Presidio is an open-source framework by Microsoft for detecting and anonymizing sensitive information (PII) in text.

The AnalyzerEngine is the core component that scans your data for entities like names, emails, phone numbers, dates, etc.

from faker import Faker

Faker is a Python library for generating fake but realistic data (names, emails, dates, addresses, etc.).

You’ll use it to create pseudonymized values for detected entities.

from faker.providers import internet, person, date_time

These are Faker providers—modular plugins that add more types of fake data.

internet: Fake emails, URLs, IPs.

person: Names, genders, titles.

date_time: Dates and timestamps.

This lets you generate pseudonyms relevant to the detected data type.

import openai

Imports the OpenAI Python library.

This is used to interact with OpenAI’s GPT models for more advanced NLP, if needed (optional in pseudonymization, but useful for LLM-based processing).



In [1]:
# Import necessary libraries and define helper functions
from presidio_analyzer import AnalyzerEngine
from faker import Faker
from faker.providers import internet, person, date_time
import openai


ModuleNotFoundError: No module named 'presidio_analyzer'

In [3]:
%pip install presidio-analyzer
!pip install faker

Collecting faker
  Downloading faker-37.5.3-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.5.3-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.5.3


In [4]:
# Import necessary libraries and define helper functions
from presidio_analyzer import AnalyzerEngine
from faker import Faker
from faker.providers import internet, person, date_time
import openai


In [8]:
# Import necessary libraries
from presidio_analyzer import AnalyzerEngine
from faker import Faker
from faker.providers import internet, person, date_time

# Initialize Faker and add providers for generating fake data
fake = Faker("en_US")
fake.add_provider(internet)  # For generating fake emails, URLs
fake.add_provider(person)    # For generating fake names
fake.add_provider(date_time) # For generating fake dates and times

# Initialize Presidio's AnalyzerEngine to scan and detect sensitive data
analyzer = AnalyzerEngine()

# Function to anonymize text and map original values to fake values
def anonymize_text(analyzer_results, text_to_anonymize):
    """
    Anonymizes the given text using Faker and creates a mapping for de-anonymization.

    Args:
    analyzer_results: List of results from Presidio's analysis (detected PII entities)
    text_to_anonymize: The original text that contains sensitive data to be anonymized

    Returns:
    updated_text: The anonymized text
    entity_mapping: A dictionary mapping fake values back to original values for de-anonymization
    """
    # Create an empty mapping dictionary
    entity_mapping = {}

    # Make a copy of the text to avoid modifying the original
    updated_text = text_to_anonymize

    # Function to replace the detected PII with a fake value and store the original value
    def replace_and_store(entity_type, replacement_func):
        nonlocal updated_text  # Access the outer scope variable updated_text
        for result in analyzer_results:
            if result.entity_type == entity_type:
                # Extract the original value of the detected entity
                original_value = text_to_anonymize[result.start:result.end]

                # Generate a fake replacement value using the specified Faker function
                fake_value = replacement_func()

                # Update the text by replacing the original value with the fake value
                updated_text = updated_text.replace(original_value, fake_value, 1)

                # Store the mapping between fake and original value
                entity_mapping[fake_value] = original_value

        return updated_text

    # Replace detected entities with their fake equivalents
    updated_text = replace_and_store("EMAIL_ADDRESS", fake.safe_email)
    updated_text = replace_and_store("PERSON", fake.name)
    updated_text = replace_and_store("DATE_TIME", lambda: fake.date_time().strftime('%Y-%m-%d'))

    # Return the anonymized text and the mapping for future de-anonymization
    return updated_text, entity_mapping

# Function to de-anonymize text using the stored entity mapping
def de_anonymize_text(anonymized_text, entity_mapping):
    """
    De-anonymizes the text by replacing fake values with their corresponding original values.

    Args:
    anonymized_text: The text containing fake values
    entity_mapping: A dictionary mapping fake values to the original values

    Returns:
    anonymized_text: The text with fake values replaced by the original values
    """
    # Iterate through the entity mapping and reverse the replacement
    for fake_value, real_value in entity_mapping.items():
        anonymized_text = anonymized_text.replace(fake_value, real_value)

    return anonymized_text

# Example of usage:
text = "Hari Bahadur's email is Haribahadur@example.com, and his birth date is 1990-01-01."

# Analyze the text to find PII entities
analyzer_results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PERSON", "DATE_TIME"], language="en")

# Display the initial text and the analysis results
print(f"Original Text:\n{text}\n")
print(f"Analyzer result:\n{analyzer_results}\n")

# Anonymize the text
anonymized_text, entity_mapping = anonymize_text(analyzer_results, text)

# Display the anonymized result and the entity mapping
print(f"Anonymized Text:\n{anonymized_text}\n")
print(f"Entity Mapping:\n{entity_mapping}\n")

# De-anonymize the text
de_anonymized_text = de_anonymize_text(anonymized_text, entity_mapping)

# Display the de-anonymized text
print(f"De-anonymized Text:\n{de_anonymized_text}")




Original Text:
Hari Bahadur's email is Haribahadur@example.com, and his birth date is 1990-01-01.

Analyzer result:
[type: EMAIL_ADDRESS, start: 24, end: 47, score: 1.0, type: DATE_TIME, start: 71, end: 81, score: 0.95, type: PERSON, start: 0, end: 14, score: 0.85]

Anonymized Text:
Nicholas Blackwell email is mcdonaldkyle@example.net, and his birth date is 2010-09-08.

Entity Mapping:
{'mcdonaldkyle@example.net': 'Haribahadur@example.com', 'Nicholas Blackwell': "Hari Bahadur's", '2010-09-08': '1990-01-01'}

De-anonymized Text:
Hari Bahadur's email is Haribahadur@example.com, and his birth date is 1990-01-01.


In [20]:
from google.generativeai import GenerativeModel
import google.generativeai as genai

# Initialize Gemini model with API key
genai.configure(api_key='AIzaSyAqp_6nhcHvkYoZNlaBISbjXvcfngoGHWo')
model = GenerativeModel('gemini-2.0-flash')

# Define the prompt with anonymized text
prompt = anonymized_text + " Re-write that information a little differently."

# Generate response from Gemini
response = model.generate_content(prompt)

# Print the response
print(response)

# Extract the generated text from the response
generated_text = response.text.strip()

# Display the sent prompt and Gemini response
print("\nAnonymized Text Sent to LLM:\n", prompt, "\n")
print("\nLLM Response:\n", generated_text, "\n")

# De-anonymize the response and display the final result
de_anonymized_text = de_anonymize_text(generated_text, entity_mapping)
print("\nDe-anonymized Response:\n", de_anonymized_text, "\n")

response:
GenerateContentResponse(
    done=True,
    iterator=None,
    result=protos.GenerateContentResponse({
      "candidates": [
        {
          "content": {
            "parts": [
              {
                "text": "Here's the information re-written in a few different ways:\n\n*   **Email:** Nicholas Blackwell can be contacted at mcdonaldkyle@example.net. **Birthday:** September 8th, 2010.\n*   Nicholas Blackwell's email address is mcdonaldkyle@example.net. He was born on September 8, 2010.\n*   Born on 2010-09-08, Nicholas Blackwell's email is mcdonaldkyle@example.net."
              }
            ],
            "role": "model"
          },
          "finish_reason": "STOP",
          "avg_logprobs": -0.17633222650598596
        }
      ],
      "usage_metadata": {
        "prompt_token_count": 37,
        "candidates_token_count": 108,
        "total_token_count": 145
      },
      "model_version": "gemini-2.0-flash"
    }),
)

Anonymized Text Sent to LLM:
 Nicholas 

In [23]:
print("\nOriginal Text:\n", text, "\n")
print("\nAnalyzer result:\n", analyzer_results, "\n")
# Anonymize the text and display the anonymized text and mapping
print("\nAnonymized Text:\n", anonymized_text, "\n")
print("\nEntity Mapping:\n", entity_mapping, "\n")
print("\nAnonymized Text Sent to LLM:\n", prompt, "\n")
print("\nLLM Response:\n", generated_text, "\n")
print("\nDe-anonymized Response:\n", de_anonymized_text, "\n")


Original Text:
 Hari Bahadur's email is Haribahadur@example.com, and his birth date is 1990-01-01. 


Analyzer result:
 [type: EMAIL_ADDRESS, start: 24, end: 47, score: 1.0, type: DATE_TIME, start: 71, end: 81, score: 0.95, type: PERSON, start: 0, end: 14, score: 0.85] 


Anonymized Text:
 Nicholas Blackwell email is mcdonaldkyle@example.net, and his birth date is 2010-09-08. 


Entity Mapping:
 {'mcdonaldkyle@example.net': 'Haribahadur@example.com', 'Nicholas Blackwell': "Hari Bahadur's", '2010-09-08': '1990-01-01'} 


Anonymized Text Sent to LLM:
 Nicholas Blackwell email is mcdonaldkyle@example.net, and his birth date is 2010-09-08. Re-write that information a little differently. 


LLM Response:
 Here's the information re-written in a few different ways:

*   **Email:** Nicholas Blackwell can be contacted at mcdonaldkyle@example.net. **Birthday:** September 8th, 2010.
*   Nicholas Blackwell's email address is mcdonaldkyle@example.net. He was born on September 8, 2010.
*   Born on 