# De-identifying Data with Microsoft Presidio

To begin working with Microsoft Presidio, we conducted research using online resources to understand its functionality and solutions. Presidio is designed to identify private information within a dataset and replace it with anonymized values, ensuring data privacy.
For this exercise, we created sample text containing various types of Personally Identifiable Information (PII) and employed the `analyze` method of `AnalyzerEngine` to identify these PII entities, specifying the desired types.
Presidio's capabilities include identifying various formats for each type of PII and providing the chart where these fields start and end within the text. We defined the PII fields for replacement and specified the anonymized values. In our case, each detected PII entity was replaced with the placeholder `<ANONYMIZED>`.
We also examined Presidio's ability to handle different formats for each field, ensuring accurate identification and anonymization. For example, Presidio effectively anonymized dates in formats like "dd/mm/yyyy" or "July 15, 2024, at 2:30 PM." However, since the date represented the time of the conversation and was not private information, we retained the original value.

The final output consists of the original text and the anonymized text, with all detected PII elements replaced with `<ANONYMIZED>`. This demonstrates Presidio's capability to detect and anonymize a wide range of PII entities in a customizable manner, ensuring the safeguarding of sensitive information, ensuring sensitive information is safeguarded.

## Python code customized with Presidio

In [3]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

Define Analyzer & Anonymizer

In [4]:
analyzer = AnalyzerEngine()
anonymize = AnonymizerEngine()

A dataset structured according to entities supported by the library, as documented- https://microsoft.github.io/presidio/supported_entities/

In [None]:
text = "In addressing the legalities of data protection, John shared his personal information to illustrate the importance of privacy laws. He mentioned, 'My phone number, 123-456-7890, and my email, john.doe@example.com, are sensitive pieces of information that need safeguarding.' Furthermore, he emphasized, 'Even my bank account number, GB82WEST12345698765432, is at risk if not properly protected.' Call details: Tel Aviv, 0.0.0.0, July 15, 2024, at 2:30 PM"

PII analysis and configuration to replace identified PII with "ANONYMIZED"

In [13]:
results = analyzer.analyze(text=text, entities=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "IBAN_CODE", "LOCATION", "IP_ADDRESS"], language="en")

anonymizers_config = {
    "PERSON": OperatorConfig(operator_name="replace", params={"new_value": "<ANONYMIZED>"}),
    "PHONE_NUMBER": OperatorConfig(operator_name="replace", params={"new_value": "<ANONYMIZED>"}),
    "EMAIL_ADDRESS": OperatorConfig(operator_name="replace", params={"new_value": "<ANONYMIZED>"}),
    "IBAN_CODE": OperatorConfig(operator_name="replace", params={"new_value": "<ANONYMIZED>"}),
    "LOCATION": OperatorConfig(operator_name="replace", params={"new_value": "<ANONYMIZED>"}),
    "IP_ADDRESS": OperatorConfig(operator_name="replace", params={"new_value": "<ANONYMIZED>"}),
}

anonymized_text = anonymize.anonymize(
    text= text,
    analyzer_results = results,
    operators = anonymizers_config
)


Print the original and the anonymized

In [14]:
print("Original text:")
print(text)
print("\nAnonymized text:")
print(anonymized_text.text)


Original text:
In addressing the legalities of data protection, John shared his personal information to illustrate the importance of privacy laws. He mentioned, 'My phone number, 123-456-7890, and my email, john.doe@example.com, are sensitive pieces of information that need safeguarding.' Furthermore, he emphasized, 'Even my bank account number, GB82WEST12345698765432, is at risk if not properly protected.' Call details: Tel Aviv, 0.0.0.0, July 15, 2024, at 2:30 PM

Anonymized text:
In addressing the legalities of data protection, <ANONYMIZED> shared his personal information to illustrate the importance of privacy laws. He mentioned, 'My phone number, <ANONYMIZED>, and my email, <ANONYMIZED>, are sensitive pieces of information that need safeguarding.' Furthermore, he emphasized, 'Even my bank account number, <ANONYMIZED>, is at risk if not properly protected.' Call details: <ANONYMIZED>, <ANONYMIZED>, July 15, 2024, at 2:30 PM
