# Detecting PII with Presidio

In this example, we'll show how to detect PII using [Microsoft's Presidio](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fmicrosoft.github.io%2Fpresidio%2F) and Langkit

Note: The `regexes` module also provides basic PII detection capabilities. The difference is that `regexes` uses a simple list of regex patterns, while `pii` (the module in this example) uses much more sophisticated models and patterns to detect PII, at the expense of requiring more resources to do so.

## Setup

The `pii` module is included when installing the `all` extra, so let's install it. This module will use `presidio` as a dependency, which itself requires `spacy`. Extra models will need to be automatically installed, such as Spacy's `en_core_web_lg`.

In [None]:
%pip install langkit[all]==0.0.29

## Extracting PIIs from Prompts and Responses

To extract PII entities from your prompt and responses, you simply need to import the module and use it with langkit's `extract`. Let's define some examples to test it out.


In [6]:
prompts = [
    "Hello, my name is David Johnson and I live in Maine. \
    My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.",
    "On September 18 I visited microsoft.com and sent an email to test@presidio.site,  from the IP 192.168.0.1.",
    "My passport: 191280342 and my phone number: (212) 555-1234.",
    "This is a valid International Bank Account Number: IL150120690000003111111 . \
    Can you please check the status on bank account 954567876544?",
    "Kate's social security number is 078-05-1126.  Her driver license? it is 1234567A.",
    "Hi, My name is John.",
]


You can use `extract` by passing a single prompt and/or response in a dictionary, like this:

In [7]:
from langkit import extract, pii

data = {"prompt": prompts[0],
        "response": prompts[-1]}
result = extract(data)

result

{'prompt': 'Hello, my name is David Johnson and I live in Maine.     My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.',
 'response': 'Hi, My name is John.',
 'prompt.pii_presidio.result': '[{"type": "CREDIT_CARD", "start": "82", "end": "101", "score": "1.0"}, {"type": "CRYPTO", "start": "129", "end": "163", "score": "1.0"}]',
 'prompt.pii_presidio.entities_count': 2,
 'response.pii_presidio.result': '[]',
 'response.pii_presidio.entities_count': 0}

The list of searched entities is defined in the `PII_entities.json` under the Langkit folder. You can also pass your own list of entities when initializing the `PII` module.

Let's write a file called `my_custom_entities.json` and then use it to initialize the `PII` module.

In [10]:
%%writefile my_custom_entities.json

{
  "entities": ["CREDIT_CARD"]
}

Overwriting my_custom_entities.json


In [9]:
pii.init(entities_file_path="my_custom_entities.json")
data = {"prompt": prompts[0],
        "response": prompts[-1]}
result = extract(data)

result

{'prompt': 'Hello, my name is David Johnson and I live in Maine.     My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.',
 'response': 'Hi, My name is John.',
 'prompt.pii_presidio.result': '[{"type": "CREDIT_CARD", "start": "82", "end": "101", "score": "1.0"}]',
 'prompt.pii_presidio.entities_count': 1,
 'response.pii_presidio.result': '[]',
 'response.pii_presidio.entities_count': 0}

Note that, in the cell above, only the `CREDIT_CARD` entity is detected, since it is the only one defined in `my_custom_entities.json`.

### Supported Entities

You can check the list of supported entities in Presidio's documentation [here](https://microsoft.github.io/presidio/supported_entities/).

## Extracting PII from a Pandas Dataframe

You can also use the `pii` module to extract PII from a Pandas Dataframe. The `pii.extract` function takes a Pandas Dataframe as input and returns a new Dataframe with the extracted PII.

In [11]:
import pandas as pd

pii.init()
data = pd.DataFrame({"prompt": prompts, "response": prompts})

result = extract(data)

result

Unnamed: 0,prompt,response,prompt.pii_presidio.result,prompt.pii_presidio.entities_count,response.pii_presidio.result,response.pii_presidio.entities_count
0,"Hello, my name is David Johnson and I live in ...","Hello, my name is David Johnson and I live in ...","[{""type"": ""CREDIT_CARD"", ""start"": ""82"", ""end"":...",2,"[{""type"": ""CREDIT_CARD"", ""start"": ""82"", ""end"":...",2
1,On September 18 I visited microsoft.com and se...,On September 18 I visited microsoft.com and se...,"[{""type"": ""IP_ADDRESS"", ""start"": ""94"", ""end"": ...",3,"[{""type"": ""IP_ADDRESS"", ""start"": ""94"", ""end"": ...",3
2,My passport: 191280342 and my phone number: (2...,My passport: 191280342 and my phone number: (2...,"[{""type"": ""PHONE_NUMBER"", ""start"": ""44"", ""end""...",5,"[{""type"": ""PHONE_NUMBER"", ""start"": ""44"", ""end""...",5
3,This is a valid International Bank Account Num...,This is a valid International Bank Account Num...,"[{""type"": ""IBAN_CODE"", ""start"": ""51"", ""end"": ""...",3,"[{""type"": ""IBAN_CODE"", ""start"": ""51"", ""end"": ""...",3
4,Kate's social security number is 078-05-1126. ...,Kate's social security number is 078-05-1126. ...,"[{""type"": ""US_SSN"", ""start"": ""33"", ""end"": ""44""...",3,"[{""type"": ""US_SSN"", ""start"": ""33"", ""end"": ""44""...",3
5,"Hi, My name is John.","Hi, My name is John.",[],0,[],0


## Results Format
For each column (prompt or response), `pii` will generate two additional columns:

__`pii_presidio.result`__

This is a list of dictionaries, where each dictionary represents a single detected entity. In it, we have the following keys:

- type: the type of entity detected
- start: the start index of the entity in the text
- end: the end index of the entity in the text
- score: the confidence score of the entity

The result is provided as json formatted string.

__`pii_presidio.entities_count`__

Contains the total number of entities detected in the text. It's equal to the length of the list in `pii_presidio.result`.
