# Amazon Comprehend PII detection

Amazon comprehend added new capabilities to detect PII entities in text data. In this notebook, we will explore different ways to access and use Comprehend PII detection service.


        
## Overview

1. [PII detection via Console](#console)
1. [PII detection via CLI](#cli)
1. [Async APIs to Redact PII](#redact)
1. [Async APIs to Redact / Mask PII Entities](#mask)
1. [Cleanup](#cleanup)

## PII detection via Console <a class="anchor" id="console"/>

To get started with Amazon Comprehend, all you need is an [AWS account](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/).

In the [Amazon Comprehend console](https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#home) in the Input Text section, choose analysis type Built-in radio. Provide the following text in Input text and click Analyze

```
   Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away. Let's check.
```

1. Which entities do you see detected under **Insights** `PII` tab?


2. Examine the JSON response for one of these entities so you can see how `BeginOffset` and `EndOffset` could be used to highlight text.

## PII detection via CLI  <a class="anchor" id="cli"/>

Let's try to use the [AWS CLI](https://aws.amazon.com/cli/) for sentiment detection.


1. Confirm you have the AWS CLI setup and configured using something like this `aws sagemaker list-notebook-instances`

In [22]:
#!aws sagemaker list-notebook-instances

2. Now let's try to identify PII entities using the command line.

In [4]:
!aws comprehend detect-pii-entities \
--language-code en --text \
"Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check."

{
    "Entities": [
        {
            "Score": 0.9970920085906982,
            "Type": "NAME",
            "BeginOffset": 36,
            "EndOffset": 55
        },
        {
            "Score": 0.9974018335342407,
            "Type": "EMAIL",
            "BeginOffset": 167,
            "EndOffset": 195
        },
        {
            "Score": 0.9999964237213135,
            "Type": "ADDRESS",
            "BeginOffset": 211,
            "EndOffset": 245
        },
        {
            "Score": 0.9999964237213135,
            "Type": "PHONE",
            "BeginOffset": 265,
            "EndOffset": 277
        },
        {
            "Score": 0.9999970197677612,
            "Type": "SSN",
            "BeginOffset": 308,
            "EndOffset": 319
        },
        {
            "Score": 0.9999761581420898,
            "Type": "BANK_ACCOUNT_NUMBER",
            "BeginOffset": 347,
            "EndOffset": 359
        },
        {
            "Score": 0.9999786615371704,
      

Install jq for parsing output, jq is a lightweight and flexible command-line JSON processor.

In [9]:
# open a new terminal and install jq
# install jq
!apt-get update
!apt-get install jq

Hit:1 http://security.debian.org/debian-security buster/updates InRelease
Hit:2 http://deb.debian.org/debian buster InRelease
Hit:3 http://deb.debian.org/debian buster-updates InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
jq is already the newest version (1.5+dfsg-2+b1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [10]:
!aws comprehend detect-pii-entities \
--language-code en --text \
"Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check." \
| jq -r '.Entities[] |  .Type '

NAME
EMAIL
ADDRESS
PHONE
SSN
BANK_ACCOUNT_NUMBER
BANK_ROUTING
CREDIT_DEBIT_NUMBER
CREDIT_DEBIT_EXPIRY
CREDIT_DEBIT_CVV
PIN


## Async APIs to Redact PII Entities<a class="anchor" id="redact"/>

Lets look at the input content we want to redact, while redacting we will replace PIIEntity with the name of the entity


In [11]:
!aws s3 cp s3://ai-ml-services-lab/public/labs/comprehend/pii/input/redact/pii-s3-input.txt -

Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check.

### Async request
1. Using Async APIs for an input file in s3, we can redact the content.

In [15]:
!aws comprehend start-pii-entities-detection-job \
 --input-data-config S3Uri="s3://ai-ml-services-lab/public/labs/comprehend/pii/input/redact/pii-s3-input.txt"  \
 --output-data-config S3Uri="s3://ai-ml-services-lab/public/labs/comprehend/pii/output/redact/"  \
 --mode "ONLY_REDACTION" \
 --redaction-config PiiEntityTypes="BANK_ACCOUNT_NUMBER","BANK_ROUTING","CREDIT_DEBIT_NUMBER","CREDIT_DEBIT_CVV","CREDIT_DEBIT_EXPIRY","PIN","EMAIL","ADDRESS","NAME","PHONE","SSN",MaskMode="REPLACE_WITH_PII_ENTITY_TYPE" \
 --data-access-role-arn "arn:aws:iam::<ACCT>:role/ComprehendBucketAccessRole" \
 --job-name "comprehend-blog-redact-001" \
 --language-code "en"

{
    "JobId": "1fbe531aafad163b2fd3bf7287525482",
    "JobStatus": "SUBMITTED"
}


2. Monitor redaction job

In [23]:
#!aws comprehend describe-pii-entities-detection-job --job-id "1fbe531aafad163b2fd3bf7287525482"

### Output
Lets look at the output

In [24]:
#!aws s3 cp s3://ai-ml-services-lab/public/labs/comprehend/pii/output/redact/<acct>-PII-1fbe531aafad163b2fd3bf7287525482/output/pii-s3-input.txt.out -
    

## Async APIs to Redact / Mask PII Entities<a class="anchor" id="mask"/>

Lets look at the input content we want to redact, while redacting we will replace PIIEntity with the maked char * of the entity


In [18]:
!aws s3 cp s3://ai-ml-services-lab/public/labs/comprehend/pii/input/mask/pii-s3-input.txt -

Good morning, everybody. My name is Van Bokhorst Serdar, and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address SerdarvanBokhorst@dayrep.com. My address is 2657 Koontz Lane, Los Angeles, CA. My phone number is 818-828-6231. My Social security number is 548-95-6370. My Bank account number is 940517528812 and routing number 195991012. My credit card number is 5534816011668430, Expiration Date 6/1/2022, my C V V code is 121, and my pin 123456. Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check.

### Async request

1. Using Async APIs for an input file in s3, we can redact the content and mask the redacted content.

In [19]:
!aws comprehend start-pii-entities-detection-job \
 --input-data-config S3Uri="s3://ai-ml-services-lab/public/labs/comprehend/pii/input/mask/pii-s3-input.txt"  \
 --output-data-config S3Uri="s3://ai-ml-services-lab/public/labs/comprehend/pii/output/mask/"  \
 --mode "ONLY_REDACTION" \
 --redaction-config PiiEntityTypes="BANK_ACCOUNT_NUMBER","BANK_ROUTING","CREDIT_DEBIT_NUMBER","CREDIT_DEBIT_CVV","CREDIT_DEBIT_EXPIRY","PIN","EMAIL","ADDRESS","NAME","PHONE","SSN",MaskMode="MASK",MaskCharacter="*" \
 --data-access-role-arn "arn:aws:iam::<ACCT>:role/ComprehendBucketAccessRole" \
 --job-name "comprehend-blog-redact-mask-001" \
 --language-code "en"

{
    "JobId": "960d6d5347840302b722edd115fb8195",
    "JobStatus": "SUBMITTED"
}


2. Monitor redaction masking job

In [25]:
#!aws comprehend describe-pii-entities-detection-job --job-id "46e49284a3ea037d48f80371c053bf74"

### Output
Lets look at the output

In [21]:
!aws s3 cp s3://ai-ml-services-lab/public/labs/comprehend/pii/output/mask/<Acct>-PII-46e49284a3ea037d48f80371c053bf74/output/pii-s3-input.txt.out -

Good morning, everybody. My name is ******************** and today I feel like sharing a whole lot of personal information with you. Let's start with my Email address ***************************** My address is ********************************** My phone number is ************* My Social security number is ************ My Bank account number is ************ and routing number ********** My credit card number is ***************** Expiration Date ********* my C V V code is **** and my pin ******* Well, I think that's it. You know a whole lot about me. And I hope that Amazon comprehend is doing a good job at identifying PII entities so you can redact my personal information away from this document. Let's check.


## Cleanup <a class="anchor" id="cleanup"/>
TBD to clean all the resources