# DPK Example: Remove Personally Identifiable Information (PII)

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/data-prep-kit/pii_1.ipynb)

 This notebook will illustrate how we can use Data Prep Kit's  [PII Redactor transform](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/pii_redactor) to remove peronally identifiable information (PII) from documents.

 References and credits

 - https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/language/pii_redactor
 - https://github.com/data-prep-kit/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb
  

## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [2]:
## Download any code files we may need

if RUNNING_IN_COLAB:
    !wget -O 'file_utils.py'   'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data-prep-kit/file_utils.py'

In [3]:
%%capture
# %%time

import os

if RUNNING_IN_COLAB:
  # setup a sandbox env to avoid conflicts with colab libraries
  !pip install -q condacolab
  import condacolab
  condacolab.install()

  !conda create -n my_env python=3.11 -y
  !conda activate my_env
  !pip install  --default-timeout=100  \
        data-prep-toolkit-transforms[all]==1.1.0 \
        humanfriendly
  # terminate the current kernel, so we restart the runtime
  os.kill(os.getpid(), 9)
  ## restart the session

### 1.3 - Restart Runtime

After installing dependencies, you may want to <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration  & Utils

### 2.1 - Basic Config

In [4]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 - Setup input/outpur directories

In [5]:
## setup path to utils folder
import sys
sys.path.append('../utils')

In [6]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [7]:
import os, sys
import shutil

if RUNNING_IN_COLAB:
    input_dir = "input/"
    shutil.os.makedirs(input_dir, exist_ok=True)
else:
    input_dir = "../data/pii/"

output_dir = "output"
output_pdf2pq_dir = os.path.join (output_dir, '01_pdf2pq_out')
output_pii_dir = os.path.join (output_dir, '02_pii_out')
output_md_dir = os.path.join (output_dir, "03_md")


## clear output folder
shutil.rmtree(output_dir, ignore_errors=True)
shutil.os.makedirs(output_dir, exist_ok=True)
shutil.os.makedirs(output_md_dir, exist_ok=True)
print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: Inspect the Data

We will use invoice PDF.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/invoices)

- [invoice-3.pdf](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/invoices/invoice-3.pdf)

### 3.1 -Download Data

In [8]:
from file_utils import download_file

if RUNNING_IN_COLAB:
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/pii/invoice-3.pdf', os.path.join(input_dir, 'invoice-3.pdf'))
else:
    print ('input : ', input_dir)

input :  ../data/pii/


## Step-4: Extract Data from PDF (pdf2parquet)

This step we will read PDF files and extract the text data.

[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)

We use the [Docling package](https://github.com/DS4SD/docling).


In [9]:
%%time

from dpk_pdf2parquet.transform_python import Pdf2Parquet
from dpk_pdf2parquet.transform import pdf2parquet_contents_types

print (f"🏃🏼 Processing input='{input_dir}' --> output='{output_pdf2pq_dir}'\n", flush=True)

result = Pdf2Parquet(input_folder= input_dir,
                    output_folder= output_pdf2pq_dir,
                    data_files_to_use=['.pdf'],
                    pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,   # markdown
                    ).transform()

if result == 0:
    print (f"✅ Operation completed successfully")
else:
    raise Exception (f"❌ Operation  failed")

🏃🏼 Processing input='../data/pii/' --> output='output/01_pdf2pq_out'



22:32:07 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
22:32:07 INFO - pipeline id pipeline_id
22:32:07 INFO - code location None
22:32:07 INFO - data factory data_ is using local data access: input_folder - ../data/pii/ output_folder - output/01_pdf2pq_out
22:32:07 INFO - data factory data_ max_files -1, n_sample -1
22:32:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
22:32:07 INFO - orchestrator pdf2parquet started at 2025-03-25 22:32:07
22:32:07 INFO - Number of files is 1, source profile {'max_file_size': 0.03161430358886719, 'min_file_size': 0.03

✅ Operation completed successfully
CPU times: user 7.73 s, sys: 1.41 s, total: 9.15 s
Wall time: 10.9 s


### 4.2 - Inspect Generated output


In [10]:
from file_utils import read_parquet_files_as_df

print ("Displaying contents of : ", output_pdf2pq_dir)
output_df = read_parquet_files_as_df(output_pdf2pq_dir)
output_df


Displaying contents of :  output/01_pdf2pq_out
Successfully read 1 parquet files with 1 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,invoice-3.pdf,## INVOICE\n\nApple Inc.\n\nInvoice Details:\n...,2,1,26,9ca5e70e-7494-49e5-ba5a-1c90597cffd5,4603599332662865930,pdf,c826f7ef2880228f98fec86be400d018710d335706bef2...,1063,2025-03-25T22:32:13.489930,1.078686,invoice-3.pdf


In [11]:
print (output_df[output_df['filename'] == 'invoice-3.pdf'].iloc[0,]['contents'])

## INVOICE

Apple Inc.

Invoice Details:

Invoice Number: INV-2024-001

Invoice Date: November 15, 2024

Due Date: November 30, 2024

Billing Information:

Customer Name: John Doe

Address: 123 Elm Street, Apt 45, Springfield, IL 62704

Email: john.doe@example.com

Phone: +1-312-555-7890

Shipping Information:

Recipient Name: John Doe

Address: 123 Elm Street, Apt 45, Springfield, IL 62704

## Item Details:

| Description               | Quantity   | Unit Price   | Total                               |
|---------------------------|------------|--------------|-------------------------------------|
| MacBook Air (13-inch, M2) | 1          | $999.00      | $999.00                             |
| 1                         |            | $199.00      | AppleCare+ for MacBook Air  $199.00 |

Payment Method: Credit Card (Visa)

Transaction ID: 9876543210ABCDE

Notes:

Thank you for your purchase!

For assistance, please contact our support team at support@apple.com or 1-800-MY-APPLE.

Subtot

## Step-5: PII Redactor

[PII Documentation](https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/language/pii_redactor/)

Entities are detected using [Spacy](https://spacy.io/)

**Configure the transform parameters.**

`--pii_redactor_entities PII_ENTITIES` : list of PII entities to be captured for example: ["PERSON", "EMAIL"]

Supported entities:

- PERSON: Names of individuals
- EMAIL_ADDRESS: Email addresses
- ORGANIZATION: Names of organizations
- DATE_TIME: Dates and times
- PHONE_NUMBER: Phone number
- CREDIT_CARD: Credit card numbers
- More [entities](https://microsoft.github.io/presidio/supported_entities/)

`--pii_redactor_operator REDACTOR_OPERATOR` : Two redaction techniques are supported - replace(default), redact

`--pii_redactor_transformed_contents PII_TRANSFORMED_CONTENT_COLUMN_NAME`: Mention the column name in which transformed contents will be added. This is required argument.

`--pii_redactor_score_threshold SCORE_THRESHOLD`: The score_threshold is a parameter that sets the minimum confidence score required for an entity to be considered a match. Provide a value above 0.6

In [12]:
from dpk_pii_redactor.transform_python import PIIRedactor

result = PIIRedactor(input_folder=output_pdf2pq_dir,
            output_folder= output_pii_dir,
            pii_redactor_entities = ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "LOCATION" ],
            pii_redactor_operator = "replace",
            pii_redactor_transformed_contents = "redacted_contents").transform()

if result == 0:
    print (f"✅ Operation completed successfully")
else:
    raise Exception (f"❌ Operation  failed")

22:32:14 INFO - pipeline id pipeline_id
22:32:14 INFO - code location None
22:32:14 INFO - data factory data_ is using local data access: input_folder - output/01_pdf2pq_out output_folder - output/02_pii_out
22:32:14 INFO - data factory data_ max_files -1, n_sample -1
22:32:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
22:32:14 INFO - orchestrator pii_redactor started at 2025-03-25 22:32:14
22:32:14 INFO - Number of files is 1, source profile {'max_file_size': 0.012164115905761719, 'min_file_size': 0.012164115905761719, 'total_file_size': 0.012164115905761719}
22:32:14 INFO - Loading model from flair/ner-english-large


2025-03-25 22:32:24,688 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


22:32:25 INFO - Completed 1 files (100.0%) in 0.002 min
22:32:25 INFO - Done processing 1 files, waiting for flush() completion.
22:32:25 INFO - done flushing in 0.0 sec
22:32:25 INFO - Completed execution in 0.19 min, execution result 0


✅ Operation completed successfully


### 5.2 - Inspect Generated output

Let's see the output

In [13]:
from file_utils import read_parquet_files_as_df

print ("Displaying contents of : ", output_pii_dir)
output_df = read_parquet_files_as_df(output_pii_dir)
output_df


Displaying contents of :  output/02_pii_out
Successfully read 1 parquet files with 1 total rows


Unnamed: 0,detected_pii,redacted_contents,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,"[PERSON, LOCATION, LOCATION, LOCATION, EMAIL_A...",## INVOICE\n\nApple Inc.\n\nInvoice Details:\n...,invoice-3.pdf,## INVOICE\n\nApple Inc.\n\nInvoice Details:\n...,2,1,26,9ca5e70e-7494-49e5-ba5a-1c90597cffd5,4603599332662865930,pdf,c826f7ef2880228f98fec86be400d018710d335706bef2...,1063,2025-03-25T22:32:13.489930,1.078686,invoice-3.pdf


In [14]:
print (output_df[output_df['filename'] == 'invoice-3.pdf'].iloc[0,]['redacted_contents'])

## INVOICE

Apple Inc.

Invoice Details:

Invoice Number: INV-2024-001

Invoice Date: November 15, 2024

Due Date: November 30, 2024

Billing Information:

Customer Name: <PERSON>

Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704

Email: <EMAIL_ADDRESS>

Phone: <PHONE_NUMBER>

Shipping Information:

Recipient Name: <PERSON>

Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704

## Item Details:

| Description               | Quantity   | Unit Price   | Total                               |
|---------------------------|------------|--------------|-------------------------------------|
| MacBook Air (13-inch, M2) | 1          | $999.00      | $999.00                             |
| 1                         |            | $199.00      | AppleCare+ for MacBook Air  $199.00 |

Payment Method: Credit Card (Visa)

Transaction ID: 9876543210ABCDE

Notes:

Thank you for your purchase!

For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.



## Step-6: Save as MD

In [15]:
## save markdown text
from file_utils import read_parquet_files_as_df

df = read_parquet_files_as_df(output_pii_dir)

for index, row in df.iterrows():
    output_file_name = os.path.join (output_md_dir, row['filename'] + '.txt')
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['redacted_contents'])

print (f"✅ Saved CLEAN markdown output to '{output_md_dir}'")

Successfully read 1 parquet files with 1 total rows
✅ Saved CLEAN markdown output to 'output/03_md'
