## First, let's install the packages we will need. The following libraries will be used throughout the project:

- huggingface_hub
- presidio_analyzer
- presidio_anonymizer
- presidio_image_redactor
- spacy

In [1]:
!pip install -U -r requirements.txt -q
# below is a fix for HuggingFace + Tensorflow 2.13+
!pip install -U git+https://github.com/huggingface/transformers.git -q

### We are going to download and use the dslim/bert-base-NER to augment PII detection. 

_bert-base-NER is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC)._

In [2]:
from huggingface_hub import snapshot_download

repo_id = 'dslim/bert-base-NER'
model_id = repo_id.split('/')[-1]

snapshot_download(repo_id=repo_id, local_dir=model_id)

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

'/Users/spm1976/development/pii-analyzer-anonymizer/bert-base-NER'

### Next we will implement our Presidio anonymizer.

_First the base analyzer is created and initialized_
_Second we will create a class to extend the base analyzer instantiation_

Because Spacy is large, we don't want to download it every time. This code checks to see if it is already installed.

In [3]:
import spacy

try:
  nlp_lg = spacy.load("en_core_web_lg")
except ModuleNotFoundError:
  download(model="en_core_web_lg")


This defines the packages that we want to download and the anonymous entries that we want to search for. This can be customized. See https://microsoft.github.io/presidio/supported_entities/#list-of-supported-entities

In [4]:
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig, RecognizerResult
from presidio_analyzer import AnalyzerEngine
from typing import List  

from presidio_analyzer import AnalyzerEngine, EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts

from transformers import pipeline

# load spacy model -> workaround
#import os
#os.system("spacy download en_core_web_lg")

# list of entities: https://microsoft.github.io/presidio/supported_entities/#list-of-supported-entities
DEFAULT_ANOYNM_ENTITIES = [
    "CREDIT_CARD", 
    "CRYPTO",
    "DATE_TIME",
    "EMAIL_ADDRESS",
    "IBAN_CODE",
    "IP_ADDRESS",
    "NRP",
    "LOCATION",
    "PERSON",
    "PHONE_NUMBER",
    "MEDICAL_LICENSE",
    "URL",
    "ORGANIZATION",
    "US_SSN"
]


This is the implementation of our NER EntityRecognizer. More information on this can be found at: https://microsoft.github.io/presidio/analyzer/adding_recognizers/#extending-the-analyzer-for-additional-pii-entities

In [5]:
# implement EntityRecognizer class for HuggingFace NER model
class TransformerRecognizer(EntityRecognizer):
    '''
    '''
    def __init__(
        self,
        model_id_or_path=None,
        aggregation_strategy='simple',
        supported_language='en',
        ignore_labels=['0','O','MISC']
    ):
         # initialize transformers pipeline for given mode or path
        self.pipeline = pipeline(
            "token-classification",
            model=model_id_or_path, 
            aggregation_strategy=aggregation_strategy,
            ignore_labels=ignore_labels
        )
        
        # map labels to presidio labels
        self.label2presidio = {
            "PER": "PERSON",
            "LOC": "LOCATION",
            "ORG": "ORGANIZATION"
        }
        
        #pass entities from model to parent class
        super().__init__(
            supported_entities=list(self.label2presidio.values()), 
            supported_language=supported_language
        )
        
    '''
    '''
    def load(self):
        ''' no loading is required '''
        pass
    
    '''
    '''
    def analyze(
        self,
        text,
        entities=None,
        nlp_artifacts=None
    ):        
        predicted_entities = self.pipeline(text)
        
        results = [ 
            RecognizerResult(entity_type=self.label2presidio[e['entity_group']], 
                             start=e['start'], 
                             end=e['end'], 
                             score=e['score']) for e in predicted_entities
        ]
                
        return results

In order to detect PII, we are going to use the Presidio AnalyzerEngine, and then register our NER EntityRecognizer into the pipeline. This furthers our capability to detect other PII fields.

In [6]:
model_dir = 'bert-base-NER' # directory that we downloaded HuggingFace to above

xfmr_recognizer = TransformerRecognizer(model_dir)
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(xfmr_recognizer)

Some weights of the model checkpoint at bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Using the default encoder, this is what ouput from Presidio looks like. It uses generic tags and does not give a nice format.

In [7]:
text = "His name is Mr. Jones and his phone number is 212-555-5555"

analyzer_results = analyzer.analyze(text=text, language="en")

print(analyzer_results)

[type: PERSON, start: 16, end: 21, score: 0.944421648979187, type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]


In [8]:
# initialize the anonymizer. this is not the extended EntityRecognizer
anonymizer_engine = AnonymizerEngine()

# create anonymized results
anonymized_results = anonymizer_engine.anonymize(
    text=text, analyzer_results=analyzer_results
)

print(anonymized_results)

text: His name is Mr. <PERSON> and his phone number is <PHONE_NUMBER>
items:
[
    {'start': 49, 'end': 63, 'entity_type': 'PHONE_NUMBER', 'text': '<PHONE_NUMBER>', 'operator': 'replace'},
    {'start': 16, 'end': 24, 'entity_type': 'PERSON', 'text': '<PERSON>', 'operator': 'replace'}
]



In [9]:
operators = {
    "DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
    "PHONE_NUMBER": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "*",
            "chars_to_mask": 12,
            "from_end": True,
        },
    ),
    "US_SSN": OperatorConfig(
        "mask",
        {
            "type": "mask",
            "masking_char": "#",
            "chars_to_mask": 11,
            "from_end": False
        }
    ),
    "TITLE": OperatorConfig("redact", {}),
}

By adding operators to the AnonymizerEngine instance, the output can be customized to produce a more desireable result.

In [10]:
# initialize the anonymizer. this is not the extended EntityRecognizer
anonymizer_engine = AnonymizerEngine()

# create anonymized results
anonymized_results = anonymizer_engine.anonymize(
    text=text, analyzer_results=analyzer_results, operators=operators
)

print(anonymized_results)

text: His name is Mr. <ANONYMIZED> and his phone number is ************
items:
[
    {'start': 53, 'end': 65, 'entity_type': 'PHONE_NUMBER', 'text': '************', 'operator': 'mask'},
    {'start': 16, 'end': 28, 'entity_type': 'PERSON', 'text': '<ANONYMIZED>', 'operator': 'replace'}
]



This is a longer test of the anonymizer engine with custom operators

In [11]:
text = '''
John Smith, born in 1987, lives in Seattle, Washington. 
He is a software engineer and has a Bachelor's degree in Computer Science from the University of Washington. 
He drives a blue Honda Accord and his driver's license number is A123456789. 
His social security number is 995-12-2716 and his phone number is (206) 555-1234. 
John enjoys playing basketball and hiking in his free time. 
He is married to Sarah Smith and they have two children, Emma and Jake.
He banks at JPMC and his account number is 99953153415
'''

analyzer_results =  analyzer.analyze(text=text, language="en")

anonymized_results = anonymizer_engine.anonymize(
    text=text, analyzer_results=analyzer_results, operators=operators
)

print(anonymized_results)

text: 
<ANONYMIZED>, born in <ANONYMIZED>, lives in <ANONYMIZED>, <ANONYMIZED>. 
He is a software engineer and has a Bachelor's degree in Computer Science from the <ANONYMIZED>. 
He drives a blue Honda Accord and his driver's license number is <ANONYMIZED>. 
His social security number is ########### and his phone number is (2************. 
<ANONYMIZED> enjoys playing basketball and hiking in his free time. 
He is married to <ANONYMIZED> and they have two children, <ANONYMIZED> and <ANONYMIZED>.
He banks at <ANONYMIZED> and his account number is ***********

items:
[
    {'start': 545, 'end': 556, 'entity_type': 'PHONE_NUMBER', 'text': '***********', 'operator': 'mask'},
    {'start': 506, 'end': 518, 'entity_type': 'ORGANIZATION', 'text': '<ANONYMIZED>', 'operator': 'replace'},
    {'start': 480, 'end': 492, 'entity_type': 'PERSON', 'text': '<ANONYMIZED>', 'operator': 'replace'},
    {'start': 463, 'end': 475, 'entity_type': 'PERSON', 'text': '<ANONYMIZED>', 'operator': 'replace'},
   

#### Streaming
Start Redpanda and produce messages from JSON. start_container.bash creates redpanda.env for the next sections.

In [12]:
%%bash

L_NUM_CONTAINERS=3

rpk container start -n ${L_NUM_CONTAINERS} | grep export | sed -e 's/^[ \t]*//' > redpanda.env

In [13]:
from dotenv import load_dotenv
import json
from kafka import KafkaProducer
import os
import time

""" read in information about started redpanda environment """
load_dotenv('redpanda.env')

""" create producer """
producer = KafkaProducer(
    bootstrap_servers = os.environ.get('RPK_BROKERS'),
    value_serializer=lambda m: json.dumps(m).encode('ascii')
)

topic = "random-pii-text"

def on_success(metadata):
  print(f"Message produced to topic '{metadata.topic}' at offset {metadata.offset}")

def on_error(e):
  print(f"Error sending message: {e}")

""" read in OpenAI generated PII """
with open('../data/pii_records.json') as f:
  l_json_data = json.load(f)

""" push messages to toic from OpenAI """
for ii in range(len(l_json_data)):
  msg = dict(id=ii, inputs=l_json_data[ii]['inputs'])
  future = producer.send(topic, msg)
  future.add_callback(on_success)
  future.add_errback(on_error)
  time.sleep(0.100) # sleep for 1/10 sec to cause a delay.

""" flush and close producer """
producer.flush()
producer.close()

NoBrokersAvailable: NoBrokersAvailable

In [None]:
!#cd redpanda && ./stop_container.bash