### Regex and Spacy in Conjunction
##### The overarching concept preceding PII removal for the BLOOM LLM was to 100% reliably remove the most dangerous PII (credit/bank numbers, government id numbers, etc.), while still capturing most general PII without removing too much information. 

##### This script detects Social Security numbers, credit card numbers, phone numbers, and named entities such as person names, dates, organizations, geopolitical entities, and locations using spaCy. It can customized further by adding more regex patterns or improving the named-entity recognition.

In [1]:
!pip install spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import re
import spacy

# Load English tokenizer, POS tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Regex patterns for SSN, credit card, and phone numbers
ssn_pattern = re.compile(r'\b\d{3}[-]?\d{2}[-]?\d{4}\b')
credit_card_pattern = re.compile(r'\b(?:\d[ -]*?){13,16}\b')
phone_number_pattern = re.compile(r'\b(?:\d{1,4}[ -]?){2,4}\d{1,4}\b')


def detect_pii(text):
    pii_entities = []

    # Detect SSN, credit card, and phone numbers using regex
    for match in ssn_pattern.finditer(text):
        pii_entities.append(("SSN", match.start(), match.end()))

    for match in credit_card_pattern.finditer(text):
        pii_entities.append(("CREDIT_CARD", match.start(), match.end()))

    for match in phone_number_pattern.finditer(text):
        pii_entities.append(("PHONE_NUMBER", match.start(), match.end()))

    # Named-entity recognition using spaCy
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in {"PERSON", "DATE", "ORG", "GPE", "LOC"}:
            pii_entities.append((ent.label_, ent.start_char, ent.end_char))

    return pii_entities


if __name__ == "__main__":
    text = "My name is John Doe and I live in New York. My phone number is 555-123-4567 and my SSN is 123-45-6789. My credit card number is 1234 5678 9012 3456."
    pii_entities = detect_pii(text)

    for entity_type, start, end in pii_entities:
        print(f"{entity_type}: {text[start:end]}")




SSN: 123-45-6789
CREDIT_CARD: 1234 5678 9012 3456
PHONE_NUMBER: 555-123-4567
PHONE_NUMBER: 123-45-6789
PHONE_NUMBER: 1234 5678 9012 3456
PERSON: John Doe
GPE: New York
ORG: SSN
DATE: 1234 5678 9012 3456
