<a href="https://colab.research.google.com/github/rayaneghilene/Mask_Personal_Data/blob/main/Mask_Personal_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mask Personal Data with NER

This notebook contains the code to mask personal data in text, for imporved security. It leverages the numind open source NuExtract model: https://huggingface.co/numind/NuExtract

Blog post:  https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction

---



In [16]:
import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer


def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]):
    schema = json.dumps(json.loads(schema), indent=4)
    input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
    for i in example:
      if i != "":
          input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"

    input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")

    output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
    return output.split("<|output|>")[1].split("<|end-output|>")[0]

## Uncomment the following section if you wish to use the Tiny model (0.5B)
# model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)
# tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)

## The following model is the 3.7B version
model = AutoModelForCausalLM.from_pretrained("numind/NuExtract", torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract", trust_remote_code=True)

model.to("cuda")

model.eval()



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

In [23]:
text = """John Doe and his wife, Jane Smith, recently moved to 123 Maple Street, Springfield.
They used to live at 456 Oak Avenue, Metropolis.
John's friend, Alice Johnson, helped them with the move.
You can contact John at (555)123-4567 for more details.
For secure access, use the API key: abc123XYZ789."""

schema = """{
    "Addresses": [],
    "Names": [],
    "Phone Numbers": [],
    "API Keys": [],
    "Spouse Names": []
}"""

prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
print(prediction)


{
    "Addresses": [
        "123 Maple Street, Springfield",
        "456 Oak Avenue, Metropolis"
    ],
    "Names": [
        "John Doe",
        "Jane Smith",
        "Alice Johnson"
    ],
    "Phone Numbers": [
        "(555)123-4567"
    ],
    "API Keys": [
        "abc123XYZ789"
    ],
    "Spouse Names": [
        "Jane Smith"
    ]
}



In [24]:
import json
import re

def mask_words_by_class(text, extracted_data_str):
    extracted_data = json.loads(extracted_data_str)

    word_set = set()
    for key in extracted_data:
        for item in extracted_data[key]:
            word_set.add(item.lower())

    sorted_word_set = sorted(word_set, key=len, reverse=True)
    pattern = re.compile(r'\b(' + '|'.join(re.escape(word) for word in sorted_word_set) + r')\b', re.IGNORECASE)
    masked_text = pattern.sub('<mask>', text)

    return masked_text

In [25]:
masked_text = mask_words_by_class(text, prediction)
print("Original Text: \n", text)
print("Masked Text: \n", masked_text)

Original Text: 
 John Doe and his wife, Jane Smith, recently moved to 123 Maple Street, Springfield. 
They used to live at 456 Oak Avenue, Metropolis. 
John's friend, Alice Johnson, helped them with the move. 
You can contact John at (555)123-4567 for more details. 
For secure access, use the API key: abc123XYZ789.
Masked Text: 
 <mask> and his wife, <mask>, recently moved to <mask>. 
They used to live at <mask>. 
John's friend, <mask>, helped them with the move. 
You can contact John at (555)123-4567 for more details. 
For secure access, use the API key: <mask>.
