<a href="https://colab.research.google.com/github/rayaneghilene/Mask_Personal_Data/blob/main/Mask_Personal_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mask Personal Data with NER

This notebook contains the code to mask personal data in text, for imporved security. It leverages the numind open source NuExtract model: https://huggingface.co/numind/NuExtract

Blog post:  https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction

---



In [1]:
import json, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]):
    schema = json.dumps(json.loads(schema), indent=4)
    input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
    for i in example:
      if i != "":
          input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"

    input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
    input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to(device)

    output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
    return output.split("<|output|>")[1].split("<|end-output|>")[0]

# Uncomment the following section if you wish to use the Tiny model (0.5B)
model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)

# ## The following model is the 3.7B version
# model = AutoModelForCausalLM.from_pretrained("numind/NuExtract", torch_dtype=torch.bfloat16, trust_remote_code=True)
# tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract", trust_remote_code=True)

model.to(device)

model.eval()

Using device: cpu


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.86G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/108 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/256 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm()
        (post_attention_layernorm): Qwen2RMSNorm()
      )
    )
    (norm): Qwen2RMSNorm()
  )
  (lm_head): Line

In [2]:
text = """John Doe and his wife, Jane Smith, recently moved to 123 Maple Street, Springfield.
They used to live at 456 Oak Avenue, Metropolis.
John's friend, Alice Johnson, helped them with the move.
You can contact John at (555)123-4567 for more details.
For secure access, use the API key: abc123XYZ789."""

schema = """{
    "Addresses": [],
    "Names": [],
    "Phone Numbers": [],
    "API Keys": [],
    "Spouse Names": []
}"""

prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
print(prediction)

Setting `pad_token_id` to `eos_token_id`:151646 for open-end generation.



{
    "Addresses": [
        "123 Maple Street",
        "456 Oak Avenue, Metropolis"
    ],
    "Names": [
        "John Doe",
        "Jane Smith"
    ],
    "Phone Numbers": [
        "(555)123-4567"
    ],
    "API Keys": [
        "abc123XYZ789"
    ],
    "Spouse Names": [
        "Jane Smith"
    ]
}



In [3]:
import json
import re

def mask_words_by_class(text, extracted_data_str):
    extracted_data = json.loads(extracted_data_str)

    word_set = set()
    for key in extracted_data:
        for item in extracted_data[key]:
            word_set.add(item.lower())

    sorted_word_set = sorted(word_set, key=len, reverse=True)
    pattern = re.compile(r'\b(' + '|'.join(re.escape(word) for word in sorted_word_set) + r')\b', re.IGNORECASE)
    masked_text = pattern.sub('<mask>', text)

    return masked_text

In [4]:
masked_text = mask_words_by_class(text, prediction)
print("Original Text: \n", text)
print("Masked Text: \n", masked_text)

Original Text: 
 John Doe and his wife, Jane Smith, recently moved to 123 Maple Street, Springfield.
They used to live at 456 Oak Avenue, Metropolis.
John's friend, Alice Johnson, helped them with the move.
You can contact John at (555)123-4567 for more details.
For secure access, use the API key: abc123XYZ789.
Masked Text: 
 <mask> and his wife, <mask>, recently moved to <mask>, Springfield.
They used to live at <mask>.
John's friend, Alice Johnson, helped them with the move.
You can contact John at (555)123-4567 for more details.
For secure access, use the API key: <mask>.
