## **Data Masking**

In [None]:
%pip install transformers torch

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
import pandas as pd
import re
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from tqdm import tqdm
import spacy
import torch

#### Reading dataset

In [None]:
df = pd.read_json("hf://datasets/AGBonnet/augmented-clinical-notes/augmented_notes_30K.jsonl", lines=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
df.head()

Unnamed: 0,note,conversation,idx,summary,full_note
0,"A a sixteen year-old girl, presented to our Ou...","Doctor: Good morning, what brings you to the O...",155216,"{\n""visit motivation"": ""Discomfort in the neck...","A a sixteen year-old girl, presented to our Ou..."
1,This is the case of a 56-year-old man that was...,"Doctor: Hi, how are you feeling today?\nPatien...",77465,"{\n""visit motivation"": ""Complaints of a dull p...",This is the case of a 56-year-old man that was...
2,A 36-year old female patient visited our hospi...,"Doctor: Hello, what brings you to the hospital...",133948,"{\n""visit motivation"": ""Pain and restricted ra...",A 36-year old female patient visited our hospi...
3,A 49-year-old male presented with a complaint ...,"Doctor: Good morning, Mr. [Patient's Name]. I'...",80176,"{\n""visit motivation"": ""Pain in the left proxi...",A 49-year-old male presented with a complaint ...
4,A 47-year-old male patient was referred to the...,"Doctor: Good morning, how are you feeling toda...",72232,"{\n""visit motivation"": ""Recurrent attacks of p...",A 47-year-old male patient was referred to the...


In [4]:
df.dropna(inplace=True)

In [5]:
# Remove unwanted characters from the 'note' and 'full note' columns
df['note'] = df['note'].astype(str).str.replace(r'(\\n|\n|\r|\[|\]|\(|\))', '', regex=True)
df['full_note'] = df['full_note'].astype(str).str.replace(r'(\\n|\n|\r|\[|\]|\(|\))', '', regex=True)

In [6]:
len(df)

30000

#### Counting most frequency token

In [7]:
from collections import Counter

# Exclude tokens
exclude_tokens = ["the", "be", "to", "of", "and", "a", "an", "in", "that", "these",
                  "those", "I", "it", "for", "not", "on", "with", "as", "you",
                  "do", "at"]

exclude_tokens_set = set(exclude_tokens)

# Count exclude tokens
def count_exclude_tokens(text):
    words = re.findall(r"\b\w+\b", text.lower())
    return Counter(word for word in words if word in exclude_tokens_set)

# Total
total_counts = Counter()
total_words = 0
total_rows = 0

# Summary boath column
for _, row in df.iterrows():
    for field in ["note", "full_note"]:
        text = row[field]
        counts = count_exclude_tokens(text)
        total_counts.update(counts)
        total_words += len(re.findall(r"\b\w+\b", text.lower()))
        total_rows += 1

# Average frequency
token_stats = {
    token: {
        "total_count": total_counts[token],
        "mean_per_row": total_counts[token] / total_rows,
        "relative_freq": total_counts[token] / total_words
    }
    for token in exclude_tokens
}

# Sorted
sorted_stats = dict(sorted(token_stats.items(), key=lambda x: x[1]["relative_freq"], reverse=True))
sorted_stats_df = pd.DataFrame.from_dict(sorted_stats, orient='index')
sorted_stats_df

Unnamed: 0,total_count,mean_per_row,relative_freq
the,1706581,28.443017,0.066148
of,918881,15.314683,0.035616
and,918000,15.3,0.035582
a,610443,10.17405,0.023661
to,470057,7.834283,0.01822
with,414384,6.9064,0.016062
in,379412,6.323533,0.014706
for,208113,3.46855,0.008067
on,185870,3.097833,0.007204
at,137799,2.29665,0.005341


### **CLINICAL-BERT NER**

In [8]:
# gender to mask
gender_terms = {
    "male", "female", "man", "woman", "boy", "girl",
    "he", "she", "his", "her", "him", "hers", "himself", "herself"
}

# Function to mask words mantaining first two letter (eg. nack --> na**)
# Digit and gender terms are completely hidden
# Exclude most frequency token
def mask_word(word):
    if word.lower() in exclude_tokens:
        return word
    if word.isdigit() or word.lower() in gender_terms or len(word) <= 2:
        return "*" * len(word)
    return word[:2] + "*" * (len(word) - 2)

In [16]:
model_name = "samrawal/bert-base-uncased_clinical-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# GPU if available
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)
df_masked_text = {'index': [], 'note': [], 'masked_note': [], 'full_note': [], 'masked_full_note': []}

# Iterate over all dataset
for index, row in tqdm(df.iterrows(), total=len(df), desc="Masked dataset"):
    note = row["note"]
    full_note = row["full_note"]

    df_masked_text['index'].append(index)
    df_masked_text['note'].append(note)
    df_masked_text['full_note'].append(full_note)

    for idx, text in enumerate([note, full_note]):
        ner_results = ner_pipeline(text)

        # Map of masked intervals
        to_mask = []
        for entity in ner_results:
            if entity["entity_group"] in {"problem", "treatment"}:
                to_mask.append((entity["start"], entity["end"]))

        # Mask words in specif ranges
        masked_text = ""
        i = 0
        for start, end in sorted(to_mask):
            masked_text += text[i:start]
            original = text[start:end]
            masked = re.sub(r'\b\w+\b', lambda m: mask_word(m.group()), original)
            masked_text += masked
            i = end

        masked_text += text[i:]

        if idx == 0:
            df_masked_text['masked_note'].append(masked_text)
        else:
            df_masked_text['masked_full_note'].append(masked_text)

        # print("\nMasked text (first two letter are visible for PROBLEM/TREATMENT):\n")
        # print(masked_text)

Device set to use cuda
Masked dataset:   0%|          | 5/30000 [00:01<1:59:46,  4.17it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Masked dataset: 100%|██████████| 30000/30000 [50:00<00:00, 10.00it/s]


In [17]:
df_masked = pd.DataFrame(df_masked_text)

In [18]:
print(df_masked['masked_note'][0])

A a sixteen year-old girl, presented to our Outpatient department with the complaints of di******** in the ne** and lo*** ba** as well as re********* of bo** mo*******. She was not able to maintain an erect posture and would tend to fall on either side while standing up from a sitting position. She would keep her head turned to the right and upwards due to the su******* co********* of the ne** mu*****. There was a si****** be***** of the ba** in the lumbar region. To counter the ab****** po********* of the ba** and ne**, she would keep her limbs in a specific position to allow her body weight to be supported. Due to the restrictions with the body movements at the neck and in the lumbar region, she would require assistance in standing and walking. She would require her parents to help her with daily chores, including all activities of self-care.She had been experiencing these di********** for the past four months since when she was introduced to ol****pi** ta***** for the control of ***

### **SPACY NER**

In [12]:
spacy.cli.download("en_core_web_lg")

# spaCy model
nlp = spacy.load("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [19]:
# Entity type to mask
entities_to_mask = {"PERSON", "NORP", "DATE", "CARDINAL", "QUANTITY", "FAC", "ORG",
                    "GPE", "LOC", "LANGUAGE", "TIME"}

for index, row in tqdm(df_masked.iterrows(), total=len(df_masked), desc="Masked dataset"):
    for col in ["masked_note", "masked_full_note"]:
        text = row[col]

        to_mask = []
        other_to_mask = []

        doc = nlp(text)

        for ent in doc.ents:
            if ent.label_ in entities_to_mask:
                to_mask.append((ent.start_char, ent.end_char))
            if ent.label_ in ["CARDINAL", "QUANTITY", "TIME"]:
                other_to_mask.append((ent.start_char, ent.end_char))

        masked_text_new = ""
        i = 0
        for start, end in sorted(to_mask):
            if (start, end) in other_to_mask:
                masked_text_new += text[i:start]
                masked_text_new += '*' * (end - start)
                i = end
                continue
            masked_text_new += text[i:start]
            original = text[start:end]
            masked = re.sub(r'\b\w+\b', lambda m: mask_word(m.group()), original)
            masked_text_new += masked
            i = end

        masked_text_new += text[i:]

        masked_text_new = re.sub(
          r'\b\w+\b',
          lambda m: mask_word(m.group()) if m.group().lower() in gender_terms else m.group(),
          masked_text_new
        )

        df_masked.at[index, col] = masked_text_new

Masked dataset: 100%|██████████| 30000/30000 [2:22:03<00:00,  3.52it/s]


In [20]:
from google.colab import files

df_masked.to_csv('masked_dataset.csv', index=False)
files.download('masked_dataset.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Reading masked dataset

In [21]:
print(df_masked['masked_note'][0])

A a si***** ye**-ol* ****, presented to our Ou******** department with the complaints of di******** in the ne** and lo*** ba** as well as re********* of bo** mo*******. *** was not able to maintain an erect posture and would tend to fall on either side while standing up from a sitting position. *** would keep *** head turned to the right and upwards due to the ********* *********** of the **** mu*****. There was a si****** be***** of the ba** in the lumbar region. To counter the ab****** po********* of the ba** and ne**, *** would keep *** limbs in a specific position to allow *** body weight to be supported. Due to the restrictions with the body movements at the neck and in the lumbar region, *** would require assistance in standing and walking. *** would require *** parents to help *** with daily chores, including all activities of self-care.*** had been experiencing these di********** for the pa** fo** mo**** since when *** was introduced to ol****pi** ta***** for the control of ***

In [22]:
print(df_masked['masked_full_note'][0])

A a si***** ye**-ol* ****, presented to our Ou******** department with the complaints of di******** in the ne** and lo*** ba** as well as re********* of bo** mo*******. *** was not able to maintain an erect posture and would tend to fall on either side while standing up from a sitting position. *** would keep *** head turned to the right and upwards due to the ********* *********** of the **** mu*****. There was a si****** be***** of the ba** in the lumbar region. To counter the ab****** po********* of the ba** and ne**, *** would keep *** limbs in a specific position to allow *** body weight to be supported. Due to the restrictions with the body movements at the neck and in the lumbar region, *** would require assistance in standing and walking. *** would require *** parents to help *** with daily chores, including all activities of self-care.*** had been experiencing these di********** for the pa** fo** mo**** since when *** was introduced to ol******** ta***** for the control of ***