<a href="https://colab.research.google.com/github/zelal-Eizaldeen/DLH-Project-Reproduce-HurtfulWords/blob/main/Fairness_metric_recreation_using_pretrained_model_withadversarial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Names: Payel Chakraborty && Zilal Eiz Al Din

NetIDs: payelc2 && zelalae2

Purpose: This file contains Paper Name, their Pretrained Models, Environment and Prerequisites and Script Usage.

Dataset: HurtfulWordsDataset (MIMIC3)

Paper Reference: Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings

The paper: https://arxiv.org/abs/2003.11515

Github Repo of the paper: https://github.com/MLforHealth/HurtfulWords

Github Repo of the Reproduction: https://github.com/zelal-Eizaldeen/DLH-Project-Reproduce-HurtfulWords

Prerequisite: the cleaned demographic dataset from "data.csv" and the listfile.csv within phenotyping folder in mimic3-benchmark folder.

Usage: This script is used to merge the demographic data (subject, language, ethnicity, gender, insurance, and cleaned up clinical notes) with the 25 labels from the "phenotyping" folder in mimic3-benchmark. Phenotyping folder has both train and test datasets, but we have used the train dataset to tokenize and evaluate the for performance gaps within each demog subgroup. Why the train dataset? Because we are not training the model here anymore, just evaluating it. So the train datset has more data, and to be able to evaluate the model for a larger dataset. Also, evaluate the same dataset with the Adversarial Debiased model.
(To assess fairness and bias in clinical prediction models (particularly BERT-based ones) across demographic subgroups, especially gender, ethnicity, language, and insurance.)

# Mounting the Google Drive and importing necessary packages and libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
#Import modules
import sys
sys.path.append('/content/drive/MyDrive/Payel-DLH-related/Colab Notebooks')
import pandas as pd
from pathlib import Path
import os
from transformers import BertForMaskedLM, BertModel, BertTokenizer,AutoModelForMaskedLM, AutoTokenizer
import torch
import pandas as pd

# Pretrained Clinical BERT Model that was downloaded from the Original paper Github repository

In [None]:
SCIBERT_DIR = Path('/content/drive/MyDrive/Payel-DLH-related/HurtfulWords/pretrainedModels')

In [None]:
baseline_model_path = SCIBERT_DIR / 'Baseline_Clinical_BERT/baseline_clinical_BERT_1_epoch_512'

In [None]:
print(baseline_model_path)

/content/drive/MyDrive/Payel-DLH-related/HurtfulWords/pretrainedModels/Baseline_Clinical_BERT/baseline_clinical_BERT_1_epoch_512


# There are two kinds of input files: the demog file (data.csv) is the file with notes per subject anf the mimic files are the files with phenotype labels(25 of them) originally downloaded using the mimic-benchmark build steps. The latter have both train and test of files.

In [None]:
# DEMOG_FILE = Path('/content/drive/MyDrive/Payel-DLH-related/DataFiles/sampled_files')
DEMOG_FILE = Path('/content/drive/MyDrive/Payel-DLH-related/DataFiles/cleaned_data/data.csv')
MIMIC_FILE = Path('/content/drive/MyDrive/Payel-DLH-related/mimic3-benchmarks/data/phenotyping/train')
MIMIC_FILE_TEST = Path('/content/drive/MyDrive/Payel-DLH-related/mimic3-benchmarks/data/phenotyping/test')

# Dataframe creation for the demog file

In [None]:
df_demog = pd.read_csv(DEMOG_FILE)

In [None]:
print(len(df_demog))

5018


In [None]:
df_demog.columns

Index(['row_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input'],
      dtype='object')

# Renaming the correct note field from row_id to note_id in the demog file.

In [None]:
# df_demog=df_demog.reset_index()()
df_demog=df_demog.rename(columns={'row_id':'note_id'})
df_demog['cleaned_sent'].head()

Unnamed: 0,cleaned_sent
0,"['Patient made CMO last night, bradycardic 20-..."
1,"['Patient made CMO last night, bradycardic 20-..."
2,['PHIAGEPHI yr old female admitted to the micu...
3,['PHIAGEPHI yr old female admitted to the micu...
4,"['Comfort care (CMO, Comfort Measures) Asse..."


In [None]:
df_demog['cleaned_sent'][1]

"['Patient made CMO last night, bradycardic 20-30 bpm when I came in,    apneic, appears comfortable.', 'Family at bedside with patient, emotional    support given.', 'Went into asystole, house officer intormed.', 'Pronounced    dead by doctor PHINAMEPHI at  Family offered for autopsy which they    refused to be done to the patient.', 'Post mortem care done, after family    have spent time with patient.', 'Patient s wife not at bedside, children    will inform their mother personally.', 'PHINAMEPHI rites  sacrament of sick    done yesterday.', 'Patient was brought to PHINAMEPHI with name band intact.', 'Comfort care (CMO, Comfort Measures)    Assessment:    Throughout shift pt.', 'became increasingly hypotensive despite being    given blood and albumin.', 'Sats progressively worsened on 100% FIO    Family called and MDs conformed CMO status.', 'Action:    PRN morphine as needed.', 'Morphine gtt ordered but not hung as pt.', 'unresponsive and appears comfortable.', 'Emotional support to

# Dataframe creation for the mimic files- the listfile.csv contains the subject and the 5 labels. There are two of them, on for train and the other for test. The subject_id has been extracted from the fist field in the listfile.csv which was actually contaning the episode timeseries filenames.

In [None]:
df_mimic = pd.read_csv(MIMIC_FILE/'listfile.csv',index_col=0)
df_mimic = df_mimic.reset_index()

In [None]:
df_mimic.head(5)

Unnamed: 0,stay,period_length,Acute and unspecified renal failure,Acute cerebrovascular disease,Acute myocardial infarction,Cardiac dysrhythmias,Chronic kidney disease,Chronic obstructive pulmonary disease and bronchiectasis,Complications of surgical procedures or medical care,Conduction disorders,...,Gastrointestinal hemorrhage,Hypertension with complications and secondary hypertension,Other liver diseases,Other lower respiratory disease,Other upper respiratory disease,Pleurisy; pneumothorax; pulmonary collapse,Pneumonia (except that caused by tuberculosis or sexually transmitted disease),Respiratory failure; insufficiency; arrest (adult),Septicemia (except in labor),Shock
0,10036_episode1_timeseries.csv,43.3632,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
1,1004_episode1_timeseries.csv,144.9864,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
2,1004_episode2_timeseries.csv,53.8656,1,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
3,10064_episode1_timeseries.csv,4.968,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,10093_episode1_timeseries.csv,21.6408,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


In [None]:

df_mimic['subject_id'] = df_mimic['stay'].str.split('_').str[0]

In [None]:
df_mimic_test = pd.read_csv(MIMIC_FILE_TEST/'listfile.csv',index_col=0)
df_mimic_test = df_mimic_test.reset_index()

df_mimic_test['subject_id'] = df_mimic_test['stay'].str.split('_').str[0]

In [None]:
print(len(df_mimic_test))
print(len(df_mimic))

225
1261


In [None]:
df_demog.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input'],
      dtype='object')

# Merging the demog file and the miic file (create one merged cop for train and the other for test)

In [None]:
# df_demog=df_demog.rename(columns={'subject_id_x':'subject_id'})
df_demog['subject_id'] = df_demog['subject_id'].astype(int)
# df_demog = df_demog[['subject_id_x','gender','ethnicity','language_to_use','insurance']]
df_mimic['subject_id'] = df_mimic['subject_id'].astype(int)
df_mimic_test['subject_id'] = df_mimic_test['subject_id'].astype(int)
merged_df = pd.merge(df_demog, df_mimic, on='subject_id', how='inner')
merged_df.head()


Unnamed: 0,note_id,subject_id,hadm_id,chartdate,charttime,storetime,category,description,cgid,iserror,...,Gastrointestinal hemorrhage,Hypertension with complications and secondary hypertension,Other liver diseases,Other lower respiratory disease,Other upper respiratory disease,Pleurisy; pneumothorax; pulmonary collapse,Pneumonia (except that caused by tuberculosis or sexually transmitted disease),Respiratory failure; insufficiency; arrest (adult),Septicemia (except in labor),Shock
0,316599,29080,181664.0,2163-02-23,2163-02-23 09:11:00,2163-02-23 09:11:59,Nursing,Nursing Progress Note,21297.0,,...,1,1,0,0,0,0,1,0,0,0
1,316573,29080,181664.0,2163-02-23,2163-02-23 04:31:00,2163-02-23 04:31:07,Nursing,Nursing Progress Note,20706.0,,...,1,1,0,0,0,0,1,0,0,0
2,315769,29548,182189.0,2195-01-18,2195-01-18 16:24:00,2195-01-18 16:25:03,Physician,Physician Resident Admission Note,15235.0,,...,0,1,0,1,0,0,0,1,0,0
3,315777,29548,182189.0,2195-01-18,2195-01-18 16:50:00,2195-01-18 16:50:40,Physician,Physician Attending Admission Note,20047.0,,...,0,1,0,1,0,0,0,1,0,0
4,317118,30699,108670.0,2121-01-07,2121-01-07 05:50:00,2121-01-07 05:50:42,Nursing,Nursing Progress Note,21290.0,,...,0,0,0,0,0,0,0,1,0,0


In [None]:
merged_df_test = pd.merge(df_demog, df_mimic_test, on='subject_id', how='inner')
merged_df_test.head()

Unnamed: 0,note_id,subject_id,hadm_id,chartdate,charttime,storetime,category,description,cgid,iserror,...,Gastrointestinal hemorrhage,Hypertension with complications and secondary hypertension,Other liver diseases,Other lower respiratory disease,Other upper respiratory disease,Pleurisy; pneumothorax; pulmonary collapse,Pneumonia (except that caused by tuberculosis or sexually transmitted disease),Respiratory failure; insufficiency; arrest (adult),Septicemia (except in labor),Shock
0,325855,28313,156864.0,2117-04-04,2117-04-04 01:30:00,2117-04-04 05:23:20,Nursing,Nursing Progress Note,18469.0,,...,0,0,0,0,0,0,0,0,0,0
1,325855,28313,156864.0,2117-04-04,2117-04-04 01:30:00,2117-04-04 05:23:20,Nursing,Nursing Progress Note,18469.0,,...,1,0,0,0,0,0,0,1,1,1
2,325844,28313,156864.0,2117-04-04,2117-04-04 01:30:00,2117-04-04 01:30:36,Nursing,Nursing Progress Note,18469.0,,...,0,0,0,0,0,0,0,0,0,0
3,325844,28313,156864.0,2117-04-04,2117-04-04 01:30:00,2117-04-04 01:30:36,Nursing,Nursing Progress Note,18469.0,,...,1,0,0,0,0,0,0,1,1,1
4,325847,28313,156864.0,2117-04-04,2117-04-04 03:15:00,2117-04-04 03:15:16,Physician,Physician Resident Admission Note,18767.0,,...,0,0,0,0,0,0,0,0,0,0


In [None]:
print(len(merged_df_test))
print(len(merged_df))

945
4490


# Preparation for Tokenization

In [None]:
!pip install nltk



In [None]:
import nltk
nltk.data.path.append('/root/nltk_data')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
nltk.data.path.append('/root/nltk_data')

In [None]:
import os
print(os.listdir('/root/nltk_data/tokenizers'))

['punkt_tab', 'punkt_tab.zip', 'punkt', 'punkt.zip']


# Just creating copies of train and test dataframes as merged_df_copy and meregd_df_test_copy with which we'll work henceforth

In [None]:
merged_df_copy=merged_df

In [None]:
merged_df_test_copy=merged_df_test

In [None]:
print(len(merged_df))

4490


In [None]:
print(len(merged_df_test))

945


In [None]:
merged_df_copy.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

# Tokenization of our dataset (We will be evaluating our train dataet due to more samples than our test dataset, and the purpose is to be able to evaluate the dataset, not train the model in this script).

In [None]:
from transformers import BertTokenizer
from tqdm import tqdm
import torch

# Setup
tqdm.pandas()
tokenizer = BertTokenizer.from_pretrained(baseline_model_path)

# Sliding Window Tokenization Function
def sliding_window_tokenize(text, tokenizer, max_len=512, stride=64):
    encoded = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=max_len,
        stride=stride,
        return_overflowing_tokens=True,
        return_tensors='pt'
    )
    return {
        "input_ids": encoded["input_ids"],              # Shape: [num_chunks, max_len]
        "attention_mask": encoded["attention_mask"]
    }

# Apply per note (from cleaned sentences)
def process_note_with_sliding_window(row, tokenizer, max_len=512, stride=64):
    note = " ".join(row['cleaned_sent'])  # Join all cleaned sentences into one string
    return sliding_window_tokenize(note, tokenizer, max_len=max_len, stride=stride)

# Apply to the DataFrame
merged_df_copy['tokenized_notes'] = merged_df_copy.progress_apply(
    lambda row: process_note_with_sliding_window(row, tokenizer, max_len=512, stride=64),
    axis=1
)


100%|██████████| 4490/4490 [08:08<00:00,  9.19it/s]


In [None]:
merged_df_copy.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

# Creating one list column called labels having all the 25 labels as the list elements.

In [None]:
label_col = ['Acute and unspecified renal failure',
       'Acute cerebrovascular disease', 'Acute myocardial infarction',
       'Cardiac dysrhythmias', 'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Complications of surgical procedures or medical care',
       'Conduction disorders', 'Congestive heart failure; nonhypertensive',
       'Coronary atherosclerosis and other heart disease',
       'Diabetes mellitus with complications',
       'Diabetes mellitus without complication',
       'Disorders of lipid metabolism', 'Essential hypertension',
       'Fluid and electrolyte disorders', 'Gastrointestinal hemorrhage',
       'Hypertension with complications and secondary hypertension',
       'Other liver diseases', 'Other lower respiratory disease',
       'Other upper respiratory disease',
       'Pleurisy; pneumothorax; pulmonary collapse',
       'Pneumonia (except that caused by tuberculosis or sexually transmitted disease)',
       'Respiratory failure; insufficiency; arrest (adult)',
       'Septicemia (except in labor)', 'Shock']

In [None]:
merged_df_copy['labels'] = merged_df_copy[label_col].values.tolist()

In [None]:
merged_df_copy.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

In [None]:
merged_df_copy['labels'][1]

[1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]

# Preparing and evaluating the dataset with the Pretrained Clinical BERT Model

In [None]:
from torch.utils.data import Dataset
import torch

class ClinicalNotesDataset(Dataset):
    def __init__(self, df, max_len=512):
        self.samples = []

        for _, row in df.iterrows():
            # Get tokenized notes and labels
            tokenized = row['tokenized_notes']
            label = torch.tensor(row['labels'], dtype=torch.float)  # Assuming it's a multi-label tensor

            # Extract chunks for each tokenized note
            input_ids_chunks = tokenized['input_ids']  # Shape: [num_chunks, max_len]
            attention_mask_chunks = tokenized['attention_mask']

            # Create a sample for each chunk of the note
            for input_ids, attention_mask in zip(input_ids_chunks, attention_mask_chunks):
                # Repeat the label for every chunk (because label applies to the entire note)
                self.samples.append({
                    'input_ids': input_ids,
                    'attention_mask': attention_mask,
                    'labels': label
                })

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return {
            'input_ids': self.samples[idx]['input_ids'],
            'attention_mask': self.samples[idx]['attention_mask'],
            'labels': self.samples[idx]['labels']
        }


In [None]:
dataset = ClinicalNotesDataset(merged_df_copy)
sample = dataset[0]

print(sample['input_ids'].shape)       # torch.Size([512])
print(sample['attention_mask'].shape)  # torch.Size([512])
print(sample['labels'].shape)          # torch.Size([25])

torch.Size([512])
torch.Size([512])
torch.Size([25])


In [None]:
train_dataset_phase2 = ClinicalNotesDataset(merged_df_copy, max_len=512)

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score
import numpy as np

# Load the pretrained Phase 1 model
model = BertForSequenceClassification.from_pretrained(baseline_model_path,num_labels=25)

# Define training arguments for evaluation
eval_args = TrainingArguments(
    output_dir="./eval_results",  # You can update this path
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    do_train=False,
    do_eval=True
)

# Custom metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = (pred.predictions > 0.5).astype(int)

    recall = recall_score(labels, preds, average='micro')
    precision = precision_score(labels, preds, average='micro')
    f1 = f1_score(labels, preds, average='micro')
    accuracy = accuracy_score(labels, preds)

    return {
        'recall': recall,
        'precision': precision,
        'f1': f1,
        'accuracy': accuracy,
    }

# Initialize Trainer for evaluation
trainer = Trainer(
    model=model,
    args=eval_args,
    eval_dataset=train_dataset_phase2,
    compute_metrics=compute_metrics
)

# Run evaluation
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)

# Optional: Get predictions
predictions = trainer.predict(train_dataset_phase2)
y_true = predictions.label_ids
y_pred = (predictions.predictions > 0.5).astype(int)

# Compute specificity
specificities = []
for i in range(y_true.shape[1]):
    cm = confusion_matrix(y_true[:, i], y_pred[:, i])
    if cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
        specificities.append(specificity)
    else:
        specificities.append(np.nan)

average_specificity = np.nanmean(specificities)
print(f"Average Specificity: {average_specificity:.2f}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /content/drive/MyDrive/Payel-DLH-related/HurtfulWords/pretrainedModels/Baseline_Clinical_BERT/baseline_clinical_BERT_1_epoch_512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mcpayel25[0m ([33mcpayel25-university-of-illinois-urbana-champaign[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Evaluation Results: {'eval_loss': 0.7342121601104736, 'eval_model_preparation_time': 0.0029, 'eval_recall': 0.06156236129604971, 'eval_precision': 0.10101230791639357, 'eval_f1': 0.07650092385758804, 'eval_accuracy': 0.0, 'eval_runtime': 131.5058, 'eval_samples_per_second': 34.143, 'eval_steps_per_second': 2.137}
Average Specificity: 0.88


In [None]:
merged_df_copy.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

# Table 4 structure formation: First map the category fields like Gender, Ethnicity_to_use, Insurance and Language_to_use to actual values in the comparison on the original paper.

In [None]:
gender_map = {
    'F': 'Female',
    'M': 'Male'
}

merged_df_copy['gender_mapped'] = merged_df_copy['gender'].map(gender_map)
ethnicity_map = {
    "WHITE": "White",
    "BLACK": "Black",
    "ASIAN": "Asian",
    "HISPANIC/LATINO": "Hispanic",
    "OTHER": "Other",
    "UNKNOWN/NOT SPECIFIED": "Other"
}

merged_df_copy['ethnicity_mapped'] = merged_df_copy['ethnicity_to_use'].map(ethnicity_map)


language_map = {
    "English": "English",
    "Other": "Other",
    "Missing": "Other"  # You can also exclude Missing instead
}

merged_df_copy['language_mapped'] = merged_df_copy['language_to_use'].map(language_map)
insurance_map = {
    "Medicare": "Medicare",
    "Private": "Private",
    "Medicaid": "Medicaid",
    "Government": "Other",
    "Self Pay": "Other",
    "Other": "Other"
}

merged_df_copy['insurance_mapped'] = merged_df_copy['insurance'].map(insurance_map)


# Fairness metrics calculation and the reproducing the Table 4 on the paper ( Classifiers trained with baseline clinical BERT embeddings have multi-group fairness performance gaps across gender, language, ethnicity, and insurance status. We count number of downstream classification tasks with statistically significant differences (out of 57 total we used 25 in our case while reproducing), as well as the percentage of significant tasks which favor a subgroup. For many comparisons, there are a large number of tasks for which a significant difference exists across subgroups)

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix
from tqdm import tqdm
from IPython.display import display  # For rendering the table

# np.random.seed(42)  # Any number works; just keep it fixed across runs
# --- Config ---
demographic_columns = {
    "gender_mapped": [("Male", "Female")],  # Ensuring Male vs Female
    "language_mapped": [("English", "Other")],
    "ethnicity_mapped": [("White", "Other"), ("Black", "Other"), ("Hispanic", "Other"), ("Asian", "Other"), ("Other", "Other")],  # Added "Other" vs "Other" here
    "insurance_mapped": [("Medicare", "Other"), ("Private", "Other"), ("Medicaid", "Other")]
}
n_bootstrap = 100
alpha = 0.05

# --- Results placeholder ---
table4_results = []

# Efficient specificity function
def safe_specificity(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred, labels=[0, 1])
    if cm.shape == (2, 2):
        tn, fp, _, _ = cm.ravel()
        denom = tn + fp
        return tn / denom if denom else 0.0
    return 0.0

# Safe recall
def safe_recall(y_true, y_pred):
    return recall_score(y_true, y_pred) if len(np.unique(y_true)) > 1 else 0.0

# Bootstrap difference with numpy vectorization
def bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, metric_fn):
    n1, n2 = len(y_true1), len(y_true2)
    if n1 == 0 or n2 == 0:
        return 0.0, False, False
    diffs = []
    for _ in range(n_bootstrap):
        idx1 = np.random.randint(0, n1, n1)
        idx2 = np.random.randint(0, n2, n2)
        try:
            m1 = metric_fn(y_true1[idx1], y_pred1[idx1])
            m2 = metric_fn(y_true2[idx2], y_pred2[idx2])
            diffs.append(m1 - m2)
        except:
            continue
    if not diffs:
        return 0.0, False, False
    diffs = np.array(diffs)
    ci_low, ci_high = np.percentile(diffs, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    significant = (ci_low > 0) or (ci_high < 0)
    favors_group1 = diffs.mean() > 0
    return diffs.mean(), significant, favors_group1

# Main loop with tqdm
for fairness_attr, group_pairs in demographic_columns.items():
    for group1, group2 in group_pairs:
        recall_sig = parity_sig = spec_sig = 0
        recall_fav = parity_fav = spec_fav = 0

        # Precompute masks
        if group1 == "Other":
            # When group1 is the literal string "Other", match only rows labeled "Other"
            mask1 = (merged_df_copy[fairness_attr] == "Other").values
        else:
            mask1 = (merged_df_copy[fairness_attr] == group1).values

        if group2 == "Other":
            # When group2 is "Other", dynamically create mask2 as all rows NOT in group1
            mask2 = (merged_df_copy[fairness_attr] != group1).values
        else:
            mask2 = (merged_df_copy[fairness_attr] == group2).values

        for label_idx in tqdm(range(y_true.shape[1]), desc=f"{fairness_attr} {group1} vs {group2}"):
            y_true1, y_pred1 = y_true[mask1, label_idx], y_pred[mask1, label_idx]
            y_true2, y_pred2 = y_true[mask2, label_idx], y_pred[mask2, label_idx]

            if len(y_true1) < 1 or len(y_true2) < 1:
                continue

            # Bootstrapping
            _, sig_r, fav_r = bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, safe_recall)
            _, sig_p, fav_p = bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, accuracy_score)
            _, sig_s, fav_s = bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, safe_specificity)

            recall_sig += sig_r
            parity_sig += sig_p
            spec_sig += sig_s

            recall_fav += sig_r and fav_r
            parity_fav += sig_p and fav_p
            spec_fav += sig_s and fav_s

        # Adjust the comparison name to "Other vs Other" if applicable
        comparison_name = f"{group1} vs {group2}"
        if group1 == "Other" and group2 == "Other":
            comparison_name = "Other vs Other"

        table4_results.append({
            "Fairness Definition": fairness_attr,
            "Comparison (Group 1 vs Group 2)": comparison_name,
            "Recall Gap": f"{recall_sig} ({recall_fav / max(recall_sig, 1) * 100:.1f}%)",
            "Parity Gap": f"{parity_sig} ({parity_fav / max(parity_sig, 1) * 100:.1f}%)",
            "Specificity Gap": f"{spec_sig} ({spec_fav / max(spec_sig, 1) * 100:.1f}%)"
        })

# Final output
# Map the demographic columns to their more readable labels
fairness_map = {
    "gender_mapped": "Gender",
    "language_mapped": "Language",
    "ethnicity_mapped": "Ethnicity",
    "insurance_mapped": "Insurance"
}

# Convert to DataFrame
table4_df = pd.DataFrame(table4_results)

# Update the first column in the results to show the new labels
table4_df['Fairness Definition'] = table4_df['Fairness Definition'].map(fairness_map)

# Add a MultiIndex header above the Gap columns
table4_df.columns = pd.MultiIndex.from_tuples([
    ("Fairness Definition", ""),
    ("Comparison (Group 1 vs Group 2)", ""),
    ("Significant Differences by Fairness Definition", "Recall Gap"),
    ("Significant Differences by Fairness Definition", "Parity Gap"),
    ("Significant Differences by Fairness Definition", "Specificity Gap")
])

# Style the DataFrame
styled_table = (
    table4_df.style
    .set_table_styles(
        [   {'selector': 'th', 'props': [('font-weight', 'bold'), ('border', '1px solid black')]},
            {'selector': 'td', 'props': [('border', '1px solid black')]},
            {'selector': 'table', 'props': [('border-collapse', 'collapse')]},
        ]
    )
    .apply(lambda x: ['font-weight: bold' if x.name == 'Fairness Definition' else '' for _ in x], axis=1)
    .set_properties(**{'text-align': 'center'})
    .set_caption("Final Fairness Table 4 (Styled)")
)

# Display in Jupyter/IPython
display(styled_table)

# Save as HTML
styled_table.to_html("final_fairness_table4.html", escape=False, index=False)


gender_mapped Male vs Female: 100%|██████████| 25/25 [00:14<00:00,  1.72it/s]
language_mapped English vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.71it/s]
ethnicity_mapped White vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.74it/s]
ethnicity_mapped Black vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.75it/s]
ethnicity_mapped Hispanic vs Other: 100%|██████████| 25/25 [00:13<00:00,  1.83it/s]
ethnicity_mapped Asian vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.76it/s]
ethnicity_mapped Other vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.77it/s]
insurance_mapped Medicare vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.76it/s]
insurance_mapped Private vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.76it/s]
insurance_mapped Medicaid vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.76it/s]


Unnamed: 0_level_0,Fairness Definition,Comparison (Group 1 vs Group 2),Significant Differences by Fairness Definition,Significant Differences by Fairness Definition,Significant Differences by Fairness Definition
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Recall Gap,Parity Gap,Specificity Gap
0,Gender,Male vs Female,3 (66.7%),15 (53.3%),2 (0.0%)
1,Language,English vs Other,4 (50.0%),20 (30.0%),4 (50.0%)
2,Ethnicity,White vs Other,3 (66.7%),17 (29.4%),2 (50.0%)
3,Ethnicity,Black vs Other,4 (50.0%),18 (61.1%),4 (75.0%)
4,Ethnicity,Hispanic vs Other,5 (20.0%),12 (83.3%),3 (100.0%)
5,Ethnicity,Asian vs Other,5 (60.0%),14 (78.6%),4 (50.0%)
6,Ethnicity,Other vs Other,4 (0.0%),16 (68.8%),4 (25.0%)
7,Insurance,Medicare vs Other,6 (66.7%),18 (5.6%),2 (0.0%)
8,Insurance,Private vs Other,4 (25.0%),18 (88.9%),2 (100.0%)
9,Insurance,Medicaid vs Other,5 (40.0%),15 (100.0%),2 (100.0%)


In [None]:
import random
import numpy as np
import torch

# Set seeds for reproducibility
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

# Save files to Google Drive
torch.save(train_dataset_phase2, "/content/drive/MyDrive/Payel-DLH-related/DataFiles/Model files-save for future use-2/train_dataset_phase2.pt")
np.save("/content/drive/MyDrive/Payel-DLH-related/DataFiles/Model files-save for future use-2/y_pred.npy", y_pred)
np.save("/content/drive/MyDrive/Payel-DLH-related/DataFiles/Model files-save for future use-2/y_true.npy", y_true)
merged_df_copy.to_csv("/content/drive/MyDrive/Payel-DLH-related/DataFiles/Model files-save for future use-2/merged_df_copy.csv", index=False)
table4_df.to_csv("/content/drive/MyDrive/Payel-DLH-related/DataFiles/Model files-save for future use-2/table4_fairness_results.csv", index=False)

# Trying the Pretrained Adversarially debiased model

In [None]:
SCIBERT_DIR = Path('/content/drive/MyDrive/Payel-DLH-related/HurtfulWords/pretrainedModels')
debiased_model_path = SCIBERT_DIR / 'Adversarially_Debiased_Clinical_BERT/adv_clinical_BERT_1_epoch_512'
print(debiased_model_path)

/content/drive/MyDrive/Payel-DLH-related/HurtfulWords/pretrainedModels/Adversarially_Debiased_Clinical_BERT/adv_clinical_BERT_1_epoch_512


In [None]:
for item in os.listdir(debiased_model_path):
    print(item)

config.json
pytorch_model.bin
vocab.txt


In [None]:
merged_df.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

In [None]:
# # Load tokenizer and model
# debiased_tokenizer = BertTokenizer.from_pretrained(debiased_model_path)
# debiased_model = BertModel.from_pretrained(debiased_model_path)

In [None]:
from transformers import BertTokenizer
from tqdm import tqdm
import torch

# Setup
tqdm.pandas()
tokenizer = BertTokenizer.from_pretrained(debiased_model_path, trust_remote_code=True)

# Sliding Window Tokenization Function
def sliding_window_tokenize(text, tokenizer, max_len=512, stride=64):
    encoded = tokenizer(
        text,
        truncation=True,
        padding='max_length',
        max_length=max_len,
        stride=stride,
        return_overflowing_tokens=True,
        return_tensors='pt'
    )
    return {
        "input_ids": encoded["input_ids"],              # Shape: [num_chunks, max_len]
        "attention_mask": encoded["attention_mask"]
    }

# Apply per note (from cleaned sentences)
def process_note_with_sliding_window(row, tokenizer, max_len=512, stride=64):
    note = " ".join(row['cleaned_sent'])  # Join all cleaned sentences into one string
    return sliding_window_tokenize(note, tokenizer, max_len=max_len, stride=stride)

# Apply to the DataFrame
merged_df['tokenized_notes'] = merged_df.progress_apply(
    lambda row: process_note_with_sliding_window(row, tokenizer, max_len=512, stride=64),
    axis=1
)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'. 
The class this function is called from is 'BertTokenizer'.
100%|██████████| 4490/4490 [08:21<00:00,  8.95it/s]


In [None]:
merged_df.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

In [None]:
merged_df['labels'] = merged_df[label_col].values.tolist()

In [None]:
merged_df.columns

Index(['note_id', 'subject_id', 'hadm_id', 'chartdate', 'charttime',
       'storetime', 'category', 'description', 'cgid', 'iserror', 'text',
       'row_id_x', 'gender', 'dob', 'dod', 'dod_hosp', 'dod_ssn',
       'expire_flag', 'fold', 'row_id_y', 'admittime', 'dischtime',
       'deathtime', 'admission_type', 'admission_location',
       'discharge_location', 'insurance', 'language', 'religion',
       'marital_status', 'ethnicity', 'edregtime', 'edouttime', 'diagnosis',
       'hospital_expire_flag', 'has_chartevents_data', 'dod_merged',
       'ethnicity_to_use', 'age', 'icd9_code', 'language_to_use', 'icustay_id',
       'intime', 'hours_from_icu_admit', 'concat_notes', 'cleaned_sent',
       'bert_input', 'stay', 'period_length',
       'Acute and unspecified renal failure', 'Acute cerebrovascular disease',
       'Acute myocardial infarction', 'Cardiac dysrhythmias',
       'Chronic kidney disease',
       'Chronic obstructive pulmonary disease and bronchiectasis',
       'Com

In [None]:
from torch.utils.data import Dataset
import torch

class ClinicalNotesDataset(Dataset):
    def __init__(self, df, max_len=512):
        self.samples = []

        for _, row in df.iterrows():
            # Get tokenized notes and labels
            tokenized = row['tokenized_notes']
            label = torch.tensor(row['labels'], dtype=torch.float)  # Assuming it's a multi-label tensor

            # Extract chunks for each tokenized note
            input_ids_chunks = tokenized['input_ids']  # Shape: [num_chunks, max_len]
            attention_mask_chunks = tokenized['attention_mask']

            # Create a sample for each chunk of the note
            for input_ids, attention_mask in zip(input_ids_chunks, attention_mask_chunks):
                # Repeat the label for every chunk (because label applies to the entire note)
                self.samples.append({
                    'input_ids': input_ids,
                    'attention_mask': attention_mask,
                    'labels': label
                })

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return {
            'input_ids': self.samples[idx]['input_ids'],
            'attention_mask': self.samples[idx]['attention_mask'],
            'labels': self.samples[idx]['labels']
        }


In [None]:
dataset = ClinicalNotesDataset(merged_df)
sample = dataset[0]

print(sample['input_ids'].shape)       # torch.Size([512])
print(sample['attention_mask'].shape)  # torch.Size([512])
print(sample['labels'].shape)


torch.Size([512])
torch.Size([512])
torch.Size([25])


In [None]:
train_dataset_debiased = ClinicalNotesDataset(merged_df, max_len=512)

In [None]:

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, accuracy_score
import numpy as np

# Load the pretrained Phase 1 model
model = BertForSequenceClassification.from_pretrained(debiased_model_path,num_labels=25)

# Define training arguments for evaluation
eval_args = TrainingArguments(
    output_dir="./eval_results",  # You can update this path
    per_device_eval_batch_size=16,
    logging_dir="./logs",
    do_train=False,
    do_eval=True
)

# Custom metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = (pred.predictions > 0.5).astype(int)

    recall = recall_score(labels, preds, average='micro')
    precision = precision_score(labels, preds, average='micro')
    f1 = f1_score(labels, preds, average='micro')
    accuracy = accuracy_score(labels, preds)

    return {
        'recall': recall,
        'precision': precision,
        'f1': f1,
        'accuracy': accuracy,
    }

# Initialize Trainer for evaluation
trainer = Trainer(
    model=model,
    args=eval_args,
    eval_dataset=train_dataset_debiased ,
    compute_metrics=compute_metrics
)

# Run evaluation
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)

# Optional: Get predictions
predictions = trainer.predict(train_dataset_debiased)
y_true = predictions.label_ids
y_pred = (predictions.predictions > 0.5).astype(int)

# Compute specificity
specificities = []
for i in range(y_true.shape[1]):
    cm = confusion_matrix(y_true[:, i], y_pred[:, i])
    if cm.shape == (2, 2):
        tn, fp, fn, tp = cm.ravel()
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
        specificities.append(specificity)
    else:
        specificities.append(np.nan)

average_specificity = np.nanmean(specificities)
print(f"Average Specificity: {average_specificity:.2f}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /content/drive/MyDrive/Payel-DLH-related/HurtfulWords/pretrainedModels/Adversarially_Debiased_Clinical_BERT/adv_clinical_BERT_1_epoch_512 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluation Results: {'eval_loss': 0.6916810274124146, 'eval_model_preparation_time': 0.0027, 'eval_recall': 0.09316466932978251, 'eval_precision': 0.2531966224366707, 'eval_f1': 0.13621025308241402, 'eval_accuracy': 0.0, 'eval_runtime': 140.1366, 'eval_samples_per_second': 32.04, 'eval_steps_per_second': 2.005}
Average Specificity: 0.93


In [None]:
gender_map = {
    'F': 'Female',
    'M': 'Male'
}

merged_df['gender_mapped'] = merged_df['gender'].map(gender_map)
ethnicity_map = {
    "WHITE": "White",
    "BLACK": "Black",
    "ASIAN": "Asian",
    "HISPANIC/LATINO": "Hispanic",
    "OTHER": "Other",
    "UNKNOWN/NOT SPECIFIED": "Other"
}

merged_df['ethnicity_mapped'] = merged_df['ethnicity_to_use'].map(ethnicity_map)


language_map = {
    "English": "English",
    "Other": "Other",
    "Missing": "Other"  # You can also exclude Missing instead
}

merged_df['language_mapped'] = merged_df['language_to_use'].map(language_map)
insurance_map = {
    "Medicare": "Medicare",
    "Private": "Private",
    "Medicaid": "Medicaid",
    "Government": "Other",
    "Self Pay": "Other",
    "Other": "Other"
}

merged_df['insurance_mapped'] = merged_df['insurance'].map(insurance_map)



In [None]:

import pandas as pd
import numpy as np
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix
from tqdm import tqdm
from IPython.display import display  # For rendering the table

# np.random.seed(42)  # Any number works; just keep it fixed across runs
# --- Config ---
demographic_columns = {
    "gender_mapped": [("Male", "Female")],  # Ensuring Male vs Female
    "language_mapped": [("English", "Other")],
    "ethnicity_mapped": [("White", "Other"), ("Black", "Other"), ("Hispanic", "Other"), ("Asian", "Other"), ("Other", "Other")],  # Added "Other" vs "Other" here
    "insurance_mapped": [("Medicare", "Other"), ("Private", "Other"), ("Medicaid", "Other")]
}
n_bootstrap = 100
alpha = 0.05

# --- Results placeholder ---
table4_results = []

# Efficient specificity function
def safe_specificity(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred, labels=[0, 1])
    if cm.shape == (2, 2):
        tn, fp, _, _ = cm.ravel()
        denom = tn + fp
        return tn / denom if denom else 0.0
    return 0.0

# Safe recall
def safe_recall(y_true, y_pred):
    return recall_score(y_true, y_pred) if len(np.unique(y_true)) > 1 else 0.0

# Bootstrap difference with numpy vectorization
def bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, metric_fn):
    n1, n2 = len(y_true1), len(y_true2)
    if n1 == 0 or n2 == 0:
        return 0.0, False, False
    diffs = []
    for _ in range(n_bootstrap):
        idx1 = np.random.randint(0, n1, n1)
        idx2 = np.random.randint(0, n2, n2)
        try:
            m1 = metric_fn(y_true1[idx1], y_pred1[idx1])
            m2 = metric_fn(y_true2[idx2], y_pred2[idx2])
            diffs.append(m1 - m2)
        except:
            continue
    if not diffs:
        return 0.0, False, False
    diffs = np.array(diffs)
    ci_low, ci_high = np.percentile(diffs, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    significant = (ci_low > 0) or (ci_high < 0)
    favors_group1 = diffs.mean() > 0
    return diffs.mean(), significant, favors_group1

# Main loop with tqdm
for fairness_attr, group_pairs in demographic_columns.items():
    for group1, group2 in group_pairs:
        recall_sig = parity_sig = spec_sig = 0
        recall_fav = parity_fav = spec_fav = 0

        # Precompute masks
        if group1 == "Other":
            # When group1 is the literal string "Other", match only rows labeled "Other"
            mask1 = (merged_df[fairness_attr] == "Other").values
        else:
            mask1 = (merged_df[fairness_attr] == group1).values

        if group2 == "Other":
            # When group2 is "Other", dynamically create mask2 as all rows NOT in group1
            mask2 = (merged_df[fairness_attr] != group1).values
        else:
            mask2 = (merged_df[fairness_attr] == group2).values

        for label_idx in tqdm(range(y_true.shape[1]), desc=f"{fairness_attr} {group1} vs {group2}"):
            y_true1, y_pred1 = y_true[mask1, label_idx], y_pred[mask1, label_idx]
            y_true2, y_pred2 = y_true[mask2, label_idx], y_pred[mask2, label_idx]

            if len(y_true1) < 1 or len(y_true2) < 1:
                continue

            # Bootstrapping
            _, sig_r, fav_r = bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, safe_recall)
            _, sig_p, fav_p = bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, accuracy_score)
            _, sig_s, fav_s = bootstrap_metric_diff(y_true1, y_pred1, y_true2, y_pred2, safe_specificity)

            recall_sig += sig_r
            parity_sig += sig_p
            spec_sig += sig_s

            recall_fav += sig_r and fav_r
            parity_fav += sig_p and fav_p
            spec_fav += sig_s and fav_s

        # Adjust the comparison name to "Other vs Other" if applicable
        comparison_name = f"{group1} vs {group2}"
        if group1 == "Other" and group2 == "Other":
            comparison_name = "Other vs Other"

        table4_results.append({
            "Fairness Definition": fairness_attr,
            "Comparison (Group 1 vs Group 2)": comparison_name,
            "Recall Gap": f"{recall_sig} ({recall_fav / max(recall_sig, 1) * 100:.1f}%)",
            "Parity Gap": f"{parity_sig} ({parity_fav / max(parity_sig, 1) * 100:.1f}%)",
            "Specificity Gap": f"{spec_sig} ({spec_fav / max(spec_sig, 1) * 100:.1f}%)"
        })

# Final output
# Map the demographic columns to their more readable labels
fairness_map = {
    "gender_mapped": "Gender",
    "language_mapped": "Language",
    "ethnicity_mapped": "Ethnicity",
    "insurance_mapped": "Insurance"
}

# Convert to DataFrame
table4_df_adv = pd.DataFrame(table4_results)

# Update the first column in the results to show the new labels
table4_df_adv['Fairness Definition'] = table4_df_adv['Fairness Definition'].map(fairness_map)

# Add a MultiIndex header above the Gap columns
table4_df_adv.columns = pd.MultiIndex.from_tuples([
    ("Fairness Definition", ""),
    ("Comparison (Group 1 vs Group 2)", ""),
    ("Significant Differences by Fairness Definition", "Recall Gap"),
    ("Significant Differences by Fairness Definition", "Parity Gap"),
    ("Significant Differences by Fairness Definition", "Specificity Gap")
])

# Style the DataFrame
styled_table = (
    table4_df_adv.style
    .set_table_styles(
        [   {'selector': 'th', 'props': [('font-weight', 'bold'), ('border', '1px solid black')]},
            {'selector': 'td', 'props': [('border', '1px solid black')]},
            {'selector': 'table', 'props': [('border-collapse', 'collapse')]},
        ]
    )
    .apply(lambda x: ['font-weight: bold' if x.name == 'Fairness Definition' else '' for _ in x], axis=1)
    .set_properties(**{'text-align': 'center'})
    .set_caption("Final Fairness Table 4 (Styled)")
)

# Display in Jupyter/IPython
display(styled_table)

# Save as HTML
styled_table.to_html("final_fairness_table4.html", escape=False, index=False)

gender_mapped Male vs Female: 100%|██████████| 25/25 [00:15<00:00,  1.66it/s]
language_mapped English vs Other: 100%|██████████| 25/25 [00:15<00:00,  1.66it/s]
ethnicity_mapped White vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.69it/s]
ethnicity_mapped Black vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.71it/s]
ethnicity_mapped Hispanic vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.72it/s]
ethnicity_mapped Asian vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.71it/s]
ethnicity_mapped Other vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.74it/s]
insurance_mapped Medicare vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.69it/s]
insurance_mapped Private vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.72it/s]
insurance_mapped Medicaid vs Other: 100%|██████████| 25/25 [00:14<00:00,  1.68it/s]


Unnamed: 0_level_0,Fairness Definition,Comparison (Group 1 vs Group 2),Significant Differences by Fairness Definition,Significant Differences by Fairness Definition,Significant Differences by Fairness Definition
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Recall Gap,Parity Gap,Specificity Gap
0,Gender,Male vs Female,1 (100.0%),16 (37.5%),2 (100.0%)
1,Language,English vs Other,2 (50.0%),19 (15.8%),1 (100.0%)
2,Ethnicity,White vs Other,0 (0.0%),17 (29.4%),2 (50.0%)
3,Ethnicity,Black vs Other,0 (0.0%),19 (68.4%),2 (0.0%)
4,Ethnicity,Hispanic vs Other,1 (100.0%),13 (84.6%),1 (100.0%)
5,Ethnicity,Asian vs Other,1 (0.0%),13 (61.5%),2 (0.0%)
6,Ethnicity,Other vs Other,1 (100.0%),17 (76.5%),2 (50.0%)
7,Insurance,Medicare vs Other,2 (50.0%),17 (11.8%),2 (0.0%)
8,Insurance,Private vs Other,1 (100.0%),18 (83.3%),2 (100.0%)
9,Insurance,Medicaid vs Other,2 (50.0%),15 (80.0%),1 (0.0%)


# Extract Gaps for Male and Female

In [None]:
def extract_gender_gap_counts(df):
    # Filter for Gender: Male vs Female only
    df_gender = df[
        (df[("Fairness Definition", "")] == "Gender") &
        (df[("Comparison (Group 1 vs Group 2)", "")] == "Male vs Female")
    ]

    # Extract the (count, % favoring male) from each cell
    def parse(val):
        count, percent = val.split()
        return int(count), float(percent.strip("()%"))

    results = {}
    for metric in ["Parity Gap", "Recall Gap", "Specificity Gap"]:
        count, favoring_male = parse(df_gender[("Significant Differences by Fairness Definition", metric)].values[0])
        results[metric] = (count, favoring_male)
    return results

# Get results for Baseline and Debiased- Only Gender

In [None]:
baseline_gender_results = extract_gender_gap_counts(table4_df)
debiased_gender_results = extract_gender_gap_counts(table4_df_adv)

# Build Table 5

In [None]:
import pandas as pd

table5 = pd.DataFrame({
    "Model": ["Baseline", "Debiased"],
    "Recall Gap": [
        f"{baseline_gender_results['Recall Gap'][0]} ({baseline_gender_results['Recall Gap'][1]:.1f}%)",
        f"{debiased_gender_results['Recall Gap'][0]} ({debiased_gender_results['Recall Gap'][1]:.1f}%)"
    ],
    "Parity Gap": [
        f"{baseline_gender_results['Parity Gap'][0]} ({baseline_gender_results['Parity Gap'][1]:.1f}%)",
        f"{debiased_gender_results['Parity Gap'][0]} ({debiased_gender_results['Parity Gap'][1]:.1f}%)"
    ],
    "Specificity Gap": [
        f"{baseline_gender_results['Specificity Gap'][0]} ({baseline_gender_results['Specificity Gap'][1]:.1f}%)",
        f"{debiased_gender_results['Specificity Gap'][0]} ({debiased_gender_results['Specificity Gap'][1]:.1f}%)"
    ]
})

table5
# Style the Table 5 DataFrame
styled_table5 = (
    table5.style
    .set_table_styles(
        [   {'selector': 'th', 'props': [('font-weight', 'bold'), ('border', '1px solid black')]},
            {'selector': 'td', 'props': [('border', '1px solid black')]},
            {'selector': 'table', 'props': [('border-collapse', 'collapse')]},
        ]
    )
    .set_properties(**{'text-align': 'center'})
    .set_caption("Table 5: Comparison of classifiers based on our original clinical BERT and the gender-debiased clinical BERT on 57 tasks (25 in our case)")
)

# Display it in Jupyter/IPython
display(styled_table5)

# Optionally save as HTML
styled_table5.to_html("final_fairness_table5.html", escape=False, index=False)


Unnamed: 0,Model,Recall Gap,Parity Gap,Specificity Gap
0,Baseline,3 (66.7%),15 (53.3%),2 (0.0%)
1,Debiased,1 (100.0%),16 (37.5%),2 (100.0%)


These tables above summarize performance disparities across different demographic subgroups (gender, language, ethnicity, insurance) for clinical prediction tasks. For each group comparison, it reports how often there were statistically significant gaps in model performance, measured across:

Recall Gap (True Positive Rate)

Parity Gap (Accuracy)

Specificity Gap (True Negative Rate)

Example from the Baseline model evaluation: 3 (66.7%) as Recall Gap shows:

The number of tasks where a significant gap was found, and

The percentage of those tasks where the gap favored the first group (e.g., Male, English, White, etc.)

