<a href="https://colab.research.google.com/github/yzm9393/swineBRET-ICD/blob/main/04_Pseudo_Labeling_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pseudo labeling
1. Load Data: The script begins by loading large, pre-processed training dataset of ~38,400 records.

2. Isolate Unlabeled Records: It then filters out the 2,000 records that were already annotated by expert, ensuring it only works with the data that needs to be labeled.

3. Apply Teacher Models: The script loops through each of 11 saved "teacher" models. In each loop, it loads a specialist model (e.g., the "respiratory" model) and its corresponding text vectorizer. It then uses this model to predict the probability for that single label across all ~36,400 records, storing these scores in a new column.

4. Save the Final Dataset: After the loop is complete, the script saves a new CSV file. This file contains the original text and 11 new "pseudo-label" columns, each filled with the probability scores from the teacher models. This becomes the final, enriched dataset for training SwineBERT model.

In [None]:
import pandas as pd
import numpy as np
import joblib
import os
import re
from tqdm import tqdm

# --- 1. Configuration ---
# The file with ALL ~38,400 pre-processed/anonymized training records
PROCESSED_TRAINING_FILE = '/Users/zimoyang/Documents/swine_project/data/training_set_processed.csv'

# The directory where you saved your teacher models
MODEL_INPUT_DIR = '/Users/zimoyang/Documents/swine_project/teacher_models_calibrated'
# The final output file this script will create
OUTPUT_PSEUDO_LABELED_FILE = '/Users/zimoyang/Documents/swine_project/data/pseudo_label.csv'

ID_COLUMN = 'U_SUBMISSIONID'
PROCESSED_TEXT_COLUMN = 'anonymized_text_for_ml'

# The final list of modeling labels you trained teacher models for
FINAL_MODELING_LABELS = [
    '[01] Certain infectious or parasitic diseases',
    '[08] Diseases of the nervous system',
    '[12] Diseases of the respiratory system',
    '[13] Diseases of the digestive system',
    '[14] Diseases of the skin',
    '[15] Diseases of the musculoskeletal system or connective tissue',
    '[18] Pregnancy, childbirth or the puerperium',
    '[19] Certain conditions originating in the perinatal period',
    'Monitoring',
    'Unknown',
    'Symptoms not classified elsewhere'
]

# --- 2. Load Data ---

df_master = pd.read_csv(PROCESSED_TRAINING_FILE)

# --- 3. Isolate Records for Pseudo-Labeling ---
# We will only predict on records that were NOT part of the gold-standard set
df_to_label = df_master[df_master['expert_labels_vector'].isna()].copy()
print(f"Found {len(df_to_label)} records to pseudo-label.")


# --- 4. Loop Through Teacher Models and Predict Probabilities ---
print("\n--- Applying Teacher Models to Generate Pseudo-Labels ---")

text_to_predict = df_to_label[PROCESSED_TEXT_COLUMN].astype(str)

for label_name in tqdm(FINAL_MODELING_LABELS, desc="Pseudo-labeling progress"):
    # Sanitize label name to match saved filenames
    safe_label_name = re.sub(r'[\W_]+', '', label_name)
    model_path = os.path.join(MODEL_INPUT_DIR, f'model_{safe_label_name}.joblib')
    vectorizer_path = os.path.join(MODEL_INPUT_DIR, f'vectorizer_{safe_label_name}.joblib')

    pseudo_label_col = f'pseudo_label_{safe_label_name}'

    if os.path.exists(model_path) and os.path.exists(vectorizer_path):
        # Load the saved model and its corresponding vectorizer
        model = joblib.load(model_path)
        vectorizer = joblib.load(vectorizer_path)

        # Transform the EXISTING anonymized text using the loaded vectorizer
        X_tfidf = vectorizer.transform(text_to_predict)

        # Predict probabilities for the positive class (class 1)
        predicted_probabilities = model.predict_proba(X_tfidf)[:, 1]

        # Store the probabilities in a new column
        df_to_label[pseudo_label_col] = predicted_probabilities
    else:
        print(f"Warning: Model for '{label_name}' not found. Filling with 0.0 probability.")
        df_to_label[pseudo_label_col] = 0.0

# --- 5. Save the Pseudo-Labeled Dataset ---
print("\n--- Saving the Final Pseudo-Labeled Dataset ---")

# We only need the ID, text, and the new pseudo-labels for the next step
columns_to_save = [ID_COLUMN, PROCESSED_TEXT_COLUMN] + [col for col in df_to_label.columns if 'pseudo_label_' in col]
df_to_label[columns_to_save].to_csv(OUTPUT_PSEUDO_LABELED_FILE, index=False)

print(f"Successfully created '{OUTPUT_PSEUDO_LABELED_FILE}'.")

Found 35058 records to pseudo-label.

--- Applying Teacher Models to Generate Pseudo-Labels ---


Pseudo-labeling progress: 100%|█████████████████| 11/11 [00:03<00:00,  2.92it/s]



--- Saving the Final Pseudo-Labeled Dataset ---
Successfully created '/Users/zimoyang/Documents/swine_project/data/pseudo_label.csv'.
