
# Anxiety Notes Classifier — Databricks + MLflow Scaffold

**Goal:** Build and compare simple, interpretable text models to flag anxiety-related clinical notes; log to MLflow; pick a winner; register & batch score; produce artifacts for slides.

> Replace the data-loading placeholders with your actual Delta table or CSV paths. Run cells top-to-bottom.


In [0]:

# Databricks setup (run in a single cell at the top of the notebook)
# If you are on Databricks, uncomment the %pip lines and run once per cluster restart.

#%pip install mlflow==2.* scikit-learn==1.* pandas numpy matplotlib scipy
#%pip install spacy scispacy
#%pip install scispacy
#!python -m spacy download en_core_sci_sm

#%pip install transformers==4.* torch --extra-index-url https://download.pytorch.org/whl/cpu

import os
import mlflow
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (precision_recall_curve, average_precision_score,
                             roc_auc_score, confusion_matrix, classification_report)
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV

mlflow.set_experiment("/Users/modi.boutrs@wellforce.org/anxiety_notes_experiment")  # TODO: set to your path
print("MLflow experiment set.")


[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m

[38;5;1m✘ No compatible package found for 'en_core_sci_sm' (spaCy v3.7.5)[0m

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m
MLflow experiment set.


In [0]:
%restart_python

## 1) Load & prepare data

In [0]:
%run "/Users/modi.boutrs@wellforce.org/mimic_mod/io"



In [0]:
%run "/Users/modi.boutrs@wellforce.org/mimic_mod/prep"


In [0]:
%run "/Users/modi.boutrs@wellforce.org/mimic_mod/nlp"


In [0]:
%run "/Users/modi.boutrs@wellforce.org/mimic_mod/train"

In [0]:
# ---- paths & tables (adjust to your volume & tables) ----
notes_tbl = "default.df_anxiety_notes"   # or default.df_anxiety_notes
icd_tbl   = "default.diagnoses_icd"
icd_tbl_d = "default.d_icd_diagnoses"

df_notes = load_table(notes_tbl)
df_icd   = load_table(icd_tbl)
df_icd = df_icd.join(load_table(icd_tbl_d), "icd_code", "left")

# clean/prepare
df_notes = filter_valid_text(df_notes, "text")
df_notes = clean_text(df_notes, "text", "text_clean")

df_labels = anxiety_label_from_icd(df_icd, "long_title")
df_join   = join_notes_labels(df_notes, df_labels, "left")

display(df_join.select("hadm_id","label_anxiety","text_clean").limit(1000))
pdf = df_join.select("text_clean", "label_anxiety").toPandas()
y = pdf["label_anxiety"].astype(int).values
texts = pdf["text_clean"].tolist()

###Cleared Cell output to prevent data showing in public github.  for version with output check attached notes with assignment.

In [0]:

X_train_txt, X_test_txt, y_train, y_test = train_test_split(texts, y, test_size=0.2, random_state=42, stratify=y)
len(X_train_txt), len(X_test_txt), sum(y), len(y)


(12118, 3030, 7628, 15148)

## 2) MLflow helpers — log artifacts & metrics

In [0]:

def log_eval_artifacts(y_true, y_prob, run_params=None, threshold=None):
    import mlflow, numpy as np, matplotlib.pyplot as plt
    from sklearn.metrics import (
        precision_recall_curve,
        confusion_matrix,
        classification_report,
        average_precision_score,
        roc_auc_score,
    )
    # Persist run params
    if run_params:
        for k, v in run_params.items():
            mlflow.log_param(k, v)
    # Threshold selection (target precision ~0.80 if not provided)
    if threshold is None:
        p, r, t = precision_recall_curve(y_true, y_prob)
        idx = np.where(p >= 0.80)[0]
        threshold = float(t[idx[0]]) if len(idx) else 0.5
    mlflow.log_param("decision_threshold", threshold)
    # Predictions
    y_pred = (y_prob >= threshold).astype(int)
    cm = confusion_matrix(y_true, y_pred)

    # Confusion matrix plot
    plt.figure()
    plt.imshow(cm, interpolation='nearest')
    plt.title('Confusion Matrix'); plt.colorbar()
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, cm[i, j], ha="center", va="center")
    plt.xlabel('Predicted'); plt.ylabel('Actual'); plt.tight_layout()
    mlflow.log_figure(plt.gcf(), "confusion_matrix.png")
    plt.close()

    # PR curve
    p, r, _ = precision_recall_curve(y_true, y_prob)
    plt.figure()
    plt.plot(r, p)
    plt.xlabel("Recall"); plt.ylabel("Precision"); plt.title("Precision-Recall Curve")
    plt.tight_layout()
    mlflow.log_figure(plt.gcf(), "pr_curve.png")
    plt.close()

    # Scalar metrics
    ap = average_precision_score(y_true, y_prob)
    auroc = roc_auc_score(y_true, y_prob)
    mlflow.log_metric("pr_auc", float(ap))
    mlflow.log_metric("auroc", float(auroc))

    # Classification report (at threshold)
    rep = classification_report(y_true, y_pred, digits=3)
    with open("classification_report.txt", "w") as f:
        f.write(rep)
    mlflow.log_artifact("classification_report.txt")

    return {"threshold": threshold, "pr_auc": ap, "auroc": auroc}


## 3) Baseline — TF‑IDF ➜ Logistic Regression

In [0]:

from sklearn.pipeline import Pipeline

with mlflow.start_run(run_name="tfidf_logreg"):
    vec = TfidfVectorizer(ngram_range=(1,2), max_features=30000, min_df=2)
    clf = LogisticRegression(max_iter=2000, n_jobs=-1)
    pipe = Pipeline([("tfidf", vec), ("clf", clf)])
    pipe.fit(X_train_txt, y_train)
    
    # Use predict_proba for probabilities
    y_prob = pipe.predict_proba(X_test_txt)[:,1]
    
    results = log_eval_artifacts(y_test, y_prob, run_params={
        "feature_set":"tfidf_bigrams",
        "estimator":"logreg",
        "max_features":30000,
        "min_df":2
    })
    
    # Log model
    mlflow.sklearn.log_model(pipe, "model")
    print(results)


[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-8540165991169680>, line 3[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01msklearn[39;00m[38;5;21;01m.[39;00m[38;5;21;01mpipeline[39;00m [38;5;28;01mimport[39;00m Pipeline
[0;32m----> 3[0m [38;5;28;01mwith[39;00m mlflow[38;5;241m.[39mstart_run(run_name[38;5;241m=[39m[38;5;124m"[39m[38;5;124mtfidf_logreg[39m[38;5;124m"[39m):
[1;32m      4[0m     vec [38;5;241m=[39m TfidfVectorizer(ngram_range[38;5;241m=[39m([38;5;241m1[39m,[38;5;241m2[39m), max_features[38;5;241m=[39m[38;5;241m30000[39m, min_df[38;5;241m=[39m[38;5;241m2[39m)
[1;32m      5[0m     clf [38;5;241m=[39m LogisticRegression(max_iter[38;5;241m=[39m[38;5;241m2000[39m, n_jobs[38;5;241m=[39m[38;5;241m-[39m[38;5;241m1[39m)

[0;31mNameError[0m: name 'mlflow' is not defined

## 4) Stronger linear — TF‑IDF ➜ Linear SVM 

In [0]:
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV


In [0]:

with mlflow.start_run(run_name="tfidf_linear_svm_calibrated"):
    vec = TfidfVectorizer(ngram_range=(1,2), max_features=40000, min_df=2)
    svm = LinearSVC()  # no predict_proba; we'll calibrate
    calibrated = CalibratedClassifierCV(svm, cv=5, method="sigmoid")
    
    pipe = Pipeline([("tfidf", vec), ("svm", calibrated)])
    pipe.fit(X_train_txt, y_train)
    
    y_prob = pipe.predict_proba(X_test_txt)[:,1]
    results = log_eval_artifacts(y_test, y_prob, run_params={
        "feature_set":"tfidf_bigrams",
        "estimator":"linear_svm_calibrated",
        "max_features":40000,
        "min_df":2
    })
    mlflow.sklearn.log_model(pipe, "model")
    print(results)




Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



{'threshold': 0.6116622927609477, 'pr_auc': 0.814517743423736, 'auroc': 0.8232053972281868}


## 5) Add light clinical features via SciSpaCy

In [0]:
# Run this once in a separate cell
%pip install scispacy
%pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_craft_md-0.5.4.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_jnlpba_md-0.5.4.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bionlp13cg_md-0.5.4.tar.gz
#pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz
  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_md-0.5.4.tar.gz (119.1 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.1 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.1 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.1 MB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/119.1 MB[0m [31m673.6 kB/s[0m eta [36m0:02:57[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/119.1 MB[0m [31m673.6 kB/s[0m eta [36m0:02:57[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [0]:

# Build extra features and concatenate with TF-IDF.
# For simplicity in this scaffold, we'll compute a few counts and append them to the TF-IDF matrix.
import numpy as np
import spacy
try:
    nlp = spacy.load("en_core_sci_md")
except Exception as e:
    nlp = None
    print("SciSpaCy model not loaded. Install en_core_sci_sm and re-run if you want these features.")

ANXIETY_TERMS = {"anxiety", "panic", "anxious", "worry", "rumination"}
SYMPTOMS = {"tachycardia","palpitations","dyspnea","sweating","dizziness","tremor"}

def featurize_clinical_batch(texts):
    if nlp is None:
        return np.zeros((len(texts), 4), dtype=float)
    feats = []
    for t in texts:
        doc = nlp(t)
        toks = {tok.lemma_.lower() for tok in doc if tok.is_alpha and not tok.is_stop}
        ents = " ".join([e.text.lower() for e in doc.ents])
        feats.append([
            float(len(ANXIETY_TERMS & toks) > 0),
            float(sum(1 for w in ANXIETY_TERMS if w in toks)),
            float(sum(1 for s in SYMPTOMS if s in ents)),
            float(sum(1 for tok in doc if tok.dep_ == "neg"))
        ])
    return np.array(feats, dtype=float)

from scipy.sparse import hstack

with mlflow.start_run(run_name="tfidf_logreg_scispacy_feats"):
    vec = TfidfVectorizer(ngram_range=(1,2), max_features=30000, min_df=2)
    Xtr = vec.fit_transform(X_train_txt)
    Xte = vec.transform(X_test_txt)
    
    Ftr = featurize_clinical_batch(X_train_txt)
    Fte = featurize_clinical_batch(X_test_txt)
    
    Xtr_aug = hstack([Xtr, Ftr])
    Xte_aug = hstack([Xte, Fte])
    
    clf = LogisticRegression(max_iter=2000, n_jobs=-1)
    clf.fit(Xtr_aug, y_train)
    y_prob = clf.predict_proba(Xte_aug)[:,1]
    
    results = log_eval_artifacts(y_test, y_prob, run_params={
        "feature_set":"tfidf_bigrams+scispacy4",
        "estimator":"logreg",
        "max_features":30000,
        "min_df":2
    })
    
    # Log vectorizer & model together as a pyfunc (minimal)
    import mlflow.pyfunc
    class TfidfLogRegWithClin(mlflow.pyfunc.PythonModel):
        def load_context(self, context):
            import pickle, json, numpy as np
            with open(context.artifacts["vec"], "rb") as f:
                self.vec = pickle.load(f)
            with open(context.artifacts["clf"], "rb") as f:
                self.clf = pickle.load(f)
        def predict(self, context, model_input):
            texts = model_input["text"].astype(str).tolist()
            X = self.vec.transform(texts)
            F = featurize_clinical_batch(texts)
            from scipy.sparse import hstack
            X_aug = hstack([X, F])
            return self.clf.predict_proba(X_aug)[:,1]
    
    import pickle, tempfile
    with open("vec.pkl","wb") as f: pickle.dump(vec, f)
    with open("clf.pkl","wb") as f: pickle.dump(clf, f)
    
    mlflow.pyfunc.log_model(
        "model",
        python_model=TfidfLogRegWithClin(),
        artifacts={"vec":"vec.pkl", "clf":"clf.pkl"}
    )
    print(results)


  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]



Uploading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

{'threshold': 0.5900916627222337, 'pr_auc': 0.8346019203323355, 'auroc': 0.8452584283762303}


## 6) BERT CLS embedding ➜ Linear head (Bio_ClinicalBERT )

In [0]:

# This block extracts CLS embeddings and trains a simple linear classifier.
# On small clusters, consider DistilBERT to keep runtime reasonable.

import torch
from transformers import AutoTokenizer, AutoModel

MODEL_NAME = "emilyalsentzer/Bio_ClinicalBERT"  # or "distilbert-base-uncased" for speed
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
bert = AutoModel.from_pretrained(MODEL_NAME)

def cls_embed_batch(texts, max_length=256):
    bert.eval()
    embs = []
    with torch.no_grad():
        for t in texts:
            inputs = tokenizer(t, return_tensors="pt", truncation=True, max_length=max_length)
            outputs = bert(**inputs)
            cls_vec = outputs.last_hidden_state[:,0,:].squeeze(0).cpu().numpy()
            embs.append(cls_vec)
    return np.vstack(embs)

with mlflow.start_run(run_name="bert_cls_linear_head"):
    Xtr = cls_embed_batch(X_train_txt[:500])  # cap for demo speed; adjust/remove for full data
    Xte = cls_embed_batch(X_test_txt[:200])
    ytr = y_train[:len(Xtr)]
    yte = y_test[:len(Xte)]
    
    head = LogisticRegression(max_iter=2000, n_jobs=-1)
    head.fit(Xtr, ytr)
    y_prob = head.predict_proba(Xte)[:,1]
    
    results = log_eval_artifacts(yte, y_prob, run_params={
        "feature_set":"bert_cls",
        "estimator":"logreg_head",
        "bert_model":MODEL_NAME,
        "max_length":256
    })
    import pickle
    with open("bert_head.pkl","wb") as f: pickle.dump(head, f)
    mlflow.log_artifact("bert_head.pkl")
    print(results)


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

{'threshold': 0.9392115684555318, 'pr_auc': 0.5878154218827862, 'auroc': 0.5740350877192982}


## 7) Register the best model in MLflow Model Registry

In [0]:

# After deciding which run is best (by PR-AUC, Recall@Precision), register it.
# Replace <RUN_ID> below after you choose the best run from the MLflow UI.
mlflow.register_model(
    model_uri="runs:/0c3d8ac4b4b8441bae2a154c23301373/model",
    name="anxiety_notes_classifier"
)
print("When you select a winning run in MLflow, register it with the commented line above.")


Registered model 'anxiety_notes_classifier' already exists. Creating a new version of this model...
2025/10/26 12:03:13 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: anxiety_notes_classifier, version 2


When you select a winning run in MLflow, register it with the commented line above.


Created version '2' of model 'anxiety_notes_classifier'.


## 8) Batch score new notes to a Delta table - Extra 

In [0]:

# Pseudocode example for Databricks (uncomment and adapt)
# import mlflow
# import pandas as pd
# model_uri = "models:/anxiety_notes_classifier/Production"
# loaded = mlflow.pyfunc.load_model(model_uri)
# df_new = spark.table("schema.notes_to_score").toPandas()
# scores = loaded.predict(df_new[["text"]])
# out = pd.DataFrame({"note_id": df_new["note_id"], "anxiety_prob": scores})
# spark.createDataFrame(out).write.format("delta").mode("overwrite").saveAsTable("analytics.anxiety_predictions")
print("Scoring pseudocode ready — customize table names and run in Databricks.")


Scoring pseudocode ready — customize table names and run in Databricks.


(Bounus Learning after creating the above step)
## Appendix: SQL dashboard snippets 
- **Daily volume & % flagged**
```sql
SELECT date_trunc('day', ts) AS day,
       COUNT(*) AS notes,
       AVG(CASE WHEN anxiety_prob >= 0.5 THEN 1 ELSE 0 END) AS pct_flagged
FROM analytics.anxiety_predictions
GROUP BY 1 ORDER BY 1;
```
- **Top services by % flagged**
```sql
SELECT service_line,
       COUNT(*) AS notes,
       AVG(CASE WHEN anxiety_prob >= 0.5 THEN 1 ELSE 0 END) AS pct_flagged
FROM analytics.anxiety_predictions ap
JOIN clinical.encounters e USING (note_id)
GROUP BY 1 ORDER BY pct_flagged DESC;
```


In [0]:
%sql
SELECT date_trunc('day', ts) AS day,
       COUNT(*) AS notes,
       AVG(CASE WHEN anxiety_prob >= 0.5 THEN 1 ELSE 0 END) AS pct_flagged
FROM analytics.anxiety_predictions
GROUP BY 1 ORDER BY 1;