# 03 - Feature Engineering

## üìå Purpose

This notebook generates baseline feature datasets for the thesis pipeline:
- **Original structured features** (scaled).
- **Word2Vec embeddings** (Radiology, Discharge, Combined).
- **Merged datasets** (structured + scaled embeddings).

It performs feature engineering with Word2Vec embeddings:
- Train/load Word2Vec models on Radiology, Discharge, and Combined notes
- Generate subject-level averaged embeddings
- Merge embeddings with structured features
- Standard scale embeddings

All final outputs are saved into `data/processed/` for downstream modeling.


## 0. Imports

In [1]:
import pandas as pd
from src.data_prep import split_data
from src.features import (
    scale_features,
    train_word2vec,
    get_w2v_params,
    load_word2vec,
    apply_embeddings_to_subjects,
    scale_w2v_embeddings,
    merge_embeddings_with_features,
    save_feature_dataset,
    validate_saved_datasets
)
from src.utils import resolve_path
import os

## 1. Load NLP-ready dataset

We start from `nlp_ready_df`, generated in **02_data_preprocessing.ipynb**.  

This step loads the preprocessed dataset containing both
structured EHR variables and the concatenated note text columns for
radiology, discharge summaries, and combined notes.


In [2]:
nlp_ready_path = resolve_path("data/interim/data_nlp_ready.csv")
nlp_ready_df = pd.read_csv(nlp_ready_path)

print(f"‚úÖ Loaded NLP-ready dataset: {nlp_ready_df.shape}")
print(f"Columns: {nlp_ready_df.columns.tolist()[:10]} ...")

‚úÖ Loaded NLP-ready dataset: (5208, 51)
Columns: ['subject_id', 'hospital_expire_flag', 'max_age', 'los_icu', 'first_hosp_stay', 'suspected_infection', 'sofa_score', 'sepsis3', 'avg_urineoutput', 'glucose_min'] ...


## 2. Drop note text columns for structured modeling

Drop the raw text columns (`Radiology_notes`, `Discharge_summary_notes`, and `combined_notes`)
to isolate purely structured numeric and categorical features.
These form the base input for the tabular (non-NLP) models.

Then drop additional columns that lead to target leakage.

In [3]:
# Drop note text columns
original_df = nlp_ready_df.drop(
    columns=["Radiology_notes", "Discharge_summary_notes", "combined_notes"]
)

X_original = original_df.drop(columns=[
    'hospital_expire_flag',
    'first_hosp_stay',
    'suspected_infection',
    'sepsis3'])
y_original = original_df["hospital_expire_flag"]


print(f"‚úÖ Structured features: {X_original.shape}, Target: {y_original.shape}")


‚úÖ Structured features: (5208, 44), Target: (5208,)


## 3. Train/Test Split Structured features

Perform a single consistent train/test split that defines all downstream
processing for both structured and text-based feature sets.
The same `subject_id` partitions will be reused for all data modalities.

In [4]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig = split_data(X_original, y_original, test_size=0.2, random_state=42)
print(f"Train: {X_train_orig.shape}, Test: {X_test_orig.shape}")

Train: (4166, 44), Test: (1042, 44)


## 4. Scale Structured Features

Fit a `StandardScaler` on the training structured features and transform both train and test sets.
Scaling is performed before merging with embeddings to maintain consistent numeric distributions.

In [5]:
X_train_orig_scaled, X_test_orig_scaled, y_train_orig, y_test_orig = scale_features(
    X_train_orig, X_test_orig, y_train_orig, y_test_orig, prefix="original"
)

‚úÖ Scaled original features prepared (not saved ‚Äî handled downstream)


## 5. Train or load Baseline Word2Vec Models

Train (or load if already available) three baseline Word2Vec models‚Äî
one each for Radiology, Discharge, and Combined notes.
These models are stored under `embedding_cache/w2v/baseline/`.


In [6]:
# Relative paths (pass these to load/train; helpers will resolve)
paths = {
    "radiology": {
        "corpus": "data/interim/w2v_interim/w2v_Radiology_notes.txt",
        "model":  "embedding_cache/w2v/baseline/w2v_radiology.model",
    },
    "discharge": {
        "corpus": "data/interim/w2v_interim/w2v_Discharge_notes.txt",
        "model":  "embedding_cache/w2v/baseline/w2v_discharge.model",
    },
    "combined": {
        "corpus": "data/interim/w2v_interim/w2v_combined_notes.txt",
        "model":  "embedding_cache/w2v/baseline/w2v_combined.model",
    },
}

models = {}
for note_type, p in paths.items():
    params = get_w2v_params(note_type)
    model_abs_path = resolve_path(p["model"])  # only for existence check

    if os.path.exists(resolve_path(p["model"])):
        models[note_type] = load_word2vec(p["model"])  # ‚úÖ auto-detects baseline if present
        print(f"‚úÖ Loaded {note_type} Word2Vec model.")
    else:
        models[note_type] = train_word2vec(
            corpus_path=p["corpus"],
            model_out=p["model"],
            baseline=True,
            **params
        )
        print(f"‚úÖ Trained & saved {note_type} Word2Vec model to {resolve_path(p['model'])}")

print("‚úÖ All Word2Vec models ready.")

‚úÖ Word2Vec model (baseline) trained and saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\baseline\w2v_radiology.model
‚úÖ Trained & saved radiology Word2Vec model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\w2v_radiology.model
‚úÖ Word2Vec model (baseline) trained and saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\baseline\w2v_discharge.model
‚úÖ Trained & saved discharge Word2Vec model to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\w2v_discharge.model
‚úÖ Word2Vec model (baseline) trained and saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\baseline\w2v_combined.model
‚úÖ Trained & saved combined Word2Vec model to C:\Users\tyle

## 6. Generate Subject-Level Embeddings per Note Type

Using the train/test subject IDs from Step 3,
generate averaged document embeddings for each subject
across all three note types (`Radiology_notes`, `Discharge_summary_notes`, `combined_notes`).
Each embedding set will be created separately with its respective model.


In [7]:
train_ids = set(X_train_orig["subject_id"])
test_ids  = set(X_test_orig["subject_id"])

# Align note text subsets to train/test subjects
train_notes = nlp_ready_df.loc[nlp_ready_df["subject_id"].isin(train_ids)].copy()
test_notes  = nlp_ready_df.loc[nlp_ready_df["subject_id"].isin(test_ids)].copy()

# Generate averaged embeddings for each model

# radiology
w2v_train_rad = apply_embeddings_to_subjects(train_notes, "Radiology_notes", models["radiology"], prefix="w2v_rad_")
w2v_test_rad  = apply_embeddings_to_subjects(test_notes,  "Radiology_notes", models["radiology"], prefix="w2v_rad_")
# discharge
w2v_train_dis = apply_embeddings_to_subjects(train_notes, "Discharge_summary_notes", models["discharge"], prefix="w2v_dis_")
w2v_test_dis  = apply_embeddings_to_subjects(test_notes,  "Discharge_summary_notes", models["discharge"], prefix="w2v_dis_")
# combined
w2v_train_comb = apply_embeddings_to_subjects(train_notes, "combined_notes", models["combined"], prefix="w2v_comb_")
w2v_test_comb  = apply_embeddings_to_subjects(test_notes,  "combined_notes", models["combined"], prefix="w2v_comb_")

print(f"‚úÖ Radiology embeddings: {w2v_train_rad.shape}")
print(f"‚úÖ Discharge embeddings: {w2v_train_dis.shape}")
print(f"‚úÖ Combined embeddings:  {w2v_train_comb.shape}")


‚úÖ Radiology embeddings: (4166, 101)
‚úÖ Discharge embeddings: (4166, 101)
‚úÖ Combined embeddings:  (4166, 101)


## 7. Scale Embeddings

Fit a `StandardScaler` on the training embeddings for each note type
and apply the transformation to the test embeddings.
This ensures all embedding features have zero mean and unit variance
without introducing data leakage.


In [8]:
w2v_train_rad_scaled, w2v_test_rad_scaled = scale_w2v_embeddings(w2v_train_rad, w2v_test_rad, prefix="w2v_rad")
w2v_train_dis_scaled, w2v_test_dis_scaled = scale_w2v_embeddings(w2v_train_dis, w2v_test_dis, prefix="w2v_dis")
w2v_train_comb_scaled, w2v_test_comb_scaled = scale_w2v_embeddings(w2v_train_comb, w2v_test_comb, prefix="w2v_comb")

print("‚úÖ Embeddings scaled")


‚úÖ Scaled w2v_rad embeddings saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\w2v_rad (embeddings only, not merged)
‚úÖ Scaled w2v_dis embeddings saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\w2v_dis (embeddings only, not merged)
‚úÖ Scaled w2v_comb embeddings saved to C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\embedding_cache\w2v\baseline\w2v_comb (embeddings only, not merged)
‚úÖ Embeddings scaled


## 8. Merge Structured Features + Embeddings

Merge the scaled structured features with each set of scaled Word2Vec embeddings.
This produces final merged datasets for radiology, discharge, and combined feature spaces.
Each merged train/test pair is saved under `data/processed/{variant}/`.

In [9]:
X_train_w2v_rad_scaled, X_test_w2v_rad_scaled = merge_embeddings_with_features(
    X_train_orig_scaled, X_test_orig_scaled, w2v_train_rad_scaled, w2v_test_rad_scaled, prefix="w2v_rad"
)

X_train_w2v_dis_scaled, X_test_w2v_dis_scaled = merge_embeddings_with_features(
    X_train_orig_scaled, X_test_orig_scaled, w2v_train_dis_scaled, w2v_test_dis_scaled, prefix="w2v_dis"
)

X_train_w2v_comb_scaled, X_test_w2v_comb_scaled = merge_embeddings_with_features(
    X_train_orig_scaled, X_test_orig_scaled, w2v_train_comb_scaled, w2v_test_comb_scaled, prefix="w2v_comb"
)
print("‚úÖ Original + embeddings data merged")


‚úÖ Merged w2v_rad train/test sets saved under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_rad
‚úÖ Merged w2v_dis train/test sets saved under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_dis
‚úÖ Merged w2v_comb train/test sets saved under C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_comb
‚úÖ Original + embeddings data merged


## 9. Remove Subject_ID from Feature Space

In [11]:
# Define all dataset variants
variant_scaled_sets = {
    "original": (X_train_orig_scaled, X_test_orig_scaled),
    "w2v_radiology": (X_train_w2v_rad_scaled, X_test_w2v_rad_scaled),
    "w2v_discharge": (X_train_w2v_dis_scaled, X_test_w2v_dis_scaled),
    "w2v_combined": (X_train_w2v_comb_scaled, X_test_w2v_comb_scaled)
}

for variant, (xtrain, xtest) in variant_scaled_sets.items():
    for df_name, df in {"xtrain": xtrain, "xtest": xtest}.items():
        if "subject_id" in df.columns:
            df.drop(columns=["subject_id"], inplace=True)
            print(f"‚úÖ Removed subject_id from {variant} {df_name}")


‚úÖ Removed subject_id from original xtrain
‚úÖ Removed subject_id from original xtest
‚úÖ Removed subject_id from w2v_radiology xtrain
‚úÖ Removed subject_id from w2v_radiology xtest
‚úÖ Removed subject_id from w2v_discharge xtrain
‚úÖ Removed subject_id from w2v_discharge xtest
‚úÖ Removed subject_id from w2v_combined xtrain
‚úÖ Removed subject_id from w2v_combined xtest


## 10. Save All Processed Datasets

All scaled and merged datasets (structured‚Äêonly and Word2Vec variants) are saved
to their respective folders under `data/processed/{variant}/`.  

Each folder contains the four core files:

- `data_{variant}_xtrain.csv`
- `data_{variant}_xtest.csv`
- `data_{variant}_ytrain.csv`
- `data_{variant}_ytest.csv`

This centralized save step ensures consistent versioned outputs across all feature modalities.


In [15]:
# dictionary of datasets: prefix ‚Üí (X_train, X_test)

datasets_to_save = {
    "original": (X_train_orig_scaled, X_test_orig_scaled),
    "w2v_radiology": (X_train_w2v_rad_scaled, X_test_w2v_rad_scaled),
    "w2v_discharge": (X_train_w2v_dis_scaled, X_test_w2v_dis_scaled),
    "w2v_combined": (X_train_w2v_comb_scaled, X_test_w2v_comb_scaled)
}

for name, (Xtr, Xte) in datasets_to_save.items():
    assert "subject_id" not in Xtr.columns
    assert "subject_id" not in Xte.columns
print("‚úÖ Verified: no subject_id columns remain.")

# corresponding y values (shared across all variants)
if isinstance(y_train_orig, pd.Series):
    y_train_orig = y_train_orig.to_frame()
if isinstance(y_test_orig, pd.Series):
    y_test_orig = y_test_orig.to_frame()


for prefix, (Xtr, Xte) in datasets_to_save.items():
    base_dir = f"data/processed/{prefix}"
    # ensure directory exists
    os.makedirs(base_dir, exist_ok=True)

    # save X_train / X_test
    save_feature_dataset(Xtr, f"data_{prefix}_xtrain.csv", base_dir=base_dir)
    save_feature_dataset(Xte, f"data_{prefix}_xtest.csv",  base_dir=base_dir)

    # save y_train / y_test once per variant for clarity
    save_feature_dataset(y_train_orig, f"data_{prefix}_ytrain.csv", base_dir=base_dir)
    save_feature_dataset(y_test_orig,  f"data_{prefix}_ytest.csv",  base_dir=base_dir)

print("‚úÖ All scaled structured and merged datasets successfully saved.")

‚úÖ Verified: no subject_id columns remain.
‚úÖ Saved feature dataset ‚Üí C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\original\data_original_xtrain.csv
‚úÖ Saved feature dataset ‚Üí C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\original\data_original_xtest.csv
‚úÖ Saved feature dataset ‚Üí C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\original\data_original_ytrain.csv
‚úÖ Saved feature dataset ‚Üí C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\original\data_original_ytest.csv
‚úÖ Saved feature dataset ‚Üí C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_radiology\data_w2v_radiology_xtrain.csv
‚úÖ Saved feature dataset ‚Üí C:\Users\tyler\OneDrive - University of Pittsburgh\BIOST 2021 Thesis\Masters-Thesis\data\processed\w2v_

## 11. Validate all saved dataset outputs

This step calls `validate_saved_datasets()`
(from `src/features.py`) to check that all processed datasets
exist, have valid shapes, and (optionally) preserve `subject_id`
alignment between features and labels.

A summary table is displayed below; if all entries show `Exists=True`,
the preprocessing pipeline is verified.

In [22]:
output_summary = validate_saved_datasets(check_alignment=True)
display(output_summary)

if not output_summary["Exists"].all():
    missing = output_summary.loc[~output_summary["Exists"], ["Variant", "Split", "File"]]
    print("\n‚ö†Ô∏è Missing or invalid files:")
    display(missing)
else:
    print("\n‚úÖ All processed datasets found and validated.")

Unnamed: 0,Variant,Split,File,Exists,Rows,Columns,Aligned
0,original,X_train,data_original_xtrain.csv,True,4166,43,
1,original,X_test,data_original_xtest.csv,True,1042,43,
2,original,y_train,data_original_ytrain.csv,True,4166,1,
3,original,y_test,data_original_ytest.csv,True,1042,1,
4,w2v_radiology,X_train,data_w2v_radiology_xtrain.csv,True,4166,143,
5,w2v_radiology,X_test,data_w2v_radiology_xtest.csv,True,1042,143,
6,w2v_radiology,y_train,data_w2v_radiology_ytrain.csv,True,4166,1,
7,w2v_radiology,y_test,data_w2v_radiology_ytest.csv,True,1042,1,
8,w2v_discharge,X_train,data_w2v_discharge_xtrain.csv,True,4166,143,
9,w2v_discharge,X_test,data_w2v_discharge_xtest.csv,True,1042,143,



‚úÖ All processed datasets found and validated.


## 12: Next Steps
- Proceed to `04_model_training.ipynb` for baseline and tuned models.