# 1. Data Imports and setting environment

**Project:** Predicting ventilator requirements for COVIDâ€‘19 (Target: `INTUBATED`)  
**Notebook:** `03_feature_engineering.ipynb` (Feature engineering + dataset variants)  

> **Goal:** Construct clinically motivated engineered features and produce multiple dataset variants for subsequent modelling and comparison. All artefacts are saved to `notebooks/outputs/`.


In [1]:
# ---------------------------------------------------------------------
# 1.1 Imports
# ---------------------------------------------------------------------
from __future__ import annotations

from pathlib import Path
import json
import warnings

import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)


In [None]:
# ---------------------------------------------------------------------
# 1.2 Reproducibility + paths
# ---------------------------------------------------------------------
SEED = 42
np.random.seed(SEED)

PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
OUTPUT_DIR = PROJECT_ROOT / "outputs"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CLEAN_DATA_PATH = OUTPUT_DIR / "covid_clean.parquet"   # produced by 01_eda.ipynb
print("PROJECT_ROOT:", PROJECT_ROOT)
print("CLEAN_DATA_PATH:", CLEAN_DATA_PATH)


# 2.0 Load Clean Dataset

**Assumption:** `01_eda.ipynb` has already been run and produced `notebooks/outputs/covid_clean.parquet`.


In [3]:
# ---------------------------------------------------------------------
# 2.1 Load cleaned data
# ---------------------------------------------------------------------
if not CLEAN_DATA_PATH.exists():
    raise FileNotFoundError(
        f"Missing cleaned dataset at {CLEAN_DATA_PATH}. "
        "Run 01_eda.ipynb first (or update CLEAN_DATA_PATH)."
    )

df = pd.read_parquet(CLEAN_DATA_PATH)
print("Shape:", df.shape)
df.head()


Shape: (192706, 22)


Unnamed: 0,usmer,medical_unit,sex,patient_type,date_died,intubed,pneumonia,age,pregnant,diabetes,copd,asthma,inmsupr,hipertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,clasiffication_final,icu,intubated
0,0,1,0,0,09/06/2020,1,0,55,97,1,2,2,2,2,2,2,2,2,2,3,2,1
1,0,1,1,0,9999-99-99,0,1,40,2,2,2,2,2,2,2,2,2,2,2,3,2,0
2,0,1,1,0,9999-99-99,0,0,37,2,1,2,2,2,1,2,2,1,2,2,3,2,0
3,0,1,1,0,9999-99-99,0,0,25,2,2,2,2,2,2,2,2,2,2,2,3,2,0
4,0,1,0,0,9999-99-99,0,0,24,97,2,2,2,2,2,2,2,2,2,2,3,2,0


In [4]:
# ---------------------------------------------------------------------
# 2.2 Identify target + basic checks
# ---------------------------------------------------------------------
TARGET = "intubated"
if TARGET not in df.columns:
    raise KeyError(f"Expected target column '{TARGET}' not found. Columns: {list(df.columns)[:30]} ...")

print("Target positive rate:", (df[TARGET].astype(int) == 1).mean())
print("Missing values (any):", df.isna().any().any())


Target positive rate: 0.174649466025967
Missing values (any): False


# 3.0 Feature Engineering Plan

The purpose of feature engineering here is to incorporate **clinically meaningful structure** that may improve predictive performance.

**Process:**
1. Identify sets of comorbidity/risk-factor indicators (binary columns) where available.
2. Create **COMORBIDITY_COUNT** to summarise overall health burden.
3. Create a simple **SEVERITY_SCORE** to capture acute severity signals (e.g., pneumonia / ICU).
4. Construct multiple dataset variants so that downstream modelling can quantify trade-offs.

> **Note:** Column names vary across COVID datasets. This notebook uses *robust matching* and will log which columns were used.


In [5]:
# ---------------------------------------------------------------------
# 3.1 Helper functions
# ---------------------------------------------------------------------
def find_columns_like(df: pd.DataFrame, keywords: list[str]) -> list[str]:
    """Return columns containing any keyword (case-insensitive)."""
    cols = []
    low_cols = {c: c.lower() for c in df.columns}
    for c, lc in low_cols.items():
        if any(k.lower() in lc for k in keywords):
            cols.append(c)
    return cols

def ensure_binary_int(df: pd.DataFrame, cols: list[str]) -> list[str]:
    """Keep columns that look like binary 0/1 after coercion."""
    kept = []
    for c in cols:
        s = df[c]
        if pd.api.types.is_numeric_dtype(s):
            vals = set(pd.Series(s.dropna().unique()).astype(int).tolist())
            if vals.issubset({0, 1}):
                kept.append(c)
        else:
            # attempt minimal mapping for string binary
            low = s.astype(str).str.lower()
            uniq = set(low.dropna().unique().tolist())
            if uniq.issubset({"0", "1", "true", "false", "yes", "no"}):
                kept.append(c)
    return kept


# 4.0 Identify Candidate Clinical Indicator Columns

We programmatically identify common indicator groups. These lists can be adjusted depending on the specific dataset schema.


In [6]:
# ---------------------------------------------------------------------
# 4.1 Candidate indicator groups (keyword-based)
# ---------------------------------------------------------------------
# Common comorbidity/risk factors in public COVID clinical datasets:
comorbidity_keywords = [
    "diabetes", "hypertension", "asthma", "copd", "obesity", "cardio",
    "cardiovascular", "renal", "kidney", "immuno", "immunosupp",
    "tobacco", "smoker", "smoking", "pregnan", "cancer", "other_disease",
    "inmsupr", "hipertension", "epoc"  # Spanish variants seen in some datasets
]

# Acute severity / clinical course indicators:
severity_keywords = [
    "pneumonia", "icu", "intensive", "hospital", "hosp", "patient_type",
    "vent", "respir", "dysp", "shortness", "saturation", "spo2"
]

comorb_cols = find_columns_like(df.drop(columns=[TARGET]), comorbidity_keywords)
sev_cols = find_columns_like(df.drop(columns=[TARGET]), severity_keywords)

# Keep only columns that look binary-like for comorbidities
comorb_cols = ensure_binary_int(df, comorb_cols)

print("Detected comorbidity indicator columns:", len(comorb_cols))
print(comorb_cols[:30])

print("\nDetected severity-related columns (may include non-binary):", len(sev_cols))
print(sev_cols[:30])


Detected comorbidity indicator columns: 0
[]

Detected severity-related columns (may include non-binary): 3
['patient_type', 'pneumonia', 'icu']


# 5.0 Engineer Features

## 5.1 COMORBIDITY_COUNT
- Defined as the sum of detected comorbidity indicator columns per patient.

## 5.2 SEVERITY_SCORE
- A simple additive score combining acute severity signals where available.
- In many COVID datasets, **pneumonia** and **ICU** are strong markers of clinical severity.

> The exact columns used will be recorded in `feature_dictionary.json`.


In [7]:
# ---------------------------------------------------------------------
# 5.1 Create COMORBIDITY_COUNT
# ---------------------------------------------------------------------
df_fe = df.copy()

if len(comorb_cols) == 0:
    print("WARNING: No comorbidity indicator columns detected. COMORBIDITY_COUNT will be skipped.")
    df_fe["comorbidity_count"] = 0
    used_comorb_cols = []
else:
    df_fe["comorbidity_count"] = df_fe[comorb_cols].fillna(0).astype(int).sum(axis=1)
    used_comorb_cols = comorb_cols

df_fe["comorbidity_count"].describe()




count    192706.0
mean          0.0
std           0.0
min           0.0
25%           0.0
50%           0.0
75%           0.0
max           0.0
Name: comorbidity_count, dtype: float64

In [8]:
# ---------------------------------------------------------------------
# 5.2 Create SEVERITY_SCORE (pneumonia + ICU + comorbidity_count by default)
# ---------------------------------------------------------------------
# Try to identify pneumonia/ICU columns robustly
pneumonia_candidates = find_columns_like(df_fe, ["pneumonia"])
icu_candidates = find_columns_like(df_fe, ["icu"])

# Keep only binary 0/1 columns among candidates
pneumonia_cols = ensure_binary_int(df_fe, pneumonia_candidates)
icu_cols = ensure_binary_int(df_fe, icu_candidates)

pneumonia_col = pneumonia_cols[0] if len(pneumonia_cols) > 0 else None
icu_col = icu_cols[0] if len(icu_cols) > 0 else None

# Build severity score with available signals
severity_components = []

if pneumonia_col is not None:
    severity_components.append(pneumonia_col)
if icu_col is not None:
    severity_components.append(icu_col)

# comorbidity_count is numeric; include as-is
df_fe["severity_score"] = df_fe["comorbidity_count"]

if len(severity_components) > 0:
    df_fe["severity_score"] = df_fe["severity_score"] + df_fe[severity_components].fillna(0).astype(int).sum(axis=1)

print("Using pneumonia column:", pneumonia_col)
print("Using ICU column:", icu_col)
print("Severity components:", severity_components + ["comorbidity_count"])
df_fe["severity_score"].describe()


Using pneumonia column: pneumonia
Using ICU column: None
Severity components: ['pneumonia', 'comorbidity_count']


count    192706.000000
mean          0.602021
std           0.489482
min           0.000000
25%           0.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: severity_score, dtype: float64

# 6.0 Construct Dataset Variants

We create multiple variants to evaluate the incremental value of engineered features.

**Variants saved:**
1. `dataset_v0_base`: original cleaned features
2. `dataset_v1_comorbidity`: base + `COMORBIDITY_COUNT`
3. `dataset_v2_severity`: base + `COMORBIDITY_COUNT` + `SEVERITY_SCORE`
4. `dataset_v3_drop_low_corr`: remove features with low absolute correlation to target (screening heuristic)

> Note: correlation screening is used as a *heuristic* to reduce noise and dimensionality; it does not imply causality.


In [9]:
# ---------------------------------------------------------------------
# 6.1 Prepare base design matrix
# ---------------------------------------------------------------------
TARGET = "intubated"
base_cols = [c for c in df.columns if c != TARGET]

dataset_v0_base = df[base_cols + [TARGET]].copy()
dataset_v1_comorbidity = df_fe[base_cols + ["comorbidity_count", TARGET]].copy()
dataset_v2_severity = df_fe[base_cols + ["comorbidity_count", "severity_score", TARGET]].copy()

print("v0:", dataset_v0_base.shape)
print("v1:", dataset_v1_comorbidity.shape)
print("v2:", dataset_v2_severity.shape)


v0: (192706, 22)
v1: (192706, 23)
v2: (192706, 24)


In [10]:
# ---------------------------------------------------------------------
# 6.2 Correlation screening variant (numeric-only heuristic)
# ---------------------------------------------------------------------
# Compute absolute correlation for numeric columns; keep columns above a threshold.
# Threshold can be tuned; a small value prevents overly aggressive dropping.
THRESH = 0.05

v2 = dataset_v2_severity.copy()
numeric_cols = v2.select_dtypes(include=[np.number]).columns.tolist()

if TARGET not in numeric_cols:
    numeric_cols.append(TARGET)

corr = v2[numeric_cols].corr(numeric_only=True)[TARGET].drop(TARGET)
keep_numeric = corr[ corr.abs() >= THRESH ].index.tolist()

# Always keep engineered features if present
for forced in ["comorbidity_count", "severity_score"]:
    if forced in v2.columns and forced not in keep_numeric:
        keep_numeric.append(forced)

# Build dataset: keep all non-numeric columns + selected numeric columns
non_numeric_cols = [c for c in v2.columns if c not in numeric_cols and c != TARGET]
final_keep = sorted(set(non_numeric_cols + keep_numeric + [TARGET]), key=lambda x: list(v2.columns).index(x))

dataset_v3_drop_low_corr = v2[final_keep].copy()

print("Correlation screening threshold:", THRESH)
print("Kept numeric features:", len(keep_numeric))
print("v3:", dataset_v3_drop_low_corr.shape)

# Quick look at top correlated features for transparency
top_corr = corr.abs().sort_values(ascending=False).head(20)
top_corr


Correlation screening threshold: 0.05
Kept numeric features: 6
v3: (192706, 8)


intubed                 1.000000
severity_score          0.156074
pneumonia               0.156074
clasiffication_final    0.124715
age                     0.095533
medical_unit            0.049875
sex                     0.049182
pregnant                0.048970
asthma                  0.019600
renal_chronic           0.018881
icu                     0.018770
cardiovascular          0.018726
tobacco                 0.018532
copd                    0.017926
inmsupr                 0.015798
usmer                   0.014203
other_disease           0.013191
hipertension            0.012969
diabetes                0.012235
obesity                 0.008320
Name: intubated, dtype: float64

***Summary of Feature Engineering***

- `COMORBIDITY_COUNT` summarises the overall burden of detected comorbidities.
- `SEVERITY_SCORE` combines comorbidity burden with acute severity signals (pneumonia/ICU where available).
- Multiple dataset variants are created to support controlled downstream comparisons.
- The exact columns used for engineered features are recorded for transparency and reproducibility.


# 7.0 Save Dataset Variants + Feature Dictionary

All datasets are saved as parquet files to preserve dtypes.


In [12]:
# ---------------------------------------------------------------------
# 7.1 Save artefacts
# ---------------------------------------------------------------------
variant_paths = {
    "dataset_v0_base": OUTPUT_DIR / "dataset_v0_base.parquet",
    "dataset_v1_comorbidity": OUTPUT_DIR / "dataset_v1_comorbidity.parquet",
    "dataset_v2_severity": OUTPUT_DIR / "dataset_v2_severity.parquet",
    "dataset_v3_drop_low_corr": OUTPUT_DIR / "dataset_v3_drop_low_corr.parquet",
}

for name, path in variant_paths.items():
    locals()[name].to_parquet(path, index=False)
    print("Saved:", name, "->", path.name)

feature_dict = {
    "seed": SEED,
    "target": TARGET,
    "engineered_features": {
        "comorbidity_count": {
            "description": "Sum of detected binary comorbidity indicator columns (0/1).",
            "columns_used": used_comorb_cols,
        },
        "severity_score": {
            "description": "Additive score combining comorbidity_count with acute severity indicators (pneumonia/ICU) where available.",
            "pneumonia_col": pneumonia_col,
            "icu_col": icu_col,
            "components": (severity_components + ["comorbidity_count"]),
        },
    },
    "correlation_screening": {
        "threshold_abs_corr": THRESH,
        "numeric_features_kept": keep_numeric,
    },
}

feature_dict_path = OUTPUT_DIR / "feature_dictionary.json"
with open(feature_dict_path, "w") as f:
    json.dump(feature_dict, f, indent=2)

print("Saved:", feature_dict_path.name)


Saved: dataset_v0_base -> dataset_v0_base.parquet
Saved: dataset_v1_comorbidity -> dataset_v1_comorbidity.parquet
Saved: dataset_v2_severity -> dataset_v2_severity.parquet
Saved: dataset_v3_drop_low_corr -> dataset_v3_drop_low_corr.parquet
Saved: feature_dictionary.json
