# Methodology

This section outlines our approach to evaluating the impact of different imputation methods on predictive modeling for heart failure patients in MIMIC-IV.

## Overview
1. Feature Selection: LASSO → XGBoost
2. Missing Data Handling: GPLVM Imputation
3. Model Development: Baseline (LR, RF) vs Primary (XGBoost)
4. Model Interpretation: SHAP Analysis
5. Hyperparameter Tuning: GridSearchCV

## 1. Feature Selection: LASSO → XGBoost

### Feature Selection Pipeline
- Initial feature selection using LASSO regression to identify the most important predictors
- Further refinement using XGBoost feature importance
- Final feature set used across all imputation methods for fair comparison

In [21]:
import numpy as np
import pandas as pd
import torch
import pyro
import pyro.contrib.gp as gp
import pyro.distributions as dist
import pyro.ops.stats as stats
from torch.nn import Parameter
from sklearn.linear_model import LassoCV, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.model_selection import GridSearchCV, cross_val_score
import xgboost as xgb
import shap
import matplotlib.pyplot as plt

def feature_selection_pipeline(data, target_col):
    """
    Two-step feature selection using LASSO and XGBoost
    """
    # Split data
    X = data.drop(columns=[target_col])
    y = data[target_col]
    
    # LASSO feature selection
    lasso = LassoCV(cv=5)
    lasso.fit(X, y)
    
    # Get non-zero coefficients
    lasso_features = X.columns[lasso.coef_ != 0]
    
    # XGBoost feature importance
    xgb_model = xgb.XGBClassifier()
    xgb_model.fit(X[lasso_features], y)
    
    # Get feature importance
    importance = pd.DataFrame({
        'feature': lasso_features,
        'importance': xgb_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    return importance

## 2. Missing Data Handling


### Imputation Methods
We evaluate three different imputation approaches:
1. **Mean/Mode Imputation**: Simple baseline method
2. **Regression-based Imputation**: Using predictive models for each feature

### Missingness Scenarios
- Full data (0% missing)
- Subset with 0% missing values
- Subset with 20% missing values
- Subset with 40% missing values

In [24]:
def create_missing_data(data, missing_percentage):
    # Create missing values in the dataset
    mask = np.random.random(data.shape) < missing_percentage
    data_missing = data.copy()
    data_missing[mask] = np.nan
    return data_missing

def mean_mode_imputation(data):
    # Separate numeric and categorical columns
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    categorical_cols = data.select_dtypes(include=['object']).columns
    
    # Create imputers
    numeric_imputer = SimpleImputer(strategy='mean')
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    
    # Impute data
    data_imputed = data.copy()
    if len(numeric_cols) > 0:
        data_imputed[numeric_cols] = numeric_imputer.fit_transform(data[numeric_cols])
    if len(categorical_cols) > 0:
        data_imputed[categorical_cols] = categorical_imputer.fit_transform(data[categorical_cols])
    
    return data_imputed

def regression_imputation(data):
    # Use Random Forest for imputation
    imputer = IterativeImputer(
        estimator=RandomForestRegressor(),
        max_iter=10,
        random_state=42
    )
    
    # Impute only numeric columns
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    data_imputed = data.copy()
    data_imputed[numeric_cols] = imputer.fit_transform(data[numeric_cols])
    
    return data_imputed

**GPLVM Imputation**: Advanced deep learning-based approach

The Gaussian Process Latent Variable Model (GPLVM) implementation includes:

1. **Model Structure**:
   - Sparse GP regression for efficient computation
   - RBF kernel for smooth latent space representation
   - Two-dimensional latent space for visualization

2. **Key Components**:
   - Prior mean initialization for latent variables
   - Inducing points for sparse approximation
   - Automatic guide for variational inference

3. **Features**:
   - Imputation of missing values with uncertainty estimates
   - Visualization of training progress and latent space
   - Support for both continuous and categorical variables

4. **Usage Example**:
```python
# Example usage
results = gplvm_svm_pipeline(
    data=your_data,
    target_col='target',
    missing_percentage=0.2,
    latent_dim=2,
    num_inducing=32,
    num_steps=4000
)

# Visualize results
visualize_gplvm_svm_results(
    results,
    X=your_data.drop(columns=['target']),
    feature_names=your_feature_names
)

The Constrained SVM approach is a method that takes into account the uncertainty in the imputed values when making predictions. 

Instead of using a single imputed dataset, we generate multiple possible datasets based on the uncertainty estimates from GPLVM

Each dataset represents a different possible realization of the true values
We train an SVM on each dataset and then select the optimal model

**Key features**:

- We use package from https://github.com/pyro-ppl/pyro/blob/dev/tutorial/source/gplvm.ipynb
- GPLVM handles missing value imputation with uncertainty estimates
- Constrained SVM uses these uncertainty estimates to create robust models
- SHAP analysis helps interpret the model's decisions
- Visualization tools help understand the results


In [23]:
def gplvm_imputation(data, latent_dim=2, num_inducing=32, num_steps=4000):
    """
    GPLVM-based imputation for missing values
    """
    # Convert data to torch tensor
    data_tensor = torch.tensor(data.values, dtype=torch.get_default_dtype())
    y = data_tensor.t()
    
    # Create prior mean for latent variables
    X_prior_mean = torch.zeros(y.size(1), latent_dim)
    
    # Define RBF kernel
    kernel = gp.kernels.RBF(input_dim=latent_dim, lengthscale=torch.ones(latent_dim))
    
    # Initialize latent variables
    X = Parameter(X_prior_mean.clone())
    
    # Initialize inducing points
    Xu = stats.resample(X_prior_mean.clone(), num_inducing)
    
    # Create sparse GP model
    gplvm = gp.models.SparseGPRegression(
        X, y, kernel, Xu,
        noise=torch.tensor(0.01),
        jitter=1e-5
    )
    
    # Set up prior for X
    gplvm.X = pyro.nn.PyroSample(
        dist.Normal(X_prior_mean, 0.1).to_event()
    )
    
    # Set up guide
    gplvm.autoguide("X", dist.Normal)
    
    # Train the model
    losses = gp.util.train(gplvm, num_steps=num_steps)
    
    # Get imputed values and uncertainty
    gplvm.mode = "guide"
    X = gplvm.X_loc.detach().numpy()
    
    # Reconstruct data with uncertainty
    imputed_data = data.copy()
    uncertainty = pd.DataFrame(index=data.index, columns=data.columns, dtype=float)
    
    # Use the learned latent space to impute missing values
    for i in range(data.shape[1]):
        if data.iloc[:, i].isnull().any():
            # Get predictions for missing values
            pred_mean, pred_var = gplvm.forward(X)
            imputed_data.iloc[:, i] = pred_mean.detach().numpy()
            uncertainty.iloc[:, i] = pred_var.detach().numpy()
    
    return imputed_data, uncertainty

## 3, 4: Model Development and Model Interpretation: SHAP Analysis: Baseline (LR, RF) vs Primary (XGBoost)

In [25]:
def train_models(X_train, X_test, y_train, y_test):
    """
    Train baseline and primary models with hyperparameter tuning
    """
    # Define models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Random Forest': RandomForestClassifier(),
        'XGBoost': xgb.XGBClassifier()
    }
    
    # Define parameter grids for GridSearchCV
    param_grids = {
        'Logistic Regression': {'C': [0.1, 1, 10]},
        'Random Forest': {'n_estimators': [100, 200], 'max_depth': [None, 10]},
        'XGBoost': {'n_estimators': [100, 200], 'max_depth': [3, 6]}
    }
    
    # Store SHAP values
    shap_values = {}
    
    for name, model in models.items():
        # Perform grid search
        grid_search = GridSearchCV(model, param_grids[name], cv=5, scoring='roc_auc')
        grid_search.fit(X_train, y_train)
        
        # Get best model
        best_model = grid_search.best_estimator_
        models[name] = best_model
        
        # Calculate SHAP values
        if name == 'XGBoost':
            explainer = shap.TreeExplainer(best_model)
            shap_values[name] = explainer.shap_values(X_test)
        elif name == 'Random Forest':
            explainer = shap.TreeExplainer(best_model)
            shap_values[name] = explainer.shap_values(X_test)
        else:  # Logistic Regression
            explainer = shap.LinearExplainer(best_model, X_train)
            shap_values[name] = explainer.shap_values(X_test)
    
    return models, shap_values

## 6. Complete Analysis Pipeline

This implementation follows all requirements for predicting in-hospital mortality for ICU HF patients:

1. **Feature Selection**
   - Two-step reduction using LASSO and XGBoost
   - Identifies most important predictors

2. **Missing Data Handling**
   - GPLVM imputation with uncertainty estimation
   - Handles different missingness levels (0%, 20%, 40%)

3. **Model Development**
   - Baseline models (Logistic Regression, Random Forest)
   - Primary model (XGBoost)
   - Constrained SVM with uncertainty incorporation

4. **Model Interpretation**
   - SHAP analysis for feature importance
   - Comparison across missingness levels
   - Uncertainty visualization


In [27]:
def constrained_svm(X, y, uncertainty, n_samples=10):
    """
    Constrained SVM implementation that incorporates uncertainty
    """
    # Store all trained models and their performance
    models = []
    scores = []
    
    for _ in range(n_samples):
        # Generate a new dataset by sampling from uncertainty distribution
        X_sampled = X + np.random.normal(0, np.sqrt(uncertainty))
        
        # Train SVM on this sampled dataset
        model = SVC(kernel='rbf', probability=True)
        model.fit(X_sampled, y)
        
        # Evaluate model performance
        score = cross_val_score(model, X_sampled, y, cv=5).mean()
        
        models.append(model)
        scores.append(score)
    
    # Select the best performing model
    best_idx = np.argmax(scores)
    best_model = models[best_idx]
    
    return best_model, scores[best_idx]

In [28]:
def complete_analysis_pipeline(data, target_col):
    """
    Complete analysis pipeline following all requirements
    """
    results = {}
    
    # 1. Feature Selection
    print("Performing feature selection...")
    importance = feature_selection_pipeline(data, target_col)
    
    # 2. Analyze different missingness levels
    for missing_pct in [0, 0.2, 0.4]:
        print(f"\nProcessing {missing_pct*100}% missing data...")
        
        # Create missing data
        data_missing = create_missing_data(data, missing_pct)
        
        # GPLVM imputation
        imputed_data, uncertainty = gplvm_imputation(data_missing)
        
        # Prepare data for modeling
        X = imputed_data.drop(columns=[target_col])
        y = imputed_data[target_col]
        
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        
        # Train models
        models, shap_values = train_models(X_train, X_test, y_train, y_test)
        
        # Train constrained SVM
        best_svm, svm_score = constrained_svm(X_train, y_train, uncertainty)
        
        # Store results
        results[missing_pct] = {
            'feature_importance': importance,
            'imputed_data': imputed_data,
            'uncertainty': uncertainty,
            'models': models,
            'shap_values': shap_values,
            'svm_model': best_svm,
            'svm_score': svm_score
        }
    
    return results

def visualize_results(results, X, feature_names):
    """
    Visualize results from analysis
    """
    plt.figure(figsize=(15, 5))
    
    # Plot 1: Feature importance across missingness levels
    plt.subplot(131)
    for missing_pct, result in results.items():
        importance = result['feature_importance']
        plt.plot(importance['importance'], 
                label=f'{missing_pct*100}% missing')
    plt.title('Feature Importance Across Missingness Levels')
    plt.xlabel('Feature Rank')
    plt.ylabel('Importance')
    plt.legend()
    
    # Plot 2: SHAP summary plot
    plt.subplot(132)
    if isinstance(results[0]['shap_values']['XGBoost'], list):
        shap_values = results[0]['shap_values']['XGBoost'][0]
    else:
        shap_values = results[0]['shap_values']['XGBoost']
    shap.summary_plot(shap_values, X, feature_names=feature_names, show=False)
    plt.title("SHAP Summary Plot")
    
    # Plot 3: Uncertainty distribution
    plt.subplot(133)
    plt.hist(results[0]['uncertainty'].values.flatten(), bins=50)
    plt.title("Distribution of Imputation Uncertainty")
    plt.xlabel("Uncertainty")
    plt.ylabel("Count")
    
    plt.tight_layout()
    plt.show()