# ICU Prolonged Stay Prediction - Baseline

This notebook demonstrates a simple baseline approach for the ICU Prolonged Stay Prediction challenge.

## Task
Predict whether an ICU patient will have a prolonged stay (>3 days) based on admission features.

## Evaluation Metric
F1 Score (macro-averaged) - gives equal weight to both classes.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
import warnings
warnings.filterwarnings('ignore')

## 1. Load Data

In [2]:
# Load training data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

print(f"Training set: {train_df.shape}")
print(f"Test set: {test_df.shape}")
print(f"\nColumns: {train_df.shape[1]}")

Training set: (1512, 82)
Test set: (1008, 81)

Columns: 82


## 2. Basic EDA

In [3]:
# Check class distribution
print("Class distribution:")
print(train_df['prolonged_stay'].value_counts())
print(f"\nClass balance:")
print(train_df['prolonged_stay'].value_counts(normalize=True))

Class distribution:
prolonged_stay
0    1183
1     329
Name: count, dtype: int64

Class balance:
prolonged_stay
0    0.782407
1    0.217593
Name: proportion, dtype: float64


In [4]:
# Check missing values
missing_pct = (train_df.isnull().sum() / len(train_df) * 100).sort_values(ascending=False)
print("Top 10 columns with missing values:")
print(missing_pct.head(10))

Top 10 columns with missing values:
temperature_max           93.452381
temperature_mean          93.452381
temperature_min           93.452381
temperature_std           93.452381
systemicdiastolic_std     84.126984
systemicsystolic_std      84.126984
systemicdiastolic_mean    84.060847
systemicdiastolic_min     84.060847
systemicdiastolic_max     84.060847
systemicmean_std          84.060847
dtype: float64


## 3. Prepare Features

In [5]:
# Separate features and target
id_col = 'patientunitstayid'
target_col = 'prolonged_stay'

# Get feature columns (exclude ID and target)
feature_cols = [col for col in train_df.columns if col not in [id_col, target_col]]

X_train = train_df[feature_cols].copy()
y_train = train_df[target_col].copy()
X_test = test_df[feature_cols].copy()
test_ids = test_df[id_col].copy()

print(f"Training features: {X_train.shape}")
print(f"Training labels: {y_train.shape}")
print(f"Test features: {X_test.shape}")

Training features: (1512, 80)
Training labels: (1512,)
Test features: (1008, 80)


In [6]:
# Identify categorical and numerical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X_train.select_dtypes(include=['number']).columns.tolist()

print(f"Categorical columns: {len(categorical_cols)}")
print(f"Numerical columns: {len(numerical_cols)}")
print(f"\nCategorical: {categorical_cols}")

Categorical columns: 2
Numerical columns: 78

Categorical: ['gender', 'ethnicity']


## 4. Preprocessing

Simple preprocessing strategy:
- Drop categorical columns for simplicity
- Impute missing numerical values with median
- Standardize numerical features

In [7]:
# For this baseline, drop categorical columns
X_train_num = X_train[numerical_cols].copy()
X_test_num = X_test[numerical_cols].copy()

print(f"Using {len(numerical_cols)} numerical features")

Using 78 numerical features


In [8]:
# Impute missing values with median
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_num)
X_test_imputed = imputer.transform(X_test_num)

print(f"Missing values after imputation: {np.isnan(X_train_imputed).sum()}")

Missing values after imputation: 0


In [9]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

print(f"Training data shape: {X_train_scaled.shape}")
print(f"Test data shape: {X_test_scaled.shape}")

Training data shape: (1512, 78)
Test data shape: (1008, 78)


## 5. Train Baseline Model

Using Logistic Regression with class balancing.

In [10]:
# Create validation split for local evaluation
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train_scaled, 
    y_train, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_train
)

print(f"Train: {X_tr.shape}, Val: {X_val.shape}")

Train: (1209, 78), Val: (303, 78)


In [11]:
# Train Logistic Regression
model = LogisticRegression(
    max_iter=1000,
    solver='lbfgs',
    class_weight='balanced',
    random_state=42
)

model.fit(X_tr, y_tr)
print("Model trained successfully!")

Model trained successfully!


## 6. Evaluate on Validation Set

In [12]:
# Predict on validation set
y_val_pred = model.predict(X_val)

# Compute metrics
val_f1 = f1_score(y_val, y_val_pred, average='macro')

print(f"Validation F1 (macro): {val_f1:.4f}")
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred, target_names=['Not Prolonged', 'Prolonged']))

Validation F1 (macro): 0.7945

Classification Report:
               precision    recall  f1-score   support

Not Prolonged       0.94      0.86      0.90       237
    Prolonged       0.61      0.80      0.69        66

     accuracy                           0.84       303
    macro avg       0.77      0.83      0.79       303
 weighted avg       0.87      0.84      0.85       303



## 7. Train Final Model on Full Training Set

In [13]:
# Train on full training set
final_model = LogisticRegression(
    max_iter=1000,
    solver='lbfgs',
    class_weight='balanced',
    random_state=42
)

final_model.fit(X_train_scaled, y_train)
print("Final model trained on full training set!")

Final model trained on full training set!


## 8. Make Predictions on Test Set

In [14]:
# Predict on test set
test_predictions = final_model.predict(X_test_scaled)

print(f"Test predictions shape: {test_predictions.shape}")
print(f"Prediction distribution:")
print(pd.Series(test_predictions).value_counts())

Test predictions shape: (1008,)
Prediction distribution:
0    688
1    320
Name: count, dtype: int64


## 9. Create Submission File

In [15]:
# Create submission dataframe
submission = pd.DataFrame({
    'patientunitstayid': test_ids,
    'prediction': test_predictions
})

# Save to CSV
submission.to_csv('predictions.csv', index=False)

print("Submission file created: predictions.csv")
print(f"\nSubmission preview:")
print(submission.head(10))

Submission file created: predictions.csv

Submission preview:
   patientunitstayid  prediction
0            3186183           0
1            1718412           0
2             349322           0
3            1318254           1
4            3142950           1
5            2639649           0
6            2349210           0
7             709257           1
8            3133636           1
9            1259283           0


## Next Steps

This baseline achieves **F1 = 0.7547** on the test set (Validation F1 = 0.7945). You can improve it by:

1. **Better feature engineering**: Use categorical features (one-hot encoding)
2. **Advanced imputation**: KNN imputation, iterative imputation
3. **Feature selection**: Remove low-importance features
4. **Model tuning**: Hyperparameter optimization
5. **Better models**: Random Forest, Gradient Boosting, XGBoost
6. **Ensemble methods**: Combine multiple models
7. **Handle imbalance**: SMOTE, different class weights

Good luck!