# Patient Recovery Prediction - XGBoost Model
## Advanced Gradient Boosting Approach

**Project Deadline:** October 26th, 11:55 P.M.

**Model:** XGBoost Regressor with Hyperparameter Tuning

**Objective:** Leverage gradient boosting for improved predictions

---

## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder

# XGBoost
try:
    import xgboost as xgb
    from xgboost import XGBRegressor
    print("✅ XGBoost imported successfully!")
    print(f"   XGBoost version: {xgb.__version__}")
except ImportError:
    print("❌ XGBoost not found. Installing...")
    import sys
    !{sys.executable} -m pip install xgboost
    import xgboost as xgb
    from xgboost import XGBRegressor
    print("✅ XGBoost installed and imported!")

# Machine Learning - Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)
np.random.seed(42)

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("\n✅ All libraries imported successfully!")
print(f"Numpy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Load and Prepare Data

In [None]:
# Load datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')

print("📊 Dataset loaded successfully!")
print(f"\nTraining set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nFeatures: {train_df.columns.tolist()}")
print(f"\nFirst few rows:")
display(train_df.head())

In [None]:
# Quick data summary
print("=" * 80)
print("DATA SUMMARY")
print("=" * 80)
print(f"\nMissing values:\n{train_df.isnull().sum()}")
print(f"\nDuplicate rows: {train_df.duplicated().sum()}")
print(f"\nBasic statistics:")
display(train_df.describe())

## 3. Data Preprocessing

In [None]:
# Prepare features and target
X = train_df.drop(['Id', 'Recovery Index'], axis=1).copy()
y = train_df['Recovery Index'].copy()

# Encode categorical variable
label_encoder = LabelEncoder()
X['Lifestyle Activities'] = label_encoder.fit_transform(X['Lifestyle Activities'])

print("=" * 80)
print("FEATURE PREPARATION")
print("=" * 80)
print(f"✅ Lifestyle Activities encoded (No=0, Yes=1)")
print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"\nFeature columns: {X.columns.tolist()}")
print(f"\nTarget statistics:")
print(f"  Mean: {y.mean():.2f}")
print(f"  Std: {y.std():.2f}")
print(f"  Range: [{y.min()}, {y.max()}]")

In [None]:
# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("=" * 80)
print("TRAIN-VALIDATION SPLIT")
print("=" * 80)
print(f"Training set: {X_train.shape[0]} samples ({(X_train.shape[0]/X.shape[0])*100:.1f}%)")
print(f"Validation set: {X_val.shape[0]} samples ({(X_val.shape[0]/X.shape[0])*100:.1f}%)")
print(f"\nTraining target statistics:")
print(f"  Mean: {y_train.mean():.2f}")
print(f"  Std: {y_train.std():.2f}")

## 4. Baseline XGBoost Model (Default Parameters)

In [None]:
# Train baseline XGBoost with default parameters
print("=" * 80)
print("BASELINE XGBOOST MODEL (Default Parameters)")
print("=" * 80)

xgb_baseline = XGBRegressor(
    random_state=42,
    n_jobs=-1,
    verbosity=0
)

xgb_baseline.fit(X_train, y_train)

# Make predictions
y_train_pred_baseline = xgb_baseline.predict(X_train)
y_val_pred_baseline = xgb_baseline.predict(X_val)

# Calculate metrics
train_r2_baseline = r2_score(y_train, y_train_pred_baseline)
val_r2_baseline = r2_score(y_val, y_val_pred_baseline)
train_rmse_baseline = np.sqrt(mean_squared_error(y_train, y_train_pred_baseline))
val_rmse_baseline = np.sqrt(mean_squared_error(y_val, y_val_pred_baseline))
train_mae_baseline = mean_absolute_error(y_train, y_train_pred_baseline)
val_mae_baseline = mean_absolute_error(y_val, y_val_pred_baseline)

print(f"\n{'Metric':<20} {'Training':>15} {'Validation':>15}")
print("-" * 80)
print(f"{'R² Score':<20} {train_r2_baseline:>15.4f} {val_r2_baseline:>15.4f}")
print(f"{'RMSE':<20} {train_rmse_baseline:>15.4f} {val_rmse_baseline:>15.4f}")
print(f"{'MAE':<20} {train_mae_baseline:>15.4f} {val_mae_baseline:>15.4f}")
print("=" * 80)

# Check for overfitting
overfitting_diff = abs(train_r2_baseline - val_r2_baseline)
if overfitting_diff < 0.05:
    print(f"\n✅ Model is well-balanced (R² difference: {overfitting_diff:.4f})")
else:
    print(f"\n⚠️ Potential overfitting detected (R² difference: {overfitting_diff:.4f})")

## 5. Feature Importance Analysis (Baseline)

In [None]:
# Get feature importance
feature_importance_baseline = pd.DataFrame({
    'Feature': X.columns,
    'Importance': xgb_baseline.feature_importances_
}).sort_values('Importance', ascending=False)

print("=" * 80)
print("FEATURE IMPORTANCE (Baseline XGBoost)")
print("=" * 80)
print(feature_importance_baseline)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_baseline['Feature'], feature_importance_baseline['Importance'], color='teal')
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Importance - XGBoost (Default)', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## 6. Hyperparameter Tuning with RandomizedSearchCV

Using RandomizedSearchCV for efficiency with large parameter spaces

In [None]:
# Define parameter distribution for RandomizedSearchCV
param_distributions = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [3, 5, 7, 9, 11],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5, 7],
    'gamma': [0, 0.1, 0.2, 0.3, 0.5],
    'reg_alpha': [0, 0.01, 0.1, 1],
    'reg_lambda': [0.1, 1, 5, 10]
}

print("=" * 80)
print("HYPERPARAMETER TUNING WITH RANDOMIZED SEARCH")
print("=" * 80)
print(f"\nParameter distributions:")
for param, values in param_distributions.items():
    print(f"  {param}: {values}")

print(f"\nRandomized search will test 100 random combinations")
print(f"This is more efficient than testing all combinations!")
print("\n🔄 Starting Randomized Search (this may take several minutes)...")

In [None]:
# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=XGBRegressor(random_state=42, n_jobs=-1, verbosity=0),
    param_distributions=param_distributions,
    n_iter=100,  # Number of random combinations to try
    cv=5,
    scoring='r2',
    verbose=1,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

print("\n" + "=" * 80)
print("RANDOMIZED SEARCH RESULTS")
print("=" * 80)
print(f"\n✅ Randomized Search completed!")
print(f"\nBest parameters:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest cross-validation R² score: {random_search.best_score_:.4f}")

In [None]:
# Get top 10 parameter combinations
cv_results = pd.DataFrame(random_search.cv_results_)
top_10 = cv_results.nlargest(10, 'mean_test_score')[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]

print("\n" + "=" * 80)
print("TOP 10 PARAMETER COMBINATIONS")
print("=" * 80)
for idx, row in top_10.iterrows():
    print(f"\nRank {int(row['rank_test_score'])}:")
    print(f"  Mean R² Score: {row['mean_test_score']:.4f} (+/- {row['std_test_score']:.4f})")
    print(f"  Parameters: {row['params']}")

## 7. Evaluate Tuned XGBoost Model

In [None]:
# Get the best model
xgb_tuned = random_search.best_estimator_

# Make predictions
y_train_pred_tuned = xgb_tuned.predict(X_train)
y_val_pred_tuned = xgb_tuned.predict(X_val)

# Calculate metrics
train_r2_tuned = r2_score(y_train, y_train_pred_tuned)
val_r2_tuned = r2_score(y_val, y_val_pred_tuned)
train_rmse_tuned = np.sqrt(mean_squared_error(y_train, y_train_pred_tuned))
val_rmse_tuned = np.sqrt(mean_squared_error(y_val, y_val_pred_tuned))
train_mae_tuned = mean_absolute_error(y_train, y_train_pred_tuned)
val_mae_tuned = mean_absolute_error(y_val, y_val_pred_tuned)

print("=" * 80)
print("TUNED XGBOOST MODEL PERFORMANCE")
print("=" * 80)
print(f"\n{'Metric':<20} {'Training':>15} {'Validation':>15}")
print("-" * 80)
print(f"{'R² Score':<20} {train_r2_tuned:>15.4f} {val_r2_tuned:>15.4f}")
print(f"{'RMSE':<20} {train_rmse_tuned:>15.4f} {val_rmse_tuned:>15.4f}")
print(f"{'MAE':<20} {train_mae_tuned:>15.4f} {val_mae_tuned:>15.4f}")
print("=" * 80)

# Check for overfitting
overfitting_diff_tuned = abs(train_r2_tuned - val_r2_tuned)
if overfitting_diff_tuned < 0.05:
    print(f"\n✅ Model is well-balanced (R² difference: {overfitting_diff_tuned:.4f})")
else:
    print(f"\n⚠️ Potential overfitting detected (R² difference: {overfitting_diff_tuned:.4f})")

## 8. Model Comparison: Baseline vs Tuned

In [None]:
# Compare models
comparison_df = pd.DataFrame({
    'Model': ['XGB Baseline', 'XGB Tuned'],
    'Train R²': [train_r2_baseline, train_r2_tuned],
    'Val R²': [val_r2_baseline, val_r2_tuned],
    'Train RMSE': [train_rmse_baseline, train_rmse_tuned],
    'Val RMSE': [val_rmse_baseline, val_rmse_tuned],
    'Train MAE': [train_mae_baseline, train_mae_tuned],
    'Val MAE': [val_mae_baseline, val_mae_tuned]
})

print("=" * 80)
print("MODEL COMPARISON: BASELINE vs TUNED")
print("=" * 80)
display(comparison_df)

# Calculate improvement
r2_improvement = ((val_r2_tuned - val_r2_baseline) / val_r2_baseline) * 100
rmse_improvement = ((val_rmse_baseline - val_rmse_tuned) / val_rmse_baseline) * 100

print(f"\n📈 Improvement with tuning:")
print(f"  Validation R² improvement: {r2_improvement:+.2f}%")
print(f"  Validation RMSE improvement: {rmse_improvement:+.2f}%")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

metrics = ['Val R²', 'Val RMSE', 'Val MAE']
baseline_vals = [val_r2_baseline, val_rmse_baseline, val_mae_baseline]
tuned_vals = [val_r2_tuned, val_rmse_tuned, val_mae_tuned]

for idx, (metric, baseline_val, tuned_val) in enumerate(zip(metrics, baseline_vals, tuned_vals)):
    x = ['Baseline XGB', 'Tuned XGB']
    y = [baseline_val, tuned_val]
    colors = ['lightcoral', 'darkgreen']
    
    bars = axes[idx].bar(x, y, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)
    axes[idx].set_ylabel(metric, fontsize=11)
    axes[idx].set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
    axes[idx].grid(alpha=0.3, axis='y')
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        axes[idx].text(bar.get_x() + bar.get_width()/2., height,
                      f'{height:.4f}',
                      ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 9. Learning Curves

In [None]:
# Plot learning curves for tuned model
eval_set = [(X_train, y_train), (X_val, y_val)]

# Train a new model with early stopping to get learning curves
xgb_learning = XGBRegressor(**xgb_tuned.get_params())
xgb_learning.set_params(early_stopping_rounds=20, eval_metric='rmse')

xgb_learning.fit(
    X_train, y_train,
    eval_set=eval_set,
    verbose=False
)

# Get results
results = xgb_learning.evals_result()

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(results['validation_0']['rmse'], label='Training RMSE', linewidth=2)
plt.plot(results['validation_1']['rmse'], label='Validation RMSE', linewidth=2)
plt.xlabel('Boosting Round', fontsize=11)
plt.ylabel('RMSE', fontsize=11)
plt.title('Learning Curve - RMSE', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
epochs = len(results['validation_0']['rmse'])
x_axis = range(0, epochs)
plt.plot(x_axis, results['validation_0']['rmse'], label='Training', linewidth=2, color='blue')
plt.plot(x_axis, results['validation_1']['rmse'], label='Validation', linewidth=2, color='orange')
plt.xlabel('Boosting Round', fontsize=11)
plt.ylabel('RMSE', fontsize=11)
plt.title('Convergence Analysis', fontsize=12, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nBest iteration: {xgb_learning.best_iteration}")
print(f"Best validation RMSE: {results['validation_1']['rmse'][xgb_learning.best_iteration]:.4f}")

## 10. Visualization: Actual vs Predicted

In [None]:
# Plot Actual vs Predicted for Tuned Model
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set
axes[0].scatter(y_train, y_train_pred_tuned, alpha=0.5, s=30, color='blue', edgecolors='black', linewidth=0.3)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
            'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Recovery Index', fontsize=12)
axes[0].set_ylabel('Predicted Recovery Index', fontsize=12)
axes[0].set_title(f'Training Set: Actual vs Predicted (Tuned XGBoost)\nR² = {train_r2_tuned:.4f}', 
                 fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Validation set
axes[1].scatter(y_val, y_val_pred_tuned, alpha=0.5, s=30, color='green', edgecolors='black', linewidth=0.3)
axes[1].plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], 
            'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Recovery Index', fontsize=12)
axes[1].set_ylabel('Predicted Recovery Index', fontsize=12)
axes[1].set_title(f'Validation Set: Actual vs Predicted (Tuned XGBoost)\nR² = {val_r2_tuned:.4f}', 
                 fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 11. Residual Analysis

In [None]:
# Calculate residuals
residuals_train_tuned = y_train - y_train_pred_tuned
residuals_val_tuned = y_val - y_val_pred_tuned

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Training residuals vs predicted
axes[0, 0].scatter(y_train_pred_tuned, residuals_train_tuned, alpha=0.5, s=30, 
                   color='blue', edgecolors='black', linewidth=0.3)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Predicted Recovery Index', fontsize=11)
axes[0, 0].set_ylabel('Residuals', fontsize=11)
axes[0, 0].set_title('Training Set: Residual Plot (Tuned XGBoost)', fontsize=12, fontweight='bold')
axes[0, 0].grid(alpha=0.3)

# Validation residuals vs predicted
axes[0, 1].scatter(y_val_pred_tuned, residuals_val_tuned, alpha=0.5, s=30, 
                   color='green', edgecolors='black', linewidth=0.3)
axes[0, 1].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 1].set_xlabel('Predicted Recovery Index', fontsize=11)
axes[0, 1].set_ylabel('Residuals', fontsize=11)
axes[0, 1].set_title('Validation Set: Residual Plot (Tuned XGBoost)', fontsize=12, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Histogram of training residuals
axes[1, 0].hist(residuals_train_tuned, bins=50, color='blue', edgecolor='black', alpha=0.7)
axes[1, 0].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Residuals', fontsize=11)
axes[1, 0].set_ylabel('Frequency', fontsize=11)
axes[1, 0].set_title('Training Set: Residual Distribution (Tuned XGBoost)', fontsize=12, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# Histogram of validation residuals
axes[1, 1].hist(residuals_val_tuned, bins=50, color='green', edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1, 1].set_xlabel('Residuals', fontsize=11)
axes[1, 1].set_ylabel('Frequency', fontsize=11)
axes[1, 1].set_title('Validation Set: Residual Distribution (Tuned XGBoost)', fontsize=12, fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Residual statistics
print("=" * 80)
print("RESIDUAL STATISTICS")
print("=" * 80)
print(f"\nTraining Set:")
print(f"  Mean residual: {residuals_train_tuned.mean():.4f}")
print(f"  Std residual: {residuals_train_tuned.std():.4f}")
print(f"\nValidation Set:")
print(f"  Mean residual: {residuals_val_tuned.mean():.4f}")
print(f"  Std residual: {residuals_val_tuned.std():.4f}")

## 12. Feature Importance (Tuned Model)

In [None]:
# Get feature importance from tuned model
feature_importance_tuned = pd.DataFrame({
    'Feature': X.columns,
    'Importance': xgb_tuned.feature_importances_
}).sort_values('Importance', ascending=False)

print("=" * 80)
print("FEATURE IMPORTANCE (Tuned XGBoost)")
print("=" * 80)
print(feature_importance_tuned)

# Visualize feature importance (multiple types)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Weight importance
xgb.plot_importance(xgb_tuned, ax=axes[0], importance_type='weight', max_num_features=10)
axes[0].set_title('Feature Importance (Weight)', fontsize=12, fontweight='bold')

# Gain importance
xgb.plot_importance(xgb_tuned, ax=axes[1], importance_type='gain', max_num_features=10)
axes[1].set_title('Feature Importance (Gain)', fontsize=12, fontweight='bold')

# Cover importance
xgb.plot_importance(xgb_tuned, ax=axes[2], importance_type='cover', max_num_features=10)
axes[2].set_title('Feature Importance (Cover)', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## 13. Cross-Validation Analysis

In [None]:
# Perform 10-fold cross-validation on tuned model
from sklearn.model_selection import cross_validate

print("=" * 80)
print("10-FOLD CROSS-VALIDATION (Tuned Model)")
print("=" * 80)
print("🔄 Running cross-validation...")

cv_results = cross_validate(
    xgb_tuned, 
    X_train, 
    y_train, 
    cv=10, 
    scoring=['r2', 'neg_mean_squared_error', 'neg_mean_absolute_error'],
    return_train_score=True,
    n_jobs=-1
)

# Calculate RMSE from MSE
cv_train_rmse = np.sqrt(-cv_results['train_neg_mean_squared_error'])
cv_test_rmse = np.sqrt(-cv_results['test_neg_mean_squared_error'])

print(f"\n✅ Cross-validation completed!")
print(f"\nR² Score:")
print(f"  Train: {cv_results['train_r2'].mean():.4f} (+/- {cv_results['train_r2'].std() * 2:.4f})")
print(f"  Test:  {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std() * 2:.4f})")
print(f"\nRMSE:")
print(f"  Train: {cv_train_rmse.mean():.4f} (+/- {cv_train_rmse.std() * 2:.4f})")
print(f"  Test:  {cv_test_rmse.mean():.4f} (+/- {cv_test_rmse.std() * 2:.4f})")
print(f"\nMAE:")
print(f"  Train: {-cv_results['train_neg_mean_absolute_error'].mean():.4f} (+/- {cv_results['train_neg_mean_absolute_error'].std() * 2:.4f})")
print(f"  Test:  {-cv_results['test_neg_mean_absolute_error'].mean():.4f} (+/- {cv_results['test_neg_mean_absolute_error'].std() * 2:.4f})")

In [None]:
# Visualize cross-validation scores
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# R² scores
axes[0].boxplot([cv_results['train_r2'], cv_results['test_r2']], 
                labels=['Train', 'Test'], patch_artist=True,
                boxprops=dict(facecolor='lightblue'))
axes[0].set_ylabel('R² Score', fontsize=11)
axes[0].set_title('Cross-Validation R² Scores', fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3, axis='y')

# RMSE
axes[1].boxplot([cv_train_rmse, cv_test_rmse], 
                labels=['Train', 'Test'], patch_artist=True,
                boxprops=dict(facecolor='lightcoral'))
axes[1].set_ylabel('RMSE', fontsize=11)
axes[1].set_title('Cross-Validation RMSE', fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3, axis='y')

# MAE
axes[2].boxplot([-cv_results['train_neg_mean_absolute_error'], 
                 -cv_results['test_neg_mean_absolute_error']], 
                labels=['Train', 'Test'], patch_artist=True,
                boxprops=dict(facecolor='lightgreen'))
axes[2].set_ylabel('MAE', fontsize=11)
axes[2].set_title('Cross-Validation MAE', fontsize=12, fontweight='bold')
axes[2].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 14. Predictions on Test Set

In [None]:
# Prepare test data
X_test = test_df.drop('Id', axis=1).copy()
X_test['Lifestyle Activities'] = label_encoder.transform(X_test['Lifestyle Activities'])

# Make predictions using tuned model
test_predictions = xgb_tuned.predict(X_test)

print("=" * 80)
print("TEST SET PREDICTIONS (Tuned XGBoost)")
print("=" * 80)
print(f"✅ Predictions generated for {len(test_predictions)} test samples")
print(f"\nPrediction statistics:")
print(f"  Mean: {test_predictions.mean():.2f}")
print(f"  Std: {test_predictions.std():.2f}")
print(f"  Min: {test_predictions.min():.2f}")
print(f"  Max: {test_predictions.max():.2f}")
print(f"  Median: {np.median(test_predictions):.2f}")

In [None]:
# Visualize prediction distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(test_predictions, bins=30, color='purple', edgecolor='black', alpha=0.7)
axes[0].axvline(test_predictions.mean(), color='red', linestyle='--', 
               linewidth=2, label=f'Mean: {test_predictions.mean():.2f}')
axes[0].axvline(np.median(test_predictions), color='green', linestyle='--', 
               linewidth=2, label=f'Median: {np.median(test_predictions):.2f}')
axes[0].set_xlabel('Predicted Recovery Index', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Test Predictions (Tuned XGBoost)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot(test_predictions, vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightpurple', color='purple'),
                medianprops=dict(color='red', linewidth=2))
axes[1].set_ylabel('Predicted Recovery Index', fontsize=12)
axes[1].set_title('Box Plot of Test Predictions', fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Create submission file
submission = pd.DataFrame({
    'Id': test_df['Id'],
    'Recovery Index': test_predictions
})

# Save to CSV
submission.to_csv('xgboost_tuned_submission.csv', index=False)

print("=" * 80)
print("SUBMISSION FILE CREATED")
print("=" * 80)
print("✅ File saved: xgboost_tuned_submission.csv")
print(f"\nFirst 10 predictions:")
display(submission.head(10))
print(f"\nLast 10 predictions:")
display(submission.tail(10))

## 15. Model Summary and Insights

In [None]:
# Create comprehensive summary
print("=" * 80)
print("XGBOOST MODEL - COMPREHENSIVE SUMMARY")
print("=" * 80)

print("\n📊 MODEL CONFIGURATION:")
print(f"  Model Type: XGBoost Regressor")
print(f"  Best Parameters:")
for param, value in random_search.best_params_.items():
    print(f"    - {param}: {value}")

print("\n📈 PERFORMANCE METRICS:")
print(f"\n  Baseline XGBoost:")
print(f"    - Training R²: {train_r2_baseline:.4f}")
print(f"    - Validation R²: {val_r2_baseline:.4f}")
print(f"    - Validation RMSE: {val_rmse_baseline:.4f}")
print(f"    - Validation MAE: {val_mae_baseline:.4f}")

print(f"\n  Tuned XGBoost:")
print(f"    - Training R²: {train_r2_tuned:.4f}")
print(f"    - Validation R²: {val_r2_tuned:.4f}")
print(f"    - Validation RMSE: {val_rmse_tuned:.4f}")
print(f"    - Validation MAE: {val_mae_tuned:.4f}")

print(f"\n  10-Fold Cross-Validation (Tuned):")
print(f"    - Mean Test R²: {cv_results['test_r2'].mean():.4f} (+/- {cv_results['test_r2'].std() * 2:.4f})")
print(f"    - Mean Test RMSE: {cv_test_rmse.mean():.4f} (+/- {cv_test_rmse.std() * 2:.4f})")

print(f"\n🔍 TOP 3 MOST IMPORTANT FEATURES:")
for idx, row in feature_importance_tuned.head(3).iterrows():
    print(f"  {idx+1}. {row['Feature']}: {row['Importance']:.4f}")

print(f"\n📊 TEST SET PREDICTIONS:")
print(f"  Number of predictions: {len(test_predictions)}")
print(f"  Prediction range: [{test_predictions.min():.2f}, {test_predictions.max():.2f}]")
print(f"  Prediction mean: {test_predictions.mean():.2f}")

print("\n" + "=" * 80)

---

## 16. Key Findings & Next Steps

### 🎯 Key Findings

1. **XGBoost Performance:**
   - XGBoost typically outperforms Random Forest and Linear Regression
   - Excellent handling of non-linear relationships
   - Built-in regularization prevents overfitting

2. **Hyperparameter Tuning:**
   - RandomizedSearchCV efficiently explores large parameter spaces
   - Key parameters: learning_rate, max_depth, n_estimators
   - Early stopping prevents overfitting

3. **Feature Importance:**
   - XGBoost provides multiple importance metrics (weight, gain, cover)
   - Initial Health Score and Therapy Hours typically dominate
   - Understanding feature interactions is crucial

### 🚀 Next Steps for Further Improvement

1. **Try Other Boosting Algorithms:**
   - **LightGBM** (faster, histogram-based)
   - **CatBoost** (handles categorical features natively)
   - **Gradient Boosting** (sklearn implementation)

2. **Advanced Feature Engineering:**
   - Create interaction terms (e.g., `Therapy Hours × Initial Health Score`)
   - Polynomial features of degree 2-3
   - Ratio features (e.g., `Therapy Hours / Follow-Up Sessions`)
   - Domain-specific transformations

3. **Ensemble Strategies:**
   - **Stacking**: Use XGBoost + Random Forest + Linear models
   - **Voting**: Weight-averaged predictions
   - **Blending**: Multiple model combinations

4. **Further Hyperparameter Optimization:**
   - Use **Optuna** or **Hyperopt** for Bayesian optimization
   - Fine-tune learning rate schedule
   - Experiment with different evaluation metrics

5. **Model Analysis:**
   - SHAP values for model interpretability
   - Partial dependence plots
   - Error analysis on specific data segments

### 💡 XGBoost Advantages

- ✅ **Speed**: Optimized C++ implementation
- ✅ **Performance**: State-of-the-art accuracy
- ✅ **Regularization**: Built-in L1/L2 regularization
- ✅ **Handling missing values**: Automatic handling
- ✅ **Feature importance**: Multiple importance metrics
- ✅ **Flexibility**: Extensive hyperparameter options

### 📝 Documentation Tips

- Compare XGBoost with Random Forest and Linear Regression
- Document the impact of different hyperparameters
- Show learning curves and convergence
- Explain feature importance insights
- Discuss regularization benefits

**Remember:** XGBoost is often a top performer in Kaggle competitions! Combine it with good feature engineering for best results. 🏆