# Day 2 (Part 2): Model Selection & Tuning (R Version)

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, you'll learn how to systematically improve your ML models through:
- Cross-validation for robust evaluation
- Hyperparameter tuning (grid search, random search)
- Feature importance analysis
- Final model selection

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sysylvia/ethiopia-ds-workshop-2026/blob/main/notebooks/03-model-tuning-r.ipynb)

## Setup

::: {.callout-important}
## Google Colab R Runtime
Make sure you're using the R runtime: **Runtime -> Change runtime type -> R**
:::

In [None]:
# Install packages if needed (run once in Colab)
if (!require("tidymodels", quietly = TRUE)) {
  install.packages(c("tidymodels", "ranger", "xgboost", "vip"))
}

# Load packages
library(tidymodels)
library(tidyverse)
library(vip)

# Settings
set.seed(42)
theme_set(theme_minimal())

cat("Packages loaded!\n")

## Part 1: Load Data

We'll use the same supply chain dataset from the previous notebook.

In [None]:
# Load the supply chain dataset
url <- "https://raw.githubusercontent.com/sysylvia/ethiopia-ds-workshop-2026/main/data/supply-chain-sample.csv"

tryCatch({
  df <- read_csv(url, show_col_types = FALSE)
  cat("Data loaded from GitHub!\n")
}, error = function(e) {
  # Create sample data if URL not available
  cat("Creating sample data...\n")
  set.seed(42)
  n <- 1000
  
  df <<- tibble(
    region = sample(c('Addis Ababa', 'Oromia', 'Amhara', 'SNNP', 'Tigray'), n, replace = TRUE),
    facility_type = sample(c('Hospital', 'Health Center', 'Clinic'), n, replace = TRUE, prob = c(0.2, 0.5, 0.3)),
    season = sample(c('dry', 'rainy'), n, replace = TRUE),
    population_served = round(runif(n, 5000, 100000)),
    month = sample(1:12, n, replace = TRUE),
    previous_demand = round(100 + rnorm(n, 0, 30)),
    distance_to_warehouse = round(runif(n, 10, 500)),
    stockout_last_month = sample(0:1, n, replace = TRUE, prob = c(0.8, 0.2)),
    avg_delivery_days = round(runif(n, 3, 21)),
    storage_capacity = round(runif(n, 500, 5000)),
    actual_demand = round(100 + 
      ifelse(facility_type == 'Hospital', 80, ifelse(facility_type == 'Health Center', 40, 0)) +
      0.3 * previous_demand + 
      rnorm(n, 0, 20))
  )
})

cat("Data shape:", nrow(df), "rows x", ncol(df), "columns\n")
cat("Columns:", paste(names(df), collapse = ", "), "\n")
head(df)

In [None]:
# Prepare features
# Convert categorical to factors for proper handling
df <- df %>%
  mutate(
    region = factor(region),
    facility_type = factor(facility_type),
    season = factor(season)
  )

cat("Features prepared\n")

In [None]:
# Split data
set.seed(42)
data_split <- initial_split(df, prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)

cat("Training set:", nrow(train_data), "samples\n")
cat("Test set:", nrow(test_data), "samples\n")

## Part 2: Why Tune Models?

### Hyperparameters vs Learned Parameters

| Type | What it is | Examples | How set? |
|------|-----------|----------|----------|
| **Learned parameters** | Values learned from data | Regression coefficients, tree splits | Training algorithm |
| **Hyperparameters** | Choices that control learning | Number of trees, learning rate, max depth | You choose! |

**Key insight:** Wrong hyperparameters can lead to underfitting (too simple) or overfitting (too complex).

## Part 3: Cross-Validation Deep Dive

Cross-validation gives us a more robust estimate of model performance than a single train/test split.

### K-Fold Cross-Validation

```
Fold 1: [TEST] [train] [train] [train] [train]
Fold 2: [train] [TEST] [train] [train] [train]
Fold 3: [train] [train] [TEST] [train] [train]
Fold 4: [train] [train] [train] [TEST] [train]
Fold 5: [train] [train] [train] [train] [TEST]

Final score = average of all 5 folds
```

In [None]:
# Define recipe for tree models
tree_recipe <- recipe(actual_demand ~ population_served + month + previous_demand + 
                                      distance_to_warehouse + stockout_last_month + 
                                      avg_delivery_days + storage_capacity + 
                                      region + facility_type + season, 
                     data = train_data)

# Baseline: Random Forest with default settings
rf_default_spec <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("regression")

rf_default_wf <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(rf_default_spec)

# 5-fold cross-validation
set.seed(42)
cv_folds <- vfold_cv(train_data, v = 5)

# Fit and evaluate across folds
cv_results <- fit_resamples(
  rf_default_wf,
  resamples = cv_folds,
  metrics = metric_set(rmse, rsq)
)

# Collect metrics
cv_metrics <- collect_metrics(cv_results)

cat("5-Fold Cross-Validation Results:\n")
print(cv_metrics)

In [None]:
# Visualize cross-validation results per fold
cv_rmse_per_fold <- cv_results %>%
  collect_metrics(summarize = FALSE) %>%
  filter(.metric == "rmse")

mean_rmse <- mean(cv_rmse_per_fold$.estimate)
sd_rmse <- sd(cv_rmse_per_fold$.estimate)

ggplot(cv_rmse_per_fold, aes(x = id, y = .estimate)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_hline(yintercept = mean_rmse, color = "red", linetype = "dashed") +
  geom_ribbon(aes(ymin = mean_rmse - sd_rmse, ymax = mean_rmse + sd_rmse, group = 1), 
              fill = "red", alpha = 0.1) +
  labs(
    x = "Fold",
    y = "RMSE",
    title = "Cross-Validation Performance Across Folds",
    subtitle = sprintf("Mean RMSE: %.2f (+/- %.2f)", mean_rmse, sd_rmse)
  ) +
  theme_minimal()

## Part 4: Grid Search for Hyperparameter Tuning

**Grid Search** tries every combination of hyperparameters you specify.

For Random Forest, key hyperparameters include:
- `trees`: Number of trees (more = better but slower)
- `mtry`: Number of features to consider at each split
- `min_n`: Minimum samples in leaf nodes (prevents overly specific rules)

In [None]:
# Define tunable Random Forest specification
rf_tune_spec <- rand_forest(
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

rf_tune_wf <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(rf_tune_spec)

# Define hyperparameter grid
rf_grid <- grid_regular(
  trees(range = c(50, 200)),
  mtry(range = c(2, 8)),
  min_n(range = c(2, 10)),
  levels = 3
)

cat("Total combinations to try:", nrow(rf_grid), "\n")
cat("With 5-fold CV:", nrow(rf_grid) * 5, "model fits\n")
print(head(rf_grid))

In [None]:
# Run grid search
cat("Running grid search (this may take a minute)...\n")

set.seed(42)
rf_tune_results <- tune_grid(
  rf_tune_wf,
  resamples = cv_folds,
  grid = rf_grid,
  metrics = metric_set(rmse, rsq)
)

# Show best results
cat("\nTop 5 Configurations:\n")
show_best(rf_tune_results, metric = "rmse", n = 5)

In [None]:
# Get best parameters
best_rf_params <- select_best(rf_tune_results, metric = "rmse")
cat("Best parameters:\n")
print(best_rf_params)

## Part 5: Visualize Hyperparameter Effects

In [None]:
# Visualize tuning results
autoplot(rf_tune_results) +
  theme_minimal() +
  labs(title = "Hyperparameter Tuning Results")

In [None]:
# Effect of trees on performance
tune_metrics <- collect_metrics(rf_tune_results) %>%
  filter(.metric == "rmse")

ggplot(tune_metrics, aes(x = trees, y = mean, color = factor(min_n))) +
  geom_line() +
  geom_point() +
  facet_wrap(~mtry, labeller = label_both) +
  labs(
    x = "Number of Trees",
    y = "Mean CV RMSE",
    color = "min_n",
    title = "Effect of Hyperparameters on Performance"
  ) +
  theme_minimal()

## Part 6: Random Search (Faster Alternative)

When the hyperparameter space is large, **Random Search** samples random combinations instead of trying everything.

In [None]:
# Define a larger search space for random search
rf_random_grid <- grid_random(
  trees(range = c(50, 300)),
  mtry(range = c(2, 10)),
  min_n(range = c(2, 20)),
  size = 20  # Only try 20 random combinations
)

cat("Random search will try", nrow(rf_random_grid), "combinations\n")

# Run random search
set.seed(42)
rf_random_results <- tune_grid(
  rf_tune_wf,
  resamples = cv_folds,
  grid = rf_random_grid,
  metrics = metric_set(rmse, rsq)
)

cat("\nBest from random search:\n")
show_best(rf_random_results, metric = "rmse", n = 3)

## Part 7: Feature Importance Analysis

Understanding which features matter helps:
- Interpret the model
- Identify important supply chain factors
- Guide future data collection

In [None]:
# Finalize and fit the best model
final_rf_wf <- rf_tune_wf %>%
  finalize_workflow(best_rf_params)

final_rf_fit <- final_rf_wf %>%
  fit(data = train_data)

# Feature importance using vip
final_rf_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10) +
  labs(title = "Feature Importance (Tuned Random Forest)") +
  theme_minimal()

In [None]:
# Get importance values as a table
importance_df <- final_rf_fit %>%
  extract_fit_parsnip() %>%
  vi() %>%
  arrange(desc(Importance))

cat("Feature Importance Table:\n")
print(importance_df)

## Part 8: Final Model Evaluation

Now we evaluate the tuned model on the **held-out test set** (which we haven't touched during tuning).

In [None]:
# Compare default vs tuned model on test set

# Fit default model
default_fit <- rf_default_wf %>% fit(data = train_data)
default_pred <- predict(default_fit, test_data) %>%
  bind_cols(test_data %>% select(actual_demand))

# Predict with tuned model
tuned_pred <- predict(final_rf_fit, test_data) %>%
  bind_cols(test_data %>% select(actual_demand))

# Calculate metrics
default_metrics <- default_pred %>% metrics(truth = actual_demand, estimate = .pred)
tuned_metrics <- tuned_pred %>% metrics(truth = actual_demand, estimate = .pred)

# Comparison table
comparison <- tibble(
  Model = c('Default RF', 'Tuned RF'),
  RMSE = c(
    default_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
    tuned_metrics %>% filter(.metric == "rmse") %>% pull(.estimate)
  ),
  MAE = c(
    default_metrics %>% filter(.metric == "mae") %>% pull(.estimate),
    tuned_metrics %>% filter(.metric == "mae") %>% pull(.estimate)
  ),
  R_squared = c(
    default_metrics %>% filter(.metric == "rsq") %>% pull(.estimate),
    tuned_metrics %>% filter(.metric == "rsq") %>% pull(.estimate)
  )
) %>%
  mutate(across(where(is.numeric), ~round(., 3)))

cat("Test Set Performance:\n")
print(comparison)

In [None]:
# Visualize predictions
all_preds <- bind_rows(
  default_pred %>% mutate(Model = "Default RF"),
  tuned_pred %>% mutate(Model = "Tuned RF")
)

ggplot(all_preds, aes(x = actual_demand, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  facet_wrap(~Model) +
  labs(
    x = "Actual Demand",
    y = "Predicted Demand",
    title = "Default vs Tuned Random Forest"
  ) +
  theme_minimal()

## Part 9: Save Model for Day 3

We'll save the trained model so we can use it in the ML-to-ABM integration notebook.

In [None]:
# Save the tuned workflow (includes preprocessing and model)
saveRDS(final_rf_fit, "demand_model.rds")

# Also save model info
model_info <- list(
  best_params = best_rf_params,
  test_rmse = tuned_metrics %>% filter(.metric == "rmse") %>% pull(.estimate),
  feature_importance = importance_df
)
saveRDS(model_info, "demand_model_info.rds")

cat("Model saved as 'demand_model.rds'\n")
cat("\nModel summary:\n")
cat("  Best parameters: trees =", best_rf_params$trees, 
    ", mtry =", best_rf_params$mtry, 
    ", min_n =", best_rf_params$min_n, "\n")
cat("  Test RMSE:", round(model_info$test_rmse, 2), "\n")

## Summary

In this notebook, you learned:

1. **Cross-validation** (`vfold_cv()`) provides robust performance estimates
2. **Grid search** (`tune_grid()` + `grid_regular()`) systematically explores hyperparameter combinations
3. **Random search** (`grid_random()`) is faster for large search spaces
4. **Feature importance** (`vip()`) helps interpret model decisions
5. **Final evaluation** on held-out test data prevents overfitting

### Key tidymodels Functions for Tuning

| Task | Function |
|------|----------|
| Mark parameter for tuning | `tune()` |
| Create CV folds | `vfold_cv()` |
| Create parameter grid | `grid_regular()`, `grid_random()` |
| Tune model | `tune_grid()` |
| Get best parameters | `select_best()` |
| Finalize workflow | `finalize_workflow()` |
| Feature importance | `vip()`, `vi()` |

### Key Takeaways for Supply Chain Forecasting

- Previous demand is typically the strongest predictor
- Facility characteristics (type, capacity) capture structural differences
- Seasonal patterns (rainy/dry) affect demand
- Distance and delivery time influence supply chain dynamics

**Next:** Use these predictions as inputs to Agent-Based Models (Day 3)

---

## Exercise (Optional)

Try tuning a Gradient Boosting model using the same process:

1. Define a parameter grid for `boost_tree()` (try `trees`, `learn_rate`, `tree_depth`)
2. Run grid search with 5-fold CV
3. Compare the best GB model with the tuned RF model

Which performs better on this data?

In [None]:
# Your code here
# Hint: boost_tree key parameters:
# - trees: 50, 100, 200
# - learn_rate: 0.01, 0.1, 0.3
# - tree_depth: 3, 5, 7

# gb_spec <- boost_tree(
#   trees = tune(),
#   learn_rate = tune(),
#   tree_depth = tune()
# ) %>%
#   set_engine("xgboost") %>%
#   set_mode("regression")