# Notebook 05: Drug Shortage Prediction (R Version)

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, we'll replicate the approach from Roe et al. (2025), who applied Random Forest to predict drug shortages in South Korea. We adapt their methods to our Ethiopian supply chain dataset using tidymodels.

**Reference:** Roe et al. (2025). "Drug shortage in South Korea: machine learning-based prediction models and analysis of duration and causal factors." *Frontiers in Pharmacology*. [DOI: 10.3389/fphar.2025.1608843](https://doi.org/10.3389/fphar.2025.1608843)

## Part 1: Background

### The Drug Shortage Problem

Drug shortages disrupt healthcare delivery worldwide. Key challenges:
- **Unpredictable timing**: When will shortages occur?
- **Unknown duration**: How long will they last?
- **Multiple causes**: Manufacturing, raw materials, regulatory, demand surges

### The Korean Study Approach

Roe et al. built two ML models:

| Model | Task | Target | Performance |
|-------|------|--------|-------------|
| **Model 1** | Duration prediction | Short/Medium/Long | 62% accuracy |
| **Model 2** | Cause classification | 5 cause categories | >70% F1-score |

**Top predictors identified:**
1. Shortage frequency (how often the drug has been short before)
2. Import status (domestic vs. imported)
3. Alternative drug availability

## Setup

::: {.callout-important}
## Google Colab R Runtime
Make sure you're using the R runtime: **Runtime -> Change runtime type -> R**
:::

In [None]:
# Install packages if needed (run once in Colab)
if (!require("tidymodels", quietly = TRUE)) {
  install.packages(c("tidymodels", "ranger", "vip"))
}

# Load packages
library(tidymodels)
library(tidyverse)
library(vip)

# Settings
set.seed(42)
theme_set(theme_minimal())

cat("Packages loaded!\n")

## Part 2: Load and Prepare Data

We'll use our Ethiopian supply chain dataset, which now includes shortage-related columns.

In [None]:
# Load data from GitHub
url <- "https://raw.githubusercontent.com/sysylvia/ethiopia-ds-workshop-2026/main/data/supply-chain-sample.csv"

tryCatch({
  df <- read_csv(url, show_col_types = FALSE)
  cat("Data loaded from GitHub!\n")
}, error = function(e) {
  # Create sample data if URL not available
  cat("Creating sample data with shortage columns...\n")
  set.seed(42)
  n <- 1000
  
  df <<- tibble(
    region = sample(c('Addis Ababa', 'Oromia', 'Amhara', 'SNNP', 'Tigray'), n, replace = TRUE),
    facility_type = sample(c('Hospital', 'Health Center', 'Clinic'), n, replace = TRUE, prob = c(0.2, 0.5, 0.3)),
    season = sample(c('dry', 'rainy'), n, replace = TRUE),
    previous_demand = round(100 + rnorm(n, 0, 30)),
    distance_to_warehouse = round(runif(n, 10, 500)),
    stockout_last_month = sample(0:1, n, replace = TRUE, prob = c(0.8, 0.2)),
    avg_delivery_days = round(runif(n, 3, 21)),
    shortage_frequency = rpois(n, 2),  # Historical shortage count
    shortage_occurred = sample(0:1, n, replace = TRUE, prob = c(0.6, 0.4)),
    shortage_duration_days = ifelse(shortage_occurred == 1, sample(1:60, n, replace = TRUE), NA),
    shortage_duration_category = ifelse(shortage_occurred == 1, 
                                        sample(c('Short', 'Medium', 'Long'), n, replace = TRUE, prob = c(0.4, 0.35, 0.25)),
                                        NA),
    shortage_cause = ifelse(shortage_occurred == 1,
                           sample(c('Manufacturing', 'Logistics', 'Demand Surge', 'Regulatory', 'Raw Materials'), 
                                  n, replace = TRUE, prob = c(0.25, 0.3, 0.2, 0.1, 0.15)),
                           NA)
  )
})

cat("Data shape:", nrow(df), "rows x", ncol(df), "columns\n")
cat("Columns:", paste(names(df), collapse = ", "), "\n")
head(df)

In [None]:
# Explore the shortage-related columns
cat("Shortage Duration Categories:\n")
print(table(df$shortage_duration_category, useNA = "ifany"))

cat("\nShortage Cause Categories:\n")
print(table(df$shortage_cause, useNA = "ifany"))

### Data Dictionary: New Shortage Columns

| Column | Description | Mapping to Korean Study |
|--------|-------------|------------------------|
| `shortage_occurred` | Binary: did a shortage happen? | Similar to their outcome |
| `shortage_duration_days` | Duration in days (if occurred) | Their Model 1 target |
| `shortage_duration_category` | Short/Medium/Long | Their Model 1 target (binned) |
| `shortage_cause` | Cause category | Their Model 2 target |
| `shortage_frequency` | Historical shortage count | Their top predictor |

## Part 3: Duration Prediction (Model 1)

Like Roe et al., we'll predict shortage duration categories using Random Forest.

**Their result:** 62% accuracy

In [None]:
# Filter to rows where shortage occurred
df_shortage <- df %>%
  filter(shortage_occurred == 1) %>%
  mutate(
    shortage_duration_category = factor(shortage_duration_category),
    facility_type = factor(facility_type),
    region = factor(region),
    season = factor(season)
  )

cat("Shortage events:", nrow(df_shortage), "\n")

# Check class distribution
cat("\nDuration category distribution:\n")
print(prop.table(table(df_shortage$shortage_duration_category)))

In [None]:
# Prepare features for Model 1 (Duration Prediction)
# Define features (similar to Korean study predictors)
feature_cols <- c(
  'shortage_frequency',      # Top predictor in Korean study
  'facility_type',           # Maps to their "drug characteristics"
  'region',                  # Geographic factor
  'distance_to_warehouse',   # Supply chain factor (like import status)
  'season',                  # Temporal factor
  'previous_demand',         # Demand pressure
  'stockout_last_month',     # Recent history
  'avg_delivery_days'        # Logistics factor
)

cat("Features:", paste(feature_cols, collapse = ", "), "\n")

In [None]:
# Create recipe for duration prediction
duration_recipe <- recipe(shortage_duration_category ~ shortage_frequency + facility_type + 
                                                       region + distance_to_warehouse + season + 
                                                       previous_demand + stockout_last_month + 
                                                       avg_delivery_days, 
                         data = df_shortage)

# Train/test split with stratification
set.seed(42)
dur_split <- initial_split(df_shortage, prop = 0.8, strata = shortage_duration_category)
dur_train <- training(dur_split)
dur_test <- testing(dur_split)

cat("Training set:", nrow(dur_train), "samples\n")
cat("Test set:", nrow(dur_test), "samples\n")

In [None]:
# Train Random Forest classifier (like Korean study)
rf_dur_spec <- rand_forest(trees = 100, mtry = 3, min_n = 5) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_dur_wf <- workflow() %>%
  add_recipe(duration_recipe) %>%
  add_model(rf_dur_spec)

rf_dur_fit <- rf_dur_wf %>% fit(data = dur_train)

# Predictions
dur_pred <- predict(rf_dur_fit, dur_test) %>%
  bind_cols(dur_test %>% select(shortage_duration_category))

# Evaluate
accuracy_dur <- accuracy(dur_pred, truth = shortage_duration_category, estimate = .pred_class)

# F1 score (weighted)
dur_pred_probs <- predict(rf_dur_fit, dur_test, type = "prob") %>%
  bind_cols(predict(rf_dur_fit, dur_test)) %>%
  bind_cols(dur_test %>% select(shortage_duration_category))

f1_dur <- f_meas(dur_pred, truth = shortage_duration_category, estimate = .pred_class)

cat("=" |> rep(50) |> paste(collapse = ""), "\n")
cat("MODEL 1: Duration Prediction Results\n")
cat("=" |> rep(50) |> paste(collapse = ""), "\n")
cat("Our Accuracy:", round(accuracy_dur$.estimate * 100, 1), "%\n")
cat("Korean Study: 62%\n")
cat("\nOur F1-Score:", round(f1_dur$.estimate * 100, 1), "%\n")
cat("=" |> rep(50) |> paste(collapse = ""), "\n")

In [None]:
# Cross-validation for more robust estimate
set.seed(42)
cv_folds_dur <- vfold_cv(df_shortage, v = 5, strata = shortage_duration_category)

cv_dur_results <- fit_resamples(
  rf_dur_wf,
  resamples = cv_folds_dur,
  metrics = metric_set(accuracy, f_meas)
)

cv_dur_metrics <- collect_metrics(cv_dur_results)
cat("5-Fold CV Results:\n")
print(cv_dur_metrics)

In [None]:
# Confusion matrix
cm_dur <- conf_mat(dur_pred, truth = shortage_duration_category, estimate = .pred_class)

autoplot(cm_dur, type = "heatmap") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Duration Prediction: Confusion Matrix") +
  theme_minimal()

In [None]:
# Classification report (per-class metrics)
cat("\nPer-Class Metrics:\n")
dur_pred %>%
  group_by(shortage_duration_category) %>%
  summarize(
    n = n(),
    correct = sum(.pred_class == shortage_duration_category),
    accuracy = correct / n
  ) %>%
  print()

## Part 4: Cause Classification (Model 2)

Like Roe et al.'s second model, we'll predict the cause of shortages.

**Their result:** >70% F1-score

In [None]:
# Check cause distribution
cat("Shortage Cause Distribution:\n")
print(prop.table(table(df_shortage$shortage_cause)))

In [None]:
# Prepare data for Model 2
df_shortage <- df_shortage %>%
  mutate(shortage_cause = factor(shortage_cause))

# Create recipe for cause classification
cause_recipe <- recipe(shortage_cause ~ shortage_frequency + facility_type + 
                                        region + distance_to_warehouse + season + 
                                        previous_demand + stockout_last_month + 
                                        avg_delivery_days, 
                      data = df_shortage)

# Train/test split with stratification
set.seed(42)
cause_split <- initial_split(df_shortage, prop = 0.8, strata = shortage_cause)
cause_train <- training(cause_split)
cause_test <- testing(cause_split)

cat("Training set:", nrow(cause_train), "samples\n")
cat("Test set:", nrow(cause_test), "samples\n")

In [None]:
# Train Random Forest for cause classification
rf_cause_spec <- rand_forest(trees = 100, mtry = 3, min_n = 5) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_cause_wf <- workflow() %>%
  add_recipe(cause_recipe) %>%
  add_model(rf_cause_spec)

rf_cause_fit <- rf_cause_wf %>% fit(data = cause_train)

# Predictions
cause_pred <- predict(rf_cause_fit, cause_test) %>%
  bind_cols(cause_test %>% select(shortage_cause))

# Evaluate
accuracy_cause <- accuracy(cause_pred, truth = shortage_cause, estimate = .pred_class)
f1_cause <- f_meas(cause_pred, truth = shortage_cause, estimate = .pred_class)

cat("=" |> rep(50) |> paste(collapse = ""), "\n")
cat("MODEL 2: Cause Classification Results\n")
cat("=" |> rep(50) |> paste(collapse = ""), "\n")
cat("Our Accuracy:", round(accuracy_cause$.estimate * 100, 1), "%\n")
cat("Our F1-Score:", round(f1_cause$.estimate * 100, 1), "%\n")
cat("Korean Study: >70% F1\n")
cat("=" |> rep(50) |> paste(collapse = ""), "\n")

In [None]:
# Confusion matrix for cause prediction
cm_cause <- conf_mat(cause_pred, truth = shortage_cause, estimate = .pred_class)

autoplot(cm_cause, type = "heatmap") +
  scale_fill_gradient(low = "white", high = "forestgreen") +
  labs(title = "Cause Classification: Confusion Matrix") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Part 5: Feature Importance Comparison

The Korean study found these were the top predictors:
1. **Shortage frequency** (most important)
2. Import status
3. Alternative drug availability

Let's see what our models find!

In [None]:
# Feature importance from both models
p1 <- rf_dur_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 8) +
  labs(title = "Model 1: Duration Prediction\nFeature Importance") +
  theme_minimal()

p2 <- rf_cause_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 8) +
  labs(title = "Model 2: Cause Classification\nFeature Importance") +
  theme_minimal()

# Display plots
gridExtra::grid.arrange(p1, p2, ncol = 2)

In [None]:
# Compare our top predictors to Korean study
cat("Top 3 Predictors Comparison\n")
cat("=" |> rep(60) |> paste(collapse = ""), "\n")

cat("\nKorean Study (Roe et al., 2025):\n")
cat("  1. Shortage frequency\n")
cat("  2. Import status\n")
cat("  3. Alternative drug availability\n")

# Get importance from our models
dur_imp <- rf_dur_fit %>%
  extract_fit_parsnip() %>%
  vi() %>%
  head(3)

cause_imp <- rf_cause_fit %>%
  extract_fit_parsnip() %>%
  vi() %>%
  head(3)

cat("\nOur Duration Model:\n")
for (i in 1:3) {
  cat(sprintf("  %d. %s (%.3f)\n", i, dur_imp$Variable[i], dur_imp$Importance[i]))
}

cat("\nOur Cause Model:\n")
for (i in 1:3) {
  cat(sprintf("  %d. %s (%.3f)\n", i, cause_imp$Variable[i], cause_imp$Importance[i]))
}

## Part 6: Discussion Questions

**Think about:**

1. **How does our accuracy compare to the Korean study?**
   - They achieved 62% for duration, >70% F1 for causes
   - What might explain differences?

2. **Are the important features similar?**
   - Korean study: shortage frequency was #1
   - What drives predictions in our context?

3. **How would you improve these models?**
   - More features? (e.g., manufacturer data, seasonality)
   - Different algorithms?
   - More data?

4. **How could these predictions be used?**
   - Early warning systems
   - Resource allocation
   - Policy planning

## Summary

In this notebook, we:

1. **Replicated** the Roe et al. (2025) approach for drug shortage prediction
2. **Built Model 1** for duration prediction (classification)
3. **Built Model 2** for cause classification
4. **Compared** our feature importance to their findings
5. **Discussed** implications for supply chain management

### Key Takeaways

- Random Forest can predict shortage characteristics with reasonable accuracy
- Historical shortage patterns are highly predictive
- These models can inform proactive supply chain management

### tidymodels for Classification

| Task | Function |
|------|----------|
| Classification model | `set_mode("classification")` |
| Stratified split | `initial_split(strata = outcome)` |
| Stratified CV | `vfold_cv(strata = outcome)` |
| Accuracy | `accuracy()` |
| F1 score | `f_meas()` |
| Confusion matrix | `conf_mat()` |

### Connection to WISE Project

This same approach can be applied to predict:
- Which facilities are at risk of stockouts
- How long disruptions might last
- What interventions might be most effective

**Reference:** Roe, Y., Lee, S., Kim, C., & Lee, J. (2025). Drug shortage in South Korea: machine learning-based prediction models and analysis of duration and causal factors. *Frontiers in Pharmacology*, 16, 1608843. [DOI: 10.3389/fphar.2025.1608843](https://doi.org/10.3389/fphar.2025.1608843)