# Day 2: Demand Forecasting with Machine Learning (R Version)

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, you'll build your first machine learning models to forecast demand in the pharmaceutical supply chain using tidymodels.

## Setup

::: {.callout-important}
## Google Colab R Runtime
Make sure you're using the R runtime: **Runtime -> Change runtime type -> R**
:::

In [None]:
# Install packages if needed (run once in Colab)
if (!require("tidymodels", quietly = TRUE)) {
  install.packages(c("tidymodels", "ranger", "xgboost", "vip"))
}

# Load packages
library(tidymodels)
library(tidyverse)
library(vip)  # For variable importance plots

# Settings
set.seed(42)
theme_set(theme_minimal())

cat("Packages loaded!\n")

## Part 1: Load and Prepare Data

In [None]:
# Create sample data (will be replaced with real data URL)
set.seed(42)
n_rows <- 1000

# Generate data with patterns
dates <- seq(as.Date('2023-01-01'), by = 'day', length.out = n_rows)
regions <- sample(c('Addis Ababa', 'Oromia', 'Amhara', 'SNNP', 'Tigray'), n_rows, replace = TRUE)
facility_types <- sample(c('Hospital', 'Health Center', 'Clinic'), n_rows, replace = TRUE, 
                        prob = c(0.2, 0.5, 0.3))

# Create demand with seasonal pattern and facility effects
base_demand <- 100
facility_effect <- ifelse(facility_types == 'Hospital', 100,
                         ifelse(facility_types == 'Health Center', 50, 0))
day_of_year <- as.numeric(format(dates, '%j'))
seasonal_effect <- 20 * sin(2 * pi * day_of_year / 365)
noise <- rnorm(n_rows, 0, 15)

demand <- pmax(base_demand + facility_effect + seasonal_effect + noise, 10)

df <- tibble(
  date = dates,
  region = factor(regions),
  facility_type = factor(facility_types),
  demand = as.integer(demand)
)

cat("Data shape:", nrow(df), "rows x", ncol(df), "columns\n")
head(df)

## Part 2: Feature Engineering

In [None]:
# Extract time features
df <- df %>%
  mutate(
    month = month(date),
    day_of_week = wday(date),
    quarter = quarter(date),
    day_of_year = yday(date)
  )

cat("Features created:\n")
head(df)

In [None]:
# Define the recipe for preprocessing
# In tidymodels, we use recipes to handle feature engineering

demand_recipe <- recipe(demand ~ region + facility_type + month + day_of_week + quarter, 
                       data = df) %>%
  # Convert categorical to dummy variables (needed for linear regression)
  step_dummy(all_nominal_predictors())

cat("Recipe defined with features: region, facility_type, month, day_of_week, quarter\n")

## Part 3: Train/Test Split

In [None]:
# Split data (80% train, 20% test)
set.seed(42)
data_split <- initial_split(df, prop = 0.8)
train_data <- training(data_split)
test_data <- testing(data_split)

cat("Training set:", nrow(train_data), "samples\n")
cat("Test set:", nrow(test_data), "samples\n")

## Part 4: Train Models

We'll train three models:
1. Linear Regression (baseline)
2. Random Forest (using `ranger` engine)
3. Gradient Boosting (using `xgboost` engine)

In [None]:
# Model 1: Linear Regression (Baseline)
lr_spec <- linear_reg() %>%
  set_engine("lm")

lr_workflow <- workflow() %>%
  add_recipe(demand_recipe) %>%
  add_model(lr_spec)

lr_fit <- lr_workflow %>% fit(data = train_data)

lr_pred <- predict(lr_fit, test_data) %>%
  bind_cols(test_data %>% select(demand))

lr_metrics <- lr_pred %>%
  metrics(truth = demand, estimate = .pred)

lr_rmse <- lr_metrics %>% filter(.metric == "rmse") %>% pull(.estimate)
lr_rsq <- lr_metrics %>% filter(.metric == "rsq") %>% pull(.estimate)

cat("Linear Regression:\n")
cat("  RMSE:", round(lr_rmse, 2), "\n")
cat("  R-squared:", round(lr_rsq, 3), "\n")

In [None]:
# Model 2: Random Forest
# Note: Random Forest can handle categorical variables directly (no need for dummies)

# Simpler recipe for tree-based models
tree_recipe <- recipe(demand ~ region + facility_type + month + day_of_week + quarter, 
                     data = df)

rf_spec <- rand_forest(trees = 100) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")

rf_workflow <- workflow() %>%
  add_recipe(tree_recipe) %>%
  add_model(rf_spec)

rf_fit <- rf_workflow %>% fit(data = train_data)

rf_pred <- predict(rf_fit, test_data) %>%
  bind_cols(test_data %>% select(demand))

rf_metrics <- rf_pred %>%
  metrics(truth = demand, estimate = .pred)

rf_rmse <- rf_metrics %>% filter(.metric == "rmse") %>% pull(.estimate)
rf_rsq <- rf_metrics %>% filter(.metric == "rsq") %>% pull(.estimate)

cat("Random Forest:\n")
cat("  RMSE:", round(rf_rmse, 2), "\n")
cat("  R-squared:", round(rf_rsq, 3), "\n")

In [None]:
# Model 3: Gradient Boosting (XGBoost)
# XGBoost requires numeric encoding

xgb_recipe <- recipe(demand ~ region + facility_type + month + day_of_week + quarter, 
                    data = df) %>%
  step_dummy(all_nominal_predictors())

gb_spec <- boost_tree(trees = 100) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

gb_workflow <- workflow() %>%
  add_recipe(xgb_recipe) %>%
  add_model(gb_spec)

gb_fit <- gb_workflow %>% fit(data = train_data)

gb_pred <- predict(gb_fit, test_data) %>%
  bind_cols(test_data %>% select(demand))

gb_metrics <- gb_pred %>%
  metrics(truth = demand, estimate = .pred)

gb_rmse <- gb_metrics %>% filter(.metric == "rmse") %>% pull(.estimate)
gb_rsq <- gb_metrics %>% filter(.metric == "rsq") %>% pull(.estimate)

cat("Gradient Boosting (XGBoost):\n")
cat("  RMSE:", round(gb_rmse, 2), "\n")
cat("  R-squared:", round(gb_rsq, 3), "\n")

## Part 5: Compare Models

In [None]:
# Summary table
results <- tibble(
  Model = c('Linear Regression', 'Random Forest', 'Gradient Boosting'),
  RMSE = round(c(lr_rmse, rf_rmse, gb_rmse), 2),
  R_squared = round(c(lr_rsq, rf_rsq, gb_rsq), 3)
)

cat("Model Comparison:\n")
print(results)

In [None]:
# Visualize predictions vs actual
all_preds <- bind_rows(
  lr_pred %>% mutate(Model = "Linear Regression"),
  rf_pred %>% mutate(Model = "Random Forest"),
  gb_pred %>% mutate(Model = "Gradient Boosting")
)

ggplot(all_preds, aes(x = demand, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) +
  facet_wrap(~Model) +
  labs(
    x = "Actual Demand",
    y = "Predicted Demand",
    title = "Predictions vs Actual by Model"
  ) +
  theme_minimal()

## Part 6: Feature Importance

One of the key advantages of tree-based models is interpretability through feature importance.

In [None]:
# Feature importance from Random Forest using vip package
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10) +
  labs(title = "Random Forest Feature Importance") +
  theme_minimal()

In [None]:
# Feature importance from XGBoost
gb_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10) +
  labs(title = "XGBoost Feature Importance") +
  theme_minimal()

## Summary

In this notebook, you:
- Created features from raw data using `recipe()`
- Trained three different ML models using tidymodels `workflow()`
- Compared model performance using `metrics()`
- Analyzed feature importance using `vip()`

### Key tidymodels Functions

| Task | Function |
|------|----------|
| Define preprocessing | `recipe()`, `step_*()` |
| Define model | `linear_reg()`, `rand_forest()`, `boost_tree()` |
| Combine into workflow | `workflow()`, `add_recipe()`, `add_model()` |
| Train model | `fit()` |
| Make predictions | `predict()` |
| Evaluate | `metrics()`, `rmse()`, `rsq()` |

**Next:** Model Tuning notebook to optimize the best performer