# Day 1: Project Setup & ML Demo (R Version)

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, you'll:
1. Set up a reproducible analysis project using tidymodels
2. Explore the workshop dataset
3. **See overfitting in action** with a sine wave demonstration

## Part 1: Environment Setup

::: {.callout-important}
## Google Colab R Runtime
Make sure you're using the R runtime: **Runtime → Change runtime type → R**
:::

In [None]:
# Check environment
cat("R version:", R.version.string, "\n")
cat("Environment:", ifelse(Sys.getenv("COLAB_RELEASE_TAG") != "", "Colab", "Local"), "\n")

In [None]:
# Install tidymodels if not already installed (run once)
# This may take 1-2 minutes in Colab
if (!require("tidymodels", quietly = TRUE)) {
  install.packages("tidymodels")
}

# Load packages
library(tidymodels)
library(tidyverse)

# Set seed for reproducibility
set.seed(42)

# Settings
theme_set(theme_minimal())

cat("Packages loaded successfully!\n")

## Part 2: Load the Workshop Dataset

In [None]:
# Load supply chain data from GitHub
url <- "https://raw.githubusercontent.com/sysylvia/ethiopia-ds-workshop-2026/main/data/supply-chain-sample.csv"

tryCatch({
  df <- read_csv(url, show_col_types = FALSE)
  cat("Data loaded successfully! Shape:", nrow(df), "rows x", ncol(df), "columns\n")
}, error = function(e) {
  # If URL not available, create sample data
  cat("Creating sample data for demonstration...\n")
  set.seed(42)
  n_rows <- 500
  
  df <<- tibble(
    facility_id = sample(c('ETH001', 'ETH002', 'ETH003', 'ETH004', 'ETH005'), n_rows, replace = TRUE),
    region = sample(c('Addis Ababa', 'Oromia', 'Amhara', 'SNNP', 'Tigray'), n_rows, replace = TRUE),
    facility_type = sample(c('Hospital', 'Health Center', 'Clinic'), n_rows, replace = TRUE, 
                          prob = c(0.2, 0.5, 0.3)),
    date = format(seq(as.Date('2023-01-01'), length.out = n_rows, by = 'day'), '%Y-%m'),
    medication_class = sample(c('Antibiotics', 'Antimalarials', 'Chronic Disease', 'Vaccines', 'Other'), 
                             n_rows, replace = TRUE),
    demand = rpois(n_rows, 100) + sample(0:49, n_rows, replace = TRUE),
    stock_level = rpois(n_rows, 150),
    lead_time_days = sample(c(7, 14, 21, 30), n_rows, replace = TRUE, prob = c(0.3, 0.4, 0.2, 0.1))
  )
  cat("Sample data created! Shape:", nrow(df), "rows x", ncol(df), "columns\n")
})

## Part 3: Data Exploration

In [None]:
# First look at the data
glimpse(df)

In [None]:
# View first few rows
head(df)

In [None]:
# Summary statistics
summary(df)

In [None]:
# Check categorical variables
cat("Region:\n")
print(table(df$region))

cat("\nFacility Type:\n")
print(table(df$facility_type))

cat("\nMedication Class:\n")
print(table(df$medication_class))

## Part 4: Initial Visualizations

In [None]:
# Distribution of demand
ggplot(df, aes(x = demand)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(
    x = "Demand (units)",
    y = "Frequency",
    title = "Distribution of Demand"
  ) +
  theme_minimal()

In [None]:
# Demand by region
df %>%
  group_by(region) %>%
  summarize(avg_demand = mean(demand), .groups = "drop") %>%
  ggplot(aes(x = avg_demand, y = reorder(region, avg_demand))) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  labs(
    x = "Average Demand",
    y = "Region",
    title = "Average Demand by Region"
  ) +
  theme_minimal()

In [None]:
# Demand by facility type
ggplot(df, aes(x = facility_type, y = demand)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  labs(
    x = "Facility Type",
    y = "Demand",
    title = "Demand Distribution by Facility Type"
  ) +
  theme_minimal()

---

# Part 5: Understanding Overfitting with the Sine Wave

Now let's see the **bias-variance tradeoff** in action! We'll generate noisy data from a sine wave and try to fit polynomials of increasing complexity.

**Key questions:**
- When does the model fit too much noise?
- Why does training error alone mislead us?
- How do we find the right complexity?

In [None]:
# Generate sine wave data with noise
set.seed(42)

n <- 30  # number of data points
X <- sort(runif(n, 0, 2 * pi))  # random x values
y_true <- sin(X)  # the true underlying function
y <- y_true + rnorm(n, 0, 0.3)  # add noise

# Create data frame
sine_data <- tibble(x = X, y = y, y_true = y_true)

cat("Generated", n, "noisy observations from sin(x)\n")

In [None]:
# Visualize the data and true function
x_smooth <- seq(0, 2 * pi, length.out = 100)
smooth_df <- tibble(x = x_smooth, y = sin(x_smooth))

ggplot() +
  geom_line(data = smooth_df, aes(x = x, y = y), 
            color = "red", linetype = "dashed", linewidth = 1) +
  geom_point(data = sine_data, aes(x = x, y = y), 
             size = 3, color = "blue", shape = 21, fill = "blue") +
  labs(
    x = "x",
    y = "y",
    title = "The Challenge: Recover the True Pattern from Noisy Data",
    subtitle = "Red dashed = True function sin(x), Blue points = Observed data (noisy)"
  ) +
  theme_minimal()

### Fitting Polynomials of Increasing Degree

Let's fit polynomials with degrees 1 (linear), 3, 10, and 20 to see what happens.

In [None]:
# Fit polynomials of different degrees
degrees <- c(1, 3, 10, 20)
colors <- c("green", "orange", "purple", "red")

# Create base plot
p <- ggplot() +
  geom_point(data = sine_data, aes(x = x, y = y), 
             size = 3, color = "blue", shape = 21, fill = "blue") +
  geom_line(data = smooth_df, aes(x = x, y = y), 
            color = "red", linetype = "dashed", alpha = 0.5, linewidth = 1)

# Fit and add each polynomial
results <- list()
for (i in seq_along(degrees)) {
  degree <- degrees[i]
  
  # Fit polynomial using lm with poly()
  model <- lm(y ~ poly(x, degree, raw = TRUE), data = sine_data)
  
  # Predictions on smooth x
  pred_df <- tibble(
    x = x_smooth,
    y_pred = predict(model, newdata = tibble(x = x_smooth))
  )
  
  # Training MSE
  train_pred <- predict(model, newdata = sine_data)
  train_mse <- mean((sine_data$y - train_pred)^2)
  
  results[[i]] <- list(degree = degree, mse = train_mse, pred_df = pred_df, color = colors[i])
  
  # Add to plot
  p <- p + geom_line(data = pred_df, aes(x = x, y = y_pred), 
                     color = colors[i], linewidth = 1)
}

# Print MSE results
cat("Training MSE by polynomial degree:\n")
for (r in results) {
  cat(sprintf("  Degree %2d: MSE = %.3f\n", r$degree, r$mse))
}

p + 
  labs(
    x = "x",
    y = "y",
    title = "Polynomial Fits of Increasing Complexity",
    subtitle = "Green=Deg1, Orange=Deg3, Purple=Deg10, Red=Deg20"
  ) +
  ylim(-2, 2) +
  theme_minimal()

### What do you notice?

- **Degree 1 (green)**: Too simple! Misses the curve entirely.
- **Degree 3 (orange)**: Captures the sine pattern reasonably well.
- **Degree 10+ (purple, red)**: Starts wiggling through individual points.

**But look at the training MSE!** Higher-degree polynomials have *lower* training error. Does that mean they're better?

### Train vs Test Error: The Real Test

Let's split our data and see what happens on **held-out test data**.

In [None]:
# Split data using rsample: 70% train, 30% test
set.seed(42)
data_split <- initial_split(sine_data, prop = 0.7)
train_data <- training(data_split)
test_data <- testing(data_split)

cat("Training samples:", nrow(train_data), "\n")
cat("Test samples:", nrow(test_data), "\n")

In [None]:
# Compare train vs test error across polynomial degrees
degrees_to_test <- 1:15
train_errors <- numeric(length(degrees_to_test))
test_errors <- numeric(length(degrees_to_test))

for (i in seq_along(degrees_to_test)) {
  degree <- degrees_to_test[i]
  
  # Fit model on training data
  model <- lm(y ~ poly(x, degree, raw = TRUE), data = train_data)
  
  # Calculate errors
  train_pred <- predict(model, newdata = train_data)
  test_pred <- predict(model, newdata = test_data)
  
  train_errors[i] <- mean((train_data$y - train_pred)^2)
  test_errors[i] <- mean((test_data$y - test_pred)^2)
}

# Create data frame for plotting
error_df <- tibble(
  degree = rep(degrees_to_test, 2),
  error = c(train_errors, test_errors),
  type = rep(c("Training Error", "Test Error"), each = length(degrees_to_test))
)

# Find optimal degree
optimal_degree <- degrees_to_test[which.min(test_errors)]

# Plot
ggplot(error_df, aes(x = degree, y = error, color = type, shape = type)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_vline(xintercept = optimal_degree, linetype = "dashed", color = "green", alpha = 0.7) +
  annotate("text", x = optimal_degree + 1, y = max(error_df$error) * 0.8, 
           label = paste("Optimal: Degree", optimal_degree), hjust = 0) +
  scale_y_log10() +
  scale_color_manual(values = c("Training Error" = "blue", "Test Error" = "red")) +
  labs(
    x = "Polynomial Degree",
    y = "Mean Squared Error (log scale)",
    title = "The Bias-Variance Tradeoff in Action",
    color = "",
    shape = ""
  ) +
  theme_minimal() +
  theme(legend.position = "top")

### The "Aha" Moment!

**Training error** keeps decreasing as we add complexity.

**Test error** eventually starts INCREASING!

This is **overfitting**: the model memorizes training data (including noise) and fails to generalize.

---

**Key lessons:**
1. Training error alone is misleading
2. We need held-out test data to evaluate models honestly
3. More complex isn't always better
4. The optimal complexity balances bias and variance

### Cross-Validation: A Better Approach

Instead of a single train/test split, let's use **cross-validation** to get more stable estimates.

In tidymodels, we use `vfold_cv()` for k-fold cross-validation.

In [None]:
# Use cross-validation to find optimal degree
degrees_to_test <- 1:11
cv_scores <- numeric(length(degrees_to_test))

# Create 5-fold cross-validation folds
set.seed(42)
folds <- vfold_cv(sine_data, v = 5)

for (i in seq_along(degrees_to_test)) {
  degree <- degrees_to_test[i]
  fold_mses <- numeric(5)
  
  # Manual CV loop (for educational purposes)
  for (j in 1:5) {
    # Get train/test for this fold
    train_fold <- analysis(folds$splits[[j]])
    test_fold <- assessment(folds$splits[[j]])
    
    # Fit model
    model <- lm(y ~ poly(x, degree, raw = TRUE), data = train_fold)
    
    # Predict and calculate MSE
    pred <- predict(model, newdata = test_fold)
    fold_mses[j] <- mean((test_fold$y - pred)^2)
  }
  
  cv_scores[i] <- mean(fold_mses)
}

# Find optimal degree
optimal_cv_degree <- degrees_to_test[which.min(cv_scores)]

# Plot
cv_df <- tibble(degree = degrees_to_test, mse = cv_scores)

ggplot(cv_df, aes(x = degree, y = mse)) +
  geom_line(color = "darkgreen", linewidth = 1) +
  geom_point(color = "darkgreen", size = 3) +
  geom_vline(xintercept = optimal_cv_degree, linetype = "dashed", color = "red", alpha = 0.7) +
  annotate("text", x = optimal_cv_degree + 0.5, y = max(cv_df$mse) * 0.8,
           label = paste("CV Optimal: Degree", optimal_cv_degree), hjust = 0, color = "red") +
  labs(
    x = "Polynomial Degree",
    y = "Cross-Validation MSE",
    title = "Using Cross-Validation to Select Model Complexity"
  ) +
  theme_minimal()

cat("\n✓ Cross-validation suggests degree", optimal_cv_degree, "is optimal!\n")

### Connection to Day 2: Regularization

Instead of choosing polynomial degree, tomorrow we'll learn a more elegant approach:

**LASSO and Ridge regression** add penalties that automatically constrain model complexity!

```
Today:     Choose degree to control complexity
Tomorrow:  Use regularization penalty (λ) to control complexity
```

The same principle applies: **constrain complexity to prevent overfitting**.

---

## Part 6: Save Your Work

Don't forget to save a copy of this notebook to your Google Drive!

**File > Save a copy in Drive**

## Summary

In this notebook, you:

1. ✅ Set up your R environment with tidymodels
2. ✅ Loaded and explored the workshop dataset
3. ✅ Created initial visualizations with ggplot2
4. ✅ **Saw overfitting in action** with the sine wave demo
5. ✅ Learned why train/test splits and cross-validation matter

**Key takeaway**: Complex models that fit training data perfectly often fail on new data. We need validation strategies to find the right balance.

---

**Next:** Day 2 - Regularization (LASSO, Ridge) & Tree-Based Methods