# Day 1: Project Setup & ML Demo

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, you'll:
1. Set up a reproducible analysis project
2. Explore the workshop dataset
3. **See overfitting in action** with a sine wave demonstration

## Part 1: Environment Setup

In [None]:
# Check environment
import sys
print(f"Python: {sys.version}")
print(f"Environment: {'Colab' if 'google.colab' in sys.modules else 'Local'}")

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
pd.set_option('display.max_columns', 50)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Packages loaded successfully!")

## Part 2: Load the Workshop Dataset

In [None]:
# Load supply chain data from GitHub
url = "https://raw.githubusercontent.com/sysylvia/ethiopia-ds-workshop-2026/main/data/supply-chain-sample.csv"

try:
    df = pd.read_csv(url)
    print(f"Data loaded successfully! Shape: {df.shape}")
except:
    # If URL not available, create sample data matching the real data structure
    print("Creating sample data for demonstration...")
    np.random.seed(42)
    n_rows = 500
    
    df = pd.DataFrame({
        'facility_id': [f'F{str(i).zfill(3)}' for i in np.random.choice(range(1, 51), n_rows)],
        'region': np.random.choice(['Addis Ababa', 'Oromia', 'Amhara', 'SNNP', 'Tigray'], n_rows),
        'facility_type': np.random.choice(['Hospital', 'Health Center', 'Clinic'], n_rows, p=[0.2, 0.5, 0.3]),
        'season': np.random.choice(['dry', 'rainy'], n_rows),
        'month': np.random.choice(range(1, 13), n_rows),
        'previous_demand': np.random.poisson(100, n_rows) + np.random.randint(0, 50, n_rows),
        'actual_demand': np.random.poisson(100, n_rows) + np.random.randint(0, 50, n_rows),
        'distance_to_warehouse': np.random.randint(10, 500, n_rows),
        'avg_delivery_days': np.random.choice([3, 5, 7, 10, 14], n_rows)
    })
    print(f"Sample data created! Shape: {df.shape}")

## Part 3: Data Exploration

In [None]:
# First look at the data
df.head()

In [None]:
# Data types and missing values
df.info()

In [None]:
# Summary statistics
df.describe()

In [None]:
# Check categorical variables
for col in ['region', 'facility_type', 'season']:
    print(f"\n{col}:")
    print(df[col].value_counts())

## Part 4: Initial Visualizations

In [None]:
# Distribution of demand
fig, ax = plt.subplots(figsize=(10, 4))
df['actual_demand'].hist(bins=30, ax=ax, edgecolor='black')
ax.set_xlabel('Demand (units)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Demand')
plt.tight_layout()
plt.show()

In [None]:
# Demand by region
fig, ax = plt.subplots(figsize=(10, 5))
df.groupby('region')['actual_demand'].mean().sort_values().plot(kind='barh', ax=ax)
ax.set_xlabel('Average Demand')
ax.set_ylabel('Region')
ax.set_title('Average Demand by Region')
plt.tight_layout()
plt.show()

In [None]:
# Demand by facility type
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=df, x='facility_type', y='actual_demand', ax=ax)
ax.set_xlabel('Facility Type')
ax.set_ylabel('Demand')
ax.set_title('Demand Distribution by Facility Type')
plt.tight_layout()
plt.show()

---

# Part 5: Understanding Overfitting with the Sine Wave

Now let's see the **bias-variance tradeoff** in action! We'll generate noisy data from a sine wave and try to fit polynomials of increasing complexity.

**Key questions:**
- When does the model fit too much noise?
- Why does training error alone mislead us?
- How do we find the right complexity?

In [None]:
# Generate sine wave data with noise
np.random.seed(42)

n = 30  # number of data points
X = np.sort(np.random.uniform(0, 2 * np.pi, n))  # random x values
y_true = np.sin(X)  # the true underlying function
y = y_true + np.random.normal(0, 0.3, n)  # add noise

print(f"Generated {n} noisy observations from sin(x)")

In [None]:
# Visualize the data and true function
fig, ax = plt.subplots(figsize=(10, 5))

# Plot true function
x_smooth = np.linspace(0, 2 * np.pi, 100)
ax.plot(x_smooth, np.sin(x_smooth), 'r--', linewidth=2, label='True function: sin(x)')

# Plot noisy observations
ax.scatter(X, y, s=60, c='blue', edgecolors='black', zorder=5, label='Observed data (noisy)')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('The Challenge: Recover the True Pattern from Noisy Data')
ax.legend()
plt.tight_layout()
plt.show()

### Population vs Sample: The Heart of Out-of-Sample Prediction

In the real world, we only observe a **sample** from a larger **population**. Our goal is to learn patterns that generalize to the entire population—not just memorize our sample.

Let's make this concrete:
- **Population**: 200 points from sin(x) + noise (imagine this is "all possible data")
- **Sample**: We only get to see 30 of these points for training
- **Test**: Evaluate how well our model predicts the other 170 points

In [None]:
# Generate the POPULATION: many more points
np.random.seed(42)

n_population = 200  # Full population
n_sample = 30       # Our training sample

# Generate population data
X_pop = np.sort(np.random.uniform(0, 2 * np.pi, n_population))
y_true_pop = np.sin(X_pop)
y_pop = y_true_pop + np.random.normal(0, 0.3, n_population)

# Take a random sample for training
sample_idx = np.random.choice(n_population, n_sample, replace=False)
sample_idx = np.sort(sample_idx)

X_sample = X_pop[sample_idx]
y_sample = y_pop[sample_idx]

# The "unseen" population points (for testing)
unseen_idx = np.setdiff1d(np.arange(n_population), sample_idx)
X_unseen = X_pop[unseen_idx]
y_unseen = y_pop[unseen_idx]

print(f"Population: {n_population} points")
print(f"Sample (training): {n_sample} points")
print(f"Unseen (testing): {len(unseen_idx)} points")

In [None]:
# Visualize population vs sample
fig, ax = plt.subplots(figsize=(12, 5))

# Plot unseen population (light gray)
ax.scatter(X_unseen, y_unseen, s=40, c='lightgray', alpha=0.5,
           label=f'Unseen population ({len(unseen_idx)} points)')

# Plot training sample (blue)
ax.scatter(X_sample, y_sample, s=80, c='blue', edgecolors='black',
           zorder=5, label=f'Training sample ({n_sample} points)')

# Plot true function
ax.plot(x_smooth, np.sin(x_smooth), 'r--', linewidth=2, alpha=0.7,
        label='True: sin(x)')

ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('Population vs Sample: We Only See the Blue Points!', fontsize=14)
ax.legend()
plt.tight_layout()
plt.show()

print("\nWe train on the BLUE points, but want to predict the GRAY points well!")

### Fitting Polynomials of Increasing Degree

Let's fit polynomials with degrees 1 (linear), 3, 10, and 20 to see what happens. We'll add each one step by step to watch the progression from underfitting to overfitting.

In [None]:
# Setup: imports and storage for fitted models
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Store fitted models and predictions for cumulative plotting
fitted_models = {}
colors = {1: 'green', 3: 'orange', 10: 'purple', 20: 'red'}

def fit_and_plot_polynomial(degree, fitted_models):
    """Fit a polynomial to the SAMPLE and plot all models fitted so far."""
    # Fit the new polynomial on SAMPLE data
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X_sample.reshape(-1, 1))
    model = LinearRegression()
    model.fit(X_poly, y_sample)
    
    # Calculate training MSE (on sample)
    y_pred_train = model.predict(X_poly)
    train_mse = mean_squared_error(y_sample, y_pred_train)
    
    # Store model info
    X_smooth_poly = poly.transform(x_smooth.reshape(-1, 1))
    y_pred_smooth = model.predict(X_smooth_poly)
    fitted_models[degree] = {'predictions': y_pred_smooth, 'mse': train_mse, 'poly': poly, 'model': model}
    
    # Create plot with all models so far
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # Plot unseen population in background (light gray)
    ax.scatter(X_unseen, y_unseen, s=30, c='lightgray', alpha=0.3, label='Unseen population')
    
    # Plot sample data and true function
    ax.scatter(X_sample, y_sample, s=60, c='blue', edgecolors='black', zorder=5, label='Training sample')
    ax.plot(x_smooth, np.sin(x_smooth), 'r--', linewidth=2, alpha=0.5, label='True: sin(x)')
    
    # Plot all fitted models
    for d in sorted(fitted_models.keys()):
        info = fitted_models[d]
        ax.plot(x_smooth, info['predictions'], color=colors[d], linewidth=2, 
                label=f'Degree {d} (Sample MSE: {info["mse"]:.3f})')
    
    ax.set_xlabel('x', fontsize=12)
    ax.set_ylabel('y', fontsize=12)
    ax.set_title(f'Polynomial Fits on Sample Data (Now showing degree {degree})', fontsize=14)
    ax.legend(loc='upper right')
    ax.set_ylim(-2, 2)
    plt.tight_layout()
    plt.show()
    
    return train_mse

print("Helper function ready. Let's fit polynomials one at a time!")
print("(Notice: We're training on the SAMPLE, with the unseen population shown in gray)")

#### Degree 1: Linear fit (too simple?)

In [None]:
# Fit degree 1 polynomial (linear)
mse = fit_and_plot_polynomial(1, fitted_models)
print(f"\nDegree 1 (Linear): Training MSE = {mse:.3f}")
print("Notice: The line can't capture the curve at all! This is UNDERFITTING.")

#### Degree 3: A cubic polynomial (getting better?)

In [None]:
# Fit degree 3 polynomial (cubic)
mse = fit_and_plot_polynomial(3, fitted_models)
print(f"\nDegree 3 (Cubic): Training MSE = {mse:.3f}")
print("Better! The cubic captures the wave pattern. MSE dropped significantly.")

#### Degree 10: More flexibility (is more always better?)

In [None]:
# Fit degree 10 polynomial
mse = fit_and_plot_polynomial(10, fitted_models)
print(f"\nDegree 10: Training MSE = {mse:.3f}")
print("Hmm... MSE is even lower, but look at those wiggles! Is it fitting the data or the noise?")

#### Degree 20: Maximum flexibility (surely this is best?)

In [None]:
# Fit degree 20 polynomial
mse = fit_and_plot_polynomial(20, fitted_models)
print(f"\nDegree 20: Training MSE = {mse:.3f}")
print("Lowest training error yet! But look at those wild oscillations...")
print("This model is memorizing the noise, not learning the pattern. This is OVERFITTING!")

### Sample Error vs Population Error: The Truth Revealed

Now let's see what happens when we evaluate our polynomial models on:
1. **Sample (training data)**: The 30 points we used to fit
2. **Unseen population**: The 170 points we never saw

This reveals the true cost of overfitting!

In [None]:
# Compare sample error vs population error for different polynomial degrees
degrees_to_compare = [1, 3, 10, 20]
results = []

for degree in degrees_to_compare:
    poly = PolynomialFeatures(degree)
    X_sample_poly = poly.fit_transform(X_sample.reshape(-1, 1))
    model = LinearRegression()
    model.fit(X_sample_poly, y_sample)

    # Error on sample (training)
    y_sample_pred = model.predict(X_sample_poly)
    sample_mse = mean_squared_error(y_sample, y_sample_pred)

    # Error on unseen population
    X_unseen_poly = poly.transform(X_unseen.reshape(-1, 1))
    y_unseen_pred = model.predict(X_unseen_poly)
    unseen_mse = mean_squared_error(y_unseen, y_unseen_pred)

    results.append({
        'Degree': degree,
        'Sample MSE': round(sample_mse, 4),
        'Unseen MSE': round(unseen_mse, 4)
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

In [None]:
# Visualize the gap between sample and population error
fig, ax = plt.subplots(figsize=(10, 6))

x_pos = np.arange(len(degrees_to_compare))
width = 0.35

bars1 = ax.bar(x_pos - width/2, results_df['Sample MSE'], width,
               label='Sample (Training) MSE', color='blue', alpha=0.7)
bars2 = ax.bar(x_pos + width/2, results_df['Unseen MSE'], width,
               label='Unseen (Test) MSE', color='red', alpha=0.7)

ax.set_xlabel('Polynomial Degree', fontsize=12)
ax.set_ylabel('Mean Squared Error', fontsize=12)
ax.set_title('The Overfitting Gap: Sample Error vs Population Error', fontsize=14)
ax.set_xticks(x_pos)
ax.set_xticklabels(degrees_to_compare)
ax.legend()

plt.tight_layout()
plt.show()

print("\nKey insight: As degree increases, sample error drops but unseen error explodes!")
print("The gap between blue and red bars is the 'overfitting penalty'.")

### Population vs Sample: Key Takeaways

1. **We only see a sample** of the true population
2. **Training error** measures fit on the sample we saw
3. **Test/population error** measures generalization to unseen data
4. **Overfitting** = great on sample, terrible on population
5. **The goal**: Find a model that performs well on *unseen* data

This is why we need train/test splits, cross-validation, and regularization!

### 🤔 What do you notice?

- **Degree 1 (green)**: Too simple! Misses the curve entirely.
- **Degree 3 (orange)**: Captures the sine pattern reasonably well.
- **Degree 10+ (purple, red)**: Starts wiggling through individual points.

**But look at the training MSE!** Higher-degree polynomials have *lower* training error. Does that mean they're better?

### Train vs Test Error: The Real Test

Let's split our data and see what happens on **held-out test data**.

In [None]:
from sklearn.model_selection import train_test_split

# Split the SAMPLE data: 70% train, 30% test
# (This mimics what we'd do in practice when we only have sample data)
X_train, X_test, y_train, y_test = train_test_split(
    X_sample, y_sample, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print("\n(Note: We're splitting our sample to simulate train/test validation)")

In [None]:
# Compare train vs test error across polynomial degrees
degrees_to_test = range(1, 16)
train_errors = []
test_errors = []

for degree in degrees_to_test:
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
    X_test_poly = poly.transform(X_test.reshape(-1, 1))
    
    # Fit model
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    # Calculate errors
    train_pred = model.predict(X_train_poly)
    test_pred = model.predict(X_test_poly)
    
    train_errors.append(mean_squared_error(y_train, train_pred))
    test_errors.append(mean_squared_error(y_test, test_pred))

# Plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(degrees_to_test, train_errors, 'b-o', linewidth=2, markersize=8, label='Training Error')
ax.plot(degrees_to_test, test_errors, 'r-s', linewidth=2, markersize=8, label='Test Error')

# Find optimal degree
optimal_degree = degrees_to_test[np.argmin(test_errors)]
ax.axvline(x=optimal_degree, color='green', linestyle='--', alpha=0.7, 
           label=f'Optimal: Degree {optimal_degree}')

ax.set_xlabel('Polynomial Degree', fontsize=12)
ax.set_ylabel('Mean Squared Error', fontsize=12)
ax.set_title('The Bias-Variance Tradeoff in Action', fontsize=14)
ax.legend()
ax.set_yscale('log')  # Log scale to see the pattern better
plt.tight_layout()
plt.show()

### 💡 The "Aha" Moment!

**Training error** keeps decreasing as we add complexity.

**Test error** eventually starts INCREASING!

This is **overfitting**: the model memorizes training data (including noise) and fails to generalize.

---

**Key lessons:**
1. Training error alone is misleading
2. We need held-out test data to evaluate models honestly
3. More complex isn't always better
4. The optimal complexity balances bias and variance

### Cross-Validation: A Better Approach

Instead of a single train/test split, let's use **cross-validation** to get more stable estimates.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

# Use cross-validation on the SAMPLE to find optimal degree
degrees_to_test = range(1, 12)
cv_scores = []

for degree in degrees_to_test:
    # Create pipeline: polynomial features + linear regression
    pipeline = make_pipeline(
        PolynomialFeatures(degree),
        LinearRegression()
    )
    
    # 5-fold cross-validation (negative MSE because sklearn maximizes)
    scores = cross_val_score(pipeline, X_sample.reshape(-1, 1), y_sample, 
                            cv=5, scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())  # Convert back to positive MSE

# Plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(degrees_to_test, cv_scores, 'g-o', linewidth=2, markersize=8)

optimal_cv_degree = degrees_to_test[np.argmin(cv_scores)]
ax.axvline(x=optimal_cv_degree, color='red', linestyle='--', alpha=0.7,
           label=f'CV Optimal: Degree {optimal_cv_degree}')

ax.set_xlabel('Polynomial Degree', fontsize=12)
ax.set_ylabel('Cross-Validation MSE', fontsize=12)
ax.set_title('Using Cross-Validation to Select Model Complexity', fontsize=14)
ax.legend()
plt.tight_layout()
plt.show()

print(f"\n✓ Cross-validation suggests degree {optimal_cv_degree} is optimal!")

### Connection to Day 2: Regularization

Instead of choosing polynomial degree, tomorrow we'll learn a more elegant approach:

**LASSO and Ridge regression** add penalties that automatically constrain model complexity!

```
Today:     Choose degree to control complexity
Tomorrow:  Use regularization penalty (λ) to control complexity
```

The same principle applies: **constrain complexity to prevent overfitting**.

---

## Part 6: Save Your Work

Don't forget to save a copy of this notebook to your Google Drive!

**File > Save a copy in Drive**

## Summary

In this notebook, you:

1. ✅ Set up your Python environment
2. ✅ Loaded and explored the workshop dataset
3. ✅ Created initial visualizations
4. ✅ **Saw overfitting in action** with the sine wave demo
5. ✅ Learned why train/test splits and cross-validation matter

**Key takeaway**: Complex models that fit training data perfectly often fail on new data. We need validation strategies to find the right balance.

---

**Next:** Day 2 - Regularization (LASSO, Ridge) & Tree-Based Methods