# Day 1: Project Setup & ML Demo

**WISE Workshop | Addis Ababa, Feb 2026**

In this notebook, you'll:
1. Set up a reproducible analysis project
2. Explore the workshop dataset
3. **See overfitting in action** with a sine wave demonstration

## Part 1: Environment Setup

In [None]:
# Check environment
import sys
print(f"Python: {sys.version}")
print(f"Environment: {'Colab' if 'google.colab' in sys.modules else 'Local'}")

In [None]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
pd.set_option('display.max_columns', 50)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Packages loaded successfully!")

## Part 2: Load the Workshop Dataset

In [None]:
# Load supply chain data from GitHub
url = "https://raw.githubusercontent.com/sysylvia/ethiopia-ds-workshop-2026/main/data/supply-chain-sample.csv"

try:
    df = pd.read_csv(url)
    print(f"Data loaded successfully! Shape: {df.shape}")
except:
    # If URL not available, create sample data
    print("Creating sample data for demonstration...")
    np.random.seed(42)
    n_rows = 500
    
    df = pd.DataFrame({
        'facility_id': np.random.choice(['ETH001', 'ETH002', 'ETH003', 'ETH004', 'ETH005'], n_rows),
        'region': np.random.choice(['Addis Ababa', 'Oromia', 'Amhara', 'SNNP', 'Tigray'], n_rows),
        'facility_type': np.random.choice(['Hospital', 'Health Center', 'Clinic'], n_rows, p=[0.2, 0.5, 0.3]),
        'date': pd.date_range('2023-01-01', periods=n_rows, freq='D').strftime('%Y-%m'),
        'medication_class': np.random.choice(['Antibiotics', 'Antimalarials', 'Chronic Disease', 'Vaccines', 'Other'], n_rows),
        'demand': np.random.poisson(100, n_rows) + np.random.randint(0, 50, n_rows),
        'stock_level': np.random.poisson(150, n_rows),
        'lead_time_days': np.random.choice([7, 14, 21, 30], n_rows, p=[0.3, 0.4, 0.2, 0.1])
    })
    print(f"Sample data created! Shape: {df.shape}")

## Part 3: Data Exploration

In [None]:
# First look at the data
df.head()

In [None]:
# Data types and missing values
df.info()

In [None]:
# Summary statistics
df.describe()

In [None]:
# Check categorical variables
for col in ['region', 'facility_type', 'medication_class']:
    print(f"\n{col}:")
    print(df[col].value_counts())

## Part 4: Initial Visualizations

In [None]:
# Distribution of demand
fig, ax = plt.subplots(figsize=(10, 4))
df['demand'].hist(bins=30, ax=ax, edgecolor='black')
ax.set_xlabel('Demand (units)')
ax.set_ylabel('Frequency')
ax.set_title('Distribution of Demand')
plt.tight_layout()
plt.show()

In [None]:
# Demand by region
fig, ax = plt.subplots(figsize=(10, 5))
df.groupby('region')['demand'].mean().sort_values().plot(kind='barh', ax=ax)
ax.set_xlabel('Average Demand')
ax.set_ylabel('Region')
ax.set_title('Average Demand by Region')
plt.tight_layout()
plt.show()

In [None]:
# Demand by facility type
fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=df, x='facility_type', y='demand', ax=ax)
ax.set_xlabel('Facility Type')
ax.set_ylabel('Demand')
ax.set_title('Demand Distribution by Facility Type')
plt.tight_layout()
plt.show()

---

# Part 5: Understanding Overfitting with the Sine Wave

Now let's see the **bias-variance tradeoff** in action! We'll generate noisy data from a sine wave and try to fit polynomials of increasing complexity.

**Key questions:**
- When does the model fit too much noise?
- Why does training error alone mislead us?
- How do we find the right complexity?

In [None]:
# Generate sine wave data with noise
np.random.seed(42)

n = 30  # number of data points
X = np.sort(np.random.uniform(0, 2 * np.pi, n))  # random x values
y_true = np.sin(X)  # the true underlying function
y = y_true + np.random.normal(0, 0.3, n)  # add noise

print(f"Generated {n} noisy observations from sin(x)")

In [None]:
# Visualize the data and true function
fig, ax = plt.subplots(figsize=(10, 5))

# Plot true function
x_smooth = np.linspace(0, 2 * np.pi, 100)
ax.plot(x_smooth, np.sin(x_smooth), 'r--', linewidth=2, label='True function: sin(x)')

# Plot noisy observations
ax.scatter(X, y, s=60, c='blue', edgecolors='black', zorder=5, label='Observed data (noisy)')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('The Challenge: Recover the True Pattern from Noisy Data')
ax.legend()
plt.tight_layout()
plt.show()

### Fitting Polynomials of Increasing Degree

Let's fit polynomials with degrees 1 (linear), 3, 10, and 20 to see what happens.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

degrees = [1, 3, 10, 20]
colors = ['green', 'orange', 'purple', 'red']

fig, ax = plt.subplots(figsize=(12, 6))

# Plot data and true function
ax.scatter(X, y, s=60, c='blue', edgecolors='black', zorder=5, label='Data')
ax.plot(x_smooth, np.sin(x_smooth), 'r--', linewidth=2, alpha=0.5, label='True: sin(x)')

# Fit and plot each polynomial
for degree, color in zip(degrees, colors):
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(X.reshape(-1, 1))
    
    # Fit linear regression on polynomial features
    model = LinearRegression()
    model.fit(X_poly, y)
    
    # Predict on smooth x for plotting
    X_smooth_poly = poly.transform(x_smooth.reshape(-1, 1))
    y_pred_smooth = model.predict(X_smooth_poly)
    
    # Calculate training MSE
    y_pred_train = model.predict(X_poly)
    train_mse = mean_squared_error(y, y_pred_train)
    
    ax.plot(x_smooth, y_pred_smooth, color=color, linewidth=2, 
            label=f'Degree {degree} (Train MSE: {train_mse:.3f})')

ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('Polynomial Fits of Increasing Complexity', fontsize=14)
ax.legend(loc='upper right')
ax.set_ylim(-2, 2)
plt.tight_layout()
plt.show()

### ðŸ¤” What do you notice?

- **Degree 1 (green)**: Too simple! Misses the curve entirely.
- **Degree 3 (orange)**: Captures the sine pattern reasonably well.
- **Degree 10+ (purple, red)**: Starts wiggling through individual points.

**But look at the training MSE!** Higher-degree polynomials have *lower* training error. Does that mean they're better?

### Train vs Test Error: The Real Test

Let's split our data and see what happens on **held-out test data**.

In [None]:
from sklearn.model_selection import train_test_split

# Split data: 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# Compare train vs test error across polynomial degrees
degrees_to_test = range(1, 16)
train_errors = []
test_errors = []

for degree in degrees_to_test:
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train.reshape(-1, 1))
    X_test_poly = poly.transform(X_test.reshape(-1, 1))
    
    # Fit model
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    # Calculate errors
    train_pred = model.predict(X_train_poly)
    test_pred = model.predict(X_test_poly)
    
    train_errors.append(mean_squared_error(y_train, train_pred))
    test_errors.append(mean_squared_error(y_test, test_pred))

# Plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(degrees_to_test, train_errors, 'b-o', linewidth=2, markersize=8, label='Training Error')
ax.plot(degrees_to_test, test_errors, 'r-s', linewidth=2, markersize=8, label='Test Error')

# Find optimal degree
optimal_degree = degrees_to_test[np.argmin(test_errors)]
ax.axvline(x=optimal_degree, color='green', linestyle='--', alpha=0.7, 
           label=f'Optimal: Degree {optimal_degree}')

ax.set_xlabel('Polynomial Degree', fontsize=12)
ax.set_ylabel('Mean Squared Error', fontsize=12)
ax.set_title('The Bias-Variance Tradeoff in Action', fontsize=14)
ax.legend()
ax.set_yscale('log')  # Log scale to see the pattern better
plt.tight_layout()
plt.show()

### ðŸ’¡ The "Aha" Moment!

**Training error** keeps decreasing as we add complexity.

**Test error** eventually starts INCREASING!

This is **overfitting**: the model memorizes training data (including noise) and fails to generalize.

---

**Key lessons:**
1. Training error alone is misleading
2. We need held-out test data to evaluate models honestly
3. More complex isn't always better
4. The optimal complexity balances bias and variance

### Cross-Validation: A Better Approach

Instead of a single train/test split, let's use **cross-validation** to get more stable estimates.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

# Use cross-validation to find optimal degree
degrees_to_test = range(1, 12)
cv_scores = []

for degree in degrees_to_test:
    # Create pipeline: polynomial features + linear regression
    pipeline = make_pipeline(
        PolynomialFeatures(degree),
        LinearRegression()
    )
    
    # 5-fold cross-validation (negative MSE because sklearn maximizes)
    scores = cross_val_score(pipeline, X.reshape(-1, 1), y, 
                            cv=5, scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())  # Convert back to positive MSE

# Plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(degrees_to_test, cv_scores, 'g-o', linewidth=2, markersize=8)

optimal_cv_degree = degrees_to_test[np.argmin(cv_scores)]
ax.axvline(x=optimal_cv_degree, color='red', linestyle='--', alpha=0.7,
           label=f'CV Optimal: Degree {optimal_cv_degree}')

ax.set_xlabel('Polynomial Degree', fontsize=12)
ax.set_ylabel('Cross-Validation MSE', fontsize=12)
ax.set_title('Using Cross-Validation to Select Model Complexity', fontsize=14)
ax.legend()
plt.tight_layout()
plt.show()

print(f"\nâœ“ Cross-validation suggests degree {optimal_cv_degree} is optimal!")

### Connection to Day 2: Regularization

Instead of choosing polynomial degree, tomorrow we'll learn a more elegant approach:

**LASSO and Ridge regression** add penalties that automatically constrain model complexity!

```
Today:     Choose degree to control complexity
Tomorrow:  Use regularization penalty (Î») to control complexity
```

The same principle applies: **constrain complexity to prevent overfitting**.

---

## Part 6: Save Your Work

Don't forget to save a copy of this notebook to your Google Drive!

**File > Save a copy in Drive**

## Summary

In this notebook, you:

1. âœ… Set up your Python environment
2. âœ… Loaded and explored the workshop dataset
3. âœ… Created initial visualizations
4. âœ… **Saw overfitting in action** with the sine wave demo
5. âœ… Learned why train/test splits and cross-validation matter

**Key takeaway**: Complex models that fit training data perfectly often fail on new data. We need validation strategies to find the right balance.

---

**Next:** Day 2 - Regularization (LASSO, Ridge) & Tree-Based Methods