# Week 14 Workshop: Build Your First Machine Learning Model

## Water Consumption Dataset

### Objectives
1. Prepare data for machine learning
2. Train a Decision Tree model (regression)
3. Evaluate performance and detect overfitting
4. Experiment with max_depth to find the best model

### Duration: 2 hours

---

## Setup

Run this cell to load all necessary libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score
)

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

In [None]:
# Load Water Consumption dataset from datos.gov.co
url = "https://www.datos.gov.co/resource/k9gy-47jj.csv?$limit=10000"

df = pd.read_csv(url)

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"\nColumns:")
print(df.columns.tolist())

In [None]:
# Preview the data
df.head()

In [None]:
# Check data types and missing values
print("Data Types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nMissing Values:")
print(df.isnull().sum())

In [None]:
# Numeric summary
df.describe()

---

# Part 1: Data Preparation

Prepare your data for modeling.

---

## 1.1 Identify Numeric Columns

In [None]:
# Identify column types
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

print("Numeric columns:")
for col in numeric_cols:
    print(f"  - {col}: min={df[col].min():.2f}, max={df[col].max():.2f}")

## 1.2 Select Target Variable

For regression, choose a continuous numeric column (e.g., consumption amount).

In [None]:
# YOUR CODE HERE: Select the target column
consumption_candidates = [col for col in numeric_cols if any(x in col.lower() for x in ['consumo', 'consumption', 'valor', 'cantidad', 'total'])]
print(f"Potential target columns: {consumption_candidates}")

# Select target (UPDATE this based on your dataset)
target_col = ___  # e.g., consumption_candidates[0]

print(f"\nSelected target: {target_col}")
print(df[target_col].describe())

## 1.3 Select Features and Clean Data

In [None]:
# Select feature columns (numeric, excluding target and IDs)
feature_cols = [col for col in numeric_cols if col != target_col]

id_patterns = ['id', 'codigo', 'code', 'key']
feature_cols = [col for col in feature_cols if not any(p in col.lower() for p in id_patterns)]

print(f"Selected features: {feature_cols}")
print(f"Number of features: {len(feature_cols)}")

In [None]:
# Handle missing values and prepare X, y
df_clean = df[feature_cols + [target_col]].dropna()

X = df_clean[feature_cols]
y = df_clean[target_col]

print(f"Rows before cleaning: {len(df)}")
print(f"Rows after cleaning: {len(df_clean)}")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

## 1.4 Train-Test Split

In [None]:
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

---

# Part 2: Decision Tree Regressor

Build your first model using the sklearn pattern: Create -> Fit -> Predict -> Evaluate.

---

## 2.1 Train the Model

In [None]:
# YOUR CODE HERE: Create and train a Decision Tree Regressor

# Step 1: Create the model (use max_depth=5 to prevent overfitting)
model = ___

# Step 2: Train the model
___

# Step 3: Make predictions
y_pred = ___

print("Model trained and predictions made!")

## 2.2 Evaluate the Model

In [None]:
# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== DECISION TREE RESULTS ===")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")

## 2.3 Check for Overfitting

In [None]:
# Compare train vs test performance
y_pred_train = model.predict(X_train)
r2_train = r2_score(y_train, y_pred_train)

print(f"Training R-squared: {r2_train:.4f}")
print(f"Test R-squared:     {r2:.4f}")
print(f"Gap:                {r2_train - r2:.4f}")

if r2_train - r2 > 0.1:
    print("\nWarning: Possible overfitting! Consider reducing max_depth.")
else:
    print("\nGood: No significant overfitting detected.")

## 2.4 Visualize: Actual vs Predicted

In [None]:
# Actual vs Predicted scatter plot
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(y_test, y_pred, alpha=0.5, color='steelblue')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect Prediction')

ax.set_xlabel('Actual Values', fontsize=12)
ax.set_ylabel('Predicted Values', fontsize=12)
ax.set_title(f'Decision Tree (max_depth=5): Actual vs Predicted\nR-squared = {r2:.4f}', fontsize=14)
ax.legend()

plt.tight_layout()
plt.show()

---

# Part 3: Experiment with max_depth

Try different tree depths to find the best balance between accuracy and overfitting.

---

In [None]:
# YOUR CODE HERE: Try different max_depth values
# For each depth, record train R-squared, test R-squared, and the gap

depths = [2, 3, 5, 7, 10, 15, 20, None]  # None = unlimited
results = []

for depth in depths:
    dt = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    
    r2_tr = r2_score(y_train, dt.predict(X_train))
    r2_te = r2_score(y_test, dt.predict(X_test))
    
    results.append({
        'max_depth': str(depth) if depth else 'None',
        'R2_train': r2_tr,
        'R2_test': r2_te,
        'gap': r2_tr - r2_te
    })

results_df = pd.DataFrame(results)
print("=== DEPTH COMPARISON ===")
print(results_df.to_string(index=False))

In [None]:
# YOUR CODE HERE: Plot the overfitting curve
# X-axis: max_depth, Y-axis: R-squared (two lines: train and test)

fig, ax = plt.subplots(figsize=(10, 6))

x_labels = results_df['max_depth'].tolist()
x_pos = range(len(x_labels))

ax.plot(x_pos, results_df['R2_train'], 'o-', color='steelblue', linewidth=2, label='Train R-squared')
ax.plot(x_pos, results_df['R2_test'], 's-', color='coral', linewidth=2, label='Test R-squared')

ax.set_xticks(list(x_pos))
ax.set_xticklabels(x_labels)
ax.set_xlabel('max_depth', fontsize=12)
ax.set_ylabel('R-squared', fontsize=12)
ax.set_title('Overfitting Curve: Train vs Test R-squared', fontsize=14)
ax.legend(fontsize=11)
ax.set_ylim(0, 1.05)

# Shade the gap
ax.fill_between(x_pos, results_df['R2_train'], results_df['R2_test'], alpha=0.2, color='red', label='Overfitting gap')

plt.tight_layout()
plt.show()

print("\nThe gap between train and test grows as depth increases = overfitting!")

### Depth Analysis

**Write your analysis below:**

*Which max_depth gives the best test R-squared with acceptable overfitting?*

*YOUR ANALYSIS HERE*

---

---

# Part 4: Decision Tree Classifier (Bonus)

Convert the problem to classification and try a DecisionTreeClassifier.

---

In [None]:
# Create consumption categories based on quartiles
def categorize(value, thresholds):
    if value < thresholds[0]:
        return 'Low'
    elif value < thresholds[1]:
        return 'Medium'
    else:
        return 'High'

thresholds = [y.quantile(0.33), y.quantile(0.66)]
print(f"Thresholds: Low < {thresholds[0]:.2f} < Medium < {thresholds[1]:.2f} < High")

y_cat = y.apply(lambda x: categorize(x, thresholds))
print(f"\nClass distribution:")
print(y_cat.value_counts())

In [None]:
# YOUR CODE HERE: Train a DecisionTreeClassifier

# Split for classification
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X, y_cat, test_size=0.2, random_state=42
)

# Create and train classifier
clf = ___
___

# Predict and evaluate
y_pred_c = ___
accuracy = accuracy_score(y_test_c, y_pred_c)

print(f"Classification Accuracy: {accuracy:.2%}")

In [None]:
# Check for overfitting (classification)
train_accuracy = accuracy_score(y_train_c, clf.predict(X_train_c))

print(f"Training accuracy: {train_accuracy:.2%}")
print(f"Test accuracy:     {accuracy:.2%}")
print(f"Gap:               {train_accuracy - accuracy:.2%}")

---

# Part 5: Interpretation and Reflection

---

In [None]:
# Feature importance from the best model
# Decision Trees automatically calculate feature importance

# Use the best regression model (retrain with best depth if needed)
best_depth = 5  # UPDATE based on your Part 3 analysis
best_model = DecisionTreeRegressor(max_depth=best_depth, random_state=42)
best_model.fit(X_train, y_train)

importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance:")
print(importance.to_string(index=False))

In [None]:
# Visualize feature importance
fig, ax = plt.subplots(figsize=(10, 6))

ax.barh(importance['feature'], importance['importance'], color='steelblue')
ax.set_xlabel('Importance', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)
ax.set_title('Decision Tree: Feature Importance', fontsize=14)
ax.invert_yaxis()

plt.tight_layout()
plt.show()

## Reflection

**Answer these questions:**

### 1. Best max_depth
*Which max_depth gave you the best test R-squared without excessive overfitting? Why?*

*YOUR ANSWER HERE*

### 2. Model Quality
*Is the R-squared acceptable? What does it mean for your predictions?*

*YOUR ANSWER HERE*

### 3. Feature Importance
*Which features are most important? Does this make sense from a domain perspective?*

*YOUR ANSWER HERE*

### 4. Next Steps
*What would you try next to improve the model? (e.g., more features, different model, feature engineering)*

*YOUR ANSWER HERE*

---

---

## Final Checklist

Before submitting, verify:

- [ ] All cells have been executed (Kernel > Restart & Run All)
- [ ] Part 1: Data is properly prepared
- [ ] Part 2: Decision Tree trained and evaluated
- [ ] Part 3: Multiple depths tested and overfitting curve plotted
- [ ] Part 4 (Bonus): Classification attempted
- [ ] Part 5: Reflection questions answered

---

*Week 14 Workshop - Data Analytics Course - Universidad Cooperativa de Colombia*