# Week 14: In-Class Exercise - Your First Machine Learning Model

## Objective
Build a Decision Tree model end-to-end using the Water Consumption dataset.

## Time: ~30 minutes

## Dataset
Water Consumption data from datos.gov.co.

### What You Will Do:
1. Load and prepare the data (features and target)
2. Split into training and test sets
3. Train a Decision Tree
4. Evaluate the model and check for overfitting

---

## Setup
Run this cell to load the necessary libraries and dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# scikit-learn imports
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score
)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully!")

In [None]:
# Load Water Consumption dataset from datos.gov.co
url = "https://www.datos.gov.co/resource/k9gy-47jj.csv?$limit=10000"

df = pd.read_csv(url)

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Quick data inspection
df.head()

In [None]:
# Check data types and missing values
print("Data Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nNumeric columns summary:")
df.describe()

---

## Part 1: Data Preparation (10 minutes)

Before building a model, we need to:
1. Identify numeric columns
2. Choose a target variable
3. Select features
4. Handle missing values

---

In [None]:
# Identify potential target and feature columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

print("Numeric columns:")
for col in numeric_cols:
    print(f"  - {col}: min={df[col].min():.2f}, max={df[col].max():.2f}, mean={df[col].mean():.2f}")

In [None]:
# Select target variable (consumption-related column)
consumption_candidates = [col for col in numeric_cols if any(x in col.lower() for x in ['consumo', 'consumption', 'valor', 'cantidad', 'total'])]
print(f"Potential target columns: {consumption_candidates}")

# Select the target column (update based on your dataset)
if consumption_candidates:
    target_col = consumption_candidates[0]
else:
    target_col = df[numeric_cols].var().idxmax()

print(f"\nSelected target column: {target_col}")
print(f"Target statistics:")
print(df[target_col].describe())

In [None]:
# Prepare features (X) and target (y)
# Use all numeric columns except the target
feature_cols = [col for col in numeric_cols if col != target_col]

# Remove columns that are IDs or codes
id_patterns = ['id', 'codigo', 'code', 'key']
feature_cols = [col for col in feature_cols if not any(p in col.lower() for p in id_patterns)]

print(f"Feature columns: {feature_cols}")
print(f"Number of features: {len(feature_cols)}")

In [None]:
# Clean and prepare data
df_clean = df[feature_cols + [target_col]].dropna()

X = df_clean[feature_cols]
y = df_clean[target_col]

print(f"Samples after cleaning: {len(X)}")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

---

## Part 2: Train/Test Split (5 minutes)

Split the data so we can evaluate honestly.

---

### Task 2.1: Split the data into 80% training and 20% test

Use `train_test_split` with `test_size=0.2` and `random_state=42`.

In [None]:
# Task 2.1: Split data
# YOUR CODE HERE
X_train, X_test, y_train, y_test = ___

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

---

## Part 3: Train a Decision Tree (10 minutes)

Remember the sklearn pattern:
1. Create the model
2. Fit on training data
3. Predict on test data
4. Evaluate

---

### Task 3.1: Create and train a Decision Tree Regressor

Use `max_depth=5` to prevent overfitting.

In [None]:
# Task 3.1: Create and train the model

# Step 1: Create the model
# YOUR CODE HERE
model = ___

# Step 2: Fit the model on training data
# YOUR CODE HERE
___

print("Decision Tree model trained!")

### Task 3.2: Make predictions on the test set

In [None]:
# Task 3.2: Make predictions
# YOUR CODE HERE
y_pred = ___

print(f"First 5 predictions: {y_pred[:5]}")
print(f"First 5 actual values: {y_test.values[:5]}")

---

## Part 4: Evaluate and Check for Overfitting (5 minutes)

---

In [None]:
# Evaluate the model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== DECISION TREE RESULTS ===")
print(f"RMSE (Root Mean Squared Error): {rmse:.4f}")
print(f"MAE (Mean Absolute Error): {mae:.4f}")
print(f"R-squared: {r2:.4f}")
print(f"\nInterpretation: The model explains {r2*100:.1f}% of the variance in the target.")

In [None]:
# Check for overfitting: Compare train vs test performance
y_pred_train = model.predict(X_train)
r2_train = r2_score(y_train, y_pred_train)

print(f"Training R-squared: {r2_train:.4f}")
print(f"Test R-squared:     {r2:.4f}")
print(f"Gap:                {r2_train - r2:.4f}")

if r2_train - r2 > 0.1:
    print("\nWarning: Possible overfitting! Consider reducing max_depth.")
else:
    print("\nGood: No significant overfitting detected.")

In [None]:
# Visualize: Actual vs Predicted
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(y_test, y_pred, alpha=0.5, color='steelblue')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2, label='Perfect Prediction')

ax.set_xlabel('Actual Values', fontsize=12)
ax.set_ylabel('Predicted Values', fontsize=12)
ax.set_title(f'Decision Tree: Actual vs Predicted\nR-squared = {r2:.4f}', fontsize=14)
ax.legend()

plt.tight_layout()
plt.show()

---

## Bonus: Try Different max_depth Values

See how changing the tree depth affects overfitting.

---

In [None]:
# Bonus: Compare different max_depth values
depths = [2, 3, 5, 10, 20, None]  # None = unlimited depth
results = []

for depth in depths:
    dt = DecisionTreeRegressor(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    
    r2_train_d = r2_score(y_train, dt.predict(X_train))
    r2_test_d = r2_score(y_test, dt.predict(X_test))
    
    results.append({
        'max_depth': str(depth) if depth else 'None',
        'R2_train': r2_train_d,
        'R2_test': r2_test_d,
        'gap': r2_train_d - r2_test_d
    })

results_df = pd.DataFrame(results)
print("=== DEPTH COMPARISON ===")
print(results_df.to_string(index=False))
print("\nNotice: As depth increases, training score goes up but the gap (overfitting) also increases!")

---

## Summary

In this exercise, you learned:

1. **The sklearn pattern**: Create -> Fit -> Predict -> Evaluate
2. **Train/Test Split**: Always evaluate on data the model has not seen
3. **Decision Tree Regressor**: Predicts continuous values using decision rules
4. **Key Metrics**: RMSE, MAE, R-squared
5. **Overfitting Detection**: Compare train vs test scores; limit max_depth to prevent it

---

*End of Exercise*