# 5.2 Hands-On: Vibecoding the ML Pipeline

## Course 3: Advanced Classification Models for Student Success

## Introduction

In this notebook, we walk through practical examples of using AI coding tools to build complete machine learning pipelines. Each section shows:
1. The **prompt** you would give to an AI tool
2. The **expected output** (code you should get back)
3. **What to check** in the generated code

### Learning Objectives

1. Practice writing effective prompts for common ML tasks
2. See real examples of AI-generated ML code
3. Learn to identify and fix common issues in generated code
4. Build confidence in the vibecoding workflow

## 1. Prompt Engineering for ML Tasks

### Task 1: Data Loading and Exploration

**Your prompt:**
> "Load the CSV file at '../../data/training.csv' into a pandas DataFrame. Show the shape, first 5 rows, data types, and missing value counts. Create a binary target called DEPARTED where SEM_3_STATUS != 'E' maps to 1. Show the class distribution."

**What to check in the output:**
- Does it use `pd.read_csv` with the correct path?
- Does it handle the target creation correctly?
- Are the exploratory steps reasonable?

In [None]:
# This is what good AI-generated code looks like for the prompt above:
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('../../data/training.csv')

# Basic exploration
print(f"Shape: {df.shape}")
print(f"\nFirst 5 rows:")
print(df.head())
print(f"\nData types:")
print(df.dtypes)
print(f"\nMissing values:")
print(df.isnull().sum()[df.isnull().sum() > 0])

# Create binary target
df['DEPARTED'] = (df['SEM_3_STATUS'] != 'E').astype(int)
print(f"\nClass distribution:")
print(df['DEPARTED'].value_counts())
print(f"Departure rate: {df['DEPARTED'].mean():.2%}")

### Task 2: Complete Model Pipeline

**Your prompt:**
> "Build a complete ML pipeline for student departure prediction:
> 1. Load training.csv and testing.csv from ../../data/
> 2. Target: DEPARTED = 1 if SEM_3_STATUS != 'E'
> 3. Features: HS_GPA, GPA_1, GPA_2, DFW_RATE_1, DFW_RATE_2, UNITS_COMPLETED_1, UNITS_COMPLETED_2
> 4. Train a RandomForestClassifier with 200 trees, balanced weights
> 5. Evaluate: accuracy, precision, recall, F1, ROC-AUC
> 6. Show feature importances as a horizontal bar chart using plotly"

**What to check:**
- Correct feature list?
- Proper train/test split handling?
- All metrics calculated correctly?
- Chart renders properly?

In [None]:
# Expected AI-generated output for the pipeline prompt:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import plotly.graph_objects as go

# Step 1: Load data
train_df = pd.read_csv('../../data/training.csv')
test_df = pd.read_csv('../../data/testing.csv')

# Step 2: Create target
train_df['DEPARTED'] = (train_df['SEM_3_STATUS'] != 'E').astype(int)
test_df['DEPARTED'] = (test_df['SEM_3_STATUS'] != 'E').astype(int)

# Step 3: Select features
features = ['HS_GPA', 'GPA_1', 'GPA_2', 'DFW_RATE_1', 'DFW_RATE_2',
            'UNITS_COMPLETED_1', 'UNITS_COMPLETED_2']

X_train = train_df[features].fillna(train_df[features].median())
X_test = test_df[features].fillna(test_df[features].median())
y_train = train_df['DEPARTED']
y_test = test_df['DEPARTED']

# Step 4: Train model
rf = RandomForestClassifier(n_estimators=200, class_weight='balanced', random_state=42)
rf.fit(X_train, y_train)

# Step 5: Evaluate
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]

print("Model Performance:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"  ROC-AUC:   {roc_auc_score(y_test, y_prob):.4f}")

# Step 6: Feature importance plot
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=True)

fig = go.Figure(go.Bar(
    x=importance_df['Importance'],
    y=importance_df['Feature'],
    orientation='h',
    marker_color='steelblue'
))
fig.update_layout(title='Feature Importances (Random Forest)', height=400,
                  xaxis_title='Importance', yaxis_title='Feature')
fig.show()

## 2. Common Issues in AI-Generated Code

Watch out for these common problems:

| Issue | Example | Fix |
|:------|:--------|:----|
| **Wrong file path** | `pd.read_csv('data.csv')` | Check your actual file location |
| **Missing imports** | Uses `roc_auc_score` without importing | Add missing imports |
| **Deprecated API** | `use_label_encoder=True` in XGBoost | Update to current API |
| **Incorrect target encoding** | Reverses 0/1 labels | Verify label mapping |
| **No random seed** | Results differ each run | Add `random_state=42` |
| **Scaling for tree models** | Unnecessarily scales data | Remove scaling for trees |

## 3. Summary

### The Vibecoding Checklist

Before running AI-generated code, verify:

- [ ] File paths match your project structure
- [ ] All imports are present
- [ ] Target variable is encoded correctly
- [ ] Features are the ones you intended
- [ ] Model hyperparameters are reasonable
- [ ] Evaluation metrics are appropriate for your problem
- [ ] Random seeds are set for reproducibility
- [ ] The code actually runs without errors

### Next Module

**Proceed to:** `Module 6: Special Topics`