# 5.1 Introduction to AI-Assisted Coding for Data Analytics

## Course 3: Advanced Classification Models for Student Success

## Introduction

The landscape of data analytics and machine learning is being transformed by **AI-assisted coding tools**. These tools allow practitioners to describe what they want in natural language and have AI generate the code—a practice sometimes called **"vibecoding."**

This module introduces you to tools like **Codex** (OpenAI) and **Antigravity** that can help you:

- Write machine learning code faster
- Debug and optimize existing code
- Generate boilerplate code for common ML workflows
- Translate ideas into working implementations without deep coding expertise

### Why This Matters for Higher Education

Institutional researchers and analysts often have deep domain expertise but may not be full-time software engineers. AI-assisted coding bridges this gap, enabling you to:

1. **Prototype faster**: Go from idea to working model in minutes
2. **Learn by example**: See well-structured code generated from your descriptions
3. **Focus on the problem, not the syntax**: Spend time on analytics, not debugging
4. **Democratize ML**: Make advanced techniques accessible to more staff

### Learning Objectives

By the end of this module, you will be able to:

1. Understand the capabilities and limitations of AI coding assistants
2. Write effective prompts to generate ML code
3. Use AI tools to build, train, and evaluate models from natural language descriptions
4. Critically evaluate and modify AI-generated code
5. Apply AI-assisted coding to the assignments and lessons in this course

## 1. What is Vibecoding?

**Vibecoding** is the practice of using AI tools to write code by describing your intent in natural language. Instead of writing every line yourself, you:

1. **Describe** what you want to accomplish
2. **Review** the generated code
3. **Iterate** by refining your description or editing the output
4. **Validate** that the code works correctly

### Example: Building a Random Forest Model

**Traditional approach** (write every line):
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
# ... 20+ lines of setup, training, evaluation
```

**Vibecoding approach** (describe your intent):
> "Load the student training data from data/training.csv. Create a binary target where DEPARTED=1 if SEM_3_STATUS is not 'E'. Build a Random Forest with 200 trees and balanced class weights. Evaluate with 5-fold cross-validation using ROC-AUC. Show the top 10 most important features."

The AI generates the complete, working implementation.

## 2. Tools for AI-Assisted Coding

### 2.1 Codex / ChatGPT (OpenAI)

- **What it is**: An AI model that generates code from natural language
- **Access**: Via ChatGPT, GitHub Copilot, or OpenAI API
- **Strengths**: Excellent at Python, scikit-learn, pandas code generation
- **Best for**: Generating complete scripts, explaining code, debugging

### 2.2 Claude Code (Anthropic)

- **What it is**: An AI coding assistant from Anthropic
- **Access**: Via Claude interface or API
- **Strengths**: Strong reasoning, handles complex multi-step tasks
- **Best for**: Architecture decisions, code review, complex workflows

### 2.3 Antigravity

- **What it is**: A specialized AI coding tool for data science workflows
- **Strengths**: Purpose-built for data analytics tasks
- **Best for**: Data exploration, model building, visualization

### 2.4 GitHub Copilot

- **What it is**: AI-powered code completion in your IDE
- **Access**: VS Code, JetBrains, Neovim
- **Strengths**: Real-time suggestions as you type
- **Best for**: Autocomplete, inline code generation

## 3. Effective Prompting for ML Code

The quality of AI-generated code depends heavily on your **prompt**. Here are key principles:

### 3.1 Be Specific About Your Data

**Weak prompt**: "Build a classification model"

**Strong prompt**: "Load student data from '../../data/training.csv'. The target variable is SEM_3_STATUS where 'E' means enrolled (class 0) and everything else means departed (class 1). Use features: HS_GPA, GPA_1, GPA_2, DFW_RATE_1, DFW_RATE_2. Build an XGBoost classifier."

### 3.2 Specify the ML Framework

**Weak**: "Make a random forest"

**Strong**: "Using scikit-learn's RandomForestClassifier, build a model with 200 estimators, balanced class weights, and max_depth=12."

### 3.3 Include Evaluation Requirements

**Weak**: "Train and test the model"

**Strong**: "Split data 80/20, train the model, then report accuracy, precision, recall, F1, and ROC-AUC on the test set. Also plot the ROC curve using plotly."

### 3.4 Request the Output Format

"Generate a complete Jupyter notebook cell that I can run directly. Include all imports at the top. Add comments explaining each step."

### 3.5 The Vibecoding Workflow

```
Step 1: Describe → "Build an XGBoost model on student data..."
Step 2: Review   → Read the generated code carefully
Step 3: Test     → Run it and check results
Step 4: Iterate  → "Now add SHAP feature importance plots"
Step 5: Validate → Verify results make sense domain-wise
```

## 4. Applying Vibecoding to Course Assignments

You can use AI coding tools to work through the lessons and assignments in this course. Here's how:

### Example Prompts for Each Module

**Module 1 (Regularized Logistic Regression):**
> "Build a logistic regression model with L2 regularization on the student departure dataset. Compare C values of 0.01, 0.1, 1, and 10 using 5-fold cross-validation. Plot the coefficient paths."

**Module 2 (Tree-Based Models):**
> "Build a Decision Tree, Random Forest, and XGBoost model on the student departure data. Use the same train/test split. Compare all three on ROC-AUC, F1, and Recall. Create a side-by-side bar chart."

**Module 3 (Model Comparison):**
> "Create a comprehensive model comparison of Regularized Logistic Regression, Random Forest, and XGBoost. Include ROC curves, precision-recall curves, confusion matrices, and a radar chart of model capabilities."

### Important Guidelines

1. **Always review generated code** before running it—AI can make mistakes
2. **Understand what the code does**—don't just run it blindly
3. **Verify results** against your domain knowledge
4. **Credit AI assistance** in your work (academic integrity)
5. **Use AI as a learning tool**, not a replacement for understanding

## 5. Limitations and Best Practices

### Limitations of AI Coding Tools

| Limitation | Mitigation |
|:-----------|:-----------|
| Can generate incorrect code | Always test and validate |
| May use outdated APIs | Specify library versions in prompts |
| Doesn't know your specific data | Describe your data structure clearly |
| Can hallucinate functions | Check that function calls are valid |
| No guarantee of optimal code | Review for efficiency and best practices |

### Best Practices

1. **Start simple, then add complexity**: Begin with a basic prompt, then iterate
2. **Provide context**: Tell the AI about your data, goals, and constraints
3. **Review line by line**: Understand every line before running
4. **Test incrementally**: Run small sections, not entire scripts at once
5. **Keep your domain expertise**: You know student success better than AI does
6. **Document your prompts**: Keep a record of what you asked for reproducibility

## 6. Hands-On Exercise

Try using an AI coding tool to generate code for the following task:

### Task: Build and Compare Models

Write a prompt that asks an AI tool to:

1. Load the student departure dataset
2. Build three models (Logistic Regression, Random Forest, XGBoost)
3. Evaluate all three with cross-validation
4. Create a comparison visualization
5. Identify the top 5 most important features from each model

**Suggested prompt template:**
```
Using Python with scikit-learn and xgboost:

1. Load training data from '../../data/training.csv'
2. Create target: DEPARTED = 1 if SEM_3_STATUS != 'E', else 0
3. Use these features: [list features]
4. One-hot encode categorical features
5. Build: LogisticRegression(C=0.1), RandomForestClassifier(n_estimators=200), XGBClassifier(n_estimators=150)
6. 5-fold cross-validation with ROC-AUC scoring
7. Print comparison table and plot bar chart using plotly
8. Show top 5 features from each model
```

Copy this prompt into ChatGPT, Claude, or your preferred AI tool and compare the output!

## 7. Summary

### Key Takeaways

1. **AI-assisted coding (vibecoding)** lets you describe intent in natural language and get working code
2. **Effective prompts** are specific about data, framework, evaluation, and output format
3. **Always review and validate** AI-generated code—it's a tool, not a replacement for understanding
4. **Multiple tools are available**: Codex, Claude, Antigravity, GitHub Copilot
5. **Apply to this course**: Use AI tools to accelerate your work on assignments and lessons
6. **Maintain academic integrity**: Understand the code, credit AI assistance

### The Future of Analytics Work

AI-assisted coding doesn't replace the analyst—it amplifies them. Your domain expertise in higher education, combined with AI-generated code, creates a powerful combination for building and deploying student success models.

> *The best analysts of tomorrow will be those who can effectively direct AI tools while maintaining deep understanding of their domain.*