# Lab 00b: ML Concepts Primer

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/depalmar/ai_for_the_win/blob/main/notebooks/lab00b_ml_concepts.ipynb)

Understand machine learning theory before coding. No coding required - just concepts!

## Learning Objectives
- Supervised vs unsupervised learning
- Features and labels
- Training, validation, and testing
- Evaluation metrics

## 1. What is Machine Learning?

Machine learning is teaching computers to learn patterns from data instead of explicit programming.

**Traditional Programming:**
```
Rules + Data → Program → Output
```

**Machine Learning:**
```
Data + Expected Output → ML Algorithm → Model (learned rules)
```

## 2. Supervised Learning

Learning from labeled examples.

**Example: Phishing Detection**
- Input: Email text
- Label: "phishing" or "legitimate"
- Goal: Learn to classify new emails

In [None]:
# Visualization of supervised learning
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)
legitimate = np.random.randn(50, 2) + [2, 2]
phishing = np.random.randn(50, 2) + [5, 5]

plt.figure(figsize=(8, 6))
plt.scatter(legitimate[:, 0], legitimate[:, 1], c="green", label="Legitimate", alpha=0.7)
plt.scatter(phishing[:, 0], phishing[:, 1], c="red", label="Phishing", alpha=0.7)
plt.xlabel("Feature 1 (e.g., URL count)")
plt.ylabel("Feature 2 (e.g., urgency words)")
plt.title("Supervised Learning: Labeled Data")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 3. Unsupervised Learning

Finding patterns without labels.

**Example: Malware Clustering**
- Input: Malware samples with features
- No labels provided
- Goal: Group similar samples together

In [None]:
from sklearn.cluster import KMeans

# Generate unlabeled data
np.random.seed(42)
cluster1 = np.random.randn(30, 2) + [0, 0]
cluster2 = np.random.randn(30, 2) + [4, 4]
cluster3 = np.random.randn(30, 2) + [0, 4]
data = np.vstack([cluster1, cluster2, cluster3])

# Apply clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(data)

plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap="viridis", alpha=0.7)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    c="red",
    marker="X",
    s=200,
    label="Centers",
)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Unsupervised Learning: Clustering")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 4. Features and Labels

**Features (X):** Input variables the model learns from
- Email: word count, URL count, sender domain
- Malware: file size, imports, entropy
- Network: bytes sent, packet count, duration

**Labels (y):** What we want to predict
- Classification: phishing/legitimate, malware/benign
- Regression: threat score (0-10)

In [None]:
import pandas as pd

# Example feature table
data = {
    "email_length": [150, 500, 200, 1000],
    "url_count": [5, 1, 8, 0],
    "urgent_words": [3, 0, 5, 1],
    "label": ["phishing", "legitimate", "phishing", "legitimate"],
}

df = pd.DataFrame(data)
print("Features (X) and Labels (y):")
print(df)

## 5. Train/Test Split

**Why split data?**
- Training set: Model learns from this
- Test set: Evaluate on unseen data
- Prevents overfitting (memorizing instead of learning)

In [None]:
from sklearn.model_selection import train_test_split

# Generate sample data
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

## 6. Evaluation Metrics

For classification:
- **Accuracy**: % correct predictions
- **Precision**: Of predicted positives, how many are correct?
- **Recall**: Of actual positives, how many did we find?
- **F1 Score**: Balance of precision and recall

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Example predictions
y_true = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]  # Actual labels
y_pred = [1, 1, 0, 0, 0, 1, 1, 0, 1, 0]  # Model predictions

print("Confusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=["Benign", "Malicious"]))

## 7. Security Context: Why Metrics Matter

**High Recall needed when:**
- Missing a threat is costly (malware detection)
- Better to have false alarms than miss attacks

**High Precision needed when:**
- False positives are expensive (blocking legitimate users)
- Alert fatigue is a concern

In [None]:
# Visualization of precision vs recall tradeoff
thresholds = np.linspace(0.1, 0.9, 9)
precision = [0.5, 0.55, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 0.98]
recall = [0.98, 0.95, 0.9, 0.85, 0.75, 0.65, 0.5, 0.35, 0.2]

plt.figure(figsize=(8, 6))
plt.plot(thresholds, precision, "b-o", label="Precision")
plt.plot(thresholds, recall, "r-o", label="Recall")
plt.xlabel("Detection Threshold")
plt.ylabel("Score")
plt.title("Precision vs Recall Tradeoff")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Key Takeaways

1. **Supervised**: Learn from labeled examples
2. **Unsupervised**: Find patterns without labels
3. **Features**: Input data the model uses
4. **Train/Test Split**: Evaluate on unseen data
5. **Metrics**: Choose based on security context

## Next Steps
- **Lab 00c**: Prompt Engineering Mastery
- **Lab 01**: Build your first classifier!