# Lesson 05: Your First ML Model

**What you'll learn:**
- The fit-predict pattern (all models use this!)
- Train a Decision Tree classifier
- Train a KNN classifier
- Train a Random Forest classifier
- Make predictions and check accuracy

---

## Section 1: The Fit-Predict Pattern

### READ

ALL sklearn models follow the same pattern:

```python
# Step 1: Create the model
model = SomeClassifier()

# Step 2: Train (fit) on training data
model.fit(X_train, y_train)

# Step 3: Predict on new data
predictions = model.predict(X_test)
```

Once you learn this pattern, you can use ANY classifier!

### TRY IT - Setup

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

# Load and prepare data
df = pd.read_csv('../datasets/tomatjus.csv')
X = df.drop('quality', axis=1)
y = df['quality']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X.shape[1]}")
print(f"Classes: {y.unique()}")

---

## Section 2: Decision Tree Classifier

### READ

A **Decision Tree** makes decisions by asking yes/no questions:

```
Is pH > 3.5?
├── Yes: Is pulp > 10?
│   ├── Yes: Predict "Premium"
│   └── No: Predict "Average"
└── No: Predict "Special"
```

**Pros:** Easy to understand and visualize
**Cons:** Can overfit (memorize training data)

### TRY IT

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Step 1: Create the model
tree = DecisionTreeClassifier(random_state=42)

# Step 2: Train (fit) on training data
tree.fit(X_train_scaled, y_train)

# Step 3: Predict on test data
tree_predictions = tree.predict(X_test_scaled)

# Check accuracy
tree_accuracy = accuracy_score(y_test, tree_predictions)
print(f"Decision Tree Accuracy: {tree_accuracy:.1%}")

In [None]:
# See some predictions vs actual
print("\nFirst 10 predictions:")
print(f"Predicted: {list(tree_predictions[:10])}")
print(f"Actual:    {list(y_test[:10].values)}")

---

## Section 3: K-Nearest Neighbors (KNN)

### READ

**KNN** is simple: To predict a new sample, find the K closest samples in training data, and predict the most common class.

Example with K=3:
- 3 nearest neighbors are: [Average, Average, Premium]
- Prediction: **Average** (2 out of 3)

**Pros:** Simple, no training needed
**Cons:** Slow on large datasets

### TRY IT

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create KNN with K=5 (look at 5 nearest neighbors)
knn = KNeighborsClassifier(n_neighbors=5)

# Train and predict
knn.fit(X_train_scaled, y_train)
knn_predictions = knn.predict(X_test_scaled)

# Check accuracy
knn_accuracy = accuracy_score(y_test, knn_predictions)
print(f"KNN (K=5) Accuracy: {knn_accuracy:.1%}")

In [None]:
# Try different K values
print("\nTrying different K values:")
for k in [1, 3, 5, 7, 9]:
    knn_k = KNeighborsClassifier(n_neighbors=k)
    knn_k.fit(X_train_scaled, y_train)
    acc = accuracy_score(y_test, knn_k.predict(X_test_scaled))
    print(f"  K={k}: {acc:.1%}")

---

## Section 4: Random Forest

### READ

**Random Forest** builds MANY decision trees and combines their votes.

Like asking 100 experts instead of 1 - usually more accurate!

**Why "Random"?** Each tree sees a random subset of features and data. This variety helps avoid overfitting.

**Pros:** Often works great "out of the box"
**Cons:** Slower than single tree, harder to interpret

### TRY IT

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create Random Forest with 100 trees
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train and predict
rf.fit(X_train_scaled, y_train)
rf_predictions = rf.predict(X_test_scaled)

# Check accuracy
rf_accuracy = accuracy_score(y_test, rf_predictions)
print(f"Random Forest Accuracy: {rf_accuracy:.1%}")

In [None]:
# Random Forest can tell us which features are important!
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance (top 5):")
print(feature_importance.head())

---

## Section 5: Comparing Models

In [None]:
print("="*40)
print("MODEL COMPARISON")
print("="*40)
print(f"Decision Tree: {tree_accuracy:.1%}")
print(f"KNN (K=5):     {knn_accuracy:.1%}")
print(f"Random Forest: {rf_accuracy:.1%}")
print("="*40)

# Find winner
best = max([('Decision Tree', tree_accuracy), 
            ('KNN', knn_accuracy), 
            ('Random Forest', rf_accuracy)], key=lambda x: x[1])
print(f"\nWinner: {best[0]} with {best[1]:.1%} accuracy!")

---

## Quick Reference

| Classifier | Code | When to Use |
|------------|------|-------------|
| Decision Tree | `DecisionTreeClassifier()` | When you need interpretability |
| KNN | `KNeighborsClassifier(n_neighbors=5)` | Small datasets, simple problems |
| Random Forest | `RandomForestClassifier(n_estimators=100)` | Most problems, good default |

**The Pattern (same for ALL):**
```python
model = Classifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
```

---

## Next Lesson

In **Lesson 06: Model Evaluation**, you'll learn:
- Why accuracy isn't always enough
- Confusion matrix
- Precision, Recall, F1-score
- ROC curves