# Lesson 02: What is Machine Learning?

**What you'll learn:**
- What Machine Learning actually is (in simple words)
- Supervised vs Unsupervised Learning
- Classification vs Regression
- The ML Pipeline (step-by-step process)
- How this relates to YOUR assignment

---

## Section 1: The Big Picture

### READ

**Machine Learning** is teaching computers to learn patterns from data, then use those patterns to make predictions on NEW data.

Think of it like this:
- You show a child many pictures of cats and dogs
- The child learns to recognize the differences (pointy ears, whiskers, etc.)
- Now the child can identify NEW cats and dogs they've never seen before

That's exactly what ML does!

```
Training Data --> [ML Algorithm] --> Trained Model
                                          |
New Data -----> [Trained Model] -----> Prediction
```

### TRY IT

Let's see a simple example:

In [None]:
# This is what ML looks like in code (don't worry about details yet!)

# Step 1: We have some data
emails = [
    "Buy cheap watches now!",      # spam
    "Meeting at 3pm tomorrow",      # not spam
    "You won $1000000!",            # spam
    "Project report attached",      # not spam
]

labels = ["spam", "not spam", "spam", "not spam"]

# Step 2: ML learns patterns from this data
# (In real ML, we'd train a model here)

# Step 3: Now we can predict on NEW emails!
new_email = "Congratulations! You won a prize!"
# Model would predict: "spam" (because it learned words like "won" are suspicious)

print("The ML model learned patterns from training emails.")
print("Now it can predict if NEW emails are spam or not!")

### EXPLAIN

What happened:
- **Training**: We showed the model examples with correct answers
- **Learning**: The model found patterns (spam emails have words like "buy", "won", "$")
- **Predicting**: The model uses those patterns on new data

**Key point**: The model learns FROM DATA, not from rules we write!

---

## Section 2: Supervised Learning

### READ

In **Supervised Learning**, you give the computer:
1. **Input data** (features) - the information to learn from
2. **Correct answers** (labels) - what you want to predict

It's called "supervised" because you're supervising the learning by giving correct answers.

**Example: Predicting if email is spam**
- Features: words in email, sender address, time sent
- Label: "spam" or "not spam"

**YOUR ASSIGNMENT uses supervised learning:**
- Features: network traffic measurements (packet size, duration, etc.)
- Label: "normal" or "attack type"

### TRY IT

In [None]:
import pandas as pd

# Load our practice dataset
df = pd.read_csv('../datasets/tomatjus.csv')

# In supervised learning, we split data into:
# X = Features (what we use to predict)
# y = Labels (what we want to predict)

# Features: all columns EXCEPT 'quality'
X = df.drop('quality', axis=1)

# Labels: the 'quality' column (what we predict)
y = df['quality']

print("Features (X) - what we use to predict:")
print(X.head())
print(f"\nShape: {X.shape}")

In [None]:
print("Labels (y) - what we want to predict:")
print(y.head())
print(f"\nUnique labels: {y.unique()}")

### EXPLAIN

What we did:
- **X** (features) contains 11 measurements like pH, acidity, sugar
- **y** (labels) contains the quality rating we want to predict
- The model will learn: "Given these 11 measurements, predict the quality"

**Important terms:**
- X = Features = Input = Independent variables
- y = Labels = Target = Output = Dependent variable

---

## Section 3: Classification vs Regression

### READ

Two main types of supervised learning:

### CLASSIFICATION (predicting categories)
- Is this email **spam or not**? (2 categories)
- What **type of attack** is this? (multiple categories)
- Is this transaction **fraud**? (yes/no)

### REGRESSION (predicting numbers)
- What will the **house price** be? ($350,000)
- How many **units will sell**? (1,500)
- What's the **temperature** tomorrow? (25.3 degrees)

---

**YOUR ASSIGNMENT is CLASSIFICATION:**
- Given network traffic features, predict the category: Normal, DoS, Probe, R2L, or U2R

### TRY IT

In [None]:
# Classification example - predicting categories
print("CLASSIFICATION: Predicting categories")
print("="*40)

# Our tomato juice dataset predicts quality CATEGORY
print("Tomato juice quality categories:")
print(df['quality'].value_counts())
print("\nThis is classification because we predict a CATEGORY (Average/Premium/Special)")

In [None]:
# Let's look at your assignment dataset
print("YOUR ASSIGNMENT: Network Intrusion Detection")
print("="*40)

# Load the training data
nsl = pd.read_csv('../datasets/NSL_KDD/NSL_ppTrain.csv')

# Check the attack categories
print("Attack categories you'll predict:")
print(nsl['atakcat'].value_counts())
print("\nThis is MULTI-CLASS classification (5 categories)!")

### EXPLAIN

**Classification types:**
- **Binary**: 2 classes (spam/not spam, fraud/not fraud)
- **Multi-class**: 3+ classes (your assignment has 5: benign, dos, probe, r2l, u2r)

**Your assignment categories:**
- **benign**: Normal, safe network traffic
- **dos**: Denial of Service attacks (flood the network)
- **probe**: Scanning/surveillance attacks
- **r2l**: Remote to Local (unauthorized access from outside)
- **u2r**: User to Root (privilege escalation)

---

## Section 4: The ML Pipeline

### READ

Every ML project follows these steps:

```
1. GET DATA       --> Load your dataset
2. EXPLORE DATA   --> Understand what you have (EDA)
3. PREPROCESS     --> Clean and prepare the data
4. SPLIT DATA     --> Separate into training and testing
5. TRAIN MODEL    --> Let the algorithm learn patterns
6. EVALUATE       --> Check how well it performs
7. OPTIMIZE       --> Improve the model
8. PREDICT        --> Use model on new data
```

This course will teach you each step!

### TRY IT

In [None]:
# Let's walk through the ML pipeline with a simple example
# (Don't worry about understanding every line - just see the flow!)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# STEP 1: GET DATA
print("Step 1: Loading data...")
df = pd.read_csv('../datasets/tomatjus.csv')
print(f"   Loaded {len(df)} samples")

In [None]:
# STEP 2: EXPLORE DATA (briefly)
print("Step 2: Exploring data...")
print(f"   Features: {df.shape[1]-1}")
print(f"   Target classes: {df['quality'].unique()}")

In [None]:
# STEP 3: PREPROCESS (separate features and labels)
print("Step 3: Preprocessing...")
X = df.drop('quality', axis=1)
y = df['quality']
print(f"   X shape: {X.shape}")
print(f"   y shape: {y.shape}")

In [None]:
# STEP 4: SPLIT DATA (80% train, 20% test)
print("Step 4: Splitting data...")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"   Training samples: {len(X_train)}")
print(f"   Testing samples: {len(X_test)}")

In [None]:
# STEP 5: TRAIN MODEL
print("Step 5: Training model...")
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
print("   Model trained!")

In [None]:
# STEP 6: EVALUATE
print("Step 6: Evaluating...")
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"   Accuracy: {accuracy:.1%}")

In [None]:
# STEP 7 & 8: OPTIMIZE and PREDICT (we'll learn later!)
print("Step 7: Optimization (Lessons 7-9)")
print("Step 8: Make predictions on new data")

# Example prediction
sample = X_test.iloc[0:1]  # Get first test sample
prediction = model.predict(sample)
print(f"\nExample prediction: {prediction[0]}")

### EXPLAIN

**Why split into Train and Test?**

Imagine studying for an exam:
- If you only see practice questions (training) and the exam has the SAME questions → easy 100%
- But if the exam has NEW questions → that tests if you really learned

Same with ML:
- Training data: Model learns patterns
- Test data: Check if model learned real patterns (not just memorized)

**Never test on training data!** It's like grading a student on questions they've already seen.

---

## Section 5: Your Assignment Explained Simply

### READ

**What you need to do:**

1. **Build a BASELINE model**
   - Train a classifier with default settings
   - Check how well it performs

2. **OPTIMIZE it** using ONE technique:
   - **Hyperparameter tuning**: Adjust model settings (Lesson 07)
   - **Feature selection**: Use only the best features (Lesson 08)
   - **Handle imbalance**: Fix unequal class sizes (Lesson 09)

3. **COMPARE** baseline vs optimized
   - Show the improvement in numbers
   - Explain what changed

4. **Write a REPORT**
   - What you did
   - What improved
   - Which model is better

### TRY IT

In [None]:
# Let's look at your assignment dataset more closely
nsl = pd.read_csv('../datasets/NSL_KDD/NSL_ppTrain.csv')

print("YOUR ASSIGNMENT DATASET:")
print("="*40)
print(f"Total samples: {len(nsl):,}")
print(f"Features: {nsl.shape[1] - 2}")  # minus 2 for label columns
print(f"\nClass distribution:")
print(nsl['atakcat'].value_counts())

In [None]:
# Notice the IMBALANCE problem!
counts = nsl['atakcat'].value_counts()
print("\nIMBALANCE PROBLEM:")
print(f"Most common class (benign): {counts['benign']:,} samples")
print(f"Least common class (u2r): {counts['u2r']:,} samples")
print(f"\nRatio: benign has {counts['benign'] // counts['u2r']}x more samples than u2r!")
print("\nThis is why Lesson 09 (Handling Imbalance) is important!")

### EXPLAIN

**The challenge in your assignment:**

The data is **highly imbalanced**:
- benign (normal): ~67,000 samples
- u2r (attacks): only 52 samples!

If a model just predicts "benign" for everything, it gets ~50% accuracy!
But it NEVER catches u2r attacks - useless for security.

You'll learn how to handle this in Lesson 09.

---

## Key Vocabulary

| Term | Simple Meaning |
|------|----------------|
| **Machine Learning** | Computer learns patterns from data |
| **Supervised Learning** | Learning with correct answers provided |
| **Classification** | Predicting categories (spam/not spam) |
| **Regression** | Predicting numbers (house price) |
| **Features (X)** | Input data used for prediction |
| **Labels (y)** | Correct answers we want to predict |
| **Training data** | Data the model learns from |
| **Test data** | New data to check model performance |
| **Model** | The trained "brain" that makes predictions |
| **Accuracy** | % of correct predictions |

---

## Quiz Yourself

1. Is predicting house prices **classification** or **regression**?
2. Is detecting spam emails **classification** or **regression**?
3. In your assignment (NSL-KDD), what are the **features** and what is the **label**?
4. Why do we split data into **training and testing** sets?
5. What happens if we test on the **same data** we trained on?

---

## Answers

<details>
<summary>Click to see answers</summary>

1. **Regression** (predicting a number: $350,000)
2. **Classification** (predicting a category: spam or not spam)
3. Features: network traffic measurements (41 columns like packet size, duration, protocol)
   Label: attack category (benign, dos, probe, r2l, u2r)
4. To check if the model learned real patterns, not just memorized the training data
5. The model might look very accurate but fail on new data (overfitting)
</details>

---

## Next Lesson

In **Lesson 03: Data Exploration**, you'll learn:
- How to visualize your data with charts
- How to find patterns and correlations
- How to detect outliers
- How to check class balance