# Lecture 06: Classifier Evaluation
## Possible Subjective Exam Questions
---

## Section 1: Introduction to Classifier Evaluation

### Q1. Why is error on training data not a good indicator of performance on future data?

**Answer:**

Training error is not a good indicator because:

1. **New data is different:** Future data will probably not be exactly the same as training data

2. **Overfitting problem:** The model might fit the training data too precisely
   - It memorizes the training data instead of learning the pattern
   - This leads to poor results on new data

3. **Training error underestimates test error:** The training error can be very low but test error can be much higher

4. **False confidence:** A model with 100% training accuracy might fail badly on real-world data

### Q2. What is overfitting? Why is it a problem in machine learning?

**Answer:**

**Overfitting:**
Overfitting occurs when a model fits the training data too precisely, capturing noise and random fluctuations instead of the true underlying pattern.

**Why it's a problem:**

1. Model performs very well on training data but poorly on new/test data
2. The model memorizes instead of generalizes
3. It captures noise as if it were a real pattern
4. Makes the model unreliable for real-world predictions
5. Reduces the practical usefulness of the model

### Q3. What is the difference between training error and test error?

**Answer:**

| Training Error | Test Error |
|----------------|------------|
| Error calculated on data used for training | Error calculated on new unseen data |
| Easy to calculate | More meaningful for real performance |
| Usually lower | Usually higher |
| Can be misleading | Better indicator of true performance |
| May underestimate true error | Closer to actual real-world error |

**Key Point:** Training error often dramatically underestimates the test error.

## Section 2: Classifier Construction

### Q4. Explain the components involved in classifier construction.

**Answer:**

The main components in classifier construction are:

**1. Target Function:**
$$f(x, y) = z, \quad z \in [-l, +l]$$
This is the true function we want to learn.

**2. Classifier Hypothesis:**
The type of function we assume (linear, quadratic, cubic, etc.)

**3. Iterative Optimization:**
Finding the best hypothesis by iteratively improving the model

**4. Classifier Parameters:**
Values learned from data (e.g., slope and intercept for a line)

**5. Hyperparameters:**
Values set before training (e.g., regularization coefficients)

### Q5. What is the difference between classifier parameters and hyperparameters? Give examples.

**Answer:**

| Classifier Parameters | Hyperparameters |
|----------------------|------------------|
| Learned from training data | Set before training begins |
| Model finds optimal values | User/developer chooses values |
| Internal to the model | External to the model |
| **Examples:** slope, intercept, weights | **Examples:** regularization coefficient, learning rate |
| Change during training | Fixed during training |
| Determined by optimization | Determined by validation |

## Section 3: Evaluation Measures

### Q6. Define Classification Accuracy Rate and Classification Error Rate.

**Answer:**

**Classification Accuracy Rate:**

$$\text{Accuracy} = \frac{\text{Number of correct classifications}}{\text{Total number of test samples}}$$

**Classification Error Rate:**

$$\text{Error Rate} = \frac{\text{Number of incorrect classifications}}{\text{Total number of test samples}}$$

**Relationship:**
$$\text{Accuracy} + \text{Error Rate} = 1$$

**Example:**
If 90 out of 100 samples are correctly classified:
- Accuracy = 90/100 = 0.90 or 90%
- Error Rate = 10/100 = 0.10 or 10%

### Q7. Explain True Positive, False Positive, True Negative, and False Negative with a medical diagnosis example.

**Answer:**

Consider a disease detection classifier:

**1. True Positive (TP):**
- Classifier says: "You have the disease"
- Reality: You are actually ill
- Correct positive prediction ✓

**2. False Positive (FP):**
- Classifier says: "You have the disease"
- Reality: You are actually healthy
- Wrong positive prediction ✗ (Type I Error)

**3. True Negative (TN):**
- Classifier says: "You don't have the disease"
- Reality: You are actually healthy
- Correct negative prediction ✓

**4. False Negative (FN):**
- Classifier says: "You don't have the disease"
- Reality: You are actually ill
- Wrong negative prediction ✗ (Type II Error)

### Q8. Draw and explain the Confusion Matrix.

**Answer:**

**Confusion Matrix:**

```
                    Predicted Class
                 Positive    Negative
Actual    Positive   TP          FN
Class     Negative   FP          TN
```

**Explanation:**
- **TP (True Positive):** Actually positive, predicted positive
- **FN (False Negative):** Actually positive, predicted negative
- **FP (False Positive):** Actually negative, predicted positive
- **TN (True Negative):** Actually negative, predicted negative

**Uses:**
- Calculate accuracy, precision, recall
- Understand types of errors
- Compare different classifiers

### Q9. Define and explain Sensitivity (True Positive Rate) and Specificity (True Negative Rate).

**Answer:**

**Sensitivity (True Positive Rate / Recall):**

$$\text{TPR} = \frac{TP}{TP + FN}$$

- Probability of positive test result among those who actually have the disease
- Also called "Recall"
- Measures: How good is the model at finding positive cases?

**Specificity (True Negative Rate):**

$$\text{TNR} = \frac{TN}{TN + FP}$$

- Probability of negative test result among those who don't have the disease
- Measures: How good is the model at avoiding false alarms?

**Example:**
- High sensitivity: Important in disease screening (don't miss sick people)
- High specificity: Important to avoid unnecessary treatments

### Q10. Define False Positive Rate and False Negative Rate with formulas.

**Answer:**

**False Positive Rate (FPR):**

$$\text{FPR} = \frac{FP}{TN + FP}$$

- Probability of positive test result among those who are actually negative
- Also called "Fall-out"
- Related to specificity: $\text{FPR} = 1 - \text{Specificity}$

**False Negative Rate (FNR):**

$$\text{FNR} = \frac{FN}{TP + FN}$$

- Probability of negative test result among those who are actually positive
- Also called "Miss Rate"
- Related to sensitivity: $\text{FNR} = 1 - \text{Sensitivity}$

### Q11. Why might we consider total cost/benefit instead of simple accuracy?

**Answer:**

Different errors have different costs in real-world applications:

**Examples:**

1. **Medical Diagnosis:**
   - False Negative (missing a disease): Very costly - patient doesn't get treatment
   - False Positive (wrong diagnosis): Less costly - extra tests done

2. **Spam Detection:**
   - False Positive (important email marked as spam): Very costly
   - False Negative (spam in inbox): Less costly

3. **Fraud Detection:**
   - False Negative (missing fraud): Very costly - money lost
   - False Positive (blocking valid transaction): Annoying but less costly

**Conclusion:** Simple accuracy treats all errors equally, but in practice, some errors are more expensive than others.

## Section 4: Bias-Variance Tradeoff

### Q12. Explain the Bias-Variance Tradeoff with a diagram description.

**Answer:**

**Bias:**
- Difference between expected value and true value
- Low bias = Model accurately estimates the target function
- High bias = Model is too simple (underfitting)

**Variance:**
- How much the estimate changes when training set varies
- Low variance = Model is stable across different training sets
- High variance = Model is too sensitive to training data (overfitting)

**Tradeoff:**

| Model Complexity | Bias | Variance | Training Error | Test Error |
|-----------------|------|----------|----------------|------------|
| Low (simple) | High | Low | High | High |
| Medium (optimal) | Medium | Medium | Medium | Lowest |
| High (complex) | Low | High | Low | High |

**Goal:** Find the sweet spot where both bias and variance are balanced for minimum test error.

### Q13. What happens to training error and test error as model complexity increases?

**Answer:**

**As model complexity increases:**

**Training Error:**
- Continuously decreases
- Can become very low or even zero
- Model fits training data better and better

**Test Error:**
- Initially decreases (model learns useful patterns)
- Reaches a minimum at optimal complexity
- Then increases (overfitting begins)
- Model starts fitting noise instead of patterns

**Key Insight:**
The gap between training error and test error increases with complexity, indicating overfitting.

## Section 5: Holdout Method

### Q14. Explain the Holdout Method for classifier evaluation.

**Answer:**

**Holdout Method:**

A simple evaluation technique for large datasets.

**Steps:**
1. Randomly split data into two parts:
   - Training set: Usually 2/3 of data (67%)
   - Test set: Usually 1/3 of data (33%)
2. Build classifier using only the training set
3. Evaluate classifier using only the test set
4. Report the error/accuracy on test set

**When to use:**
- When you have thousands of examples
- When each class has several hundred examples

**Limitation:**
- For small or unbalanced datasets, samples might not be representative

### Q15. What are the limitations of the simple Holdout Method?

**Answer:**

**Limitations:**

1. **Randomness:** Results depend on the random split
   - Different splits give different results

2. **Data waste:** Only 2/3 data used for training
   - Less training data means potentially weaker model

3. **Small datasets:** May not work well
   - Some classes may have very few samples in test set

4. **Unbalanced data:** Some classes may be missing
   - Rare classes might not appear in test set

5. **Single estimate:** Only one error estimate
   - No measure of variance/reliability

## Section 6: Handling Unbalanced Data

### Q16. What is unbalanced data? Give real-world examples.

**Answer:**

**Unbalanced Data:**
Data where classes have very unequal frequencies - one class has many more samples than others.

**Real-world Examples:**

| Application | Majority Class | Minority Class |
|------------|----------------|----------------|
| Attrition Prediction | 97% stay | 3% leave |
| Medical Diagnosis | 90% healthy | 10% disease |
| eCommerce | 99% don't buy | 1% buy |
| Security/Terrorism | >99.99% normal citizens | <0.01% terrorists |
| Fraud Detection | 99.9% legitimate | 0.1% fraud |

**Problem:**
A majority class classifier (always predicts majority class) can achieve 97% accuracy but is completely useless!

### Q17. How do you balance unbalanced data for training?

**Answer:**

**Balancing Approach for Two Classes:**

1. Randomly select desired number of minority class instances
2. Add equal number of randomly selected majority class instances
3. Train model on this balanced set

**For Multiple Classes:**
- Ensure each class is represented with approximately equal proportions in both training and test sets

**Other Techniques:**
- **Oversampling:** Duplicate minority class samples
- **Undersampling:** Remove majority class samples
- **SMOTE:** Generate synthetic minority samples
- **Cost-sensitive learning:** Assign higher cost to minority class errors

### Q18. Why can a 97% accurate classifier be useless?

**Answer:**

**Example: Attrition Prediction**
- 97% employees stay, 3% leave

**Majority Class Classifier:**
- Always predicts "stay" for everyone
- Accuracy = 97%

**Why it's useless:**
1. It never identifies any employee who will leave
2. The whole purpose was to predict attrition, but it misses 100% of actual attrition cases
3. Sensitivity/Recall = 0%
4. No actionable insights for HR

**Lesson:**
High accuracy doesn't mean useful predictions. We need to look at other metrics like sensitivity, specificity, and per-class accuracy.

## Section 7: Parameter Tuning and Validation

### Q19. Why should test data never be used for parameter tuning?

**Answer:**

**Reason:**

Test data must be completely unseen by the model until final evaluation because:

1. **Data leakage:** Using test data for tuning means the model has "seen" it
2. **Biased evaluation:** Error estimate will be optimistically biased
3. **Overfitting to test set:** Model gets tuned to perform well on that specific test set
4. **Invalid results:** The reported accuracy won't reflect true performance on new data

**Proper Procedure:**
Use three separate sets:
- **Training data:** Build the basic model
- **Validation data:** Tune hyperparameters
- **Test data:** Final evaluation only

### Q20. Explain the purpose of Training, Validation, and Test sets.

**Answer:**

| Set | Purpose | When Used |
|-----|---------|----------|
| **Training Set** | Build the basic structure of the model | During model training |
| **Validation Set** | Optimize hyperparameters, select best model | During tuning phase |
| **Test Set** | Final unbiased evaluation | Only at the end |

**Two-Stage Learning:**
- Stage 1: Build basic structure using training data
- Stage 2: Optimize parameters using validation data
- Final: Evaluate using test data

**Important:** After evaluation is complete, all data can be used to build the final classifier for deployment.

### Q21. What should be done after evaluation is complete?

**Answer:**

After evaluation is complete:

1. **Use all data for final model:** Combine training, validation, and test sets
2. **Retrain the model:** Build final classifier on complete data
3. **Deploy:** Use this model for real predictions

**Why?**
- Larger training data generally gives better classifiers
- Returns diminish, but more data is still helpful
- The evaluation phase was just to estimate performance
- Final deployed model should use maximum available data

## Section 8: Stratified Sampling

### Q22. What is stratified sampling? Why is it important?

**Answer:**

**Stratified Sampling:**
A sampling technique where samples are selected in the same proportion as they appear in the population.

**How it works:**
1. Divide population into groups called 'strata' based on a characteristic (usually class label)
2. Sample from each stratum proportionally
3. Each class is represented with approximately equal proportions in both training and test subsets

**Why important:**
- Ensures all classes are represented
- Reduces variance in error estimates
- Especially important for small or unbalanced datasets
- More reliable evaluation

## Section 9: Repeated Holdout Method

### Q23. Explain the Repeated Holdout Method with algorithm.

**Answer:**

**Repeated Holdout Method:**
Making holdout estimate more reliable by repeating with different subsamples.

**Algorithm:**

```
Input: A = [{x_i, y_i, l_i}] with n samples

1. Initialize: Error_Rate = 0, k = 0

2. Repeat for k = 1 to 10:
   a. Randomly sample n/4 samples for training and test
   b. Train the classifier
   c. Compute error rate e(k) on test set
   d. Error_Rate = Error_Rate + e(k)
   e. k = k + 1

3. Return: Error_Rate / k
```

**Advantage:** More reliable than single holdout

**Limitation:** Test sets overlap between iterations

### Q24. What is the main limitation of Repeated Holdout Method?

**Answer:**

**Main Limitation:**
The different test sets overlap with each other.

**Problem with overlapping:**
- Same instances appear in multiple test sets
- Error estimates are not fully independent
- Some instances are tested multiple times, others never

**Solution:**
Cross-validation avoids overlapping test sets by ensuring each instance is tested exactly once.

## Section 10: Cross-Validation

### Q25. Explain K-Fold Cross-Validation with diagram description.

**Answer:**

**K-Fold Cross-Validation:**

**Steps:**
1. Split data into k subsets of equal size
2. For each iteration i (from 1 to k):
   - Use subset i as test set
   - Use remaining (k-1) subsets as training set
   - Train model and calculate error
3. Average all k error estimates

**Example with k=5:**
```
Iteration 1: [TEST] [Train] [Train] [Train] [Train]
Iteration 2: [Train] [TEST] [Train] [Train] [Train]
Iteration 3: [Train] [Train] [TEST] [Train] [Train]
Iteration 4: [Train] [Train] [Train] [TEST] [Train]
Iteration 5: [Train] [Train] [Train] [Train] [TEST]
```

**Advantage:** No overlapping test sets - each instance is tested exactly once.

### Q26. Why is 10-fold cross-validation considered the standard method?

**Answer:**

**10-Fold Cross-Validation is standard because:**

1. **Empirical evidence:** Extensive experiments have shown k=10 gives the best balance

2. **Good bias-variance tradeoff:**
   - Low k (like 2): High bias, low variance
   - High k (like n): Low bias, high variance
   - k=10: Good balance

3. **Sufficient training data:** 90% data used for training in each fold

4. **Reliable test estimate:** 10 different test sets provide stable average

5. **Computationally feasible:** Not too many iterations

**Best practice:** Stratified ten-fold cross-validation (stratification reduces variance further)

### Q27. What is Stratified Cross-Validation? What are its benefits?

**Answer:**

**Stratified Cross-Validation:**
Cross-validation where each fold maintains the same class distribution as the original dataset.

**How it works:**
1. Before splitting, stratify the data by class
2. Ensure each fold has proportional representation of all classes
3. Then perform k-fold cross-validation

**Benefits:**
1. Reduces variance in error estimates
2. More reliable for unbalanced datasets
3. Each test fold is representative of the whole dataset
4. Avoids having folds with missing classes

**Best Practice:**
Repeated stratified cross-validation (e.g., 10-fold repeated 10 times) for most reliable estimates.

## Section 11: Leave-One-Out Cross-Validation

### Q28. Explain Leave-One-Out Cross-Validation (LOOCV). When is it used?

**Answer:**

**Leave-One-Out Cross-Validation (LOOCV):**

A special case of k-fold cross-validation where k = n (number of samples).

**How it works:**
- For n training instances, build classifier n times
- Each time:
  - Leave one instance out for testing
  - Train on remaining (n-1) instances
  - Test on the single left-out instance
- Average all n results

**When to use:**
- Acute shortage of labeled data (e.g., medical domain)
- Need to use maximum data for training

**Advantages:**
- Makes best use of limited data
- No random subsampling involved
- Deterministic results

**Disadvantages:**
- Very computationally expensive (train n models)
- High variance in estimates

### Q29. Compare K-Fold Cross-Validation and Leave-One-Out Cross-Validation.

**Answer:**

| Aspect | K-Fold CV | LOOCV |
|--------|-----------|-------|
| Number of folds | k (typically 10) | n (number of samples) |
| Training set size | (k-1)/k of data | (n-1)/n of data |
| Test set size | 1/k of data | 1 sample |
| Computational cost | Train k models | Train n models |
| Variance | Lower | Higher |
| Bias | Slightly higher | Lower |
| Best for | Medium-large datasets | Very small datasets |
| Random subsampling | Yes | No |

## Section 12: Bootstrap Method

### Q30. Explain the Bootstrap method for evaluation.

**Answer:**

**Bootstrap Method:**

Uses sampling **with replacement** to form training sets.

**How it works:**
1. From dataset of n instances, sample n times **with replacement**
2. This forms a new training set of n instances (some repeated)
3. Use instances NOT in the new training set for testing
4. Repeat multiple times and average results

**Key difference from Cross-Validation:**
- Cross-validation: Sampling **without** replacement
- Bootstrap: Sampling **with** replacement

**Note:** Same instance can appear multiple times in training set.

### Q31. What is the 0.632 Bootstrap? Derive the probability.

**Answer:**

**0.632 Bootstrap:**

Named because approximately 63.2% of instances appear in the training set.

**Derivation:**

For a single instance:
- Probability of NOT being picked in one draw = $1 - \frac{1}{n}$
- Probability of NOT being picked in n draws = $\left(1 - \frac{1}{n}\right)^n$

As n becomes large:
$$\left(1 - \frac{1}{n}\right)^n \approx e^{-1} \approx 0.368$$

**Therefore:**
- Probability of being in test set = 0.368 (36.8%)
- Probability of being in training set = 1 - 0.368 = 0.632 (63.2%)

**Meaning:** Training data will contain approximately 63.2% of the original instances.

### Q32. What is the difference between sampling with replacement and without replacement?

**Answer:**

**Sampling Without Replacement (Cross-Validation):**
- Once an instance is selected, it cannot be selected again
- Each instance appears at most once in training set
- Like drawing cards without putting them back

**Sampling With Replacement (Bootstrap):**
- After selecting an instance, it's "put back" and can be selected again
- Same instance can appear multiple times in training set
- Like drawing cards and putting them back

**Example:**
Dataset: {A, B, C, D, E}

Without replacement: {A, C, E} - each appears once
With replacement: {A, A, C, E, E} - some appear multiple times

## Section 13: Numerical Problems

### Q33. Given the following confusion matrix, calculate Accuracy, Sensitivity, Specificity, FPR, and FNR.

```
                Predicted
              Pos    Neg
Actual  Pos   80     20
        Neg   10     90
```

**Answer:**

From the confusion matrix:
- TP = 80, FN = 20
- FP = 10, TN = 90
- Total = 200

**Accuracy:**
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{80 + 90}{200} = \frac{170}{200} = 0.85 = 85\%$$

**Sensitivity (TPR):**
$$\text{TPR} = \frac{TP}{TP + FN} = \frac{80}{80 + 20} = \frac{80}{100} = 0.80 = 80\%$$

**Specificity (TNR):**
$$\text{TNR} = \frac{TN}{TN + FP} = \frac{90}{90 + 10} = \frac{90}{100} = 0.90 = 90\%$$

**False Positive Rate:**
$$\text{FPR} = \frac{FP}{TN + FP} = \frac{10}{100} = 0.10 = 10\%$$

**False Negative Rate:**
$$\text{FNR} = \frac{FN}{TP + FN} = \frac{20}{100} = 0.20 = 20\%$$

### Q34. In 5-fold cross-validation with 100 samples, how many samples are in each training and test set?

**Answer:**

**Given:**
- Total samples = 100
- Number of folds (k) = 5

**Calculation:**

Each fold contains:
$$\frac{100}{5} = 20 \text{ samples}$$

In each iteration:
- **Test set size** = 1 fold = 20 samples
- **Training set size** = 4 folds = 80 samples

**Summary:**
- 5 iterations total
- Each iteration: 80 training, 20 testing
- Each sample is tested exactly once

### Q35. If a dataset has 50 samples, how many models are trained in Leave-One-Out Cross-Validation?

**Answer:**

**Given:** n = 50 samples

In LOOCV:
- k = n = 50 folds
- Each fold has exactly 1 sample for testing

**Answer:**
- **50 models** are trained
- Each model is trained on 49 samples and tested on 1 sample
- Total test predictions = 50
- Average of 50 error values gives final error estimate

### Q36. Calculate the probability that a specific instance is NOT selected in any of the n bootstrap samples.

**Answer:**

**Probability Calculation:**

For a dataset with n instances:

Probability of NOT selecting a specific instance in one draw:
$$P(\text{not selected in 1 draw}) = 1 - \frac{1}{n}$$

Probability of NOT selecting in n draws (n samples with replacement):
$$P(\text{not selected in n draws}) = \left(1 - \frac{1}{n}\right)^n$$

As $n \to \infty$:
$$\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368$$

**Answer:** Approximately **36.8%** probability of not being selected.

## Section 14: Conceptual Questions

### Q37. Compare all evaluation methods: Holdout, Repeated Holdout, K-Fold CV, LOOCV, and Bootstrap.

**Answer:**

| Method | Training % | Test % | Overlap | Best For |
|--------|------------|--------|---------|----------|
| Holdout | 67% | 33% | N/A | Large data |
| Repeated Holdout | Variable | Variable | Yes | Medium data |
| K-Fold CV | (k-1)/k | 1/k | No | Most cases |
| LOOCV | (n-1)/n | 1/n | No | Very small data |
| Bootstrap | ~63.2% | ~36.8% | In training | Statistical estimates |

**Recommendations:**
- **Large data:** Simple holdout with train/validation/test split
- **Medium data:** 10-fold stratified cross-validation
- **Small data:** LOOCV or repeated stratified cross-validation

### Q38. What is the relationship between model complexity and training/test error?

**Answer:**

**As model complexity increases:**

| Complexity | Training Error | Test Error | Phenomenon |
|------------|---------------|------------|------------|
| Too Low | High | High | Underfitting (High Bias) |
| Optimal | Medium | Lowest | Good Generalization |
| Too High | Very Low | High | Overfitting (High Variance) |

**Key Observations:**
1. Training error always decreases with complexity
2. Test error has a U-shape
3. The gap between them indicates overfitting
4. Goal: Find complexity where test error is minimum

### Q39. Summarize the key points to remember for classifier evaluation.

**Answer:**

**Key Summary Points:**

1. **Use appropriate data splits:**
   - Large data: Train, Validation, Test sets
   - Small data: Cross-validation

2. **Balance unbalanced data:**
   - Don't let majority class dominate
   - Stratify samples proportionally

3. **Never use test data for parameter tuning:**
   - Use separate validation data

4. **Most Important: Avoid Overfitting:**
   - Monitor training vs test error gap
   - Use regularization
   - Don't make model too complex

5. **After evaluation, use all data for final model**

---
## Summary of Important Formulas

| Metric | Formula |
|--------|--------|
| Accuracy | $\frac{TP + TN}{TP + TN + FP + FN}$ |
| Error Rate | $\frac{FP + FN}{TP + TN + FP + FN}$ |
| Sensitivity (TPR/Recall) | $\frac{TP}{TP + FN}$ |
| Specificity (TNR) | $\frac{TN}{TN + FP}$ |
| False Positive Rate | $\frac{FP}{TN + FP}$ |
| False Negative Rate | $\frac{FN}{TP + FN}$ |
| Bootstrap Probability | $\left(1 - \frac{1}{n}\right)^n \approx 0.368$ |

---