# __EVALUATION METRICS__:
### **1. What Is a Confusion Matrix?**
It is a matrix of size $n \times n$, where $n$ is the number of classes in the classification task. For a binary classification problem, the confusion matrix looks like this:

| **Actual \ Predicted** | **Positive** | **Negative** |
|-|--|--|
| **Positive**            | **True Positive (TP)** | **False Negative (FN)** |
| **Negative**            | **False Positive (FP)** | **True Negative (TN)** |


### **2. Terminology in Binary Classification**

#### **True Positive (TP):**
- Correctly predicted **positive** instances.  
  Example: A spam email correctly identified as spam.

#### **True Negative (TN):**
- Correctly predicted **negative** instances.  
  Example: A non-spam email correctly identified as non-spam.

#### **False Positive (FP):**
- Incorrectly predicted **positive** instances (also called a **Type I Error**).  
  Example: A non-spam email incorrectly identified as spam.

#### **False Negative (FN):**
- Incorrectly predicted **negative** instances (also called a **Type II Error**).  
  Example: A spam email incorrectly identified as non-spam.



### **3. Metrics Derived From the Confusion Matrix**
The confusion matrix allows the calculation of several performance metrics:

#### **Accuracy:**
- Proportion of correctly predicted instances:
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

#### **Precision (Positive Predictive Value):**
- Proportion of correctly predicted positives out of all predicted positives:
$$
\text{Precision} = \frac{TP}{TP + FP}
$$

#### **Recall (Sensitivity or True Positive Rate):**
- Proportion of correctly predicted positives out of all actual positives:
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

#### **F1-Score:**
- Harmonic mean of Precision and Recall, balancing both metrics:
$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

#### **Specificity (True Negative Rate):**
- Proportion of correctly predicted negatives out of all actual negatives:
$$
\text{Specificity} = \frac{TN}{TN + FP}
$$


### **4. Why Use a Confusion Matrix?**

1. **Detailed Analysis:** Unlike accuracy, which provides a single value, the confusion matrix breaks down results for each class.
2. **Imbalanced Data:** For datasets with uneven class distribution, metrics like Precision and Recall from the confusion matrix provide better insight.
3. **Multi-Class Classification:** For $n$ classes, the confusion matrix shows performance for each class.


### **5. Example of a Confusion Matrix**
#### Scenario:
- Task: Binary classification to detect spam emails.
- Dataset: 100 emails (80 spam, 20 non-spam).
- Results:
  - 70 spam emails correctly identified (TP).
  - 10 spam emails misclassified as non-spam (FN).
  - 15 non-spam emails incorrectly identified as spam (FP).
  - 5 non-spam emails correctly identified (TN).

#### Confusion Matrix:
| **Actual \ Predicted** | **Spam** | **Non-Spam** |
|-|-|--|
| **Spam**               | 70       | 10           |
| **Non-Spam**           | 15       | 5            |

#### Metrics:
- Accuracy:
  $$
  \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{70 + 5}{70 + 10 + 15 + 5} = 75\%
  $$
- Precision:
  $$
  \text{Precision} = \frac{TP}{TP + FP} = \frac{70}{70 + 15} = 82.35\%
  $$
- Recall:
  $$
  \text{Recall} = \frac{TP}{TP + FN} = \frac{70}{70 + 10} = 87.5\%
  $$
- F1-Score:
  $$
  F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \cdot \frac{0.8235 \cdot 0.875}{0.8235 + 0.875} \approx 85\%
  $$

The **F1-score** uses the **harmonic mean** of Precision and Recall instead of the arithmetic mean because the harmonic mean emphasizes balance between the two metrics, ensuring that low values in either Precision or Recall have a significant impact on the score.


### **Why Harmonic Mean for F1-Score?**

1. **Balancing Precision and Recall:**
   - The F1-score ensures a **trade-off** between Precision and Recall. 
   - If either Precision or Recall is low, the F1-score will be heavily penalized.
   - For example, if Precision = 1 and Recall = 0.1, the arithmetic mean is:
     $$
     \text{Arithmetic Mean} = \frac{1 + 0.1}{2} = 0.55
     $$
     The harmonic mean, however, is:
     $$
     \text{Harmonic Mean (F1)} = 2 \cdot \frac{1 \cdot 0.1}{1 + 0.1} = 0.18
     $$
     This better reflects the poor Recall.

2. **Sensitivity to Extremes:**
   - The harmonic mean is more sensitive to **low values** than the arithmetic mean. This makes it ideal for scenarios where **both Precision and Recall are critical** (e.g., in medical diagnosis or fraud detection).

3. **Geometric Interpretation:**
   - The harmonic mean considers the **inverse of the averages**, ensuring that both Precision and Recall contribute equally to the F1-score.

### **Formula Recap**
The F1-score is given by:
$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$
Where:
- **Precision** measures how many predicted positives are actually correct:
  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$
- **Recall** measures how many actual positives are correctly predicted:
  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

### **Why Is F1-Score Important?**
1. **Imbalanced Datasets:**
   - In datasets with imbalanced classes, accuracy can be misleading.
   - The F1-score ensures that both Precision and Recall are considered equally.
   - Example: A classifier predicting all negatives in a rare event dataset may achieve high accuracy but fail in Recall, which the F1-score penalizes.

2. **Applications:**
   - **Medical Diagnosis:** Where both false positives and false negatives carry risks.
   - **Spam Detection:** Avoiding spam requires high Precision, but catching most spam requires high Recall.

The **classification report** in machine learning provides a detailed summary of the performance of a classification model. It includes metrics like Precision, Recall, and F1-score for each class, as well as overall averages like **macro average** and **weighted average**.

### **1. What is a Classification Report?**

A **classification report** breaks down the model's performance across individual classes, showing how well it predicts each one. It typically includes:
- **Precision**
- **Recall**
- **F1-Score**
- **Support** (number of true instances for each class)

#### Example Format (For a 3-Class Problem):
| Class       | Precision | Recall | F1-Score | Support |
|-|--|--|-||
| Class 0     | 0.90      | 0.85   | 0.87     | 50      |
| Class 1     | 0.80      | 0.75   | 0.77     | 30      |
| Class 2     | 0.70      | 0.65   | 0.67     | 20      |
| **Macro Avg** | 0.80      | 0.75   | 0.77     | 100     |
| **Weighted Avg** | 0.83      | 0.80   | 0.81     | 100     |

### **2. Key Metrics in a Classification Report**

#### **Precision**  
Proportion of correctly predicted positives out of all predicted positives:
$$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$$

#### **Recall (Sensitivity)**  
Proportion of correctly predicted positives out of all actual positives:
$$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$$

#### **F1-Score**  
Harmonic mean of Precision and Recall:
$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

#### **Support**  
The number of true instances for each class in the dataset.  
- Useful to understand the distribution of data across classes.

### **3. Macro Average vs. Weighted Average**

#### **Macro Average**
- **Definition:** Calculates metrics for each class independently and takes the unweighted mean.  
$$
\text{Macro Average} = \frac{\text{Metric(Class 0)} + \text{Metric(Class 1)} + \ldots}{\text{Number of Classes}}
$$

- **Purpose:**  
  - Gives equal importance to all classes, regardless of their frequency.
  - Useful when class distribution is balanced.

- **Drawback:**  
  - If some classes have very few instances, macro averages might overemphasize their metrics.

#### **Weighted Average**
- **Definition:** Calculates metrics for each class, weighted by the class's support (number of instances).  
$$
\text{Weighted Avg Metric} = \frac{\sum (\text{Metric(Class)} \times \text{Support(Class)})}{\text{Total Instances}}
$$

- **Purpose:**  
  - Accounts for class imbalance.
  - Dominant classes contribute more to the average.

- **Example:**
  If Class 0 has 80 instances and Class 1 has 20, Class 0 contributes 80% to the weighted average, while Class 1 contributes 20%.

### **4. Micro Average**
- **Definition:** Combines all True Positives, False Positives, and False Negatives across classes and calculates metrics globally.  
$$
\text{Micro Precision} = \frac{\text{Total TP}}{\text{Total TP + Total FP}}
$$

- **Purpose:**  
  - Useful for evaluating multi-class problems with imbalanced data.  
  - Treats all instances equally, regardless of class.

### **5. When to Use Each Average?**

| **Metric**        | **Use Case**                                                                 |
|--|--|
| **Macro Average**  | When all classes are equally important, regardless of their size.           |
| **Weighted Average** | When class imbalance exists and larger classes are more significant.      |
| **Micro Average**  | For multi-class problems where the overall performance is the focus.        |
### **6. Python Example: Generating a Classification Report**

#### **Code Example:**
```python
from sklearn.metrics import classification_report

# Example true and predicted labels
y_true = [0, 1, 2, 0, 1, 2, 1, 0, 2, 1]
y_pred = [0, 1, 2, 0, 0, 2, 1, 0, 1, 2]

# Generate the classification report
report = classification_report(y_true, y_pred, target_names=["Class 0", "Class 1", "Class 2"])
print(report)
```

#### **Output:**
```
              precision    recall  f1-score   support

    Class 0       1.00      1.00      1.00         3
    Class 1       0.50      0.67      0.57         4
    Class 2       0.67      0.50      0.57         3

    accuracy                           0.70        10
   macro avg       0.72      0.72      0.71        10
weighted avg       0.72      0.70      0.70        10
```

In **regression tasks**, the objective is to predict continuous values, and the performance of the model is evaluated using various metrics. Each metric assesses different aspects of the model's predictive accuracy and error.

### **1. Common Regression Metrics**

#### **1.1. Mean Absolute Error (MAE):**
- Measures the average absolute difference between predicted values ($\hat{y}$) and true values ($y$).
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^n \lvert \hat{y}_i - y_i \rvert
$$

- **Characteristics:**
  - Easy to interpret: It shows the average error in the same units as the target variable.
  - Less sensitive to outliers than other metrics.

- **Use Case:** When you want a simple and robust measure of average prediction error.

#### **1.2. Mean Squared Error (MSE):**
- Measures the average squared difference between predicted and actual values.
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2
$$

- **Characteristics:**
  - Penalizes large errors more than small ones because of squaring.
  - Sensitive to outliers.

- **Use Case:** When large errors are undesirable and should be heavily penalized (e.g., financial forecasting).

#### **1.3. Root Mean Squared Error (RMSE):**
- The square root of MSE, providing error in the same units as the target variable.
$$
\text{RMSE} = \sqrt{\text{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^n (\hat{y}_i - y_i)^2}
$$

- **Characteristics:**
  - Easier to interpret than MSE because it’s in the same scale as the target variable.
  - Sensitive to outliers.

- **Use Case:** Similar to MSE but more interpretable in terms of scale.

#### **1.4. Mean Absolute Percentage Error (MAPE):**
- Expresses error as a percentage of the true values.
$$
\text{MAPE} = \frac{1}{n} \sum_{i=1}^n \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100
$$

- **Characteristics:**
  - Useful when the magnitude of the target variable varies widely.
  - Can be misleading if $y_i \approx 0$ because percentages can explode.

- **Use Case:** When relative errors matter (e.g., sales forecasting).

#### **1.5. R-Squared ($R^2$ or Coefficient of Determination):**
- Measures how well the model explains the variance in the target variable.
$$
R^2 = 1 - \frac{\sum_{i=1}^n (\hat{y}_i - y_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2}
$$

- **Interpretation:**
  - $R^2 = 1$: Perfect fit.
  - $R^2 = 0$: Model performs as poorly as predicting the mean $\bar{y}$.
  - $R^2 < 0$: Model is worse than predicting $\bar{y}$.

- **Use Case:** To assess the proportion of variance explained by the model.

#### **1.6. Adjusted R-Squared:**
- Adjusts $R^2$ for the number of predictors in the model:
$$
R^2_{\text{adjusted}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}
$$
Where:
- $n$: Number of observations.
- $p$: Number of predictors.

- **Use Case:** When comparing models with different numbers of predictors.

### **2. Choosing the Right Metric**

| **Metric** | **When to Use** |
||--|
| **MAE**    | When simplicity and interpretability are important, and outliers are less critical. |
| **MSE**    | When penalizing large errors is essential. |
| **RMSE**   | When you need interpretable error in the same units as the target variable. |
| **MAPE**   | When relative error is more meaningful than absolute error. |
| **R-Squared** | To measure how much variance the model explains. |
| **Adjusted R-Squared** | To compare models with different numbers of predictors. |

### **3. Python Implementation**

```python
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Example data
y_true = [100, 200, 300, 400, 500]
y_pred = [110, 190, 310, 420, 490]

# MAE
mae = mean_absolute_error(y_true, y_pred)

# MSE
mse = mean_squared_error(y_true, y_pred)

# RMSE
rmse = np.sqrt(mse)

# R-Squared
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
print(f"R-Squared: {r2}")
```

The **$R^2$ (Coefficient of Determination)** metric is widely used to evaluate how well a regression model explains the variability of the target variable. However, it has some significant **drawbacks** that limit its effectiveness in certain scenarios.

### **1. $R^2$ Does Not Indicate Model Accuracy**
- **Why:** A high $R^2$ does not guarantee that the model's predictions are accurate.
- **Example:**
  - If predictions systematically deviate from actual values (e.g., consistently overestimating), $R^2$ can still be high as long as the predictions follow the general trend.

- **Key Insight:** $R^2$ measures how well the model fits the data's variability but does not indicate how close predictions are to actual values.



### **2. Sensitive to Outliers**
- **Why:** Outliers can inflate or deflate $R^2$ because it is based on squared differences.
- **Example:**
  - A single extreme value can cause $R^2$ to increase, even if the model performs poorly on the rest of the data.
  
- **Key Insight:** $R^2$ does not robustly handle datasets with outliers.



### **3. Biased Toward Complex Models**
- **Why:** Adding more predictors to a model, even irrelevant ones, can increase $R^2$, as it always increases or remains constant when additional predictors are included.

- **Key Insight:** This overestimation can give the illusion of a better model, even if the added predictors do not improve performance.



### **4. Cannot Evaluate Non-Linear Relationships**
- **Why:** $R^2$ assumes that the relationship between predictors and the target variable is linear.
- **Example:**
  - In a non-linear problem, $R^2$ may give a poor score even if the model makes accurate predictions because it doesn't capture the curvature in the data.

- **Key Insight:** $R^2$ is not suitable for non-linear regression models without modifications.



### **5. Does Not Reflect Practical Significance**
- **Why:** $R^2$ is a statistical measure and does not necessarily align with the real-world importance of a model's predictions.

- **Example:**
  - A model with $R^2 = 0.80$ may still be unacceptable in medical diagnostics if its predictions lead to critical errors.




### **6. Negative $R^2$ Values**
- **Why:** $R^2$ can be negative when the model performs worse than predicting the mean $\bar{y}$.
- **Example:**
  - A poorly fitted model with random predictions might have $R^2 < 0$, indicating that even using the mean as a prediction would be better.
  
- **Key Insight:** Negative $R^2$ values highlight a lack of explanatory power but are often misunderstood.



### **7. Does Not Handle Class Imbalances (For Classification Tasks)**
- Although $R^2$ is primarily used in regression, when mistakenly applied to classification-like tasks with numeric targets, it fails to handle imbalances in target variable distributions.

### **Alternative Metrics to Overcome $R^2$ Drawbacks**
1. **Adjusted $R^2$:**
   - Penalizes models for adding irrelevant predictors.
   - Formula:
     $$
     R^2_{\text{adjusted}} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}
     $$
     Where:
     - $n$: Number of observations.
     - $p$: Number of predictors.

2. **Mean Absolute Error (MAE):**
   - Measures average absolute prediction error.

3. **Root Mean Squared Error (RMSE):**
   - Provides error in the same units as the target variable.

4. **Mean Absolute Percentage Error (MAPE):**
   - Expresses error as a percentage of actual values, offering interpretability.

**Standard Scaling** is a preprocessing technique used in machine learning to transform features so that they have a **mean of 0** and a **standard deviation of 1**. This ensures that all features are on the same scale, which is crucial for many machine learning algorithms to perform optimally.

### **1. Formula for Standard Scaling**
Each feature in the dataset is scaled using the formula:
$$
x' = \frac{x - \mu}{\sigma}
$$
Where:
- \(x\): Original feature value.
- \(\mu\): Mean of the feature.
- \(\sigma\): Standard deviation of the feature.
- \(x'\): Scaled feature value.

### **2. Why Use Standard Scaling?**

#### **2.1. Avoiding Feature Dominance**
- Features with large ranges (e.g., income in thousands vs. age in years) can dominate distance-based models like **K-Nearest Neighbors (KNN)**, **SVM**, and **Logistic Regression**.
- Standard scaling ensures all features contribute equally.

#### **2.2. Improving Gradient Descent**
- For optimization algorithms (e.g., gradient descent), scaling helps the algorithm converge faster by maintaining balanced updates across all weights.

#### **2.3. Required for Models Sensitive to Feature Magnitudes**
Standard scaling is particularly useful for:
- **Distance-based models:** KNN, SVM, clustering algorithms.
- **Linear models:** Logistic Regression, Linear Regression.
- **Principal Component Analysis (PCA):** PCA projects data onto orthogonal components, and unscaled features can skew the results.

### **3. How Standard Scaling Works**
#### Example Dataset:
| Feature 1 | Feature 2 |
|-----------|-----------|
| 10        | 100       |
| 20        | 200       |
| 30        | 300       |

1. Compute the **mean (\(\mu\))** and **standard deviation (\(\sigma\))** for each feature:
   - Feature 1: \(\mu = 20, \sigma = 10\)
   - Feature 2: \(\mu = 200, \sigma = 100\)

2. Apply the standard scaling formula to each value:
   - For \(x = 10\) in Feature 1:
     $$
     x' = \frac{10 - 20}{10} = -1
     $$
   - For \(x = 300\) in Feature 2:
     $$
     x' = \frac{300 - 200}{100} = 1
     $$

3. Scaled Dataset:
| Feature 1 | Feature 2 |
|-----------|-----------|
| -1        | -1        |
| 0         | 0         |
| 1         | 1         |

### **4. Standard Scaling vs. Other Scaling Methods**
| **Scaling Method**       | **Description**                                                                                   | **When to Use**                                                                          |
|--------------------------|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| **Standard Scaling**      | Scales data to mean 0, standard deviation 1.                                                     | Models sensitive to magnitude (SVM, Logistic Regression, KNN, PCA).                    |
| **Min-Max Scaling**       | Scales data to range \([0, 1]\) or \([-1, 1]\).                                                  | Neural networks or when absolute feature bounds are required.                          |
| **Robust Scaling**        | Uses median and interquartile range (IQR) to scale data, making it robust to outliers.           | Data with significant outliers.                                                        |

### **5. Python Implementation**

#### **Using Scikit-learn:**
```python
from sklearn.preprocessing import StandardScaler
import numpy as np

# Example data
data = np.array([[10, 100], [20, 200], [30, 300]])

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print("Original Data:\n", data)
print("Scaled Data:\n", scaled_data)
```

#### **Output:**
```
Original Data:
 [[ 10 100]
 [ 20 200]
 [ 30 300]]
Scaled Data:
 [[-1. -1.]
  [ 0.  0.]
  [ 1.  1.]]
```