<a href="https://colab.research.google.com/github/samiha-mahin/Data-Analysis/blob/main/Imbalanced_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is Imbalanced Data in Machine Learning?

**Imbalanced data** refers to a classification problem where the classes (or categories) are **not represented equally**. One class has **significantly more samples** than the other(s). This is common in many real-world problems, especially when the **event of interest is rare**.

---

### Why It’s a Problem

Most machine learning models assume a **roughly equal distribution** of classes. When one class dominates, the model can become **biased** towards the majority class and ignore the minority class, leading to poor performance, especially in detecting the rare class (which is often the more important one).

---

### Example 1: Fraud Detection

| Transaction | Fraud (1) | Not Fraud (0) |
| ----------- | --------- | ------------- |
| Count       | 100       | 9900          |

Here, only 1% of transactions are fraudulent. A naive model could predict **"Not Fraud" for every transaction** and still achieve **99% accuracy**, but it would **completely fail** to detect actual fraud cases.

---

### Example 2: Disease Diagnosis

| Patient | Disease Present (1) | Healthy (0) |
| ------- | ------------------- | ----------- |
| Count   | 200                 | 9800        |

The disease is rare. Even if a model has 98% accuracy by just predicting "Healthy," it would be **useless in real-world medical decision-making**, because it misses the diseased patients.

---

### Consequences of Imbalanced Data

* **Misleading accuracy**: High accuracy doesn't mean the model is good.
* **Poor recall for the minority class**: The model may miss important cases.
* **Unreliable predictions**: Especially in high-risk domains like healthcare or security.

---

### Metrics to Use Instead of Accuracy

In imbalanced settings, use these metrics:

* **Precision**: How many predicted positives are actually positive?
* **Recall (Sensitivity)**: How many actual positives did we detect?
* **F1 Score**: Harmonic mean of precision and recall.
* **ROC-AUC / PR-AUC**: Useful for imbalanced classification performance.

---

### Ways to Handle Imbalanced Data

1. **Resampling Methods**

   * **Oversampling**: Duplicate or synthetically generate minority class samples (e.g., SMOTE).
   * **Undersampling**: Remove some majority class samples.

2. **Class Weights**

   * Assign more importance to the minority class during training.

3. **Anomaly Detection Models**

   * Treat the rare class as an anomaly and use specialized models.

4. **Ensemble Methods**

   * Techniques like Random Forest or XGBoost handle imbalance better with parameter tuning.

5. **Change the Evaluation Metric**

   * Focus on precision, recall, or F1 instead of accuracy.

---

### Visual Example

If you plot the dataset:

```plaintext
Class 0 (Majority): ooooooooooooooooooooooooooooooo
Class 1 (Minority): x x x
```

A naive classifier might just learn to output “o” every time.






## 🔁 Resampling Techniques for Imbalanced Data

In imbalanced classification, **resampling** helps balance the dataset by modifying the class distribution. The two main types are:

---

## 1. ⚖️ **Undersampling**

### 🔍 What It Does:

Reduces the number of samples in the **majority class** to match the minority class.

### ✅ Pros:

* Faster training
* Simple to implement

### ❌ Cons:

* Risk of **losing important data**
* Might **underfit** the model

### 📊 Example:

```plaintext
Original Data:
- Class 0 (Not Fraud): 9500 samples
- Class 1 (Fraud):     500 samples

After Undersampling:
- Class 0: 500 samples (randomly selected)
- Class 1: 500 samples
```

Now both classes are balanced with 500 each.

---

## 2. 🔁 **Oversampling**

### 🔍 What It Does:

Increases the number of samples in the **minority class** by duplicating existing samples or generating new ones.

### ✅ Pros:

* No information is lost from the majority class
* Easy to implement

### ❌ Cons:

* Can lead to **overfitting** if the same samples are repeated

### 📊 Example:

```plaintext
Original Data:
- Class 0: 9500 samples
- Class 1: 500 samples

After Oversampling:
- Class 0: 9500 samples
- Class 1: 9500 samples (by duplicating minority class samples)
```

---

## 3. 🧠 **SMOTE (Synthetic Minority Over-sampling Technique)**

### 🔍 What It Does:

Instead of just duplicating, **SMOTE generates synthetic data** points for the minority class based on the nearest neighbors.

### ✅ Pros:

* Better than simple oversampling
* Reduces overfitting
* Makes the minority class more diverse

### ❌ Cons:

* Can create **ambiguous synthetic samples** if classes overlap
* More complex and slower

### 📊 Example:

Suppose the minority class has these 2D data points:

```
[2.0, 3.0]
[2.1, 3.2]
[1.9, 2.9]
```

SMOTE will pick a point and a neighbor (e.g., `[2.0, 3.0]` and `[2.1, 3.2]`), then generate a new point in between, like:

```
New point = [2.05, 3.1]
```

This point is **not duplicated**, but **synthetically created** based on the feature space.

---

## 🔧 When to Use What?

| Method        | Best When                                                              |
| ------------- | ---------------------------------------------------------------------- |
| Undersampling | You have **a lot** of data and can afford to discard some              |
| Oversampling  | You have **less data**, and don’t want to lose any samples             |
| SMOTE         | You want to create **realistic** synthetic data for the minority class |




# **ROC** and **AUC**

## 🔍 What is ROC?

### **ROC** stands for **Receiver Operating Characteristic** curve.

It is a **graph** that shows the performance of a classification model at all **classification thresholds**.

---

### 📈 ROC Curve Plots:

* **Y-axis:** **True Positive Rate (TPR)** = Recall = Sensitivity

  $$
  \text{TPR} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}
  $$

* **X-axis:** **False Positive Rate (FPR)**

  $$
  \text{FPR} = \frac{\text{False Positives}}{\text{False Positives + True Negatives}}
  $$

---

## 💡 What is AUC?

### **AUC** = **Area Under the ROC Curve**

* It’s a **single number** summary of the ROC curve.
* **Higher AUC** means **better** model performance.
* AUC ranges from **0 to 1**:

  * **1.0** = perfect classifier
  * **0.5** = no skill (random guess)
  * **< 0.5** = worse than guessing

---

## 🔧 Example: Binary Classification (Cancer Detection)

Let’s say we built a cancer classifier.

| Patient | Actual | Predicted Probability (Cancer) |
| ------- | ------ | ------------------------------ |
| A       | 1      | 0.95                           |
| B       | 0      | 0.90                           |
| C       | 1      | 0.80                           |
| D       | 0      | 0.60                           |
| E       | 0      | 0.40                           |
| F       | 1      | 0.30                           |
| G       | 0      | 0.10                           |

Now, if we try different thresholds like 0.9, 0.7, 0.5, etc., we’ll get different True Positives and False Positives. For each threshold, we calculate:

* TPR (Recall)
* FPR

Plot these points on a graph → that’s your **ROC curve**.

---

### 📊 Sample Points on ROC:

| Threshold | TPR (Recall) | FPR |
| --------- | ------------ | --- |
| 0.9       | 1/3          | 1/4 |
| 0.7       | 2/3          | 1/4 |
| 0.5       | 2/3          | 2/4 |
| 0.3       | 3/3          | 3/4 |

Plot (FPR, TPR) = (0.25, 0.33), (0.25, 0.66), (0.5, 0.66), etc.

Then, the **area under this curve** = **AUC**

---

## 🔍 Visual Summary

Imagine this:

```plaintext
Perfect ROC Curve:
      |
   1  |          ●----------
      |         /
TPR   |        /
      |       /
      |      /
      |     /
   0  |----------------------
      0      FPR       1
```

A model that performs better will **curve closer to the top-left**.

---

## ✅ Interpretation:

| AUC Value | Meaning                    |
| --------- | -------------------------- |
| 0.90–1.0  | Excellent model            |
| 0.80–0.90 | Very good                  |
| 0.70–0.80 | Good/fair                  |
| 0.60–0.70 | Poor                       |
| 0.50      | Random (no discrimination) |


