<a href="https://colab.research.google.com/github/samiha-mahin/Data-Analysis/blob/main/Outliers_Detect_and_Remove.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🌟 What is an Outlier?

An **outlier** is a data point that is **significantly different** from the rest of the data.

* It **lies far away** from the main group of data.
* It **doesn’t follow the general pattern** of the dataset.

---

### 🎯 Example

Suppose these are test scores of 10 students:

`[85, 88, 90, 92, 87, 89, 91, 86, 95, 30]`

* Most scores are around **85–95**.
* But **30** is much lower than the rest — it’s an **outlier**.

---


# 🔍 How to Detect Outliers

---

# 1. **Using the IQR (Interquartile Range) method**:


### 🌼 Step-by-Step Explanation of the IQR Method


### 🔸 Step 1: **Sort the Data**

Start with sorting your data in **ascending order**.

**Example Data (unsorted):**

`[85, 88, 90, 92, 87, 89, 91, 86, 95, 30]`

**Sorted:**

`[30, 85, 86, 87, 88, 89, 90, 91, 92, 95]`


### 🔸 Step 2: **Find Q1 and Q3**

* **Q1 (First Quartile)** is the **25th percentile** → the middle value of the **first half** of the data.
* **Q3 (Third Quartile)** is the **75th percentile** → the middle value of the **second half** of the data.

There are 10 values:

* First half: `[30, 85, 86, 87, 88]` → Q1 is the **middle** = **86**
* Second half: `[89, 90, 91, 92, 95]` → Q3 is the **middle** = **91**

So:

* **Q1 = 86**
* **Q3 = 91**



### 🔸 Step 3: **Calculate the IQR**

**IQR = Q3 − Q1**

So:

**IQR = 91 − 86 = 5**


### 🔸 Step 4: **Calculate Outlier Boundaries**

Use the formulas:

* **Lower Bound** = Q1 − 1.5 × IQR
* **Upper Bound** = Q3 + 1.5 × IQR

Now plug in the numbers:

* **Lower Bound** = 86 − 1.5 × 5 = 86 − 7.5 = **78.5**
* **Upper Bound** = 91 + 1.5 × 5 = 91 + 7.5 = **98.5**


### 🔸 Step 5: **Identify Outliers**

Any number:

* **Less than 78.5** or
* **Greater than 98.5**

...is an **outlier**.

Our data: `[30, 85, 86, 87, 88, 89, 90, 91, 92, 95]`

* Only **30** is below 78.5 → it's an **outlier** ✅


### ✅ Summary

| Term        | Value |
| ----------- | ----- |
| Q1          | 86    |
| Q3          | 91    |
| IQR         | 5     |
| Lower Bound | 78.5  |
| Upper Bound | 98.5  |
| Outlier(s)  | 30    |


---

# 2. **Z-score method**:



### 🌟 What is a Z-score?

A **Z-score** tells us how many **standard deviations** a value is away from the **mean** of the data.

* If a Z-score is **very high or very low** (usually beyond ±3), it’s considered an **outlier**.


### 🧠 Z-score Formula:

$$
Z = \frac{(X - \mu)}{\sigma}
$$

Where:

* $X$ = the value from the dataset
* $\mu$ = the **mean** of the dataset
* $\sigma$ = the **standard deviation**






### 📌 Given Data:

$\text{Data} = [85, 88, 90, 92, 87, 89, 91, 86, 95, 30]$



### 🔸 Step 1: Calculate the Mean (μ)

$$
\mu = \frac{85 + 88 + 90 + 92 + 87 + 89 + 91 + 86 + 95 + 30}{10} = \frac{823}{10} = 82.3
$$



### 🔸 Step 2: Compute Squared Differences from the Mean

| X  | X − 82.3 | (X − 82.3)² |
| -- | -------- | ----------- |
| 85 | 2.7      | 7.29        |
| 88 | 5.7      | 32.49       |
| 90 | 7.7      | 59.29       |
| 92 | 9.7      | 94.09       |
| 87 | 4.7      | 22.09       |
| 89 | 6.7      | 44.89       |
| 91 | 8.7      | 75.69       |
| 86 | 3.7      | 13.69       |
| 95 | 12.7     | 161.29      |
| 30 | -52.3    | 2735.29     |



### 🔸 Step 3: Sum of Squared Differences

$$
\text{Total} = 7.29 + 32.49 + 59.29 + 94.09 + 22.09 + 44.89 + 75.69 + 13.69 + 161.29 + 2735.29 = \mathbf{3246.10}
$$





### 🔸 Step 4: Divide by n = 10

$$
\text{Variance} = \frac{3246.10}{10} = 324.61
$$



### 🔸 Step 5: Square Root of Variance

$$
\sigma = \sqrt{324.61} ≈ \mathbf{17.18}
$$


#### 🔸 Step 6: Calculate Z-scores

$$
Z = \frac{X - \mu}{σ}
$$

Examples:

* For 30:

  $$
  Z = \frac{30 - 82.3}{17.18} \approx \frac{-52.3}{17.18} \approx -3.04
  $$
* For 85:

  $$
  Z = \frac{85 - 82.3}{17.18} \approx 0.16
  $$




### 🔸 Step 7: Interpret Z-scores

* If **Z > 3** or **Z < -3**, it's an outlier.
* In our case, **30 has Z ≈ -3.04**, so it's an **outlier** ✅
* Other values have Z-scores between -3 and 3 → **not outliers**



### 🧾 Summary Table

| Value | Z-score    | Outlier? |
| ----- | ---------- | -------- |
| 30    | -3.04      | ✅ Yes    |
| 85–95 | \~ -1 to 1 | ❌ No     |

---

# 3. **Boxplot**:

   * A simple plot that visually shows outliers as dots outside the box.

---

### ⚠️ Why Outliers Matter

* They can **distort averages** and **machine learning models**.
* Sometimes they are **errors** (e.g., data entry mistakes).
* Sometimes they are **important** (e.g., rare events or fraud cases).

---

### 🛠️ What to Do with Outliers?

* **Remove them** if they are clearly errors.
* **Cap or transform them** if they’re too extreme.
* **Keep them** if they’re meaningful and you want to study them.
