# 🚀 Solutions for Outliers

Once outliers are detected (e.g., using the IQR method), there are several strategies to handle them. The choice depends on the context of the data and the problem you are solving.

---

#### » Pull the diamonds dataset and assign the boundaries with IQR method

In [1]:
import seaborn as sns
dia = sns.load_dataset("diamonds")
df = dia.copy()
table = df["table"]
Q1 = table.quantile(0.25)
Q3 = table.quantile(0.75)
IQR = Q3 - Q1 #interquartile
low_lim = Q1 - 1.5 * IQR
high_lim = Q3 + 1.5 * IQR
outlier = table < low_lim
table[outlier]

1515     51.0
3238     50.1
3979     51.0
4150     51.0
5979     49.0
7418     50.0
8853     51.0
11368    43.0
22701    49.0
25179    50.0
26387    51.0
33586    51.0
35633    44.0
45798    51.0
46040    51.0
47630    51.0
Name: table, dtype: float64

## 🔹 1. Directly Removing Outliers
- Remove data points that fall outside the acceptable range (below lower bound or above upper bound).
- Useful when outliers are due to errors, noise, or irrelevant extreme values.
- **Pros**: Simple and effective.  
- **Cons**: Risk of losing valuable information if outliers are genuine.

---

#### » Display the shape before removing outliers

In [10]:
import pandas as pd
table = pd.DataFrame(table)
table.shape

(53940, 1)

#### » Display the shape after removing outliers

In [11]:
df_pure = table[~( ( table < low_lim ) | ( table > high_lim ) ).any(axis=1)]
df_pure.shape

(53335, 1)

After removing the outliers, the number of rows is less than in the previous table, which means that the data frame has outliers and we removed them

## 🔹 2. Filling with Mean/Median
- Replace outlier values with the **mean** or **median** of the dataset/column.
- Median is preferred when the distribution is skewed (more robust to outliers).
- **Pros**: Keeps dataset size the same.  
- **Cons**: May distort variability and reduce natural variance.

---

#### » Copy the dia dataframe not to change previous dataframe

In [17]:
df2 = dia.copy()
df2.table[outlier]

1515     51.0
3238     50.1
3979     51.0
4150     51.0
5979     49.0
7418     50.0
8853     51.0
11368    43.0
22701    49.0
25179    50.0
26387    51.0
33586    51.0
35633    44.0
45798    51.0
46040    51.0
47630    51.0
Name: table, dtype: float64

#### » Display the mean of the table column

In [21]:
df2.table.mean()

np.float64(57.45950597136711)

#### » Fill the outliers of the table column with the mean 

In [23]:
df2.loc[outlier,"table"] = df2.table.mean()
df2.table[outlier]

1515     57.459506
3238     57.459506
3979     57.459506
4150     57.459506
5979     57.459506
7418     57.459506
8853     57.459506
11368    57.459506
22701    57.459506
25179    57.459506
26387    57.459506
33586    57.459506
35633    57.459506
45798    57.459506
46040    57.459506
47630    57.459506
Name: table, dtype: float64

## 🔹 3. Compression (Capping)
- Also called **winsorization**: replace outliers with the closest acceptable value (the lower or upper bound).
  
$x_{new} = \begin{cases} Q1 - 1.5 \times IQR & \text{if } x < \text{Lower Bound} \\Q3 + 1.5 \times IQR & \text{if } x > \text{Upper Bound} \\x & \text{otherwise}\end{cases}$

- **Pros**: Preserves dataset size and limits extreme influence.  
- **Cons**: Artificially compresses distribution tails.

---

#### » Copy the dia dataframe not to change previous dataframes

In [24]:
df3 = dia.copy()
df3.table[outlier]

1515     51.0
3238     50.1
3979     51.0
4150     51.0
5979     49.0
7418     50.0
8853     51.0
11368    43.0
22701    49.0
25179    50.0
26387    51.0
33586    51.0
35633    44.0
45798    51.0
46040    51.0
47630    51.0
Name: table, dtype: float64

#### » Fill the outliers of the table column with the lower boundary 

In [27]:
df3.loc[outlier,"table"] = low_lim
df3.table[outlier]

1515     51.5
3238     51.5
3979     51.5
4150     51.5
5979     51.5
7418     51.5
8853     51.5
11368    51.5
22701    51.5
25179    51.5
26387    51.5
33586    51.5
35633    51.5
45798    51.5
46040    51.5
47630    51.5
Name: table, dtype: float64