# Stage 7 Homework — Outliers + Risk Assumptions
In this assignment you will implement outlier detection/handling and run a simple sensitivity analysis.

**Chain:** In the lecture, we learned detection (IQR, Z-score), options for handling (remove/winsorize), and sensitivity testing. Now, you will adapt those methods to a provided dataset and document the risks and assumptions behind your choices.

In [2]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
np.random.seed(17)

In [3]:
data_path = Path('data/raw/outliers_homework.csv')
if data_path.exists():
    df = pd.read_csv(data_path)
else:
    # Synthetic fallback: linear trend with noise and a few extremes
    x = np.linspace(0, 10, 200)
    y = 2.2 * x + 1 + np.random.normal(0, 1.2, size=x.size)
    y[10] += 15; y[120] -= 13; y[160] += 18
    df = pd.DataFrame({'x': x, 'y': y})

In [10]:
target_col='y'
# Option A: Summary stats
summ_all = df[target_col].describe()[['mean', '50%', 'std']].rename({'50%': 'median'})
summ_filtered = df.loc[~df['outliers_iqr_y'], target_col].describe()[['mean', '50%', 'std']].rename({'50%': 'median'})
summ_w = None
if 'winsorize_series' in globals():
    w = winsorize_series(df[target_col])
    summ_w = w.describe()[['mean', '50%', 'std']].rename({'50%': 'median'})

comp = pd.concat(
    {
        'all': summ_all,
        'filtered_iqr': summ_filtered,
        **({'winsorized': summ_w} if summ_w is not None else {})
    }, axis=1
)
comp

Unnamed: 0,all,filtered_iqr,winsorized
mean,12.171936,12.047851,12.119643
median,12.349663,12.285113,12.349663
std,6.761378,6.546157,6.386781


In [11]:
import sys
sys.path.append("..")
from src.outliers import *

df["outliers_iqr_y"] = detect_outliers_iqr(df['y'],k=1.5)
df["outliers_zscore"] = detect_outliers_zscore(df['y'],threshold=3)
df_w = df.copy()
df_w['y_w'] = winsorize_series(df_w['y'], lower=0.05, upper=0.95)
df_filtered = df.loc[~df['outliers_iqr_y'], ['x','y']].reset_index(drop=True)
df_w = df.copy()
df_w['y_w'] = winsorize_series(df_w['y'], lower=0.05, upper=0.95)

In [12]:
import numpy as np

def fit_and_metrics(X: np.ndarray, y: np.ndarray) -> dict:
    model = LinearRegression()
    model.fit(X, y)
    y_hat = model.predict(X)
    return {
        'slope': float(model.coef_[0]),
        'intercept': float(model.intercept_),
        'r2': float(r2_score(y, y_hat)),
        'mae': float(mean_absolute_error(y, y_hat))
    }

# All data
m_all = fit_and_metrics(df[['x']].to_numpy(), df['y'].to_numpy())
# Filtered (no IQR outliers)
m_flt = fit_and_metrics(df_filtered[['x']].to_numpy(), df_filtered["y"].to_numpy())
# Winsorized
m_win = fit_and_metrics(df[["x"]].to_numpy(), df_w['y_w'].to_numpy())

sens_table = pd.DataFrame([m_all, m_flt, m_win], index=['all', 'filtered_iqr', 'winsorized'])
sens_table

Unnamed: 0,slope,intercept,r2,mae
all,2.169679,1.323542,0.871082,1.200432
filtered_iqr,2.13665,1.397242,0.900777,1.118809
winsorized,2.095902,1.640134,0.910996,1.051938


## Which method(s) and thresholds you chose and why

 To remove outliers from the sample data we chose - 
  - IQR because it removes outlier data (< Q1-1.5*IQR or >Q3+1.5*IQR), threshold chosen is 1.5 because it appropriately captures most of the data in the centre, and ignores moderate outliers
  - z-score because it removes extreme outlier data abs(z_score)>3, z_score = (x-u)/sigma
  - Winsorization because it clips off data at lower/upper percentile, and replaces anything above/below with it. Upper threshold is 0.95 and lower threshold is 0.05

## Assumptions behind your choices

 - IQR -> sample distribution is unimodal, symmetric and not necessarily gaussian
 - Z-Score -> sample distribution is approximately normal
 - Winsorize -> extreme values are likely noise and not true representation of data, if sample data size is small then its better to replace outliers than dropping them

## Observed impacts on results

 - Observed that winsorization has the highest r2 and lowest mae, which suggests outliers are likely just noise and not true representation of data
 - Data filtered based on IQR changes mean, median, and std_dev much more than it changes for winsorization which comes from dropping the extreme data points

## Risks if assumptions are wrong.

- IQR -> If data is fat-tailed, there is a chance of dropping a significant chunk of data.
- Z-score -> Sensitive to skewness and if the data is not approximately normal then our assumption fails.
- Winsorize -> This may create an unwanted bias as we replace outliers with a threshold value