# Setup: Generate Sample Dataset

This cell creates the required folder structure (`data/raw/` and `data/processed/`) relative to the notebook, and generates the sample CSV dataset with missing values. 
This ensures the dataset is ready for cleaning functions and saves it to `data/raw/outliers_homework.csv`.

In [7]:
import os
import numpy as np
import pandas as pd

# Define folder paths relative to this notebook
raw_dir = '../data/raw'
processed_dir = '../data/processed'

# Create folders if they don't exist
os.makedirs(raw_dir, exist_ok=True)
os.makedirs(processed_dir, exist_ok=True)

# Generate business day dates
dates = pd.date_range(start="2022-01-03", end="2022-06-10", freq="B")

# Fixed random seed for reproducibility
np.random.seed(17)

# Column 1: daily_return ~ N(0, 0.01)
returns = np.random.normal(0, 0.01, size=len(dates))
mask_pre_may = dates < "2022-05-01"
returns[mask_pre_may] -= 0.0015  

# Inject "shock" values
shock_values = {
    "2022-05-02": 0.1748425237194541,
    "2022-05-03": -0.16825801732486943,
    "2022-05-06": -0.19667220757153227,
    "2022-05-09": 0.21240223590614747,
    "2022-05-12": -0.178729287231294
}
for d, v in shock_values.items():
    idx = np.where(dates == pd.to_datetime(d))[0][0]
    returns[idx] = v

# Column 2: daily_return_2, correlated with daily_return + small noise
daily_return_2 = returns * 0.6 + np.random.normal(0, 0.005, size=len(dates))

# Create DataFrame with two numeric columns
df = pd.DataFrame({
    "date": dates,
    "daily_return": returns,
    "daily_return_2": daily_return_2
})

# Save to CSV in raw data folder
csv_path = os.path.join(raw_dir, 'outliers_homework.csv')
if not os.path.exists(csv_path):
    df.to_csv(csv_path, index=False)
    print(f'Synthetic dataset with two columns created and saved to {csv_path}')
else:
    print(f'File already exists at {csv_path}. Skipping CSV creation to avoid overwrite.')

File already exists at ../data/raw/outliers_homework.csv. Skipping CSV creation to avoid overwrite.


# Stage 7 Homework — Outliers + Risk Assumptions
In this assignment you will implement outlier detection/handling and run a simple sensitivity analysis.

**Chain:** In the lecture, we learned detection (IQR, Z-score), options for handling (remove/winsorize), and sensitivity testing. Now, you will adapt those methods to a provided dataset and document the risks and assumptions behind your choices.

In [8]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
np.random.seed(17)

## Load Data (provided or synthetic fallback)

In [9]:
data_path = Path('data/raw/outliers_homework.csv')
if data_path.exists():
    df = pd.read_csv(data_path)
else:
    # Synthetic fallback: linear trend with noise and a few extremes
    x = np.linspace(0, 10, 200)
    y = 2.2 * x + 1 + np.random.normal(0, 1.2, size=x.size)
    y[10] += 15; y[120] -= 13; y[160] += 18
    df = pd.DataFrame({'x': x, 'y': y})
df.head()

Unnamed: 0,x,y
0,0.0,1.331519
1,0.050251,-1.115001
2,0.100503,1.969787
3,0.150754,2.706032
4,0.201005,2.68684


## TODO: Implement Outlier Functions (required)

### `detect_outliers_iqr(series)`
Detect outliers in a numeric pandas Series using Tukey's IQR rule.

**Rule:** A point is an outlier if it lies below `Q1 - 1.5 * IQR` or above `Q3 + 1.5 * IQR`,  
where `IQR = Q3 - Q1` and `Qk` are the 25th and 75th percentiles.

**Parameters**
- `series` *(pd.Series)*: numeric series; NaNs are ignored.

**Returns**
- `pd.Series[bool]`: Boolean mask aligned to the input index; `True` where the value is an outlier.


In [10]:
import numpy as np
import pandas as pd

def detect_outliers_iqr(series: pd.Series) -> pd.Series:
    """Detect outliers via Tukey's IQR rule.
    
    A value x is flagged as an outlier if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR,
    where IQR = Q3 - Q1 computed ignoring NaNs.
    
    Parameters
    ----------
    series : pd.Series
        Numeric Series. NaNs are ignored in quantile computation.
    
    Returns
    -------
    pd.Series of bool
        Boolean mask with the same index as `series` where True marks outliers.
    """
    if not isinstance(series, pd.Series):
        series = pd.Series(series)
    s = pd.to_numeric(series, errors="coerce")
    q1 = s.quantile(0.25)
    q3 = s.quantile(0.75)
    iqr = q3 - q1
    # Guard against zero IQR (e.g., constant series) to avoid false positives
    if pd.isna(iqr) or iqr == 0:
        return pd.Series(False, index=series.index)
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    out = (s < lower) | (s > upper)
    # Keep NaNs as False
    out = out.fillna(False)
    out.name = getattr(series, "name", None)
    return out


*(Stretch)* Implement winsorizing (optional).

### `winsorize_series(series, lower=0.05, upper=0.95)` *(stretch)*
Winsorize a numeric series by **clipping** extreme values to specified quantiles.

**Parameters**
- `series` *(pd.Series)*: numeric series; NaNs preserved.
- `lower` *(float, default 0.05)*: lower quantile in \[0, 0.5); set to 0 for no lower clipping.
- `upper` *(float, default 0.95)*: upper quantile in (0.5, 1]; set to 1 for no upper clipping.

**Returns**
- `pd.Series`: Series with values clipped to `[Q_lower, Q_upper]`, index preserved.

**Notes**
- Winsorization reduces the influence of extreme values without removing observations.
- If `lower >= upper`, no clipping is applied.


In [11]:
import numpy as np
import pandas as pd

def winsorize_series(series: pd.Series, lower: float = 0.05, upper: float = 0.95) -> pd.Series:
    """Winsorize a numeric Series by clipping values to the given quantiles.
    
    Parameters
    ----------
    series : pd.Series
        Numeric Series to winsorize. Non-numeric values are coerced to NaN.
    lower : float, default 0.05
        Lower quantile (0 <= lower < 0.5). Set to 0 to disable lower clipping.
    upper : float, default 0.95
        Upper quantile (0.5 < upper <= 1). Set to 1 to disable upper clipping.
    
    Returns
    -------
    pd.Series
        Winsorized series with same index as input; NaNs preserved.
    """
    if not isinstance(series, pd.Series):
        series = pd.Series(series)
    s = pd.to_numeric(series, errors="coerce")
    # Validate bounds
    try:
        lower_f = float(lower)
        upper_f = float(upper)
    except Exception:
        lower_f, upper_f = 0.05, 0.95
    if not (0 <= lower_f <= 1 and 0 <= upper_f <= 1):
        lower_f, upper_f = 0.05, 0.95
    if lower_f >= upper_f:
        # Degenerate case: return original casted series
        return s
    
    q_low = s.quantile(lower_f)
    q_high = s.quantile(upper_f)
    # Clip; preserve dtype if possible
    w = s.clip(lower=q_low, upper=q_high)
    w.name = getattr(series, "name", None)
    # If original looked like ints and no NaNs, try to cast back to int
    if pd.api.types.is_integer_dtype(series.dtype) and not w.isna().any():
        try:
            w = w.astype(series.dtype)
        except Exception:
            pass
    return w


## Apply Detection and Create Flags (choose a numeric column)

In [12]:
target_col = 'y' if 'y' in df.columns else df.select_dtypes(include=['number']).columns[0]
df['outlier_iqr'] = detect_outliers_iqr(df[target_col])
df['outlier_z'] = detect_outliers_zscore(df[target_col], threshold=3.0)
df[['outlier_iqr', 'outlier_z']].mean()  # fraction flagged

NameError: name 'detect_outliers_zscore' is not defined

### Visual Checks (boxplot / histogram)

In [None]:
plt.figure()
plt.boxplot(df[target_col])
plt.title(f'Boxplot: {target_col}')
plt.show()

plt.figure()
plt.hist(df[target_col], bins=30)
plt.title(f'Histogram: {target_col}')
plt.show()

## Sensitivity Analysis
Pick one: summary stats or simple linear regression comparing **all vs. filtered** (and optional winsorized).

In [None]:
# Option A: Summary stats
summ_all = df[target_col].describe()[['mean', '50%', 'std']].rename({'50%': 'median'})
summ_filtered = df.loc[~df['outlier_iqr'], target_col].describe()[['mean', '50%', 'std']].rename({'50%': 'median'})
summ_w = None
if 'winsorize_series' in globals():
    w = winsorize_series(df[target_col])
    summ_w = w.describe()[['mean', '50%', 'std']].rename({'50%': 'median'})

comp = pd.concat(
    {
        'all': summ_all,
        'filtered_iqr': summ_filtered,
        **({'winsorized': summ_w} if summ_w is not None else {})
    }, axis=1
)
comp

In [None]:
# Option B: Simple regression (if x present)
if 'x' in df.columns:
    X_all = df[['x']].to_numpy(); y_all = df[target_col].to_numpy()
    X_filtered = df.loc[~df['outlier_iqr'], ['x']].to_numpy(); y_filtered = df.loc[~df['outlier_iqr'], target_col].to_numpy()

    model_all = LinearRegression().fit(X_all, y_all)
    model_flt = LinearRegression().fit(X_filtered, y_filtered)

    mae_all = mean_absolute_error(y_all, model_all.predict(X_all))
    mae_flt = mean_absolute_error(y_filtered, model_flt.predict(X_filtered))

    results = pd.DataFrame({
        'slope': [model_all.coef_[0], model_flt.coef_[0]],
        'intercept': [model_all.intercept_, model_flt.intercept_],
        'r2': [model_all.score(X_all, y_all), model_flt.score(X_filtered, y_filtered)],
        'mae': [mae_all, mae_flt]
    }, index=['all', 'filtered_iqr'])
    results
else:
    results = None
    print("No 'x' column; skip regression or engineer features.")

### Reflection (≤ 1 page)
- Methods and thresholds used (and why)
- Assumptions behind choices
- Observed impact on results
- Risks if assumptions are wrong (e.g., discarding true events)

*Write your reflection here...*

### `detect_outliers_zscore(series, threshold=3.0)`
Detect outliers using **standard score (z‑score)**.

**Rule:** A point is an outlier if `|z| > threshold`, with `z = (x − μ)/σ`  
computed from the sample mean `μ` and sample standard deviation `σ` (ddof=0).

**Parameters**
- `series` *(pd.Series)*: numeric series; NaNs are ignored when computing μ and σ.
- `threshold` *(float, default 3.0)*: z‑score cutoff (common choices: 3.0, 2.5).

**Returns**
- `pd.Series[bool]`: Boolean mask aligned to the input index; `True` where the value is an outlier.


In [None]:
def detect_outliers_zscore(series: pd.Series, threshold: float = 3.0) -> pd.Series:
    """Detect outliers via z-score with a configurable threshold.
    
    Parameters
    ----------
    series : pd.Series
        Numeric Series. NaNs are ignored when fitting mean/std.
    threshold : float, default 3.0
        Absolute z-score above which a point is flagged as an outlier.
    
    Returns
    -------
    pd.Series of bool
        Boolean mask with the same index as `series` where True marks outliers.
    """
    if not isinstance(series, pd.Series):
        series = pd.Series(series)
    s = pd.to_numeric(series, errors="coerce")
    mu = s.mean(skipna=True)
    sigma = s.std(skipna=True, ddof=0)
    if pd.isna(sigma) or sigma == 0:
        return pd.Series(False, index=series.index)
    z = (s - mu) / sigma
    out = z.abs() > threshold
    out = out.fillna(False)
    out.name = getattr(series, "name", None)
    return out


#### Quick sanity checks

In [None]:
# Quick sanity checks
s = pd.Series([1, 2, 2, 2, 3, 100, -50, None, 2, 2, 3], name="demo")

print("IQR outliers:")
print(detect_outliers_iqr(s))

print("\nZ-score outliers (threshold=2.5):")
print(detect_outliers_zscore(s, threshold=2.5))

print("\nWinsorized (5th/95th percentiles):")
print(winsorize_series(s, 0.05, 0.95))



# Stage 7 — Outliers + Risk Assumptions (Completed)

This notebook implements reusable outlier utilities (IQR, Z-score, Winsorization), applies them to data, and runs a small sensitivity analysis (summary stats + simple regression) **with vs. without outliers** and **winsorized**. Plots and a short reflection are included to match the assignment.



## 1) Load data 

In [None]:
np.random.seed(42)

csv_path = "data/raw/outliers_homework.csv"
if os.path.exists(csv_path):
    df = pd.read_csv(csv_path)
else:
    # Synthetic fallback: linear relation with noise and a few injected outliers
    n = 300
    x = np.random.normal(50, 10, size=n)
    y = 2.5 * x + np.random.normal(0, 12, size=n)
    # Inject a few outliers
    x[:5] = x[:5] + np.array([80, -70, 100, -90, 60])
    y[:5] = y[:5] + np.array([300, -280, 350, -330, 250])
    df = pd.DataFrame({"feature_x": x, "target_y": y})
df.head()



## 2) Outlier flags
We will use `feature_x` as the explanatory variable and `target_y` as the response for modeling. Below we build boolean masks for outliers using both IQR and Z-score.


In [None]:

# Pick columns (adapt here if your CSV uses different names)
num_col = "feature_x" if "feature_x" in df.columns else df.select_dtypes("number").columns[0]
y_col = "target_y" if "target_y" in df.columns else (df.select_dtypes("number").columns[1] if df.select_dtypes("number").shape[1] > 1 else None)

iqr_mask = detect_outliers_iqr(df[num_col])
z_mask = detect_outliers_zscore(df[num_col], threshold=3.0)
df["outlier_iqr"] = iqr_mask
df["outlier_zscore"] = z_mask

df[[num_col, "outlier_iqr", "outlier_zscore"]].head()



## 3) Sensitivity — Summary stats (mean/median/std)
We compare **all data**, **filtered by IQR (no outliers)**, and **winsorized** for the chosen column.


In [None]:

# Winsorize chosen series
w_series = winsorize_series(df[num_col], lower=0.05, upper=0.95)

def stats(s):
    return pd.Series({
        "mean": s.mean(),
        "median": s.median(),
        "std": s.std(ddof=1),
        "min": s.min(),
        "max": s.max(),
        "count": s.shape[0]
    })

stats_all = stats(df[num_col].dropna())
stats_filtered = stats(df.loc[~df["outlier_iqr"], num_col].dropna())
stats_wins = stats(w_series.dropna())

summary_table = pd.concat([stats_all, stats_filtered, stats_wins], axis=1)
summary_table.columns = ["all", "no_outliers_iqr", "winsorized"]
summary_table



## 4) Boxplots — before vs. winsorized


In [None]:
plt.figure()
plt.boxplot([df[num_col].dropna().values, w_series.dropna().values], labels=["All", "Winsorized"])
plt.title(f"Boxplots for {num_col} (All vs Winsorized)")
plt.show()



## 5) Simple Linear Regression — with vs. without outliers (plus winsorized)
We fit `target_y ~ feature_x` (or the first two numeric columns) and compare coefficients and fit metrics across three treatments.


In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Choose y if not set; if no second numeric column exists, synthesize target
if y_col is None or y_col not in df.columns:
    # Synthesize a target as a noisy linear function of num_col
    y_col = "target_synth"
    df[y_col] = 2.5 * df[num_col] + np.random.normal(0, 12, size=len(df))

def fit_and_report(X_series, y_series, label):
    mask = (~X_series.isna()) & (~y_series.isna())
    X = X_series[mask].values.reshape(-1, 1)
    y = y_series[mask].values
    model = LinearRegression().fit(X, y)
    y_hat = model.predict(X)
    return pd.Series({
        "label": label,
        "coef": float(model.coef_[0]),
        "intercept": float(model.intercept_),
        "R2": float(r2_score(y, y_hat)),
        "MAE": float(mean_absolute_error(y, y_hat)),
        "n": int(mask.sum())
    })

# Variants
all_fit = fit_and_report(df[num_col], df[y_col], "all")
no_outliers_fit = fit_and_report(df.loc[~df["outlier_iqr"], num_col], df.loc[~df["outlier_iqr"], y_col], "no_outliers_iqr")
wins_fit = fit_and_report(w_series, df[y_col], "winsorized")

reg_table = pd.concat([all_fit, no_outliers_fit, wins_fit], axis=1).T.set_index("label")
reg_table



## 6) Residuals — visual check


In [None]:

def residuals_series(X_series, y_series):
    mask = (~X_series.isna()) & (~y_series.isna())
    X = X_series[mask].values.reshape(-1, 1)
    y = y_series[mask].values
    m = LinearRegression().fit(X, y)
    return pd.Series(y - m.predict(X), index=X_series[mask].index)

res_all = residuals_series(df[num_col], df[y_col])
res_no = residuals_series(df.loc[~df["outlier_iqr"], num_col], df.loc[~df["outlier_iqr"], y_col])
res_win = residuals_series(w_series, df[y_col])

plt.figure()
plt.boxplot([res_all.values, res_no.values, res_win.values], labels=["All", "No outliers (IQR)", "Winsorized"])
plt.title("Residuals distribution across treatments")
plt.show()



## 7) Reflection (≤ 1 page)

**Method choice & thresholds.**  
- IQR with 1.5×IQR was used to flag and **remove** outliers for the filtered variant.  
- Z-score at 3.0 is implemented but not used in the main comparison; it tends to be sensitive to outliers if the distribution is heavy‑tailed.  
- Winsorization at [5%, 95%] was used to **limit influence** without dropping rows.

**Assumptions.**  
- The core numeric feature is roughly unimodal with a majority of inliers.  
- The linear model is an adequate first‑order summary for sensitivity purposes.  
- Winsorized quantiles are stable (enough data).

**Observed impacts.**  
- Compare the tables above: removing outliers stabilizes the mean and reduces std; coefficients and MAE typically improve, and R² often increases. Winsorization provides a middle ground. 

**Risks if wrong.**  
- If extreme values are genuine and predictive, removing them biases conclusions.  
- If the distribution is strongly skewed/heavy‑tailed, z-score may under‑ or over‑flag.  
- Winsorization masks extremes, which could be critical for risk‑sensitive applications.
