# 📘 Notebook 4: From Clean to Dirty Data
This notebook explores how real-world data deviates from ideal assumptions. We introduce noise, outliers, multicollinearity, and missing values — and see how they affect linear regression.

**Goals:**
- Learn how each type of imperfection alters model performance
- Visualize their impact
- Diagnose common data issues with Python
- Develop intuition for robustness

## 🧪 Step 1: Start with Perfectly Clean Data

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

np.random.seed(42)
n = 100
X = np.random.rand(n, 1) * 10
y = 2 * X.flatten() + 5
df = pd.DataFrame({'x': X.flatten(), 'y': y})
px.scatter(df, x='x', y='y', title='Perfectly Linear Data')

## 😈 Step 2: Add Noise

In [None]:
noise = np.random.normal(0, 2, size=n)
df['y_noise'] = df['y'] + noise
px.scatter(df, x='x', y='y_noise', title='Data with Noise')

## ❗ Step 3: Add Outliers

In [None]:
df_outliers = df.copy()
df_outliers.loc[::10, 'y_noise'] += np.random.normal(20, 5, size=(n // 10 + 1))
px.scatter(df_outliers, x='x', y='y_noise', title='Data with Outliers')

## 🧠 Step 4: Fit Regression and See Effect of Outliers

In [None]:
from sklearn.linear_model import LinearRegression

X_clean = df[['x']]
y_clean = df['y_noise']
model_clean = LinearRegression().fit(X_clean, y_clean)
y_pred_clean = model_clean.predict(X_clean)

X_outliers = df_outliers[['x']]
y_outliers = df_outliers['y_noise']
model_outliers = LinearRegression().fit(X_outliers, y_outliers)
y_pred_outliers = model_outliers.predict(X_outliers)

import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_outliers['x'], y=y_outliers, mode='markers', name='Data with Outliers'))
fig.add_trace(go.Scatter(x=df['x'], y=y_pred_clean, mode='lines', name='Fit (Clean)'))
fig.add_trace(go.Scatter(x=df_outliers['x'], y=y_pred_outliers, mode='lines', name='Fit (Outliers)'))
fig.update_layout(title='Effect of Outliers on Regression', xaxis_title='x', yaxis_title='y')
fig.show()

## 🔄 Step 5: Simulate Multicollinearity

In [None]:
# Create two highly correlated features
x1 = np.random.rand(n)
x2 = x1 * 0.95 + np.random.rand(n) * 0.05  # 95% correlated
y_multi = 2 * x1 + 3 * x2 + np.random.normal(0, 0.1, n)

df_multi = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y_multi})
model_multi = LinearRegression().fit(df_multi[['x1', 'x2']], df_multi['y'])
model_multi.coef_, model_multi.intercept_

## ❓ What We Learn:
- Outliers heavily distort the fit
- Multicollinearity leads to unstable coefficients
- Regression is sensitive to 'dirty' data and assumptions must be checked

## ✅ Next Up: Assumptions and Diagnostics