# Section 1: What is EDA?

**Explanation**  
Exploratory Data Analysis (EDA) is the first step in any data science workflow. It helps you understand the dataset before modeling. You check distributions, relationships, and anomalies to avoid surprises later.

**Why important?**  
- Detect patterns and outliers.  
- Understand scale and variability.  
- Prevent data leakage (do EDA on training set only).  

**Common tools**  
- Scatter plots → relationships.
- Histograms → distribution.
- pandas.DataFrame.describe() → summary stats.
- Correlation matrices → linear relationships.

**Questions**  
Why should EDA be done only on the training set?    
What does a correlation coefficient close to 1 mean?    


# Section 2: Heteroscedastic Data  
**Explanation**  
We simulate data where noise increases with X (heteroscedasticity). This is common in real-world scenarios (e.g., measurement error grows with magnitude).

In [None]:
# Requirements: numpy, pandas, matplotlib, scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
plt.rcParams['figure.dpi'] = 120
plt.rcParams['axes.grid'] = False
print('Imports OK')


#### Create the data (heteroscedastic, N=600)
Noise standard deviation increases linearly from 5 to 50 as x grows.

In [None]:
N = 600
rng = np.random.default_rng(seed=42)
X = np.arange(1, N + 1, dtype=float).reshape(-1, 1)
w = (X.ravel() - X.min()) / (X.max() - X.min())
sigma_min, sigma_max = 5.0, 50.0
sigma = sigma_min + (sigma_max - sigma_min) * w
eps = rng.normal(loc=0.0, scale=sigma, size=N)
Y = X.ravel() + eps
X[:5].ravel(), Y[:5], sigma[:5]  # peek

**Task**  
- Plot sigma vs X. What do you observe?
- Why does increasing noise make modeling harder?

## Section 3: Train/Test Split

**Explanation** 
We split data into training and test sets to evaluate generalization.  
*Rule:* Never peek at test data during training or EDA.

In [None]:
X_train, X_test, y_train, y_test, sigma_train, sigma_test = train_test_split(
    X, Y, sigma, test_size=0.30, random_state=123, shuffle=True
)
len(X_train), len(X_test)

**Question**  
What happens if you do EDA on the full dataset?

## Section 4: EDA on training data  
**Explanation**  
Scatter plots show relationships. Histograms reveal distribution and skew. describe() gives quick stats.

In [None]:
# Build a small DataFrame for TRAIN
df_train = pd.DataFrame({'X': X_train.ravel(), 'y': y_train})

# 3A) Scatter: X vs y (train)
plt.figure(figsize=(6.8, 4.2))
plt.scatter(df_train['X'], df_train['y'], s=10, alpha=0.7, color='tab:blue')
plt.xlabel('X [train]')
plt.ylabel('y [train]')
plt.title('Training data: y vs X')
plt.tight_layout(); plt.show()

# 3B) Histogram of y (train)
plt.figure(figsize=(6.2, 4.0))
df_train['y'].plot(kind='hist', bins=30, color='tab:green', alpha=0.85, edgecolor='white')
plt.xlabel('y [train]')
plt.ylabel('count')
plt.title('Histogram: y (train)')
plt.tight_layout(); plt.show()

# 3C) Simple statistical description with pandas
display(df_train.describe())

# 3D) Quick correlation (train)
corr = df_train[['X','y']].corr().loc['X','y']
print(f'TRAIN correlation corr(X,y) = {corr:.3f}')


**Task**  
- Interpret the five-number summary.
- Explain the meaning of reported correlation

## Section 5: Fit Linear Model

**Explanation:**
We use LinearRegression from scikit-learn (OLS method).  

`scikit-learn` is a widely used Python library for machine learning.
- Provides tools for regression, classification, clustering, preprocessing, and model evaluation.
- `LinearRegression` implements Ordinary Least Squares (OLS) for fitting a linear model: 
  $$ \hat{y} = aX + b $$
- Assumes a linear relationship between predictors and response.
- In this exercise, we use it to fit a simple 1D linear model on the training data.

In [None]:
lin = LinearRegression().fit(X_train, y_train)
a, b = lin.coef_[0], lin.intercept_
print(f'Fit on TRAIN: Y_hat = {a:.3f} * X + {b:.3f}')

**Question**  
Why does OLS minimize squared errors?

## Section 6: Evaluate on Test Set
**Explanation**  
- OP plot (Observed vs Predicted) is correct for bias check.
- PO plot (Predicted vs Observed) is misleading.
- Residual plots reveal heteroscedasticity.

In [None]:
yhat_test = lin.predict(X_test)
yhat_test[:5]  # peek

##+# Plot Interpretation
- **OP plot (Observed vs Predicted):**
  - Observed values on y-axis, predicted on x-axis.
  - Overlay 1:1 line: ideal predictions fall on this line.
  - OP regression line close to 1:1 indicates unbiased predictions.
- **PO plot (Predicted vs Observed):**
  - Swapping axes distorts slope/intercept; misleading for bias checks.
- **Residual plots:**
  - Residuals vs predicted (correct): should center around 0; spread pattern reveals heteroscedasticity.
  - Residuals vs true (pitfall): can show spurious trends because y contains noise.


#### OP plot (Observed vs Predicted) — **correct** evaluation on TEST
Scatter **y** (y-axis) vs **\^y** (x-axis); overlay the 1:1 line and the OP regression line.

In [None]:
def fit_line(x, y):
    lr = LinearRegression().fit(x.reshape(-1,1), y)
    return lr.coef_[0], lr.intercept_, lr.predict(x.reshape(-1,1))

slope_op, intercept_op, y_op_line = fit_line(yhat_test, y_test)
xy_min = min(y_test.min(), yhat_test.min())
xy_max = max(y_test.max(), yhat_test.max())
grid = np.linspace(xy_min, xy_max, 100)

plt.figure(figsize=(6.2, 4.4))
plt.scatter(yhat_test, y_test, s=20, alpha=0.7, label='test points')
plt.plot(grid, grid, 'k--', lw=1.3, label='1:1')
plt.plot(yhat_test, y_op_line, color='tab:green', lw=2, label=f'OP fit: y = {slope_op:.2f}·ŷ + {intercept_op:.1f}')
plt.xlabel('Predicted (ŷ) [test]')
plt.ylabel('Observed (y) [test]')
plt.title('Evaluation (OP, correct) on TEST — heteroscedastic')
plt.legend(); plt.tight_layout(); plt.show()


### Performance Indicators Explained
- **MAE (Mean Absolute Error):** Average absolute difference between predictions and true values.
  - Easy to interpret; less sensitive to outliers than MSE.
- **MSE (Mean Squared Error):** Average squared difference between predictions and true values.
  - Penalizes large errors more heavily.
- **R² (Coefficient of Determination):** Proportion of variance in observed values explained by predictions.
  - Ranges from 0 to 1 (higher is better).
  - Can be negative if the model performs worse than predicting the mean.


#### Performance metrics on TEST (scikit-learn)
Compute **MAE**, **MSE**, and **R²** on the **test set**.

In [None]:
mae = mean_absolute_error(y_test, yhat_test)
mse = mean_squared_error(y_test, yhat_test)
r2  = r2_score(y_test, yhat_test)
print(f'MAE (test): {mae:.3f}')
print(f'MSE (test): {mse:.3f}')
print(f'R^2 (test): {r2:.3f}')


## Section 7: Residual Analysis
Swapping axes changes slope/intercept even if $R^2$ stays the same; evaluating with PO can lead to biased conclusions.

In [None]:
slope_po, intercept_po, y_po_line = fit_line(y_test, yhat_test)

plt.figure(figsize=(6.2, 4.4))
plt.scatter(y_test, yhat_test, s=20, alpha=0.7, label='test points')
plt.plot(grid, grid, 'k--', lw=1.3, label='1:1')
plt.plot(y_test, y_po_line, color='tab:orange', lw=2, label=f'PO fit: ŷ = {slope_po:.2f}·y + {intercept_po:.1f}')
plt.xlabel('Observed (y) [test]')
plt.ylabel('Predicted (ŷ) [test]')
plt.title('Evaluation (PO, pitfall) on TEST — heteroscedastic')
plt.legend(); plt.tight_layout(); plt.show()


#### Residual plots on TEST **with fitted trend lines**
- **Correct**: residuals vs predicted (ŷ) — center should be ~0; spread increases.
- **Pitfall**: residuals vs true (y) — misleading structure may appear since y contains noise.

In [None]:
res_test = y_test - yhat_test

# Residuals vs Predicted (correct) + fitted trend line
slope_rp, intercept_rp, res_fit_pred = fit_line(yhat_test, res_test)
plt.figure(figsize=(6.2, 3.8))
plt.scatter(yhat_test, res_test, s=18, alpha=0.7, label='residuals')
plt.axhline(0, color='k', lw=1)
plt.plot(yhat_test, res_fit_pred, color='tab:red', lw=2, label=f'fit: e = {slope_rp:.3f}·ŷ + {intercept_rp:.2f}')
plt.xlabel('Predicted (ŷ) [test]')
plt.ylabel('Residual e = y - ŷ [test]')
plt.title('Residuals vs Predicted (correct) — with fitted trend')
plt.legend(); plt.tight_layout(); plt.show()

# Residuals vs True (pitfall) + fitted trend line
slope_rt, intercept_rt, res_fit_true = fit_line(y_test, res_test)
plt.figure(figsize=(6.2, 3.8))
plt.scatter(y_test, res_test, s=18, alpha=0.7, label='residuals')
plt.axhline(0, color='k', lw=1)
plt.plot(y_test, res_fit_true, color='tab:purple', lw=2, label=f'fit: e = {slope_rt:.3f}·y + {intercept_rt:.2f}')
plt.xlabel('Observed (y) [test]')
plt.ylabel('Residual e = y - ŷ [test]')
plt.title('Residuals vs True (pitfall) — with fitted trend')
plt.legend(); plt.tight_layout(); plt.show()


**Questions**  
- Why is residual vs true misleading?