***Demo Prepared by Prof Monali Mavani***


This notebook explains how to interpret:
- Linear regression **coefficients (parameters)**
- **Evaluation metrics** (MAE, RMSE, R², Adjusted R², MAPE)
- and **explainability** of the model

 **California Housing dataset** is used for demonstration


In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, mean_absolute_percentage_error
import numpy as np
import pandas as pd


In [None]:
# Load data
data = fetch_california_housing()
X = data.data[:, [0]]  # Feature: MedInc
y = data.target        # Target: MedHouseVal
print(f"Description: {data.DESCR}")


Description: .. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using o

In [None]:
print(f"Data shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature names: {data.feature_names[0]}")
print(f"Target name: {data.target_names[0]}")
print(f"X[0]: {X[0]}")
print(f"y[0]: {y[0]}")

Data shape: (20640, 1)
Target shape: (20640,)
Feature names: MedInc
Target name: MedHouseVal
X[0]: [8.3252]
y[0]: 4.526


In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Fit simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


##  Model Parameters (Intercept and Coefficient)

In [None]:
# Print intercept and coefficient
intercept = model.intercept_
coefficient = model.coef_[0]
print(f"Intercept: {intercept:.4f}")
print(f"Coefficient (MedInc): {coefficient:.4f}")


Intercept: 0.4446
Coefficient (MedInc): 0.4193


**Interpretation:**

| Parameter                           | Raw value  | In **dollar** terms\*                              | interpretaion                                                                                                                                                                                                                       |
| ----------------------------------- | ---------- | -------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Intercept ( w₀ )**                | **0.4446** | \$ 0.4446 × 100 000  ≈  **\$ 44 460**              | The model’s baseline prediction for median house value **if Median Income were zero**. |
| **Coefficient ( w₁ ) for `MedInc`** | **0.4193** | \$ 0.4193 × 100 000  ≈  **\$ 41 930** per **unit** | A one‑unit increase in `MedInc` ( **\$10 000** in the dataset’s units) raises the predicted median house price by about **\$41 930** |


* The California‑housing target is in $ 100 000s, so multiply by 100 000 to get real USD.

##  Evaluation Metrics

In [None]:
# Compute metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

# Adjusted R²
n = len(y_test)
p = X_train.shape[1]
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)

# Display metrics
pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'RMSE', 'R²', 'Adjusted R²', 'MAPE'],
    'Value': [mae, mse, rmse, r2, r2_adj, mape]
})


Unnamed: 0,Metric,Value
0,MAE,0.629909
1,MSE,0.709116
2,RMSE,0.84209
3,R²,0.458859
4,Adjusted R²,0.458728
5,MAPE,0.390558


**Interpretation:**

| Metric          | Value    | Interpretation                                       |                                                                                                                                                                                                                   |
| --------------- | -------- | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **MAE**         | 0.629909 |  On average, the model's predictions deviate from the actual house prices by \$62,991                              |
| **MSE**         | 0.709116 | The average of the squared prediction errors. It penalizes large errors more heavily than MAE.                                                     |
| **RMSE**        | 0.842090 | The typical magnitude of prediction error is about \$84,209. It is in the same units as the target variable.                                                                  |
| **R²**          | 0.458859 |About **45.89%** of the variance in house prices is explained by the median income feature — the rest of the variation is due to other factors not included in the model.               |
| **Adjusted R²** | 0.458728 |  Very close to $R^2$ since there's only one predictor. This tells us the model isn't overfitting due to feature count. It's useful when comparing models with multiple features.                                                        |
| **MAPE**        | 0.390558 |  On average, the model's predictions are off by **39.05%** from the actual values. For example, a  \$$200,000$   house might be predicted as \$122,000 or \$278,000.  |


In [None]:
# Load data - all features (mulitple linear regression)
data1 = fetch_california_housing()
X = data1.data  # all features
y = data1.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:

# Compute metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

# Adjusted R²
n = X_test.shape[0]
p = X_test.shape[1]
r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)

# Summary
metrics = pd.DataFrame({
    "Metric": ["MAE", "MSE", "RMSE", "R²", "Adjusted R²", "MAPE"],
    "Value": [mae, mse, rmse, r2, r2_adj, mape]
})
metrics


Unnamed: 0,Metric,Value
0,MAE,0.5332
1,MSE,0.555892
2,RMSE,0.745581
3,R²,0.575788
4,Adjusted R²,0.574964
5,MAPE,0.319522


| **Metric**      | **Single Feature** (`MedInc`) | **All Features** | **% Improvement**  | **Interpretation**                                                                 |
| --------------- | ----------------------------- | ---------------- | ------------------ | ---------------------------------------------------------------------------------- |
| **MAE**         | 0.6299                        | 0.5332           | **15.3%**        | Average prediction error reduced by \~\$9,670                                      |
| **MSE**         | 0.7091                        | 0.5559           | **21.6%**        | Fewer large errors; better overall fit                                             |
| **RMSE**        | 0.8421                        | 0.7456           | **11.4%**        | Typical error reduced by \~\$9,650                                                 |
| **R²**          | 0.4589                        | 0.5758           | **+11.7 points** | More variance in house prices is now explained                                     |
| **Adjusted R²** | 0.4587                        | 0.5750           | **+11.6 points** | Improved fit that accounts for feature count                                       |
| **MAPE**        | 0.3906                        | 0.3195           | **18.2%**        | Relative prediction error reduced by \~7% (e.g. \$200k  error dropped by \~\$14k) |


 Adding more features improves performance on every metric.

 MAPE dropped from 39% to ~32%, making the model more suitable for business-level forecasting.

R² increased from ~46% to ~57%, showing that the model now explains a much larger portion of the variance in housing prices.

Adjusted R² also increased, confirming that the gain in performance is not due to overfitting.