<a href="https://colab.research.google.com/github/vivianlinnn/DS41_IDXExchange/blob/main/src/05_DecisionTree_Improved_team.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# START: Anjali Manju Gowda, Lexi Chen

 # Decision Tree Regression Model


A Decision Tree is a strong next model after linear regression because:

- **Nonlinear patterns:** Housing prices often depend on *thresholds* (e.g., LivingArea > 2500 sqft) and *interactions* (e.g., size matters differently by neighborhood).
- **Handles feature interactions automatically:** Unlike OLS, we do not need to manually add interaction terms.
- **Interpretability:** Trees can be visualized and explained through decision rules.

We will still predict `LogClosePrice` (instead of `ClosePrice`) because:
- `ClosePrice` is right-skewed, while `LogClosePrice` is closer to normal.
- Log scale makes the model focus more on relative/percentage error rather than absolute dollars.


## Import Required Libraries

We import core libraries for data manipulation and numerical computation,
along with machine learning tools required to build and evaluate a
Decision Tree regression model.

- **pandas / numpy**: data handling and numerical operations  
- **DecisionTreeRegressor**: tree-based regression model  
- **Evaluation metrics**: R² and MAPE for performance measurement

In [None]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_percentage_error

## Load Training and Test Datasets

The dataset is split **temporally** to prevent information leakage.

- **Training set**: June 2025 – November 2025  
- **Test set**: December 2025  

This setup simulates a real-world forecasting scenario where
future prices are predicted using past data.

In [None]:
train = pd.read_csv('/content/train_cleaned.csv')
test  = pd.read_csv('/content/test_cleaned.csv')

## Create LogClosePrice Target Variable

Housing prices are highly right-skewed.

We apply a natural logarithm transformation to:
- Stabilize variance
- Reduce the influence of extreme prices
- Improve model generalization

The model is trained on **log prices**, not raw prices.

In [None]:
train['LogClosePrice'] = np.log(train['ClosePrice'])
test['LogClosePrice']  = np.log(test['ClosePrice'])

## ZIP-Level Location Encoding

One-hot encoding ZIP codes introduces thousands of sparse features
and performs poorly in tree-based models.

Instead, we encode location using the **median log sale price per ZIP code**
calculated from the training data only.

This provides a strong numeric signal for neighborhood price levels
while avoiding high dimensionality.

> ZipMedianPrice acts as a learned neighborhood price prior,
allowing the tree to focus on within-area structural differences.

In [None]:
zip_median_price = (
    train
    .groupby('PostalCode')['LogClosePrice']
    .median()
)

train['ZipMedianPrice'] = train['PostalCode'].map(zip_median_price)
test['ZipMedianPrice']  = test['PostalCode'].map(zip_median_price)

# Handle unseen ZIPs in test
global_median = train['LogClosePrice'].median()
test['ZipMedianPrice'] = test['ZipMedianPrice'].fillna(global_median)

## Feature Selection

We select features that:
- Capture **location and property size**
- Work well in tree-based models
- Reduce overfitting risk

The selected features balance interpretability and predictive power.

In [None]:
tree_features = [
    'ZipMedianPrice',          # location (continuous)
    'LivingArea',
    'BathroomsTotalInteger',
    'BedroomsTotal'
]

X_train = train[tree_features]
y_train = train['LogClosePrice']

X_test  = test[tree_features]
y_test  = test['LogClosePrice']

## Define Improved Decision Tree Model

We apply regularization to improve generalization:

- **max_depth**: limits tree complexity  
- **min_samples_leaf**: prevents noisy splits  
- **absolute_error**: robust to extreme prices  

These constraints reduce overfitting while preserving nonlinear patterns.

In [None]:
tree_model = DecisionTreeRegressor(
    max_depth=6,                # capture nonlinear structure
    min_samples_leaf=100,       # smooth predictions
    criterion='absolute_error', # robust to outliers
    random_state=42
)

## Train the Decision Tree Model

The model is trained using the selected features and
log-transformed target variable.

In [None]:
tree_model.fit(X_train, y_train)

## Model Evaluation (R² on Log Scale)

We evaluate model performance using **R² on the log price scale**:

- **Train R²**: measures how well the model fits training data  
- **Test R²**: measures generalization to unseen data  

A small train–test gap indicates good generalization.

In [None]:
train_pred = tree_model.predict(X_train)
test_pred  = tree_model.predict(X_test)

train_r2 = r2_score(y_train, train_pred)
test_r2  = r2_score(y_test, test_pred)

print(f"Train R² (log): {train_r2:.4f}")
print(f"Test  R² (log): {test_r2:.4f}")

Train R² (log): 0.8702
Test  R² (log): 0.8473


## Percentage Error Metrics (MAPE & MdAPE)

We compute:
- **MAPE**: average percentage error
- **MdAPE**: median percentage error (more robust to outliers)

These metrics are easier to interpret in real estate pricing contexts.

In [None]:
# Convert back to dollars
y_train_d = np.exp(y_train)
y_test_d  = np.exp(y_test)

train_pred_d = np.exp(train_pred)
test_pred_d  = np.exp(test_pred)

train_mape = mean_absolute_percentage_error(y_train_d, train_pred_d) * 100
test_mape  = mean_absolute_percentage_error(y_test_d, test_pred_d) * 100

train_mdape = np.median(np.abs((y_train_d - train_pred_d) / y_train_d)) * 100
test_mdape  = np.median(np.abs((y_test_d - test_pred_d) / y_test_d)) * 100

print(f"Train MAPE (%):  {train_mape:.2f}")
print(f"Train MdAPE (%): {train_mdape:.2f}")

print(f"Test  MAPE (%):  {test_mape:.2f}")
print(f"Test  MdAPE (%): {test_mdape:.2f}")

Train MAPE (%):  16.42
Train MdAPE (%): 11.90
Test  MAPE (%):  17.77
Test  MdAPE (%): 12.52


## Interpret Decision Tree Rules

We extract human-readable decision rules from the trained tree.

This allows us to:
- Understand how price predictions are made
- Verify that splits align with domain intuition
- Increase model interpretability

In [None]:
from sklearn.tree import export_text

rules = export_text(
    tree_model,
    feature_names=tree_features
)

print(rules)

|--- ZipMedianPrice <= 13.78
|   |--- ZipMedianPrice <= 13.32
|   |   |--- ZipMedianPrice <= 13.07
|   |   |   |--- LivingArea <= 1690.50
|   |   |   |   |--- ZipMedianPrice <= 12.83
|   |   |   |   |   |--- LivingArea <= 1256.00
|   |   |   |   |   |   |--- value: [12.52]
|   |   |   |   |   |--- LivingArea >  1256.00
|   |   |   |   |   |   |--- value: [12.71]
|   |   |   |   |--- ZipMedianPrice >  12.83
|   |   |   |   |   |--- LivingArea <= 1206.50
|   |   |   |   |   |   |--- value: [12.75]
|   |   |   |   |   |--- LivingArea >  1206.50
|   |   |   |   |   |   |--- value: [12.90]
|   |   |   |--- LivingArea >  1690.50
|   |   |   |   |--- LivingArea <= 2185.00
|   |   |   |   |   |--- ZipMedianPrice <= 12.86
|   |   |   |   |   |   |--- value: [12.86]
|   |   |   |   |   |--- ZipMedianPrice >  12.86
|   |   |   |   |   |   |--- value: [13.06]
|   |   |   |   |--- LivingArea >  2185.00
|   |   |   |   |   |--- LivingArea <= 3165.50
|   |   |   |   |   |   |--- value: [13.18]
|   | 

## Final Model Summary

The improved Decision Tree:
- Outperforms linear regression
- Captures nonlinear price behavior
- Uses ZIP-level price encoding instead of sparse dummies
- Achieves strong generalization with interpretable rules

This model balances accuracy, robustness, and interpretability.

# Comparison of Models: Linear Regression vs Improved Decision Tree

## 1. Model Overview

| Model | Features | Target Variable | Key Characteristics |
|-------|---------|----------------|-------------------|
| Linear Regression (Baseline) | PostalCode (one-hot) | LogClosePrice | Simple linear model, only captures ZIP-level average effect, log-transform stabilizes variance |
| Decision Tree (Improved) | ZipMedianPrice, LivingArea, BathroomsTotalInteger, BedroomsTotal | LogClosePrice | Tree-based model, captures nonlinear interactions, uses ZIP-level median price encoding, regularized with max_depth=6 and min_samples_leaf=100 |

---

## 2. R² Performance (Log Scale)

| Model | Train R² | Test R² |
|-------|-----------|----------|
| Linear Regression | 0.7621 | 0.7269 |
| Decision Tree | 0.8702 | 0.8473 |

**Analysis:**  
- The improved Decision Tree **outperforms Linear Regression** on both train and test sets.  
- It captures **nonlinear relationships** between features and log-prices, leading to higher explained variance.  
- The smaller gap between train and test R² indicates **good generalization**.

---

## 3. Error Metrics (Original Dollar Scale)

| Model | Train MAPE (%) | Train MdAPE (%) | Test MAPE (%) | Test MdAPE (%) |
|-------|----------------|----------------|---------------|----------------|
| Linear Regression | 22.16 | 15.91 | 23.90 | 17.08 |
| Decision Tree | 16.42 | 11.90 | 17.77 | 12.52 |

**Analysis:**  
- Decision Tree shows **lower percentage errors**, indicating better practical prediction accuracy.  
- Median errors (MdAPE) are significantly smaller, suggesting the tree predicts **typical house prices more accurately**, while extreme outliers are less influential.

---

## 4. Interpretability

- **Linear Regression:** Easy to interpret, but only captures linear effects of ZIP codes.  
- **Decision Tree:** Extracted rules show how **location and property size interact** to determine prices.  
  - Example: Higher `ZipMedianPrice` + larger `LivingArea` → higher predicted log-price.  
  - Rules are human-readable and align with domain intuition.

---

## 5. Conclusion

The **improved Decision Tree model**:

- Explains **~85% of variance** in unseen data (Test R²).  
- Reduces median prediction error to ~12.5% vs 17% for Linear Regression.  
- Balances **accuracy, interpretability, and robustness**.  
- Clearly **outperforms the baseline Linear Regression** on all key metrics.  

> This Decision Tree as the new benchmark.

# END: Anjali Manju Gowda, Lexi Chen
