<a href="https://colab.research.google.com/github/vivianlinnn/DS41_IDXExchange/blob/main/src/05_DecisionTree_Team.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#START:LEXI CHEN

 # Decision Tree Regression Model


A Decision Tree is a strong next model after linear regression because:

- **Nonlinear patterns:** Housing prices often depend on *thresholds* (e.g., LivingArea > 2500 sqft) and *interactions* (e.g., size matters differently by neighborhood).
- **Handles feature interactions automatically:** Unlike OLS, we do not need to manually add interaction terms.
- **Interpretability:** Trees can be visualized and explained through decision rules.

We will still predict `LogClosePrice` (instead of `ClosePrice`) because:
- `ClosePrice` is right-skewed, while `LogClosePrice` is closer to normal.
- Log scale makes the model focus more on relative/percentage error rather than absolute dollars.


In [None]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

## Load the Train/Test Datasets

Test set: the last month of data provided: December 2025

Training set: a minimum of 6 months of data prior to the test month (June 2025 to November 2025)

In [None]:
test = pd.read_csv('/content/test_cleaned.csv')
train = pd.read_csv('/content/train_cleaned.csv')

## Create Log of Close Price

We log-transform `ClosePrice` to:

- Reduce right skew  
- Improve model stability  
- Improve predictive performance  

In [None]:
train['LogClosePrice'] = np.log(train['ClosePrice'])
test['LogClosePrice']  = np.log(test['ClosePrice'])

## Decision Tree Predictor Selection

Before fitting the Decision Tree model, we select predictors based on:

- Statistical signal strength (single-variable \( R^2 \))
- Real estate domain knowledge
- Overfitting risk
- Model interpretability

The goal is to explain variation in: LogClosePrice

---

1. Keep Strong Location Features

Highest  R^2 values:

$$
R^2_{\text{PostalCode}} = 0.7621
\quad
R^2_{\text{City}} = 0.6917
\quad
R^2_{\text{MLSAreaMajor}} = 0.4690
$$

Location dominates housing prices:

\[
\text{Price} = f(\text{Location}, \text{Size}, \text{Quality})
\]

Selected:

- `PostalCode`
- `City`
- `MLSAreaMajor`
- `HighSchoolDistrict`

---

2. Keep Strong Structural Features

$$
R^2_{\text{LivingArea}} = 0.3260
\quad
R^2_{\text{BathroomsTotalInteger}} = 0.2849
$$

Selected:

- `LivingArea`
- `BathroomsTotalInteger`
- `BedroomsTotal`

---

### Exclude Weak Features

Features with:

$$
R^2 < 0.05
$$

are excluded to reduce noise and overfitting.


In [None]:
tree_features = [
    'PostalCode',
    'City',
    'MLSAreaMajor',
    'HighSchoolDistrict',
    'LivingArea',
    'BathroomsTotalInteger',
    'BedroomsTotal'
]

X_train_raw = train[tree_features]
y_train = train['LogClosePrice']

X_test_raw = test[tree_features]
y_test = test['LogClosePrice']

In [None]:
# Decision Trees in sklearn require numeric input.
X_train = pd.get_dummies(X_train_raw, drop_first=True)
X_test  = pd.get_dummies(X_test_raw, drop_first=True)


# Ensures both datasets have identical feature structure.
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

In [None]:
tree_model = DecisionTreeRegressor(
    max_depth=5,
    random_state=42
)

tree_model.fit(X_train, y_train)

In [None]:
from sklearn.tree import export_text

tree_rules = export_text(tree_model, feature_names=list(X_train.columns))
print(tree_rules)

|--- LivingArea <= 2297.50
|   |--- LivingArea <= 1594.50
|   |   |--- LivingArea <= 1107.50
|   |   |   |--- HighSchoolDistrict_Other <= 0.50
|   |   |   |   |--- HighSchoolDistrict_Rim of the World <= 0.50
|   |   |   |   |   |--- value: [13.28]
|   |   |   |   |--- HighSchoolDistrict_Rim of the World >  0.50
|   |   |   |   |   |--- value: [12.70]
|   |   |   |--- HighSchoolDistrict_Other >  0.50
|   |   |   |   |--- MLSAreaMajor_699 - Not Defined <= 0.50
|   |   |   |   |   |--- value: [13.34]
|   |   |   |   |--- MLSAreaMajor_699 - Not Defined >  0.50
|   |   |   |   |   |--- value: [13.73]
|   |   |--- LivingArea >  1107.50
|   |   |   |--- MLSAreaMajor_699 - Not Defined <= 0.50
|   |   |   |   |--- MLSAreaMajor_SRCAR - Southwest Riverside County <= 0.50
|   |   |   |   |   |--- value: [13.51]
|   |   |   |   |--- MLSAreaMajor_SRCAR - Southwest Riverside County >  0.50
|   |   |   |   |   |--- value: [13.16]
|   |   |   |--- MLSAreaMajor_699 - Not Defined >  0.50
|   |   |   |   

#END: LEXI CHEN

#START : Anjali Manju Gowda


Predicting with Decision Tree**
- Predict the log of closing prices (`LogClosePrice`) using the trained Decision Tree model.
- `dt_train_pred_log`: predictions on the **training set**.
- `dt_test_pred_log`: predictions on the **test set**.
- Predictions are still in **log scale**.

In [None]:
dt_train_pred_log = tree_model.predict(X_train)
dt_test_pred_log  = tree_model.predict(X_test)

Compute R² (log scale)**
- Computes **R² (coefficient of determination)** for both training and test sets using log-transformed prices.
- R² measures the proportion of variance explained by the model (1 = perfect, 0 = none).
- Observed values:
  - **Train R² (log): 0.4061**
  - **Test R² (log): 0.4008**
- Indicates the model is **underfitting**.


In [None]:
from sklearn.metrics import r2_score

dt_train_r2_log = r2_score(y_train, dt_train_pred_log)
dt_test_r2_log  = r2_score(y_test, dt_test_pred_log)

print(f"Decision Tree Train R² (log): {dt_train_r2_log:.4f}")
print(f"Decision Tree Test  R² (log): {dt_test_r2_log:.4f}")

Decision Tree Train R² (log): 0.4061
Decision Tree Test  R² (log): 0.4008


Convert Predictions to Original Scale**
- Converts **log-scale predictions** back to original house prices using `np.expm1`.
- Also converts the true values to original scale for later error calculations.
- Needed to interpret predictions in **dollars**.

In [None]:
dt_train_pred_orig = np.expm1(dt_train_pred_log)
dt_test_pred_orig  = np.expm1(dt_test_pred_log)

y_train_orig = np.expm1(y_train)
y_test_orig  = np.expm1(y_test)

Percentage Error Function**
- Defines a function to compute **percentage error**:
\[
\text{Percentage Error} = \frac{\text{Actual} - \text{Predicted}}{\text{Actual}} \times 100
\]
- Ignores zero values in `y_true` to avoid division errors.
- Normalizes errors to make them interpretable as percentages.


In [None]:
def percentage_error(y_true, y_pred):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    mask = y_true != 0
    return (y_true[mask] - y_pred[mask]) / y_true[mask] * 100

Mean Absolute Percentage Error (MAPE)**
- Computes **MAPE**: average of absolute percentage errors.
- Shows the **average deviation** of predictions as a percentage of actual values.
- Sensitive to outliers (very expensive or very cheap houses).


In [None]:
def mean_abs_percentage_error(y_true, y_pred):
    return np.mean(np.abs(percentage_error(y_true, y_pred)))

Median Absolute Percentage Error (MdAPE)**
- Computes **MdAPE**: median of absolute percentage errors.
- Robust to outliers; indicates a **typical prediction error**.

---

In [None]:
def median_abs_percentage_error(y_true, y_pred):
    return np.median(np.abs(percentage_error(y_true, y_pred)))

Compute Decision Tree MAPE and MdAPE**
- Calculates **MAPE and MdAPE** for both training and test sets on original dollar scale.
- Results:
  - **Train MAPE:** 40.88%, **MdAPE:** 30.45%
  - **Test MAPE:** 41.65%, **MdAPE:** 30.43%
- Shows that the Decision Tree performs **worse than the baseline Linear Regression**.


In [None]:
dt_train_mape  = mean_abs_percentage_error(y_train_orig, dt_train_pred_orig)
dt_train_mdape = median_abs_percentage_error(y_train_orig, dt_train_pred_orig)

dt_test_mape  = mean_abs_percentage_error(y_test_orig, dt_test_pred_orig)
dt_test_mdape = median_abs_percentage_error(y_test_orig, dt_test_pred_orig)

print(f"Decision Tree Train MAPE (%):  {dt_train_mape:.2f}")
print(f"Decision Tree Train MdAPE (%): {dt_train_mdape:.2f}")
print(f"Decision Tree Test  MAPE (%):  {dt_test_mape:.2f}")
print(f"Decision Tree Test  MdAPE (%): {dt_test_mdape:.2f}")

Decision Tree Train MAPE (%):  40.88
Decision Tree Train MdAPE (%): 30.45
Decision Tree Test  MAPE (%):  41.65
Decision Tree Test  MdAPE (%): 30.43


Compare Models (R²)**
- Creates a comparison table of **Test R² (log scale)** for:
  - **Baseline Linear Regression (PostalCode):** 0.7269  
  - **Decision Tree:** 0.4008
- Quick check of **which model explains more variance** on unseen data.
- Linear Regression clearly **outperforms the Decision Tree** in this scenario.


In [None]:
comparison = pd.DataFrame({
    'Model': ['Linear Regression (PostalCode)', 'Decision Tree'],
    'Test R² (log scale)': [0.7269, dt_test_r2_log]
})

comparison

Unnamed: 0,Model,Test R² (log scale)
0,Linear Regression (PostalCode),0.7269
1,Decision Tree,0.400773


The Decision Tree **underperforms** compared to the baseline Linear Regression using PostalCode:
  - Lower R² values on both train and test sets.
  - Much higher percentage errors (MAPE/MdAPE).  
- Conclusion: For this dataset and feature set, **baseline linear regression is better than the Decision Tree model**.

# Comparison of Models: Linear Regression vs Decision Tree (Shallow)

## 1. Model Overview

| Model | Features | Target Variable | Key Characteristics |
|-------|---------|----------------|-------------------|
| Linear Regression (Baseline) | PostalCode (one-hot) | LogClosePrice | Simple linear model, captures ZIP-level average effect, log-transform stabilizes variance |
| Decision Tree (max_depth=5) | PostalCode, City, MLSAreaMajor, HighSchoolDistrict, LivingArea, BathroomsTotalInteger, BedroomsTotal | LogClosePrice | Tree-based model, max_depth=5, captures nonlinear patterns but underfits; interpretable via decision rules |

---

## 2. R² Performance (Log Scale)

| Model | Train R² | Test R² |
|-------|-----------|----------|
| Linear Regression | 0.7621 | 0.7269 |
| Decision Tree | 0.4061 | 0.4008 |

**Analysis:**  
- Linear Regression **outperforms the Decision Tree** in both training and test sets.  
- The shallow Decision Tree **underfits**, capturing only ~41% of variance.  
- Linear Regression generalizes better for this dataset with only ZIP-level predictor.

---

## 3. Error Metrics (Original Dollar Scale)

| Model | Train MAPE (%) | Train MdAPE (%) | Test MAPE (%) | Test MdAPE (%) |
|-------|----------------|----------------|---------------|----------------|
| Linear Regression | 22.16 | 15.91 | 23.90 | 17.08 |
| Decision Tree | 40.88 | 30.45 | 41.65 | 30.43 |

**Analysis:**  
- The Decision Tree shows **much higher percentage errors**, indicating poorer predictive accuracy.  
- Median errors are nearly double compared to Linear Regression, suggesting the tree is unreliable for typical house price predictions.  
- The model fails to leverage ZIP-level effects effectively with shallow depth.

---

## 4. Interpretability

- **Linear Regression:** Simple, interpretable coefficients; directly shows ZIP-level impact.  
- **Decision Tree:** Rules are human-readable, but **max_depth=5 is too shallow**, so it cannot fully capture complex patterns in housing data.  
  - Example rule: `LivingArea <= 2297.5 & HighSchoolDistrict_Other <= 0.5 → predicted log-price = 13.28`  
  - Some interactions captured, but overall variance explained is low.

---

## 5. Conclusion

For this dataset:

- **Linear Regression (PostalCode)** is the better model for predicting `LogClosePrice`.  
- The Decision Tree (max_depth=5) **underfits**, producing lower R² (~0.40) and higher prediction errors (~41% MAPE).  
- Key takeaway: shallow trees with these features cannot outperform a ZIP-level linear model.  

# END : Anjali Manju Gowda