#START:LEXI CHEN

 # Decision Tree Regression Model


A Decision Tree is a strong next model after linear regression because:

- **Nonlinear patterns:** Housing prices often depend on *thresholds* (e.g., LivingArea > 2500 sqft) and *interactions* (e.g., size matters differently by neighborhood).
- **Handles feature interactions automatically:** Unlike OLS, we do not need to manually add interaction terms.
- **Interpretability:** Trees can be visualized and explained through decision rules.

We will still predict `LogClosePrice` (instead of `ClosePrice`) because:
- `ClosePrice` is right-skewed, while `LogClosePrice` is closer to normal.
- Log scale makes the model focus more on relative/percentage error rather than absolute dollars.


In [6]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error

## Load the Train/Test Datasets

Test set: the last month of data provided: December 2025

Training set: a minimum of 6 months of data prior to the test month (June 2025 to November 2025)

In [7]:
test = pd.read_csv('../processed_data/test_cleaned.csv')
train = pd.read_csv('../processed_data/train_cleaned.csv')

## Create Log of Close Price

We log-transform `ClosePrice` to:

- Reduce right skew  
- Improve model stability  
- Improve predictive performance  

In [8]:
train['LogClosePrice'] = np.log(train['ClosePrice'])
test['LogClosePrice']  = np.log(test['ClosePrice'])

## Decision Tree Predictor Selection

Before fitting the Decision Tree model, we select predictors based on:

- Statistical signal strength (single-variable \( R^2 \))
- Real estate domain knowledge
- Overfitting risk
- Model interpretability

The goal is to explain variation in: LogClosePrice

---

1. Keep Strong Location Features

Highest  R^2 values:

$$
R^2_{\text{PostalCode}} = 0.7621
\quad
R^2_{\text{City}} = 0.6917
\quad
R^2_{\text{MLSAreaMajor}} = 0.4690
$$

Location dominates housing prices:

\[
\text{Price} = f(\text{Location}, \text{Size}, \text{Quality})
\]

Selected:

- `PostalCode`
- `City`
- `MLSAreaMajor`
- `HighSchoolDistrict`

---

2. Keep Strong Structural Features

$$
R^2_{\text{LivingArea}} = 0.3260
\quad
R^2_{\text{BathroomsTotalInteger}} = 0.2849
$$

Selected:

- `LivingArea`
- `BathroomsTotalInteger`
- `BedroomsTotal`

---

### Exclude Weak Features

Features with:

$$
R^2 < 0.05
$$

are excluded to reduce noise and overfitting.


In [9]:
tree_features = [
    'PostalCode',
    'City',
    'MLSAreaMajor',
    'HighSchoolDistrict',
    'LivingArea',
    'BathroomsTotalInteger',
    'BedroomsTotal'
]

X_train_raw = train[tree_features]
y_train = train['LogClosePrice']

X_test_raw = test[tree_features]
y_test = test['LogClosePrice']

In [10]:
# Decision Trees in sklearn require numeric input.
X_train = pd.get_dummies(X_train_raw, drop_first=True)
X_test  = pd.get_dummies(X_test_raw, drop_first=True)


# Ensures both datasets have identical feature structure.
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

In [11]:
tree_model = DecisionTreeRegressor(
    max_depth=5,
    random_state=42
)

tree_model.fit(X_train, y_train)

In [12]:
from sklearn.tree import export_text

tree_rules = export_text(tree_model, feature_names=list(X_train.columns))
print(tree_rules)

|--- LivingArea <= 2297.50
|   |--- LivingArea <= 1594.50
|   |   |--- LivingArea <= 1107.50
|   |   |   |--- HighSchoolDistrict_Other <= 0.50
|   |   |   |   |--- HighSchoolDistrict_Rim of the World <= 0.50
|   |   |   |   |   |--- value: [13.28]
|   |   |   |   |--- HighSchoolDistrict_Rim of the World >  0.50
|   |   |   |   |   |--- value: [12.70]
|   |   |   |--- HighSchoolDistrict_Other >  0.50
|   |   |   |   |--- MLSAreaMajor_699 - Not Defined <= 0.50
|   |   |   |   |   |--- value: [13.34]
|   |   |   |   |--- MLSAreaMajor_699 - Not Defined >  0.50
|   |   |   |   |   |--- value: [13.73]
|   |   |--- LivingArea >  1107.50
|   |   |   |--- MLSAreaMajor_699 - Not Defined <= 0.50
|   |   |   |   |--- MLSAreaMajor_SRCAR - Southwest Riverside County <= 0.50
|   |   |   |   |   |--- value: [13.51]
|   |   |   |   |--- MLSAreaMajor_SRCAR - Southwest Riverside County >  0.50
|   |   |   |   |   |--- value: [13.16]
|   |   |   |--- MLSAreaMajor_699 - Not Defined >  0.50
|   |   |   |   

#END: LEXI CHEN