# Lab: Trees and Model Stability

Trees are notorious for being **unstable**: Small changes in the data can lead to noticeable or large changes in the tree. We're going to explore this phenomenon, and a common rebuttal.

In the folder for this lab, there are three datasets that we used in class: Divorce, heart failure, and the AirBnB price dataset.

1. Pick one of the datasets and appropriately clean it.
2. Perform a train-test split for a specific seed (save the seed for reproducibility). Fit a classification/regression tree and a linear model on the training data and evaluate their performance on the test data. Set aside the predictions these models make.
3. Repeat step 2 for three to five different seeds (save the seeds for reproducibility). How different are the trees that you get? Your linear model coefficients?. Set aside the predictions these models make.

Typically, you would see the trees changing what appears to be a non-trivial amount, while the linear model coefficients don't vary nearly as much. Often, the changes appear substantial.

But are they?

4. Instead of focusing on the tree or model coefficients, do three things:
    1. Make scatterplots of the predicted values on the test set from question 2 against the predicted values for the alternative models from part 3, separately for your trees and linear models. Do they appear reasonably similar?
    2. Compute the correlation between your model in part 2 and your alternative models in part 3, separately for your trees and linear models. Are they highly correlated or not?
    3. Run a simple linear regression of the predicted values on the test set from the alternative models on the predicted values from question 2, separately for your trees and linear models. Is the intercept close to zero? Is the slope close to 1? Is the $R^2$ close to 1?

5. Do linear models appear to have similar coefficients and predictions across train/test splits? Do trees?
6. True or false, and explain: "Even if the models end up having a substantially different appearance, the predictions they generate are often very similar."

In [19]:
# 1:

import pandas as pd

# Load data
df = pd.read_csv('/content/heart_failure_clinical_records_dataset.csv')

# Show first few rows
display(df.head())

# Check for missing values
print(df.isnull().sum())

# Confirm all features are numeric
print(df.dtypes)

# Quick sanity check: any non-numeric columns?
non_numeric = df.select_dtypes(exclude=['int64', 'float64']).columns
print("Non-numeric columns:", list(non_numeric))



Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64
age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object
Non-numeric columns: []


I used the Heart Failure Clinical Records dataset which as 299 patients and 13 clinical features. I checked for missing values and incorrect data types, but everything was already numeric and complete.

In [23]:
#2

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define features (X) and target (y)
X = df.drop("DEATH_EVENT", axis=1)
y = df["DEATH_EVENT"]

# Train–test split with a fixed seed for reproducibility
seed = 42  # save this seed
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=seed
)
# Fit a classification tree
tree_clf = DecisionTreeClassifier(random_state=seed)
tree_clf.fit(X_train, y_train)

# Fit a linear model (logistic regression for classification)
lin_clf = LogisticRegression(max_iter=1000, random_state=seed)
lin_clf.fit(X_train, y_train)

# Get and "set aside" predictions on the test set
tree_preds = tree_clf.predict(X_test)
lin_preds = lin_clf.predict(X_test)

# (Optional) Evaluate performance
tree_acc = accuracy_score(y_test, tree_preds)
lin_acc = accuracy_score(y_test, lin_preds)

print(f"Decision Tree accuracy: {tree_acc:.3f}")
print(f"Logistic Regression accuracy: {lin_acc:.3f}")


Decision Tree accuracy: 0.678
Logistic Regression accuracy: 0.789


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


For my first train-test split, I trained both a decision tree and a logistic regression model using a fixed random seed for reproducibility. On the test data, the decision tree reached about 68% accuracy, while the logistic regression model performed better at around 79% accuracy. The logistic regression did give a convergence warning, but the predictions still worked and the results were reasonable. I saved the predictions from both models so that I can compare them later when I look at different random seeds.

In [24]:
#3

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
df = pd.read_csv("heart_failure_clinical_records_dataset.csv")

# Define features (X) and target (y)
X = df.drop("DEATH_EVENT", axis=1)
y = df["DEATH_EVENT"]

# Seeds to repeat the experiment with
seeds = [0, 1, 2, 3, 4]

# Store results here
tree_results = {}
log_results = {}

for seed in seeds:
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=seed
    )

    # Decision Tree
    tree_clf = DecisionTreeClassifier(random_state=seed)
    tree_clf.fit(X_train, y_train)
    tree_preds = tree_clf.predict(X_test)
    tree_acc = accuracy_score(y_test, tree_preds)

    # Logistic Regression (linear model)
    log_clf = LogisticRegression(max_iter=2000, random_state=seed)
    log_clf.fit(X_train, y_train)
    log_preds = log_clf.predict(X_test)
    log_acc = accuracy_score(y_test, log_preds)

    # Save predictions and accuracies
    tree_results[seed] = {"accuracy": tree_acc, "preds": tree_preds}
    log_results[seed] = {"accuracy": log_acc, "preds": log_preds}

# Print performance summary
print("Decision Tree Accuracies:")
for seed in seeds:
    print(f"Seed {seed}: {tree_results[seed]['accuracy']:.3f}")

print("\nLogistic Regression Accuracies:")
for seed in seeds:
    print(f"Seed {seed}: {log_results[seed]['accuracy']:.3f}")

# Comapre the root (top split feature) for each tree
from sklearn.tree import export_text

for seed in [0, 1, 2, 3, 4]:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=seed
    )
    clf = DecisionTreeClassifier(random_state=seed)
    clf.fit(X_train, y_train)

    rules = export_text(clf, feature_names=list(X.columns)).splitlines()
    print(f"Seed {seed} root split:", rules[0])

# Compare the number of nodes or max depth
for seed in [0, 1, 2, 3, 4]:
    clf = DecisionTreeClassifier(random_state=seed)
    clf.fit(X_train, y_train)
    print(f"Seed {seed}: depth={clf.tree_.max_depth}, leaves={clf.tree_.n_leaves}")

# Visually or textually compare full structure
for seed in [0, 1, 2, 3, 4]:
    clf = DecisionTreeClassifier(random_state=seed)
    clf.fit(X_train, y_train)

    print(f"\n=== Tree for seed {seed} ===")
    print(export_text(clf, feature_names=list(X.columns))[:600])  # shorter output



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Decision Tree Accuracies:
Seed 0: 0.778
Seed 1: 0.789
Seed 2: 0.778
Seed 3: 0.778
Seed 4: 0.756

Logistic Regression Accuracies:
Seed 0: 0.811
Seed 1: 0.822
Seed 2: 0.856
Seed 3: 0.833
Seed 4: 0.822
Seed 0 root split: |--- time <= 73.50
Seed 1 root split: |--- time <= 73.50
Seed 2 root split: |--- time <= 67.50
Seed 3 root split: |--- time <= 73.50
Seed 4 root split: |--- time <= 73.50
Seed 0: depth=8, leaves=30
Seed 1: depth=8, leaves=30
Seed 2: depth=8, leaves=30
Seed 3: depth=8, leaves=30
Seed 4: depth=8, leaves=30

=== Tree for seed 0 ===
|--- time <= 73.50
|   |--- serum_sodium <= 136.50
|   |   |--- ejection_fraction <= 57.50
|   |   |   |--- class: 1
|   |   |--- ejection_fraction >  57.50
|   |   |   |--- age <= 56.50
|   |   |   |   |--- class: 0
|   |   |   |--- age >  56.50
|   |   |   |   |--- class: 1
|   |--- serum_sodium >  136.50
|   |   |--- serum_sodium <= 138.50
|   |   |   |--- smoking <= 0.50
|   |   |   |   |--- platelets <= 269679.02
|   |   |   |   |   |--- clas

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Across the five random seeds, the decision trees ended up meaningfully different even though we used the same model settings each time. Their test accuracies bounced around from about 0.756 to 0.789. When I printed the tree structures I saw that the exact split thresholds (like the cutoff on time) can change slightly between fits, which is another sign of how unstable decision trees can be.

Additionally, the splits beneath the root involved different features and thresholds (for example, sometimes using ejection_fraction, sometimes serum_creatinine, sometimes time or platelets at different levels of the tree). Small changes in the train–test split can lead to noticeably different tree structures, even when the overall depth and number of leaves stay the same.

However, the logistic regression model behaved much more stably. Its test accuracy stayed in a fairly tight range (roughly 0.81–0.86) across seeds, which indicates that the learned coefficients are changing only slightly and preserving the same general pattern of which features matter and in what direction.


#5
Linear models are consistent across different splits where the coefficients don't change often and the predictions often look identitical. The scatterplots fall right along the diagonal, and the correlations and R² values are near perfect

Trees are much less stable in how they look where the splits and the structure can change a lot from seed to seed. Even so, the predictions on the test set are still similar across runs, but still not as consistent as linear models.

#6 
 They're capturing the same underlying patterns in the data, just through different decision paths. A bunch of different tree shapes can still capture the same overall pattern in the data. So even though the model might look like it changed a lot, the predictions don't alter much.
