# Arborium: Simplified Tree Representations

This notebook demonstrates how to use Arborium to create simplified tree representations of complex XGBoost models. This can be especially useful for understanding and explaining models with many trees and deep structures.

## Installation

If you're running this notebook in Colab or outside the arborium repository, uncomment and run the following cell to install the package:

In [1]:
# Uncomment if running in Colab or if you haven't installed arborium yet
# !pip install arborium[xgboost]

## Importing Libraries

First, let's import the necessary libraries:

In [11]:
from arborium import XGBTreeVisualizer
from sklearn.datasets import load_iris
import xgboost as xgb
import numpy as np

## Loading and Preparing Data

We'll use the California Housing dataset for this example, which has more samples and features than our previous examples:

In [12]:
iris = load_iris()
X, y = iris.data, iris.target

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X, label=y)

## Training a Complex XGBoost Model

Let's train a more complex XGBoost model with many trees and deep structure:

In [13]:
# Set parameters for XGBoost
params = {
    'objective': 'multi:softmax',  # multiclass classification
    'num_class': 3,  # iris has 3 classes
    'max_depth': None,
    'learning_rate': 0.1,
    'eval_metric': 'mlogloss'
}

# Train XGBoost model
num_rounds = 300
model = xgb.train(params, dtrain, num_rounds)

## Creating a Visualizer

Now, let's create an Arborium visualizer for the model:

In [5]:
# Create a visualizer
visualizer = XGBTreeVisualizer(model, X, y, feature_names=iris.feature_names, target_names=iris.target_names)
visualizer.show_tree()

As you can see, individual trees in this complex model can be quite deep and hard to interpret. This is where simplified trees come in handy.

## Creating a Simplified Tree Representation

Arborium can create a simplified decision tree that approximates the behavior of the entire ensemble:

In [15]:
# Show a simplified representation of the entire model
simplified_tree = visualizer.show_simplified_tree(
    max_depth=1024,              # Control the depth of the simplified tree
    n_components=None,        # Use all features (no dimensionality reduction)
    n_samples=5000            # Use 5000 samples to build the simplified model
)

## Using the Simplified Model for Predictions

The simplified model can also be used to make predictions. Let's see how it compares to the full model:

In [16]:
from sklearn.metrics import accuracy_score
# Get predictions from the full XGBoost model

y_pred_xgb = model.predict(dtrain)
xgb_accuracy = accuracy_score(y, y_pred_xgb)
print(f"XGBoost model accuracy: {xgb_accuracy:.4f}")

# Get predictions from the simplified tree model
simplified_model = visualizer.get_simplified_model()
y_pred_simplified = simplified_model.predict(X)
simplified_accuracy = accuracy_score(y, y_pred_simplified)
print(f"Simplified tree model accuracy: {simplified_accuracy:.4f}")



XGBoost model accuracy: 1.0000
Simplified tree model accuracy: 0.9867


## Experimenting with Different Simplification Parameters

Let's try different parameters for the simplified tree:

In [8]:
# Try a deeper simplified tree
deeper_tree = visualizer.show_simplified_tree(
    max_depth=None,
    n_samples=5000
)

In [9]:
# Try with dimensionality reduction
small_tree = visualizer.show_simplified_tree(
    max_depth=2,
    n_components=2,
    n_samples=5000
)

## Getting the Simplified Model

You can also access the simplified model directly, which is a scikit-learn decision tree:

In [10]:
# Get the most recently created simplified model
dt_model = visualizer.get_simplified_model()

# Show information about the model
print(f"Type: {type(dt_model).__name__}")
print(f"Max depth: {dt_model.max_depth}")
print(f"Number of leaves: {dt_model.get_n_leaves()}")

Type: DecisionTreeClassifier
Max depth: 2
Number of leaves: 3


## Conclusion

You've now learned how to use Arborium to create simplified tree representations of complex XGBoost models. These simplified trees can help with:

1. Model interpretation and explanation
2. Understanding the most important features and decision rules
3. Creating approximate but more interpretable models

While simplified trees sacrifice some performance compared to the full ensemble, they provide valuable insights into how the model makes predictions, which can be crucial for explaining model behavior to stakeholders or debugging model issues.