# Decision Tree Classifier

Trees are a popular class of algorithm in Machine Learning. In this notebook we build a simple Decision Tree Classifier using `scikit-learn` to show that they can be executed homomorphically using Concrete Numpy.

State of the art classifiers are generally a bit more complex than a single decision tree, but here we wanted to demonstrate FHE decision trees so results may not compete with the best models out there.

Converting a tree working over quantized data to its FHE equivalent takes only a few lines of code thanks to Concrete Numpy.

Let's dive in!

## The Use Case

The use case is a spam classification task from OpenML you can find here: https://www.openml.org/d/44

Some pre-extracted features (like some word frequencies) are provided as well as a class, `0` for a normal e-mail and `1` for spam, for 4601 samples.

Let's first get the dataset.

In [1]:
import numpy
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

features, classes = fetch_openml(data_id=44, as_frame=False, cache=True, return_X_y=True)
classes = classes.astype(numpy.int64)

print(features.shape)
print(classes.shape)

num_features = features.shape[1]
print(f"Number of features: {num_features}")

x_train, x_test, y_train, y_test = train_test_split(
    features,
    classes,
    test_size=0.15,
    random_state=42,
)


(4601, 57)
(4601,)
Number of features: 57


We first train a decision tree on the dataset as is and see what performance we can get.

In [2]:
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier

clear_clf = DecisionTreeClassifier()
clear_clf = clear_clf.fit(x_train, y_train)

print(f"Depth: {clear_clf.get_depth()}")

preds = clear_clf.predict(x_test)

mean_accuracy = numpy.mean(preds == y_test)
print(f"Mean accuracy: {mean_accuracy}")

true_negative, false_positive, false_negative, true_positive = confusion_matrix(
    y_test, preds, normalize="true"
).ravel()

num_samples = len(y_test)
num_spam = sum(y_test)

print(f"Number of test samples: {num_samples}")
print(f"Number of spams in test samples: {num_spam}")

print(f"True Negative (legit mail well classified) rate: {true_negative}")
print(f"False Positive (legit mail classified as spam) rate: {false_positive}")
print(f"False Negative (spam mail classified as legit) rate: {false_negative}")
print(f"True Positive (spam well classified) rate: {true_positive}")


Depth: 29
Mean accuracy: 0.91027496382055
Number of test samples: 691
Number of spams in test samples: 304
True Negative (legit mail well classified) rate: 0.9328165374677002
False Positive (legit mail classified as spam) rate: 0.06718346253229975
False Negative (spam mail classified as legit) rate: 0.11842105263157894
True Positive (spam well classified) rate: 0.881578947368421


We now quantize the features to train the tree directly on quantized data, this will make the trained tree FHE friendly by default which is a nice bonus, as well as allowing to see how both trees compare to each other.

The choice here is to compute the quantization parameters over the training set. We use 6 bits for each feature individually as the Concrete Numpy precision for PBSes is better for 6 bits of precision.

In [3]:
from concrete.quantization import QuantizedArray

# And quantize accordingly training and test samples
q_x_train = numpy.zeros_like(x_train, dtype=numpy.int64)
q_x_test = numpy.zeros_like(x_test, dtype=numpy.int64)
for feature_idx in range(num_features):
    q_x_train[:, feature_idx] = QuantizedArray(6, x_train[:, feature_idx]).qvalues
    q_x_test[:, feature_idx] = QuantizedArray(6, x_test[:, feature_idx]).qvalues

print(q_x_train[0])
print(q_x_test[-1])


[ 0  0  6  0  3  5  0  0  0  2  0 19  0  0  0  0  0  0  3  0  0  0  0  0
  4  4  0  7  3  0  0  0  2  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0
  0  1  0  0  0  0  0  0  1]
[ 0  0  0  0  6  0  0  0  0  0  0 10  0  0  0  0  0  0  4  0  7  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0]


So far so good, we can now train a DecisionTreeClassifier on the quantized dataset.

In [4]:
# We limit the depth to have reasonable FHE runtimes, but deep trees can still compile properly!
clf = DecisionTreeClassifier(max_depth=7)
clf = clf.fit(q_x_train, y_train)

print(f"Depth: {clf.get_depth()}")

preds = clf.predict(q_x_test)

mean_accuracy = numpy.mean(preds == y_test)
print(f"Mean accuracy: {mean_accuracy}")

true_negative, false_positive, false_negative, true_positive = confusion_matrix(
    y_test, preds, normalize="true"
).ravel()

num_samples = len(y_test)
num_spam = sum(y_test)

print(f"Number of test samples: {num_samples}")
print(f"Number of spams in test samples: {num_spam}")

print(f"True Negative (legit mail well classified) rate: {true_negative}")
print(f"False Positive (legit mail classified as spam) rate: {false_positive}")
print(f"False Negative (spam mail classified as legit) rate: {false_negative}")
print(f"True Positive (spam well classified) rate: {true_positive}")


Depth: 7
Mean accuracy: 0.8813314037626628
Number of test samples: 691
Number of spams in test samples: 304
True Negative (legit mail well classified) rate: 0.9276485788113695
False Positive (legit mail classified as spam) rate: 0.07235142118863049
False Negative (spam mail classified as legit) rate: 0.17763157894736842
True Positive (spam well classified) rate: 0.8223684210526315


This simple classifier achieves about a 7% false positive (legit mail classified as spam) rate and about a 17% false negative (spam mail classified as legit) rate. In a more common setting, not shown in this tutorial, we would use gradient boosting to assemble several small classifiers into a single one that would be more effective.

We can see that the accuracy is relatively similar to the tree trained in the clear despite the quantization (to be FHE compatible) and smaller depth to allow for faster FHE computations. The main difference being a higher False Positive rate (legit mail classified as spam).

The point here is not to beat the state of the art methods for spam detection but rather show that given a certain tree classifier we can run it homomorphically.

## Homorphic Trees

Before we can do that we need to convert the tree to a form that is easy to run homomorphically.

The Hummingbird paper from Microsoft (https://scnakandala.github.io/papers/TR_2020_Hummingbird.pdf and https://github.com/microsoft/hummingbird) gives a method to convert tree evaluation to tensor operations which we support in Concrete Numpy.

The next few cells implement the functions necessary for the conversion. They are not optimized well so that they remain readable.


In [5]:
# First an sklearn import we need
from sklearn.tree import _tree

In [6]:
def create_hummingbird_tensor_a(tree_, features, internal_nodes):
    """Create Hummingbird tensor A."""
    a = numpy.zeros((len(features), len(internal_nodes)), dtype=numpy.int64)
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            a[i, j] = tree_.feature[internal_nodes[j]] == features[i]

    return a

In [7]:
def create_hummingbird_tensor_b(tree_, internal_nodes, is_integer_tree=False):
    """Create Hummingbird tensor B."""
    b = numpy.array([tree_.threshold[int_node] for int_node in internal_nodes])

    return b.astype(numpy.int64) if is_integer_tree else b

In [8]:
def create_subtree_nodes_set_per_node(
    all_nodes, leaf_nodes, is_left_child_of: dict, is_right_child_of: dict
):
    """Create subtrees nodes set for each node in the tree."""
    left_subtree_nodes_per_node = {node: set() for node in all_nodes}
    right_subtree_nodes_per_node = {node: set() for node in all_nodes}

    current_nodes = {node: None for node in leaf_nodes}
    while current_nodes:
        next_nodes = {}
        for node in current_nodes:
            parent_as_left_child = is_left_child_of.get(node, None)
            if parent_as_left_child is not None:
                left_subtree = left_subtree_nodes_per_node[parent_as_left_child]
                left_subtree.add(node)
                left_subtree.update(left_subtree_nodes_per_node[node])
                left_subtree.update(right_subtree_nodes_per_node[node])
                next_nodes.update({parent_as_left_child: None})

            parent_as_right_child = is_right_child_of.get(node, None)
            if parent_as_right_child is not None:
                right_subtree = right_subtree_nodes_per_node[parent_as_right_child]
                right_subtree.add(node)
                right_subtree.update(left_subtree_nodes_per_node[node])
                right_subtree.update(right_subtree_nodes_per_node[node])
                next_nodes.update({parent_as_right_child: None})

        current_nodes = next_nodes

    return left_subtree_nodes_per_node, right_subtree_nodes_per_node

In [9]:
def create_hummingbird_tensor_c(
    all_nodes, internal_nodes, leaf_nodes, is_left_child_of: dict, is_right_child_of: dict
):
    """Create Hummingbird tensor C."""
    left_subtree_nodes_per_node, right_subtree_nodes_per_node = create_subtree_nodes_set_per_node(
        all_nodes, leaf_nodes, is_left_child_of, is_right_child_of
    )

    c = numpy.zeros((len(internal_nodes), len(leaf_nodes)), dtype=numpy.int64)

    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            if leaf_nodes[j] in right_subtree_nodes_per_node[internal_nodes[i]]:
                c[i, j] = -1
            elif leaf_nodes[j] in left_subtree_nodes_per_node[internal_nodes[i]]:
                c[i, j] = 1

    return c

In [10]:
def create_hummingbird_tensor_d(leaf_nodes, is_left_child_of, is_right_child_of):
    """Create Hummingbird tensor D."""
    d = numpy.zeros((len(leaf_nodes)), dtype=numpy.int64)
    for k in range(d.shape[0]):
        current_node = leaf_nodes[k]
        num_left_children = 0
        while True:
            if (parent_as_left_child := is_left_child_of.get(current_node, None)) is not None:
                num_left_children += 1
                current_node = parent_as_left_child
            elif (parent_as_right_child := is_right_child_of.get(current_node, None)) is not None:
                current_node = parent_as_right_child
            else:
                break
        d[k] = num_left_children

    return d

In [11]:
def create_hummingbird_tensor_e(tree_, leaf_nodes, classes):
    """Create Hummingbird tensor E."""
    e = numpy.zeros((len(leaf_nodes), len(classes)), dtype=numpy.int64)
    for i in range(e.shape[0]):
        leaf_node = leaf_nodes[i]
        assert tree_.feature[leaf_node] == _tree.TREE_UNDEFINED  # Sanity check
        for j in range(e.shape[1]):
            value = None
            if tree_.n_outputs == 1:
                value = tree_.value[leaf_node][0]
            else:
                value = tree_.value[leaf_node].T[0]
            class_name = numpy.argmax(value)
            e[i, j] = class_name == j

    return e

In [12]:
def tree_to_numpy(tree, num_features, classes):
    """Convert an sklearn tree to its Hummingbird tensor equivalent."""
    tree_ = tree.tree_

    number_of_nodes = tree_.node_count
    all_nodes = list(range(number_of_nodes))
    internal_nodes = [
        node_idx
        for node_idx, feature in enumerate(tree_.feature)
        if feature != _tree.TREE_UNDEFINED
    ]
    leaf_nodes = [
        node_idx
        for node_idx, feature in enumerate(tree_.feature)
        if feature == _tree.TREE_UNDEFINED
    ]

    features = list(range(num_features))

    a = create_hummingbird_tensor_a(tree_, features, internal_nodes)

    b = create_hummingbird_tensor_b(tree_, internal_nodes, is_integer_tree=True)

    is_left_child_of = {
        left_child: parent
        for parent, left_child in enumerate(tree_.children_left)
        if left_child != _tree.TREE_UNDEFINED
    }
    is_right_child_of = {
        right_child: parent
        for parent, right_child in enumerate(tree_.children_right)
        if right_child != _tree.TREE_UNDEFINED
    }

    c = create_hummingbird_tensor_c(
        all_nodes, internal_nodes, leaf_nodes, is_left_child_of, is_right_child_of
    )

    d = create_hummingbird_tensor_d(leaf_nodes, is_left_child_of, is_right_child_of)

    e = create_hummingbird_tensor_e(tree_, leaf_nodes, classes)

    def tree_predict(inputs):
        t = inputs @ a
        t = t <= b
        t = t @ c
        t = t == d
        r = t @ e
        return r

    return tree_predict

In [13]:
# We can finally convert our tree!
tree_predict = tree_to_numpy(clf, num_features, classes=[0, 1])

In [14]:
# Let's see if it works as expected
tensor_predictions = tree_predict(q_x_test)
tensor_predictions = numpy.argmax(tensor_predictions, axis=1)

tree_predictions = clf.predict(q_x_test)

print(f"Results are identical: {numpy.array_equal(tensor_predictions, tree_predictions)}")

Results are identical: True


We now have a tensor equivalent of our `DecisionTreeClassifier`, pretty neat isn't it?

Last step is compiling the tensor equivalent to FHE using the Concrete Numpy and it's nearly as easy as 1, 2, 3.

We use the training input data as well as some synthetic data to calibrate the circuit during compilation.

In [15]:
import concrete.numpy as hnp

compiler = hnp.NPFHECompiler(tree_predict, {"inputs": "encrypted"})
fhe_tree = compiler.compile_on_inputset((sample for sample in q_x_train))

And now we can start running the tree homomorphically!

In [16]:
from tqdm import tqdm
from time import perf_counter

num_runs = 10
fhe_preds = []
clear_preds = []
fhe_eval_times = []
for i in tqdm(range(num_runs)):
    start = perf_counter()
    fhe_pred = fhe_tree.run(q_x_test[i].astype(numpy.uint8))
    stop = perf_counter()
    fhe_eval_times.append(stop - start)
    fhe_pred = numpy.argmax(fhe_pred)
    fhe_preds.append(fhe_pred)
    clear_pred = clf.predict(numpy.expand_dims(q_x_test[i], axis=0))
    clear_pred = clear_pred[0]
    clear_preds.append(clear_pred)

fhe_preds = numpy.array(fhe_preds)
clear_preds = numpy.array(clear_preds)

same_preds = fhe_preds == clear_preds
n_same_preds = sum(same_preds)
print(
    f"Same predictions of FHE compared to clear: {n_same_preds}/{num_runs} "
    f"({numpy.mean(same_preds)})"
)
for idx, eval_time in enumerate(fhe_eval_times, 1):
    print(f"FHE evaluation #{idx} took {eval_time} s")

print(f"Mean FHE evaluation time: {numpy.mean(fhe_eval_times)}")


100%|██████████| 10/10 [05:01<00:00, 30.17s/it]

Same predictions of FHE compared to clear: 10/10 (1.0)
FHE evaluation #1 took 30.765692999993917 s
FHE evaluation #2 took 30.604038099998434 s
FHE evaluation #3 took 30.70741419999831 s
FHE evaluation #4 took 30.64609560000099 s
FHE evaluation #5 took 29.945520399996894 s
FHE evaluation #6 took 30.155333900002006 s
FHE evaluation #7 took 29.776400299997476 s
FHE evaluation #8 took 30.12118709999777 s
FHE evaluation #9 took 29.526597299998684 s
FHE evaluation #10 took 29.392055899996194 s
Mean FHE evaluation time: 30.16403357999807





## Conclusion

In this notebook we showed how to quantize a dataset to train a tree directly on integer data so that it is FHE friendly. We saw that despite quantization and its smaller depth, the quantized tree classification capabilities were close to a tree trained on the original real-valued dataset.

We then used the Hummingbird paper's algorithm to transform a tree evaluation to a few tensor operations which can be compiled by the Concrete Numpy to an FHE circuit.

Finally we ran the compiled circuit on a few samples (because inference times are a bit high) to show that clear and FHE computations were the same.