# Decision Tree Classifier

Trees are a popular class of algorithm in Machine Learning. In this notebook we build a simple Decision Tree Classifier using `scikit-learn` to show that they can be executed homomorphically using Concrete.

Converting a tree working over quantized data to its FHE equivalent takes only a few lines of code thanks to Concrete ML.

Let's dive in!

## The use case

The use case is a spam classification task from OpenML you can find here: https://www.openml.org/d/44

Some pre-extracted features (like some word frequencies) are provided as well as a class - `0` for a normal e-mail and `1` for spam - for 4601 samples.

Let's first get the data-set.

In [1]:
import time

import numpy
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

features, classes = fetch_openml(data_id=44, as_frame=False, cache=True, return_X_y=True)
classes = classes.astype(numpy.int64)

x_train, x_test, y_train, y_test = train_test_split(
    features,
    classes,
    test_size=0.15,
    random_state=42,
)

### Let's use the sklearn cross-validation tool to find the best hyper parameters for our model

In [2]:
# Find best hyper parameters with cross validation
from sklearn.model_selection import GridSearchCV

from concrete.ml.sklearn import DecisionTreeClassifier as ConcreteDecisionTreeClassifier

# List of hyper parameters to tune
param_grid = {
    "max_features": [None, "auto", "sqrt", "log2"],
    "min_samples_leaf": [1, 10, 100],
    "min_samples_split": [2, 10, 100],
    "max_depth": [None, 2, 4, 6, 8],
}

grid_search = GridSearchCV(
    ConcreteDecisionTreeClassifier(),
    param_grid,
    cv=10,
    scoring="average_precision",
    error_score="raise",
    n_jobs=-1,
)

gs_results = grid_search.fit(x_train, y_train)
print("Best hyper parameters:", gs_results.best_params_)
print("Best score:", gs_results.best_score_)

# Build the model with best hyper parameters
model = ConcreteDecisionTreeClassifier(
    max_features=gs_results.best_params_["max_features"],
    min_samples_leaf=gs_results.best_params_["min_samples_leaf"],
    min_samples_split=gs_results.best_params_["min_samples_split"],
    max_depth=gs_results.best_params_["max_depth"],
    n_bits=6,
)

Best hyper parameters: {'max_depth': None, 'max_features': None, 'min_samples_leaf': 10, 'min_samples_split': 100}
Best score: 0.9294174079511626


### Let's compute some metrics on the test set.

In [3]:
model, sklearn_model = model.fit_benchmark(x_train, y_train)

In [4]:
# Compute average precision on test
from sklearn.metrics import average_precision_score

# pylint: disable=no-member
y_pred_concrete = model.predict_proba(x_test)[:, 1]
y_pred_sklearn = sklearn_model.predict_proba(x_test)[:, 1]
concrete_average_precision = average_precision_score(y_test, y_pred_concrete)
sklearn_average_precision = average_precision_score(y_test, y_pred_sklearn)
print(f"Sklearn average precision score: {sklearn_average_precision:0.2f}")
print(f"Concrete average precision score: {concrete_average_precision:0.2f}")

Sklearn average precision score: 0.95
Concrete average precision score: 0.97


Note that Concrete average precision score is not running in FHE here as it would be much longer. If you want to run the model in FHE you can set the argument `fhe` to `execute` in `predict_proba()`. Also, the average precision of the Concrete model may be higher which is likely due to the quantization acting as a kind of regularization which improved the test set metric. However, in general, it should be expected that quantization decreases the average precision.

In [5]:
# Show the confusion matrix on x_test
from sklearn.metrics import confusion_matrix

y_pred = model.predict(x_test)
true_negative, false_positive, false_negative, true_positive = confusion_matrix(
    y_test, y_pred, normalize="true"
).ravel()

num_samples = len(y_test)
num_spam = sum(y_test)

print(f"Number of test samples: {num_samples}")
print(f"Number of spams in test samples: {num_spam}")

print(f"True Negative (legit mail well classified) rate: {true_negative}")
print(f"False Positive (legit mail classified as spam) rate: {false_positive}")
print(f"False Negative (spam mail classified as legit) rate: {false_negative}")
print(f"True Positive (spam well classified) rate: {true_positive}")

Number of test samples: 691
Number of spams in test samples: 304
True Negative (legit mail well classified) rate: 0.9612403100775194
False Positive (legit mail classified as spam) rate: 0.03875968992248062
False Negative (spam mail classified as legit) rate: 0.14473684210526316
True Positive (spam well classified) rate: 0.8552631578947368


### Now we are ready to go in the FHE domain

In [6]:
# We first compile the model with some data, here the training set
circuit = model.compile(x_train)

### Generate the key

In [7]:
print(f"Generating a key for an {circuit.graph.maximum_integer_bit_width()}-bit circuit")

Generating a key for an 8-bit circuit


In [8]:
time_begin = time.time()
circuit.client.keygen(force=False)
print(f"Key generation time: {time.time() - time_begin:.2f} seconds")

Key generation time: 0.48 seconds


In [9]:
# Reduce the sample size for a faster total execution time
FHE_SAMPLES = 10
x_test = x_test[:FHE_SAMPLES]
y_pred = y_pred[:FHE_SAMPLES]
y_reference = y_test[:FHE_SAMPLES]

In [10]:
# Predict in FHE for a few examples
time_begin = time.time()
y_pred_fhe = model.predict(x_test, fhe="execute")
print(f"Execution time: {(time.time() - time_begin) / len(x_test):.2f} seconds per sample")

Execution time: 0.53 seconds per sample


In [11]:
# Check prediction FHE vs sklearn
print(f"Ground truth:       {y_reference}")
print(f"Prediction sklearn: {y_pred}")
print(f"Prediction FHE:     {y_pred_fhe}")

Ground truth:       [0 0 0 1 0 1 0 0 0 0]
Prediction sklearn: [0 0 0 1 0 1 0 0 0 0]
Prediction FHE:     [0 0 0 1 0 1 0 0 0 0]


In [12]:
print(
    f"{numpy.sum(y_pred_fhe==y_pred)}/"
    "10 predictions are similar between the FHE model and the clear sklearn model."
)

10/10 predictions are similar between the FHE model and the clear sklearn model.


Here, 10 executions over 10 samples are performed to ensure that the FHE inference gives the same results as the clear model. Doing FHE inferences (to get the real FHE precision score) over the full data-set would be too expensive.

## Conclusion

Fully Homomorphic Decision Trees are now within reach for any data scientist familiar with scikit-learn APIs.