# Decision Tree Classifier

Trees are a popular class of algorithm in Machine Learning. In this notebook we build a simple Decision Tree Classifier using `scikit-learn` to show that they can be executed homomorphically using Concrete Numpy.

Converting a tree working over quantized data to its FHE equivalent takes only a few lines of code thanks to Concrete ML.

Let's dive in!

# The Use Case

The use case is a spam classification task from OpenML you can find here: https://www.openml.org/d/44

Some pre-extracted features (like some word frequencies) are provided as well as a class, `0` for a normal e-mail and `1` for spam, for 4601 samples.

Let's first get the dataset.

In [1]:
import numpy
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

features, classes = fetch_openml(data_id=44, as_frame=False, cache=True, return_X_y=True)
classes = classes.astype(numpy.int64)

x_train, x_test, y_train, y_test = train_test_split(
    features,
    classes,
    test_size=0.15,
    random_state=42,
)

### Let's use some sklearn cross validation tool to find the best hyper parameters for our model

In [2]:
# Find best hyperparameters with cross validation
from sklearn.model_selection import GridSearchCV

from concrete.ml.sklearn import DecisionTreeClassifier as ConcreteDecisionTreeClassifier

# List of hyperparameters to tune
param_grid = {
    "max_features": [None, "auto", "sqrt", "log2"],
    "min_samples_leaf": [1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40, 60, 80, 100],
    "min_samples_split": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40, 60, 80, 100],
    "max_depth": [None, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40, 60, 80, 100],
}

grid_search = GridSearchCV(
    ConcreteDecisionTreeClassifier(),
    param_grid,
    cv=10,
    scoring="average_precision",
    n_jobs=-1,
)

gs_results = grid_search.fit(x_train, y_train)
print("Best hyperparameters:", gs_results.best_params_)
print("Best score:", gs_results.best_score_)

# Build the model with best hyper parameters
model = ConcreteDecisionTreeClassifier(
    max_features=gs_results.best_params_["max_features"],
    min_samples_leaf=gs_results.best_params_["min_samples_leaf"],
    min_samples_split=gs_results.best_params_["min_samples_split"],
    max_depth=gs_results.best_params_["max_depth"],
)
model.fit(x_train, y_train)

Best hyperparameters: {'max_depth': 8, 'max_features': None, 'min_samples_leaf': 40, 'min_samples_split': 100}
Best score: 0.7608810439720429


### Let's compute some metrics on the test set.

In [3]:
# Compute average precision on test
from sklearn.metrics import average_precision_score

y_pred = model.predict(x_test)
average_precision = average_precision_score(y_test, y_pred)
print("Average precision-recall score: {0:0.2f}".format(average_precision))

Average precision-recall score: 0.85


In [4]:
# Show the confusion matrix on x_test
from sklearn.metrics import confusion_matrix

y_pred = model.predict(x_test)
true_negative, false_positive, false_negative, true_positive = confusion_matrix(
    y_test, y_pred, normalize="true"
).ravel()

num_samples = len(y_test)
num_spam = sum(y_test)

print(f"Number of test samples: {num_samples}")
print(f"Number of spams in test samples: {num_spam}")

print(f"True Negative (legit mail well classified) rate: {true_negative}")
print(f"False Positive (legit mail classified as spam) rate: {false_positive}")
print(f"False Negative (spam mail classified as legit) rate: {false_negative}")
print(f"True Positive (spam well classified) rate: {true_positive}")

Number of test samples: 691
Number of spams in test samples: 304
True Negative (legit mail well classified) rate: 0.9612403100775194
False Positive (legit mail classified as spam) rate: 0.03875968992248062
False Negative (spam mail classified as legit) rate: 0.21052631578947367
True Positive (spam well classified) rate: 0.7894736842105263


### Now we are ready to go in the FHE domain

In [5]:
# We first compile the model with some data, here the training set
model.compile(x_train)

In [6]:
# Predict in FHE for a few examples
y_pred_fhe = model.predict(x_test[:10], use_fhe=True)

In [7]:
# Check prediction FHE vs sklearn
print(f"Prediction FHE: {y_pred_fhe}")
print(f"Prediction sklearn: {y_pred[:10]}")

# We can also check the prediction from the tensor version of the tree
y_pred_tensor = model._predict_with_tensors(x_test[:10])
print(f"Prediction tensor: {y_pred_tensor}")

Prediction FHE: [0 0 0 1 0 1 0 0 0 0]
Prediction sklearn: [0 0 0 1 0 1 0 0 0 0]
Prediction tensor: [0 0 0 1 0 1 0 0 0 0]


In [8]:
print(
    f"{numpy.sum(y_pred_fhe==y_pred[:10])}/"
    "10 predictions are similar between the FHE model and the clear sklearn model."
)

10/10 predictions are similar between the FHE model and the clear sklearn model.


# Conclusion

Fully Homomorphic Decision trees are now in reach of any data scientist familiar with scikit-learn APIs.