feat: support `from_sklearn` for trees #689

fd0r · 2024-05-21T14:25:23Z

Support from_sklearn for tree based models.

Two options:

Quantization from thresholds: the main idea is to consider the thresholds of the nodes of the trees for quantization to not have to use data.
Quantization from data: build a quantizer from the data provided by the user and quantize the thresholds based on that.

This also raises the question of non-uniform input quantization. We could quantize the data based on the thresholds thus reducing the number of bits required to log2(max_{feature}(node_{feature})).

That would leak the thresholds used in the model per feature but not the structure of the tree itself while increasing significantly the number of bits required.

We could try to automatically determine the n-bits to use to properly represent all thresholds but this might result in a very high bit-with.

jfrery

Thanks! I have a few comments / questions.

pyproject.toml

src/concrete/ml/onnx/onnx_impl_utils.py

src/concrete/ml/quantization/quantizers.py

src/concrete/ml/sklearn/base.py

tests/sklearn/test_sklearn_models.py

docs/advanced_examples/DecisionTreeClassifier.ipynb

src/concrete/ml/onnx/onnx_impl_utils.py

src/concrete/ml/sklearn/base.py

tests/sklearn/test_sklearn_models.py

fd0r · 2024-05-23T08:52:49Z

New flaky it seems: python -m pytest --pdb --randomly-seed 2865112219 --randomly-dont-reset-seed "tests/sklearn/test_fhe_training.py::test_fit_single_target_class[True-7-30-1.0]"

RomanBredehoft · 2024-05-23T10:04:29Z

that's a test I've added recently, I might have an idea of why this happens actually, I'll let you open an issue @fd0r and I'll see if I can reproduce it

fd0r · 2024-05-23T11:56:35Z

@RomanBredehoft , I already have a fix, just adding some tests as we speak

fd0r · 2024-05-23T12:49:20Z

#692

fd0r · 2024-05-23T13:13:15Z

Quick summary of things to do here:

Verify the source of the degradation of the model of the notebook. And why the target changed.
Move most of the code to tree_to_numpy
Validate the move to truncate_bit_pattern
(EDIT) Make another notebook with importing tree-based models from scikit-learn, maybe we can regroup it with linear models from scikit-learn too?

anything else @RomanBredehoft @jfrery ?

jfrery · 2024-05-23T13:15:51Z

Verify the source of the degradation of the model of the notebook.

why have the target value changed? They never changed since we started this notebook
remove rf xgb from decision tree classifier notebook I suppose.

RomanBredehoft · 2024-05-23T13:51:58Z

src/concrete/ml/quantization/quantizers.py

@@ -773,7 +773,7 @@ def quant(self, values: numpy.ndarray) -> numpy.ndarray:

        return qvalues.astype(numpy.int64)

-    def dequant(self, qvalues: numpy.ndarray) -> Union[numpy.ndarray, Tracer]:
+    def dequant(self, qvalues: numpy.ndarray) -> Union[float, int, numpy.ndarray, Tracer]:


how can the following create a float, int if the input is a numpy.ndarray ?

values = self.scale * (qvalues - numpy.asarray(self.zero_point, dtype=numpy.float64))

should remain an array no ? looks like something is fishy here

The thing is that I'm cheating a bit while using this function and providing a python int instead of a numpy array

not sure we would want to keepsuch change if if it's possible tho, I guess this is related to the other similar change @jfrery saw

risky change here, dequant should return float

I can remove int from the result of dequant

I would prefer both personally, I don't see why we should allow dequant to return a float (worst, an int) and not a numpy.array !

tests/sklearn/test_sklearn_models.py

RomanBredehoft · 2024-05-23T13:55:09Z

tests/sklearn/test_sklearn_models.py

+    else:
+        for n_bits, cml_tolerance, sklearn_tolerance in [
+            (max_n_bits, 0.8, 1e-5),
+            (reasonable_n_bits, 1.8, 1.8),


why these 2 configs ? how were they chose ?

Mainly to check that increasing the number of bits we get better results.

They thresholds were chosen by trial and error. (easiest way I find)

RomanBredehoft · 2024-05-23T13:56:05Z

tests/sklearn/test_sklearn_models.py

+            # Compile both the initial Concrete ML model and the loaded one
+            concrete_model.compile(x)
+            mode = "disable"
+            if n_bits <= 8:


we have constants like N_BITS_THRESHOLD_FOR_CRT_FHE_CIRCUITS in our tests if that's what you had in mind here

Actually that won't really work with trees since we have rounding.

So I don't think it's possible from parameters only to known if the circuit will use the CRT representation.

but what is this <= 8 then ?

Ho yeah I should use reasonable_n_bits here

src/concrete/ml/sklearn/base.py

RomanBredehoft

added some new comments, but overall I agree with @jfrery's ones

github-actions · 2024-05-24T11:13:17Z

⚠️ Known flaky tests have been rerun ⚠️

One or several tests initially failed but were identified as known flaky. tests. Therefore, they have been rerun and passed. See below for more details.

Failed tests details

Known flaky tests that initially failed:

tests/torch/test_compile_torch.py::test_compile_torch_or_onnx_conv_networks[False-True-CNN_grouped-relu]

andrei-stoian-zama

In addition to the comments above, can you please:

improve the importingfromscikitlearn notebook to make it work exactly like classifier comparison
add comparison of imported xgbregressor in the https://github.com/zama-ai/concrete-ml/blob/main/docs/advanced_examples/XGBRegressor.ipynb notebook

andrei-stoian-zama · 2024-05-27T06:37:29Z

tests/sklearn/test_sklearn_models.py

+
+            # Compile both the initial Concrete ML model and the loaded one
+            concrete_model.compile(x)
+            mode = "disable"


I think you should use simulate here

Ho yeah forgot to activate simulation mode with reasonable number of bits

andrei-stoian-zama · 2024-05-27T06:39:21Z

tests/sklearn/test_sklearn_models.py

+
+            # Compile both the initial Concrete ML model and the loaded one
+            concrete_model.compile(x)
+            mode = "disable"


again, simulate would be better

andrei-stoian-zama · 2024-05-27T06:42:17Z

docs/advanced_examples/ImportingFromScikitLearn.ipynb

Here we should show nice graphs with decision boundaries like in ClassifierComparison. We shouldn't show n_bits variation of accuracy, just show good configs. Ideally the default configs (dont set n_bits in import_sklearn) should be used and they should work well (except PTQ, where you should set n_bits, we don't have a good default for that)

andrei-stoian-zama · 2024-05-27T06:43:32Z

docs/advanced_examples/XGBClassifier.ipynb

Can you add a section where you import the xgboost sklearn classifier to compare to the CML xgb one?

andrei-stoian-zama · 2024-05-27T06:44:59Z

src/concrete/ml/sklearn/base.py

+        cls,
+        sklearn_model: sklearn.base.BaseEstimator,
+        X: Optional[numpy.ndarray] = None,
+        n_bits: int = 8,


so 8 bits is ok? I remember it wasn't always the best choice in the graphs and we had discussed 9 bits. What is the accuracy on the regressors ?

jfrery

Looks good to me. Thanks!

RomanBredehoft

thanks a lot !!

fd0r · 2024-05-29T13:30:46Z

The notebooks can still be improved but I think I addressed most comments.

andrei-stoian-zama

Looks good, just a few questions!

andrei-stoian-zama · 2024-06-03T08:07:57Z

src/concrete/ml/sklearn/tree_to_numpy.py

+        if model_inputs is None:
+            # If we have no data we can just randomly generate a dataset
+            assert isinstance(n_features, int)
+            calibration_set_size = 100_000


do you really need 100 000 sampels here ?

Probably not, it was to be safe when developing, I can reduce this

Changed it to 1000

andrei-stoian-zama · 2024-06-03T08:14:37Z

tests/sklearn/test_sklearn_models.py

+    """Test `from_sklearn_model` functionnality of tree-based models."""
+
+    numpy.random.seed(0)
+    os.environ["TREES_USE_ROUNDING"] = str(int(use_rounding))


are you sure this works?

yes it does

fd0r · 2024-06-03T13:05:13Z

I got to rebase on main

Support `from_sklearn` for tree based models. Two options: - Quantization from thresholds: the main idea is to consider the thresholds of the nodes of the trees for quantization to not have to use data. - Quantization from data: build a quantizer from the data provided by the user and quantize the thresholds based on that. This also raises the question of non-uniform input quantization. We could quantize the data based on the thresholds thus reducing the number of bits required to log2(max_{feature}(node_{feature})). That would leak the thresholds used in the model per feature but not the structure of the tree itself while increasing significantly the number of bits required. We could try to automatically determine the n-bits to use to properly represent all thresholds but this might result in a very high bit-with. This commit also changes to comparison so that it uses truncation and not rounding anymore.

github-actions · 2024-06-03T14:19:27Z

Coverage passed ✅

Coverage details

---------- coverage: platform linux, python 3.8.18-final-0 -----------
Name    Stmts   Miss  Cover   Missing
-------------------------------------
TOTAL    7744      0   100%

59 files skipped due to complete coverage.

fd0r requested a review from a team as a code owner May 21, 2024 14:25

cla-bot bot added the cla-signed label May 21, 2024

fd0r force-pushed the trees_from_sklearn branch 5 times, most recently from b88907c to b344aa4 Compare May 22, 2024 08:40

jfrery requested changes May 22, 2024

View reviewed changes

fd0r force-pushed the trees_from_sklearn branch from b344aa4 to eaa0249 Compare May 22, 2024 09:57

RomanBredehoft reviewed May 22, 2024

View reviewed changes

src/concrete/ml/onnx/onnx_impl_utils.py Show resolved Hide resolved

RomanBredehoft reviewed May 22, 2024

View reviewed changes

src/concrete/ml/sklearn/base.py Outdated Show resolved Hide resolved

RomanBredehoft reviewed May 22, 2024

View reviewed changes

tests/sklearn/test_sklearn_models.py Outdated Show resolved Hide resolved

fd0r force-pushed the trees_from_sklearn branch 2 times, most recently from 016d323 to f0d2dfd Compare May 22, 2024 14:16

fd0r requested a review from jfrery May 23, 2024 07:07

fd0r force-pushed the trees_from_sklearn branch from f0d2dfd to 240eb7b Compare May 23, 2024 07:13

fd0r force-pushed the trees_from_sklearn branch from 240eb7b to 5de50b5 Compare May 23, 2024 13:28

RomanBredehoft reviewed May 23, 2024

View reviewed changes

tests/sklearn/test_sklearn_models.py Show resolved Hide resolved

RomanBredehoft reviewed May 23, 2024

View reviewed changes

tests/sklearn/test_sklearn_models.py Show resolved Hide resolved

RomanBredehoft reviewed May 23, 2024

View reviewed changes

src/concrete/ml/sklearn/base.py Outdated Show resolved Hide resolved

RomanBredehoft reviewed May 23, 2024

View reviewed changes

src/concrete/ml/sklearn/base.py Outdated Show resolved Hide resolved

RomanBredehoft requested changes May 23, 2024

View reviewed changes

fd0r force-pushed the trees_from_sklearn branch from 12f7a92 to 2db3b84 Compare May 24, 2024 11:55

fd0r requested a review from RomanBredehoft May 24, 2024 12:35

andrei-stoian-zama requested changes May 27, 2024

View reviewed changes

jfrery previously approved these changes May 27, 2024

View reviewed changes

RomanBredehoft previously approved these changes May 27, 2024

View reviewed changes

fd0r dismissed stale reviews from RomanBredehoft and jfrery via e73483b May 29, 2024 11:49

fd0r force-pushed the trees_from_sklearn branch from e73483b to b3a5b44 Compare May 29, 2024 11:50

fd0r requested review from RomanBredehoft, andrei-stoian-zama and jfrery May 29, 2024 13:30

RomanBredehoft previously approved these changes May 29, 2024

View reviewed changes

fd0r force-pushed the trees_from_sklearn branch from c6a3faa to af14572 Compare May 30, 2024 11:20

fd0r dismissed RomanBredehoft’s stale review via 13ae4d4 May 31, 2024 14:08

fd0r force-pushed the trees_from_sklearn branch from af14572 to 13ae4d4 Compare May 31, 2024 14:08

fd0r requested a review from RomanBredehoft May 31, 2024 14:45

andrei-stoian-zama previously approved these changes Jun 3, 2024

View reviewed changes

fd0r dismissed andrei-stoian-zama’s stale review via 8b7cd79 June 3, 2024 08:21

fd0r force-pushed the trees_from_sklearn branch from 13ae4d4 to 8b7cd79 Compare June 3, 2024 08:21

bcm-at-zama added a commit that referenced this pull request Jun 3, 2024

2.6.0-rc1: #689 appears mac intel

41ff886

fd0r force-pushed the trees_from_sklearn branch from 8b7cd79 to 1f5b10e Compare June 3, 2024 13:07

jfrery approved these changes Jun 3, 2024

View reviewed changes

fd0r merged commit 5ca282b into main Jun 3, 2024
12 checks passed

fd0r deleted the trees_from_sklearn branch June 3, 2024 15:09

feat: support from_sklearn for trees #689

feat: support from_sklearn for trees #689

Conversation

fd0r commented May 21, 2024

jfrery left a comment

Choose a reason for hiding this comment

fd0r commented May 23, 2024

RomanBredehoft commented May 23, 2024

fd0r commented May 23, 2024

fd0r commented May 23, 2024

fd0r commented May 23, 2024 • edited Loading

jfrery commented May 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RomanBredehoft left a comment

Choose a reason for hiding this comment

github-actions bot commented May 24, 2024

⚠️ Known flaky tests have been rerun ⚠️

Known flaky tests that initially failed:

andrei-stoian-zama left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfrery left a comment

Choose a reason for hiding this comment

RomanBredehoft left a comment

Choose a reason for hiding this comment

fd0r commented May 29, 2024

andrei-stoian-zama left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fd0r commented Jun 3, 2024

github-actions bot commented Jun 3, 2024

Coverage passed ✅

feat: support `from_sklearn` for trees #689

feat: support `from_sklearn` for trees #689

fd0r commented May 23, 2024 •

edited

Loading