# Vertically Federated XGB (SecureBoost) 

>The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

Welcome to this tutorial on SecureBoost!

In this tutorial, we will explore how to use SecretFlow's tree modeling capabilities to perform vertical federated learning using the SecureBoost algorithm. SecureBoost is a classical algorithm that prioritizes the protection of label information on vertically partitioned datasets. It accomplishes this using Homomorphic Encryption technology, which allows for the encryption of labels and the execution of key tree boosting steps in ciphertext. The outcome is a distributed boosted-trees model comprised of PYUObjects, with each party having knowledge only of their own split points. This implementation utilizes both HEU and PYU devices to achieve high performance with ease.

Let's dive into the details and learn how to use SecureBoost with SecretFlow!

### Set up the devices

Similar to other algorithms, setting up a secure cluster and specifying devices is necessary for SecureBoost implementation. 

In particular, a HEU device must be designated for SecureBoost to ensure the encryption of labels and the protection of sensitive information.

In [1]:
import spu
from sklearn.metrics import roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import (
    Sgb,
    get_classic_XGB_params,
    get_classic_lightGBM_params,
)
from secretflow.ml.boost.sgb_v.model import load_model
import pprint

pp = pprint.PrettyPrinter(depth=4)

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

The version of SecretFlow: 1.4.0.dev20240222


In [None]:
alice_ip = '127.0.0.1'
bob_ip = '127.0.0.1'
ip_party_map = {bob_ip: 'bob', alice_ip: 'alice'}

_system_config = {'lineage_pinning_enabled': False}
sf.shutdown()
# init cluster
sf.init(
    ['alice', 'bob'],
    address='local',
    _system_config=_system_config,
    object_store_memory=5 * 1024 * 1024 * 1024,
)

# SPU settings
cluster_def = {
    'nodes': [
        {'party': 'alice', 'id': 'local:0', 'address': alice_ip + ':12945'},
        {'party': 'bob', 'id': 'local:1', 'address': bob_ip + ':12946'},
        # {'party': 'carol', 'id': 'local:2', 'address': '127.0.0.1:12347'},
    ],
    'runtime_config': {
        # SEMI2K support 2/3 PC, ABY3 only support 3PC, CHEETAH only support 2PC.
        # pls pay attention to size of nodes above. nodes size need match to PC setting.
        'protocol': spu.ProtocolKind.SEMI2K,
        'field': spu.FieldType.FM128,
    },
}

# HEU settings
heu_config = {
    'sk_keeper': {'party': 'alice'},
    'evaluators': [{'party': 'bob'}],
    'mode': 'PHEU',
    'he_parameters': {
        # ou is a fast encryption schema that is as secure as paillier.
        'schema': 'ou',
        'key_pair': {
            'generate': {
                # bit size should be 2048 to provide sufficient security.
                'bit_size': 2048,
            },
        },
    },
    'encoding': {
        'cleartext_type': 'DT_I32',
        'encoder': "IntegerEncoder",
        'encoder_args': {"scale": 1},
    },
}

2024-02-22 17:39:52,841	INFO worker.py:1538 -- Started a local Ray instance.


In [3]:
alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])

###  Prepare Data
Basically we are preparing a vertical dataset.

In [4]:
from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()
x, y = ds['data'], ds['target']

v_data = FedNdarray(
    {
        alice: (alice(lambda: x[:, :15])()),
        bob: (bob(lambda: x[:, 15:])()),
    },
    partition_way=PartitionWay.VERTICAL,
)
label_data = FedNdarray(
    {alice: (alice(lambda: y)())},
    partition_way=PartitionWay.VERTICAL,
)

### Prepare Params

In [5]:
params = get_classic_XGB_params()
params['num_boost_round'] = 3
params['max_depth'] = 3
pp.pprint(params)

{'audit_paths': {},
 'base_score': 0.0,
 'batch_encoding_enabled': True,
 'bottom_rate': 0.5,
 'colsample_by_tree': 1.0,
 'enable_early_stop': False,
 'enable_goss': False,
 'enable_monitor': False,
 'enable_packbits': False,
 'enable_quantization': False,
 'eval_metric': 'roc_auc',
 'first_tree_with_label_holder_feature': True,
 'fixed_point_parameter': 20,
 'gamma': 0.0,
 'learning_rate': 0.3,
 'max_depth': 3,
 'max_leaf': 15,
 'num_boost_round': 3,
 'objective': 'logistic',
 'quantization_scale': 10000.0,
 'reg_lambda': 0.1,
 'rowsample_by_tree': 1.0,
 'save_best_model': False,
 'seed': 1212,
 'sketch_eps': 0.1,
 'stopping_rounds': 1,
 'stopping_tolerance': 0.001,
 'top_rate': 0.3,
 'tree_growing_method': 'level',
 'validation_fraction': 0.1}


### Run Sgb
We create a Sgb object with heu device and fit the data.

In [6]:
sgb = Sgb(heu)
model = sgb.train(params, v_data, label_data)

INFO:root:Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party bob.
INFO:root:training the first tree with label holder only.
INFO:root:train tree context set up.
[2m[36m(SGBActor pid=116420)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(SGBActor pid=116420)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(SGBActor pid=116420)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2m[36m(SGBActor pid=116420)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/

[2m[36m(_run pid=109091)[0m [2024-02-22 17:40:02.833] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[2m[36m(HEUSkKeeper(heu_id=140509190467248, party=alice) pid=115648)[0m [2024-02-22 17:40:03.036] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[2m[36m(_run pid=109555)[0m [2024-02-22 17:40:03.003] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[2m[36m(HEUEvaluator(heu_id=140509190467248, party=bob) pid=116302)[0m [2024-02-22 17:40:03.045] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63


INFO:root:epoch 1 time 4.750282560009509s
INFO:root:train tree context set up.
INFO:root:begin train tree.
INFO:root:epoch 2 time 0.18019765900680795s


### Model Evaluation
Now we can compare the model outputs with true labels. 

In [7]:
yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"auc: {roc_auc_score(y, yhat)}")

[2m[36m(_run pid=109403)[0m [2024-02-22 17:40:08.134] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
auc: 0.9970072934834311


[2m[36m(_run pid=109403)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(_run pid=109403)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(_run pid=109403)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2m[36m(_run pid=109403)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.


### Model Save and Load
We can now save the model and load it to use later. Note that the model is a distributed identity, we will save to and load from multiple parties.

Let's first define the paths.

In [8]:
# each participant party needs a location to store
saving_path_dict = {
    # in production we may use remote oss, for example.
    device: "./" + device.party
    for device in v_data.partitions.keys()
}

Then let's save the model.

In [9]:
r = model.save_model(saving_path_dict)
wait(r)

Now you can check the files at specified location.

Finally, let's load the model and do a sanity check.

In [10]:
# alice is our label holder
model_loaded = load_model(saving_path_dict, alice)
fed_yhat_loaded = model_loaded.predict(v_data, alice)
yhat_loaded = reveal(fed_yhat_loaded.partitions[alice])

assert (
    yhat == yhat_loaded
).all(), "loaded model predictions should match original, yhat {} vs yhat_loaded {}".format(
    yhat, yhat_loaded
)

## More training Options

What if we want to train a boosting model in lightGBM style? We can do that by setting leaf_wise training and enable goss.

In [11]:
params = get_classic_lightGBM_params()
params['num_boost_round'] = 3
params['max_leaf'] = 2**3
pp.pprint(params)
model = sgb.train(params, v_data, label_data)

INFO:root:Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party bob.
INFO:root:training the first tree with label holder only.
INFO:root:train tree context set up.


{'audit_paths': {},
 'base_score': 0.0,
 'batch_encoding_enabled': True,
 'bottom_rate': 0.5,
 'colsample_by_tree': 1.0,
 'enable_early_stop': False,
 'enable_goss': True,
 'enable_monitor': False,
 'enable_packbits': False,
 'enable_quantization': False,
 'eval_metric': 'roc_auc',
 'first_tree_with_label_holder_feature': True,
 'fixed_point_parameter': 20,
 'gamma': 0.0,
 'learning_rate': 0.3,
 'max_depth': 5,
 'max_leaf': 8,
 'num_boost_round': 3,
 'objective': 'logistic',
 'quantization_scale': 10000.0,
 'reg_lambda': 0.1,
 'rowsample_by_tree': 1.0,
 'save_best_model': False,
 'seed': 1212,
 'sketch_eps': 0.1,
 'stopping_rounds': 1,
 'stopping_tolerance': 0.001,
 'top_rate': 0.3,
 'tree_growing_method': 'leaf',
 'validation_fraction': 0.1}


[2m[36m(SGBActor pid=118378)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(SGBActor pid=118378)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(SGBActor pid=118378)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2m[36m(SGBActor pid=118378)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
INFO:root:begin train tree.
INFO:root:epoch 0 time 6.653825129964389s
INFO:root:train tree context set up.
INFO:root:begin train tree.
INFO:root:epoch 1 time 0.5482736540143378s
INFO:root:train tree context set up.
INFO:root:begin train 

In [12]:
yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"auc: {roc_auc_score(y, yhat)}")

auc: 0.9966901855081655


## Conclusion

Great job on completing the tutorial!

In conclusion, we have learned how to use tree models for training in SecretFlow and explored SecureBoost, a high-performance boosting algorithm designed specifically for vertically partitioned datasets. SecureBoost is similar to XGBoost but has a key focus on protecting sensitive labels in vertical learning scenarios. By utilizing homomorphic encryption and PYUObjects, SecureBoost allows us to train powerful distributed forest models while maintaining the privacy and security of our data.

Thank you for participating in this tutorial, and we hope you found it informative and helpful!
