该代码实现了DNN模型下bankmarketing数据集使用LLDP安全聚合策略进行拆分学习隐语实现,并与隐语银行营销内置dp策略进行了实验对比。

In [None]:
# Copyright 2022 Ant Group Co., Ltd.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Author: Yuanran Song
# E-mail: 809127446@qq.com

In [None]:
import secretflow as sf
import matplotlib.pyplot as plt
from typing import List
import math
import numpy as np
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
import pandas as pd

In [None]:
联邦表定义/数据预处理以及训练集和测试集的划分

In [None]:
from secretflow.utils.simulation.datasets import dataset

df = pd.read_csv(dataset('bank_marketing'), sep=';')
print(df)
alice_data = df[["age","job", "marital", "education", "y"]]
alice_data
print(alice_data)
bob_data = df[["default", "balance", "housing", "loan", "contact",
        "day","month","duration","campaign","pdays","previous","poutcome"]]
bob_data
from secretflow.data.split import train_test_split
from secretflow.ml.nn import SLModel
# spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
from secretflow.utils.simulation.datasets import load_bank_marketing

# Alice has the first four features,
# while bob has the laaaeft features
data = load_bank_marketing(parts={alice: (0, 4), bob: (4, 16)}, axis=1)##alice取四维数据 bob取12维数据
# Alice holds the label.
label = load_bank_marketing(parts={alice: (16, 17)}, axis=1)##alice拥有所有标签
print(data['age'].partitions[alice].data)##partition都会有自己的device归属，只有归属的device才可以操作数据
# print(data['age'].partitions[bob])
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.encoder import LabelEncoder
encoder = LabelEncoder()
data['job'] = encoder.fit_transform(data['job'])
data['marital'] = encoder.fit_transform(data['marital'])
data['education'] = encoder.fit_transform(data['education'])
data['default'] = encoder.fit_transform(data['default'])
data['housing'] = encoder.fit_transform(data['housing'])
data['loan'] = encoder.fit_transform(data['loan'])
data['contact'] = encoder.fit_transform(data['contact'])
data['poutcome'] = encoder.fit_transform(data['poutcome'])
data['month'] = encoder.fit_transform(data['month'])
label = encoder.fit_transform(label)
print(f"label= {type(label)},\ndata = {type(data)}")
scaler = MinMaxScaler()

data = scaler.fit_transform(data)
scaler = MinMaxScaler()

data = scaler.fit_transform(data)
from secretflow.data.split import train_test_split
random_state = 1234
train_data,test_data = train_test_split(data, train_size=0.8, random_state=random_state)
train_label,test_label = train_test_split(label, train_size=0.8, random_state=random_state)

In [None]:
创建垂直拆分场景的双方本地模型base_model和有label方持有的fuse_model

In [None]:
def create_base_model(input_dim, output_dim,  name='base_model'):
    # Create model
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        model = keras.Sequential(
            [
                keras.Input(shape=input_dim),
                layers.Dense(100,activation ="relu" ),
                layers.Dense(output_dim, activation="relu"),
            ]
        )
        # Compile model
        model.summary()
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model
# prepare model
hidden_size = 64
alice_input_feature_num = train_data.values.partition_shape()[alice][1]
bob_input_feature_num = train_data.values.partition_shape()[bob][1]

model_base_alice = create_base_model(alice_input_feature_num, hidden_size)
model_base_bob = create_base_model(bob_input_feature_num, hidden_size)
model_base_alice()
model_base_bob()
def create_fuse_model(input_dim, output_dim, party_nums, name='fuse_model'):
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        # input
        input_layers = []
        for i in range(party_nums):
            input_layers.append(keras.Input(input_dim,))

        merged_layer = layers.concatenate(input_layers)
        fuse_layer = layers.Dense(64, activation='relu')(merged_layer)
        output = layers.Dense(output_dim, activation='sigmoid')(fuse_layer)

        model = keras.Model(inputs=input_layers, outputs=output)
        model.summary()

        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model
model_fuse = create_fuse_model(
    input_dim=hidden_size, party_nums=2, output_dim=1)
model_fuse()

接下来定义lldp_strategy，我们可以使用其对base_model与fuse_model实现逐层差分隐私加噪。其相较于隐语内置dp_strategy里的embedding_dp（embedding层加噪）与label_dp（标签加噪）相比。使用LLDP策略的全局模型具有更快的收敛速度和更高的预测准确性。这些改进来自于LLDP算法中对隐私预算进行了层次化的分配。实验结果会于代码末尾附上。以下代码给出fuse_model进行加噪的例子，往函数中输入base_model也可实现对本地模型的加噪策略。用户可根据隐私保护需求自由选择。

In [None]:
def create_lldp_strategy(model):
    model_data_list = model().get_weights()
    print(model_data_list)
    delta = math.exp(-3)
    epsilon = [80, 80, 40, 40, 30, 30]  ##对三层模型进行加噪
    for i in range(len(model_data_list)):
        sigma = math.sqrt(2 * math.log(1.25 / delta)) / epsilon[i]
        # print("sigma:", sigma)
        noise = np.random.normal(0, sigma, model_data_list[i].shape)
        # add_noise
        model_data_list[i] = model_data_list[i] + noise
    model().set_weights(model_data_list)
    return model
model_fuse= create_lldp_strategy(model_fuse)
model_fuse()

创建拆分学习模型，初始化SLModel的三个参数

In [None]:
base_model_dict = {
    alice: model_base_alice,
    bob:   model_base_bob
}


train_batch_size = 128

sl_model = SLModel(
    base_model_dict=base_model_dict,
    device_y=alice,
    model_fuse=model_fuse,)
    # dp_strategy_dict=dp_strategy_dict,)

sf.reveal(test_data.partitions[alice].data), sf.reveal(test_label.partitions[alice].data)
sf.reveal(train_data.partitions[alice].data), sf.reveal(train_label.partitions[alice].data)
history =  sl_model.fit(train_data,
             train_label,
             validation_data=(test_data,test_label),
             epochs=20,
             batch_size=train_batch_size,
             shuffle=True,
             verbose=1,
             validation_freq=1,)
             # dp_spent_step_freq=dp_spent_step_freq,)
print(history)
print(history.keys())

画出精确度、损失与AUC图像

In [None]:
# Plot the change of loss during training
plt.plot(history['train_loss'])
plt.plot(history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train','Val'], loc='upper right')
plt.show()

In [None]:
# Plot the change of accuracy during training
plt.plot(history['train_accuracy'])
plt.plot(history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()


In [None]:

# Plot the Area Under Curve(AUC) of loss during training
plt.plot(history['train_auc_1'])
plt.plot(history['val_auc_1'])
plt.title('Model Area Under Curve')
plt.ylabel('Area Under Curve')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
global_metric = sl_model.evaluate(test_data, test_label, batch_size=128)
print(global_metric)

将隐语拆分学习银行营销作为baseline，进行对比实验

In [None]:
import secretflow as sf
import matplotlib.pyplot as plt

sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
import pandas as pd
from secretflow.utils.simulation.datasets import dataset

df = pd.read_csv(dataset('bank_marketing'), sep=';')
print(df)
alice_data = df[["age","job", "marital", "education", "y"]]
alice_data
print(alice_data)
bob_data = df[["default", "balance", "housing", "loan", "contact",
        "day","month","duration","campaign","pdays","previous","poutcome"]]
bob_data
from secretflow.data.split import train_test_split
from secretflow.ml.nn import SLModel
# spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
from secretflow.utils.simulation.datasets import load_bank_marketing

# Alice has the first four features,
# while bob has the laaaeft features
data = load_bank_marketing(parts={alice: (0, 4), bob: (4, 16)}, axis=1)##alice取四维数据 bob取12维数据
# Alice holds the label.
label = load_bank_marketing(parts={alice: (16, 17)}, axis=1)##alice拥有所有标签
print(data['age'].partitions[alice].data)##partition都会有自己的device归属，只有归属的device才可以操作数据
# print(data['age'].partitions[bob])
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.encoder import LabelEncoder
encoder = LabelEncoder()
data['job'] = encoder.fit_transform(data['job'])
data['marital'] = encoder.fit_transform(data['marital'])
data['education'] = encoder.fit_transform(data['education'])
data['default'] = encoder.fit_transform(data['default'])
data['housing'] = encoder.fit_transform(data['housing'])
data['loan'] = encoder.fit_transform(data['loan'])
data['contact'] = encoder.fit_transform(data['contact'])
data['poutcome'] = encoder.fit_transform(data['poutcome'])
data['month'] = encoder.fit_transform(data['month'])
label = encoder.fit_transform(label)
print(f"label= {type(label)},\ndata = {type(data)}")
scaler = MinMaxScaler()

data = scaler.fit_transform(data)
scaler = MinMaxScaler()

data = scaler.fit_transform(data)
from secretflow.data.split import train_test_split
random_state = 1234
train_data,test_data = train_test_split(data, train_size=0.8, random_state=random_state)
train_label,test_label = train_test_split(label, train_size=0.8, random_state=random_state)
def create_base_model(input_dim, output_dim,  name='base_model'):
    # Create model
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        model = keras.Sequential(
            [
                keras.Input(shape=input_dim),
                layers.Dense(100,activation ="relu" ),
                layers.Dense(output_dim, activation="relu"),
            ]
        )
        # Compile model
        model.summary()
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model


hidden_size = 64

model_base_alice = create_base_model(4, hidden_size)
model_base_bob = create_base_model(12, hidden_size)
model_base_alice()
model_base_bob()

def create_fuse_model(input_dim, output_dim, party_nums, name='fuse_model'):
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        # input
        input_layers = []
        for i in range(party_nums):
            input_layers.append(keras.Input(input_dim, ))

        merged_layer = layers.concatenate(input_layers)
        fuse_layer = layers.Dense(64, activation='relu')(merged_layer)
        output = layers.Dense(output_dim, activation='sigmoid')(fuse_layer)

        model = keras.Model(inputs=input_layers, outputs=output)
        model.summary()

        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy", tf.keras.metrics.AUC()])
        return model


    return create_model
model_fuse = create_fuse_model(
    input_dim=hidden_size, party_nums=2, output_dim=1)
model_fuse()
base_model_dict = {
    alice: model_base_alice,
    bob:   model_base_bob
}




alice方采用embeddingDP加噪方法，bob方采用LabelDP（前面数据划分中是alice获得标签列，所以对此处加噪方式存在疑惑）

In [None]:
from secretflow.security.privacy import DPStrategy, GaussianEmbeddingDP, LabelDP

# Define DP operations差分隐私
train_batch_size = 128
gaussian_embedding_dp = GaussianEmbeddingDP(
    noise_multiplier=0.5,
    l2_norm_clip=1.0,
    batch_size=train_batch_size,
    num_samples=train_data.values.partition_shape()[alice][0],
    is_secure_generator=False,
)
dp_strategy_alice = DPStrategy(embedding_dp=gaussian_embedding_dp)
label_dp = LabelDP(eps=64.0)
dp_strategy_bob = DPStrategy(label_dp=label_dp)
dp_strategy_dict = {alice: dp_strategy_alice, bob: dp_strategy_bob}
dp_spent_step_freq = 10
sl_model = SLModel(
    base_model_dict=base_model_dict,
    device_y=alice,
    model_fuse=model_fuse,
    dp_strategy_dict=dp_strategy_dict)




In [None]:
sl_model.fit(train_data,
             train_label,
             validation_data=(test_data,test_label),
             epochs=10,
             batch_size=train_batch_size,
             shuffle=True,
             verbose=1,
             validation_freq=1,
             dp_spent_step_freq=None,)

sf.reveal(test_data.partitions[alice].data), sf.reveal(test_label.partitions[alice].data)
sf.reveal(train_data.partitions[alice].data), sf.reveal(train_label.partitions[alice].data)

history = sl_model.fit(
    train_data,
    train_label,
    validation_data=(test_data, test_label),
    epochs=20,
    batch_size=train_batch_size,
    shuffle=True,
    verbose=1,
    validation_freq=1,
    dp_spent_step_freq=dp_spent_step_freq,
)

print(history)
print(history.keys())

画出精确度、损失与AUC图像

In [None]:
# Plot the change of loss during training
plt.plot(history['train_loss'])
plt.plot(history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train','Val'], loc='upper right')
plt.show()

In [None]:
# Plot the change of accuracy during training
plt.plot(history['train_accuracy'])
plt.plot(history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()

In [None]:
# Plot the Area Under Curve(AUC) of loss during training
plt.plot(history['train_auc_1'])
plt.plot(history['val_auc_1'])
plt.title('Model Area Under Curve')
plt.ylabel('Area Under Curve')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
global_metric = sl_model.evaluate(test_data, test_label, batch_size=128)
print(global_metric)

由实验结果可知，LLDP差分隐私策略在隐语可信计算环境下并不影响原始模型任务性能且有两个训练指标优于内置strategy。