# 拆分学习—Bank Marketing

在这个教程中，我们将以银行的市场营销模型为例，展示在`SecretFlow`框架下如何完成垂直场景下的拆分学习。
`SecretFlow`框架提供了一套用户友好的Api，可以很方便的将您的keras模型或者pytorch模型应用到拆分学习场景，成为拆分学习模型，进而完成垂直场景的联合建模任务。
在接下来的教程中我们将手把手演示，如何将您已有的`keras`模型变成`secretflow`下的拆分学习模型，完成联邦多方建模任务。

## 什么是拆分学习？

拆分学习的核心思想是将网络结构进行拆分，每个设备（机构）只保留一部分网络结构，所有设备的子网络结构组合在一起，构成一个完整的网络模型。在训练过程中，不同的设备（机构）只对本地的网络结构进行前向或反向计算，并将计算结果传递给下一个设备，多个设备端通过联合模型，完成训练，直到收敛为止。
 <img alt="split_learning_tutorial.png" src="resource/split_learning_tutorial.png" width="600">  


`Alice`：拥有`data_alice`，`model_base_alice`  
`Bob`: 用于`data_bob`，`model_base_bob`，`model_fuse`  

1. `Alice`方用本方的数据经过`model_base_alice`得到`hidden0`，发送给Bob
2. `Bob`方用本方的数据经过`model_base_bob`得到`hidden1`
3. `hidden0`和`hidden1`输入给`Agg Layer`层做聚合，输出聚合后的`hidden_merge`
4. `Bob`方将`hidden_merge`输入给`model_fuse`结合`label`得到梯度，并进行回传
5. 梯度经过`AggLayer`拆分成两部分`g0`,`g1`，将`g0`和`g1`分别发送给`Alice`和`Bob`
6. `Alice`和`Bob`的`basenet`分别根据`g0`和`g1`对本方的模型进行更新  


## 任务介绍

市场营销是银行业在不断变化的市场环境中，为满足客户需要、实现经营目标的整体性经营和销售的活动。在目前大数据的环境下，数据分析为银行业提供了更有效的分析手段。在客户需求分析，了解目标市场趋势以及更宏观的市场策略都能提供依据与方向。  
  
此数据来源于[kaggle](https://www.kaggle.com/janiobachmann/bank-marketing-dataset)是一组经典的银行市场营销数据，是葡萄牙一家银行机构电话直销活动，目标变量为客户是否订阅定期存款。

## 数据介绍

1. 样本量总计11162个，其中训练集8929， 测试集2233
2. 特征16维，标签维2分类
3. 我们预先对数据进行了切割，alice持有其中的4维基础属性特征，bob持有12维银行交易特征，对应的label只有alice方持有

我们先来看看我们银行市场营销数据长什么样的？  

原始数据经过分拆后分成bank_alice和bank_bob，分别存在alice和bob两方。这里的csv是原始数据仅经过分拆，没有做预处理的数据  

In [1]:
import requests
import io
import pandas as pd
data_dict = {'alice': 'bank.csv',
               'bob': 'bank.csv'}
dataset_dict = {}
for device, url in data_dict.items():
    response = requests.get(url)
    response.raise_for_status()
    dataset_dict[device] = pd.read_csv(io.BytesIO(response.content))

我们假设Alice是一个新银行，他们只有用户的基本信息，和买来的是否购买过理财产品的label

In [2]:
dataset_dict['alice']

Unnamed: 0,id,age,job,marital,education,y
0,0,59,admin.,married,secondary,yes
1,1,56,admin.,married,secondary,yes
2,2,41,technician,married,secondary,yes
3,3,55,services,married,secondary,yes
4,4,54,admin.,married,tertiary,yes
...,...,...,...,...,...,...
11157,11157,33,blue-collar,single,primary,no
11158,11158,39,services,married,secondary,no
11159,11159,32,technician,single,secondary,no
11160,11160,43,technician,married,secondary,no


In [3]:
type(dataset_dict['alice'])

pandas.core.frame.DataFrame

bob端是一个老银行，他们有用户的账户余额，是否有房，是否有贷款，以及最近的营销反馈

In [4]:
dataset_dict['bob']

Unnamed: 0,id,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,0,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown
1,1,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown
2,2,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown
3,3,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown
4,4,no,184,no,no,unknown,5,may,673,2,-1,0,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11157,11157,no,1,yes,no,cellular,20,apr,257,1,-1,0,unknown
11158,11158,no,733,no,no,unknown,16,jun,83,4,-1,0,unknown
11159,11159,no,29,no,no,cellular,19,aug,156,2,-1,0,unknown
11160,11160,no,0,no,yes,cellular,8,may,9,2,172,5,failure


## 环境的搭建

在secretflow环境创造2个实体[Alice，Bob]  
其中 `alice`, `bob` 是两个PYU  
构造好两个对象后就可以愉快的开始拆分学习的玩耍了

In [5]:
%load_ext autoreload
%autoreload 2

In [6]:
import secretflow as sf

sf.init(['alice', 'bob'], num_cpus=8, log_to_driver=True)
alice, bob = sf.PYU('alice'), sf.PYU('bob')

E0315 16:57:27.186333308   69424 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0315 16:57:27.202971759   69424 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0315 16:57:27.212423932   69424 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies


### 导入训练所需要的库  

In [7]:
from secretflow.data.split import train_test_split
from secretflow.model.sl_model import SLModelTF

2022-03-15 16:58:34.813464: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib


## 准备训练数据 

**构建联邦表**  
联邦表是一个横跨多方的虚拟概念，我们针对垂直场景定义了`VDataFrame`  
1. 联邦表的各方数据是存储在**各方本地，不允许出域**
2. 除了拥有数据的那一方，其他人都**不会接触**的具体的数据存储
3. 对于联邦表的任何操作，会由driver调度到各个worker去执行，执行指令会层层下发，直到具体worker的python runtime才会执行，框架保证了只有worker.device和object.device相同时能够操作数据。
4. 联邦表的设计从中心视角提供了对多方数据的管理和操作
5. 联邦表的接口对齐`pandas.DataFrame`,降低用户操作多方数据成本
6. secretflow框架提供了明密文混合编程能力，垂直联邦表在构建时会使用PPU，利用MPC-PSI对各方数据进行安全求交并对齐

<img alt="vdataframe.png" src="resource/vdataframe.png" width="600">  



`secretflow`的VDataFrame提供了类似pandas的readcsv接口，不同的是，read_csv接收的一个字典（定义双方数据的路径），我们可以使用`secretflow.vertical.read_csv`来构建垂直联邦表VDataFrame
```
read_csv(file_dict,delimiter,ppu,keys,drop_key)
  其中介绍几个比较关键的参数
    filepath: 参与方文件地址，地址可以是相对或绝对路径的本地文件
    ppu:PPU设备，用于PSI数据对齐；若不指定，则默认数据预对齐
    keys:用于对齐的列名，支持多关联键求交
```

创建ppu,用于后续的安全求交，对齐

In [8]:
ppu = sf.PPU(sf.utils.testing.cluster_def(['alice', 'bob']))

In [9]:
from secretflow.data.vertical import read_csv
data_dict = {alice: 'https://federal.oss-cn-hangzhou.aliyuncs.com/dataset/public/bank_alice/bank.csv',
               bob: 'https://federal.oss-cn-hangzhou.aliyuncs.com/dataset/public/bank_bob/bank.csv'}

vdf = read_csv(data_dict,ppu=ppu,keys='id',drop_keys=True)

[2m[36m(PPURuntime pid=73210)[0m I0315 16:58:41.919611 73210 external/com_github_brpc_brpc/src/brpc/server.cpp:1046] Server[ppu::link::internal::ReceiverServiceImpl] is serving on port=49793.
[2m[36m(PPURuntime pid=73210)[0m I0315 16:58:41.919692 73210 external/com_github_brpc_brpc/src/brpc/server.cpp:1049] Check out http://i85c08157.eu95sqa:49793 in web browser.
[2m[36m(PPURuntime pid=73212)[0m I0315 16:58:41.882383 73212 external/com_github_brpc_brpc/src/brpc/server.cpp:1046] Server[ppu::link::internal::ReceiverServiceImpl] is serving on port=13967.
[2m[36m(PPURuntime pid=73212)[0m I0315 16:58:41.882458 73212 external/com_github_brpc_brpc/src/brpc/server.cpp:1049] Check out http://i85c08157.eu95sqa:13967 in web browser.
[2m[36m(PPURuntime pid=73212)[0m I0315 16:58:41.983224 106636 external/com_github_brpc_brpc/src/brpc/socket.cpp:2202] Checking Socket{id=0 addr=127.0.0.1:49793} (0x56527cdd3600)
[2m[36m(PPURuntime pid=73212)[0m I0315 16:58:41.983416 106636 external/c

[2m[36m(PPURuntime pid=73212)[0m [2022-03-15 16:58:41.882] [info] [context.cc:58] connecting to mesh, id=root, self=1
[2m[36m(PPURuntime pid=73212)[0m [2022-03-15 16:58:41.898] [info] [context.cc:83] try_connect to rank 0 not succeed, sleep_for 1000ms and retry.
[2m[36m(PPURuntime pid=73210)[0m [2022-03-15 16:58:41.919] [info] [context.cc:58] connecting to mesh, id=root, self=0


[2m[36m(_run pid=73204)[0m 2022-03-15 16:58:42,649,649 DEBUG [connectionpool.py:_new_conn:1001] Starting new HTTPS connection (1): federal.oss-cn-hangzhou.aliyuncs.com:443
[2m[36m(_run pid=73204)[0m 2022-03-15 16:58:42,704,704 DEBUG [connectionpool.py:_make_request:456] https://federal.oss-cn-hangzhou.aliyuncs.com:443 "GET /dataset/public/bank_alice/bank.csv HTTP/1.1" 200 434679
[2m[36m(_run pid=73213)[0m 2022-03-15 16:58:42,674,674 DEBUG [connectionpool.py:_new_conn:1001] Starting new HTTPS connection (1): federal.oss-cn-hangzhou.aliyuncs.com:443
[2m[36m(_run pid=73213)[0m 2022-03-15 16:58:42,731,731 DEBUG [connectionpool.py:_make_request:456] https://federal.oss-cn-hangzhou.aliyuncs.com:443 "GET /dataset/public/bank_bob/bank.csv HTTP/1.1" 200 596005


[2m[36m(PPURuntime pid=73212)[0m [2022-03-15 16:58:42.898] [info] [context.cc:111] connected to mesh, id=root, self=1
[2m[36m(PPURuntime pid=73212)[0m [2022-03-15 16:58:42.951] [info] [executor_base.cc:231] Begin sanity check for input file: .data/1/psi-input.csv
[2m[36m(PPURuntime pid=73210)[0m [2022-03-15 16:58:42.898] [info] [context.cc:111] connected to mesh, id=root, self=0
[2m[36m(PPURuntime pid=73210)[0m [2022-03-15 16:58:42.935] [info] [executor_base.cc:231] Begin sanity check for input file: .data/0/psi-input.csv
[2m[36m(PPURuntime pid=73210)[0m [2022-03-15 16:58:42.942] [info] [executor_base.cc:181] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable .data/0/psi-input.csv.keys.1647334722935924684 | LC_ALL=C uniq -d > .data/0/psi-input.csv.duplicated.1647334722935924684
[2m[36m(PPURuntime pid=73210)[0m [2022-03-15 16:58:42.947] [info] [executor_base.cc:184] Finished duplicated scripts: LC_ALL=C sort --buffer-size=1G 

`vdf`为构建好的垂直联邦表，他从全局上只拥有所有数据的`Schema`

In [10]:
vdf.columns

Index(['age', 'job', 'marital', 'education', 'y', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome'],
      dtype='object')

我们进一步来探索一下vdf的数据管理  
通过一个实例可以看出，age这个字段是属于alice的，所以在alice方的partition可以得到对应的列，但是bob方想要去获取age的时候会报`KeyError`错误。  
这里有一个Partition的概念，是我们定义的一个数据分片，每个partition都会有自己的device归属，只有归属的device才可以操作数据。

In [11]:
print(vdf['age'].partitions[alice].data)
print(vdf['age'].partitions[bob])

<secretflow.device.device.pyu.PYUObject object at 0x7f602a47c970>


KeyError: <secretflow.device.device.pyu.PYU object at 0x7f5f93bdfbb0>

我们接着对生成的联邦表做数据预处理。  
我们这里以LabelEncoder和MinMaxScaler为例，这两个预处理函数在`sklearn`中有对应的概念，他的使用方法和sklearn中也是类似的

In [12]:
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.encoder import LabelEncoder

In [13]:
encoder = LabelEncoder()
vdf['job'] = encoder.fit_transform(vdf['job'])
vdf['marital'] = encoder.fit_transform(vdf['marital'])
vdf['education'] = encoder.fit_transform(vdf['education'])
vdf['default'] = encoder.fit_transform(vdf['default'])
vdf['housing'] = encoder.fit_transform(vdf['housing'])
vdf['loan'] = encoder.fit_transform(vdf['loan'])
vdf['contact'] = encoder.fit_transform(vdf['contact'])
vdf['poutcome'] = encoder.fit_transform(vdf['poutcome'])
vdf['month'] = encoder.fit_transform(vdf['month'])
vdf['y'] = encoder.fit_transform(vdf['y'])

我们将数据拆分成data和label两部分

In [14]:
label = vdf['y']
data = vdf.drop(columns='y', inplace=False)

In [15]:
print(f"label= {type(label)},\ndata = {type(data)}")

label= <class 'secretflow.data.vertical.dataframe.VDataFrame'>,
data = <class 'secretflow.data.vertical.dataframe.VDataFrame'>


通过MinMaxScaler做数据标准化

In [16]:
scaler = MinMaxScaler()

data = scaler.fit_transform(vdf[list(data.columns)])




接着我们将数据集划分成train-set和test-set

In [17]:
from secretflow.data.split import train_test_split
random_state = 1234
train_data,test_data = train_test_split(data,train_size=0.8,random_state=random_state)
train_label,test_label = train_test_split(label,train_size=0.8,random_state=random_state)

**小结：**到这里为止，我们就完成了**联邦表的定义**，**数据预处理**，以及**训练集和测试集的划分**  
secretflow框架定义了跨越多方的`联邦表`概念，同时定义了一套构建在联邦表上的操作，逻辑对等`pandas.DataFrame`，同时定义了对于联邦表的预处理操作，逻辑对等`sklearn`,您在使用过程中遇到问题，可以参考我们的文档以及api介绍，进一步了解其他的功能

## 模型介绍

**单机版本**：  
对于该任务一个基本的DNN就可以完成，输入16维特征，经过一个DNN网络，输出对于正负样本的概率。

**联邦版本**：
* Alice：
    - base_net:输入4维特征，经过一个dnn网络得到hidden
    - fuse_net:接收自己的hidden_alice,以及bob计算得到的hidden特征，输入的fuse_net，进行特征融合，送入之后的网络完成整个forward过程和backward过程
* Bob：
    - base_net:输入12维特征，经过一个dnn网络得到hidden，然后将hidden发送给alice方，完成接下来的运算

### 定义模型

接下来我们开始创建联邦模型  
在垂直场景我们定义了`SLTFModel`和`SLTorchModel(WIP)`,用于构建垂直场景的拆分学习，我们定义了简单易用可扩展的接口，可以很方便的将您已有的模型，转换成SF—Model，进而进行垂直场景联邦建模。

拆分学习即将一个模型拆分开来，一部分放在数据的本地执行，另外一部分放在有label的一方，或者server端执行。  
首先我们来定义本地执行的模型——base_model

In [18]:
# 创建base模型
def create_base_model(input_dim, output_dim,  name='base_model'):
    # Create model
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        model = keras.Sequential(
            [
                keras.Input(shape=input_dim),
                layers.Dense(100,activation ="relu" ),
                layers.Dense(output_dim, activation="relu"),
            ]
        )
        # Compile model
        model.summary()
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model  # 不能序列化的
    return create_model


我们使用create_base_model分别为`alice`和`bob`创建他们的base model

In [19]:
# prepare model
hidden_size = 64
# 用户定义的已编译后的keras model
model_base_alice = create_base_model(4, hidden_size)
model_base_bob = create_base_model(12, hidden_size)

In [20]:
model_base_alice()
model_base_bob()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 100)               500       
                                                                 
 dense_1 (Dense)             (None, 64)                6464      
                                                                 
Total params: 6,964
Trainable params: 6,964
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 100)               1300      
                                                                 
 dense_3 (Dense)             (None, 64)                6464      
                                                                 
Total params: 7,764
Trainable pa

2022-03-15 17:00:02.531874: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
2022-03-15 17:00:02.531908: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)


<keras.engine.sequential.Sequential at 0x7f60299d43d0>

接下来我们定义有label的一方，或者server端的模型——fuse_model  
在fuse_model的定义中，我们需要正确的定义loss，optimizer，metrics。这里可以兼容所有您已有的keras模型的配置

In [21]:
#创建fuse模型
def create_fuse_model(input_dim, output_dim, party_nums, name='fuse_model'):
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        # input
        input_layers = []
        for i in range(party_nums):
            input_layers.append(keras.Input(input_dim,))
        
        # 定义融合逻辑
        merged_layer = layers.concatenate(input_layers)
        fuse_layer = layers.Dense(64, activation='relu')(merged_layer)
        output = layers.Dense(output_dim, activation='sigmoid')(fuse_layer)
        # 构建模型
        model = keras.Model(inputs=input_layers, outputs=output)
        model.summary()
        # 编译模型，定义损失，优化器，以及指标
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model

In [22]:
# 定义融合模型
model_fuse = create_fuse_model(
    input_dim=hidden_size, party_nums=2, output_dim=1)

In [23]:
model_fuse()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 64)]         0           []                               
                                                                                                  
 input_4 (InputLayer)           [(None, 64)]         0           []                               
                                                                                                  
 concatenate (Concatenate)      (None, 128)          0           ['input_3[0][0]',                
                                                                  'input_4[0][0]']                
                                                                                                  
 dense_4 (Dense)                (None, 64)           8256        ['concatenate[0][0]']        

<keras.engine.functional.Functional at 0x7f602a34d8b0>

### 创建拆分学习模型
secretflow提供了拆分学习的模型 SLModelTF  
SLModelTF模型初始化需要3个参数
* base_model_dict：一个字典需要传入参与训练的所有client以及base_model映射
* device_y：PYU，哪一方持有label
* model_fuse：融合模型，具体的优化器以及损失函数都在这个模型中进行定义

定义base_model_dict  
```python
base_model_dict:Dict[PYU,model_fn]
```

In [24]:
base_model_dict = {
    alice: model_base_alice,
    bob:   model_base_bob
}

In [25]:
sl_model = SLModelTF(
    base_model_dict=base_model_dict, 
    device_y=alice,  
    model_fuse=model_fuse)

[2m[36m(_run pid=73204)[0m 2022-03-15 17:00:14.728845: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
[2m[36m(_run pid=73213)[0m 2022-03-15 17:00:14.728845: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtool

In [26]:
sl_model.fit(train_data, train_label,validation_data=(test_data,test_label), epochs=10, batch_size=128, shuffle=True,verbose=1,validation_freq=1)

[2m[36m(PYUSLTFModel pid=73204)[0m 2022-03-15 17:00:16.946573: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
[2m[36m(PYUSLTFModel pid=73204)[0m 2022-03-15 17:00:16.946600: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
[2m[36m(PYUSLTFModel pid=73213)[0m 2022-03-15 17:00:16.945592: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtools

[2m[36m(PYUSLTFModel pid=73204)[0m Model: "sequential"
[2m[36m(PYUSLTFModel pid=73204)[0m _________________________________________________________________
[2m[36m(PYUSLTFModel pid=73204)[0m  Layer (type)                Output Shape              Param #   
[2m[36m(PYUSLTFModel pid=73204)[0m  dense (Dense)               (None, 100)               1300      
[2m[36m(PYUSLTFModel pid=73204)[0m                                                                  
[2m[36m(PYUSLTFModel pid=73204)[0m  dense_1 (Dense)             (None, 64)                6464      
[2m[36m(PYUSLTFModel pid=73204)[0m                                                                  
[2m[36m(PYUSLTFModel pid=73204)[0m Total params: 7,764
[2m[36m(PYUSLTFModel pid=73204)[0m Trainable params: 7,764
[2m[36m(PYUSLTFModel pid=73204)[0m Non-trainable params: 0
[2m[36m(PYUSLTFModel pid=73204)[0m _________________________________________________________________
[2m[36m(PYUSLTFModel pid=73204

2022-03-15 17:00:19.430 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate={'loss': 0.61008704, 'accuracy': 0.7113402, 'auc_2': 0.73550725}
2022-03-15 17:00:21.394 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate={'loss': 0.55679965, 'accuracy': 0.7010309, 'auc_2': 0.8009378}
2022-03-15 17:00:23.390 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate={'loss': 0.5069495, 'accuracy': 0.7525773, 'auc_2': 0.8307758}
2022-03-15 17:00:25.317 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate={'loss': 0.49515337, 'accuracy': 0.78350514, 'auc_2': 0.83567774}
2022-03-15 17:00:27.265 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate={'loss': 0.48970804, 'accuracy': 0.78350514, 'auc_2': 0.83823526}
2022-03-15 17:00:29.204 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate={'loss': 0.461964, 'accuracy': 0.814433, 'auc_2': 0.8572037}
2022-03-15 17:00:31.091 | INFO     | secretflow.model.sl_model:fit:159 - valid evaluate

我们来调用一下评估函数，看下训练效果怎么样

In [27]:
global_metric = sl_model.evaluate(test_data, test_label, batch_size=128)
print(global_metric)

{'loss': 0.4776467, 'accuracy': 0.7938144, 'auc_2': 0.85507244}


## 和单方模型的对比

#### 模型
模型结构和上面split learning的模型保持一致，但是这里只用了有label的alice方的模型结构，模型定义参考下面的代码
#### 数据
数据同样使用kaggle的反欺诈数据，单方模型这里我们只是用了新银行alice方数据
1. 样本量总计11162个，其中训练集8929， 测试集2233
2. 特征4维

In [28]:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
from sklearn.model_selection import train_test_split

def create_model():

    model = keras.Sequential(
        [
            keras.Input(shape=4),
            layers.Dense(100,activation ="relu" ),
            layers.Dense(64, activation='relu'),
            layers.Dense(64, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ]
    )
    model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
    return model

single_model = create_model()

数据预处理

In [29]:
dataset_dict['alice']= dataset_dict['alice'].drop(columns="id",inplace=False)

In [30]:
dataset_dict['alice']

Unnamed: 0,age,job,marital,education,y
0,59,admin.,married,secondary,yes
1,56,admin.,married,secondary,yes
2,41,technician,married,secondary,yes
3,55,services,married,secondary,yes
4,54,admin.,married,tertiary,yes
...,...,...,...,...,...
11157,33,blue-collar,single,primary,no
11158,39,services,married,secondary,no
11159,32,technician,single,secondary,no
11160,43,technician,married,secondary,no


In [31]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

alice_data = dataset_dict['alice']
encoder = LabelEncoder()
alice_data['job'] = encoder.fit_transform(alice_data['job'])
alice_data['marital'] = encoder.fit_transform(alice_data['marital'])
alice_data['education'] = encoder.fit_transform(alice_data['education'])
alice_data['y'] =  encoder.fit_transform(alice_data['y'])

In [32]:
y = alice_data['y']
alice_data = alice_data.drop(columns=['y'],inplace=False)

In [33]:
scaler = MinMaxScaler()
alice_data = scaler.fit_transform(alice_data)

In [34]:
alice_data

array([[0.53246753, 0.        , 0.5       , 0.33333333],
       [0.49350649, 0.        , 0.5       , 0.33333333],
       [0.2987013 , 0.81818182, 0.5       , 0.33333333],
       ...,
       [0.18181818, 0.81818182, 1.        , 0.33333333],
       [0.32467532, 0.81818182, 0.5       , 0.33333333],
       [0.20779221, 0.81818182, 0.5       , 0.33333333]])

In [35]:
train_data,test_data = train_test_split(alice_data,train_size=0.8,random_state=random_state)
train_label,test_label = train_test_split(y,train_size=0.8,random_state=random_state)

In [36]:
single_model.fit(train_data,train_label,validation_data=(test_data,test_label),batch_size=128,epochs=10,shuffle=False)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f6014fb9d60>

### 小结
上面两个实验模拟了一个典型的垂直场景的训练问题，alice和bob拥有相同的样本群，但每一方只有样本的一部分数据，如果alice只用自己的一方数据来训练模型，能够得到一个精确度0.583,auc 0.62的模型，但是如果联合bob的数据之后，可以获得一个精确度0.793，auc0.855的模型。

## 总结

* 本篇我们介绍了什么是拆分学习，以及如何在secretflow框架下进行拆分学习  
* 从实验数据可以看出，split learning在扩充样本维度，通过联合多方训练提升模型效果方面有显著优势
* 本文档使用明文聚合来做演示，同时没有考虑隐层的泄露问题，secretflow提供了AggLayer通过MPC,TEE,HE，以及DP等方式规避隐层明文传输泄露的问题，感兴趣可以看相关文档。
* 下一步，你可能想尝试不同的数据集，您需要先将数据集进行垂直切分，然后按照本教程的流程进行
