# Split Learning—Bank Marketing

*The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.*

In this tutorial, we will use the bank's marketing model as an example to show how to accomplish split learning in vertical scenarios under the `SecretFlow` framework.
`SecretFlow` provides a user-friendly Api that makes it easy to apply your Keras model or PyTorch model to split learning scenarios to complete joint modeling tasks for vertical scenarios.

In this tutorial we will show you how to turn your existing 'Keras' model into a split learning model under `Secretflow` to complete federated multi-party modeling tasks.

## What is Split Learning？

The core idea of split learning is split the network structure. Each device (silo) retains only a part of the network structure, and the sub-network structure of all devices is combined together to form a complete network model. 
In the training process, different devices (silo) only perform forward or reverse calculation on the local network structure, and transfer the calculation results to the next device. Multiple devices complete the training through joint model until convergence.
 <img alt="split_learning_tutorial.png" src="resource/split_learning_tutorial.png" width="600">  


`Alice`：have `data_alice`，`model_base_alice`  
`Bob`: have `data_bob`，`model_base_bob`，`model_fuse`  

1. `Alice` uses its data to get 'hidden0' through 'model_base_Alice' and send it to Bob . 
2. `Bob` gets `hidden1` with its data through `model_base_bob`.
3. `hidden0` and `hidden1` are input to the 'Agg Layer' for aggregation, and the aggregated 'hidden_merge' is the output.
4. `Bob` input `hidden_merge` to `model_fuse`, get the gradient with `label` and send it back.
5. The gradient is split into two parts' g0 ', 'g1' through 'AggLayer', which are sent to 'Alice' and 'Bob' respectively.
6. Then `Alice` and `Bob` update their local base net with `g0` or `g1`.


## Task

Marketing is the banking industry in the ever-changing market environment, to meet the needs of customers, to achieve business objectives of the overall operation and sales activities. In the current environment of big data, data analysis provides a more effective analysis means for the banking industry. Customer demand analysis, understanding of target market trends and more macro market strategies can provide the basis and direction.  
  
The data from [kaggle](https://www.kaggle.com/janiobachmann/bank-marketing-dataset)is a set of classic marketing data bank, is a Portuguese bank agency telephone direct marketing activities, The target variable is whether the customer subscribes to deposit product.

## Data

1. The total sample size was 11162, including 8929 training set and 2233 test set
2. Feature dim is 16, target is binary classification
3. We have cut the data in advance. Alice holds the 4-dimensional basic attribute features, Bob holds the 12-dimensional bank transaction features, and only Alice holds the corresponding label

Let's start by looking at what our bank's marketing data look like?  

The original data is divided into Bank_Alice and Bank_Bob, which stores in Alice and Bob respectively. Here, CSV is the original data that has only been separated without pre-processing, we will use `secretflow preprocess` for FedData preprocess

In [1]:
%load_ext autoreload
%autoreload 2

import secretflow as sf

sf.init(['alice', 'bob'], num_cpus=8, log_to_driver=True)
alice, bob = sf.PYU('alice'), sf.PYU('bob')

2022-06-29 11:39:09.253448: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
E0629 11:39:12.246428698 2229130 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0629 11:39:12.271053599 2229130 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0629 11:39:12.286032004 2229130 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies


### prepare data

In [2]:
from secretflow.data.simulation.dataset import load_bank_marketing_data

column_split = {
            alice: ["age","job","marital","education","y"],
            bob: ["default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome"],
        }
file_uris = load_bank_marketing_data(party_ratio=column_split)

In [3]:
import pandas as pd
dataset_dict = {}
for device, file_path in file_uris.items():
    dataset_dict[device] = pd.read_csv(file_path)

We assume that Alice is a new bank, and they only have the basic information of the user and purchased the label of financial products from other bank.

In [4]:
dataset_dict[alice]

Unnamed: 0,id,age,job,marital,education,y
0,0,30,unemployed,married,primary,no
1,1,33,services,married,secondary,no
2,2,35,management,single,tertiary,no
3,3,30,management,married,tertiary,no
4,4,59,blue-collar,married,secondary,no
...,...,...,...,...,...,...
4516,4516,33,services,married,secondary,no
4517,4517,57,self-employed,married,tertiary,no
4518,4518,57,technician,married,secondary,no
4519,4519,28,blue-collar,married,secondary,no


In [5]:
type(dataset_dict[alice])

pandas.core.frame.DataFrame

Bob is an old bank, they have the user's account balance, house, loan, and recent marketing feedback

In [6]:
dataset_dict[bob]

Unnamed: 0,id,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,0,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown
1,1,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure
2,2,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure
3,3,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown
4,4,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,4516,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown
4517,4517,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown
4518,4518,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown
4519,4519,no,1137,no,no,cellular,6,feb,129,4,211,3,other


## Create Secretflow Environment

Create 2 entities in the Secretflow environment [Alice, Bob]
Where 'Alice' and 'Bob' are two PYU
Once you've constructed the two objects, you can happily start Splitting Learning

### Import Dependency

In [7]:
from secretflow.data.split import train_test_split
from secretflow.ml.nn import SLModelTF

## Prepare Data

**Build Federated Table**  
Federated table is a virtual concept that cross multiple parties,We define `VDataFrame` for vertical setting
1. The data of all parties in a federated table is stored locally and is not allowed to go out of the domain
2. No one has access to data store except the party that owns the data
3. Any operation of the federated table will be scheduled by the driver to each worker, and the execution instructions will be delivered layer by layer until the Python Runtime of the specific worker. The framework ensures that only `worker.device` and `Object`. device can operate data at the same time.。
4. Federated tables are designed to management and manipulation multi-party data from a central perspective
5. Interfaces to `Federated Tables` are aligned to pandas.DataFrame to reduce the cost of multi-party data operations
6. The SecretFlow framework provides Plain&Ciphertext hybrid programming capabilities. Vertical federated tables are built using `SPU`, and `Mpc-psi` is used to safely get intersection and align data from all parties

<img alt="vdataframe.png" src="resource/vdataframe.png" width="600">  



VDataFrame provides `read_csv` interface similar to pandas, except that `secretflow.read_csv` receives a dictionary that defines the path of data for both parties. We can use `secretflow.vertical.read_csv` to build the `VDataFrame`.

Create spu object

In [8]:
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))

[2m[36m(pid=2230314)[0m 2022-06-29 11:39:41.097060: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
[2m[36m(pid=2230315)[0m 2022-06-29 11:39:41.097060: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10

In [9]:
from secretflow.data.vertical import read_csv

vdf = read_csv(file_uris,spu=spu,keys='id',drop_keys=True)

[2m[36m(SPURuntime pid=2230314)[0m I0629 11:39:42.788682 2230314 external/com_github_brpc_brpc/src/brpc/server.cpp:1065] Server[yasl::link::internal::ReceiverServiceImpl] is serving on port=15261.
[2m[36m(SPURuntime pid=2230314)[0m I0629 11:39:42.788750 2230314 external/com_github_brpc_brpc/src/brpc/server.cpp:1068] Check out http://i85c08157.eu95sqa:15261 in web browser.
[2m[36m(SPURuntime pid=2230315)[0m I0629 11:39:42.790211 2230315 external/com_github_brpc_brpc/src/brpc/server.cpp:1065] Server[yasl::link::internal::ReceiverServiceImpl] is serving on port=50563.
[2m[36m(SPURuntime pid=2230315)[0m I0629 11:39:42.790271 2230315 external/com_github_brpc_brpc/src/brpc/server.cpp:1068] Check out http://i85c08157.eu95sqa:50563 in web browser.
[2m[36m(_run pid=2230316)[0m 2022-06-29 11:39:43.080818: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No

[2m[36m(SPURuntime pid=2230315)[0m [2022-06-29 11:39:44.808] [info] [executor_base.cc:246] Begin sanity check for input file: .data/1/psi-input.csv
[2m[36m(SPURuntime pid=2230315)[0m [2022-06-29 11:39:44.811] [info] [executor_base.cc:196] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable .data/1/psi-input.csv.keys.1656473984808312593 | LC_ALL=C uniq -d > .data/1/psi-input.csv.duplicated.1656473984808312593
[2m[36m(SPURuntime pid=2230315)[0m [2022-06-29 11:39:44.814] [info] [executor_base.cc:199] Finished duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable .data/1/psi-input.csv.keys.1656473984808312593 | LC_ALL=C uniq -d > .data/1/psi-input.csv.duplicated.1656473984808312593, ret=0
[2m[36m(SPURuntime pid=2230315)[0m [2022-06-29 11:39:44.814] [info] [executor_base.cc:249] End sanity check for input file: .data/1/psi-input.csv, size=4521
[2m[36m(SPURuntime pid=2230315)[0m [2022-06-29 11:39:44.821]

`VDF` is a vertically federated table that has been built. It has only the `Schema` of all the data globally

In [10]:
vdf.columns

Index(['age', 'job', 'marital', 'education', 'y', 'default', 'balance',
       'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome'],
      dtype='object')

Let's take a closer look at VDF data management 

As can be seen from an example, the `age` field belongs to Alice, so the corresponding column can be obtained in the partition of Alice, but Bob will report `KeyError` error when trying to obtain age.  
There is a concept of `Partition`, which is a data fragment defined by us. Each Partition has its own device to which it belongs, and only the device that belongs can operate data.

In [11]:
print(vdf['age'].partitions[alice].data)
print(vdf['age'].partitions[bob])

<secretflow.device.device.pyu.PYUObject object at 0x7f06a6751a30>


KeyError: <secretflow.device.device.pyu.PYU object at 0x7f06e82c04f0>

We then do data preprocessing on the `VDataFrame`.。  
Here we take `LabelEncoder` and `MinMaxScaler` as examples. These two preprocessor functions have corresponding concepts in `SkLearn` and their use methods are similar to those in skLearn

In [12]:
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.encoder import LabelEncoder

In [13]:
encoder = LabelEncoder()
vdf['job'] = encoder.fit_transform(vdf['job'])
vdf['marital'] = encoder.fit_transform(vdf['marital'])
vdf['education'] = encoder.fit_transform(vdf['education'])
vdf['default'] = encoder.fit_transform(vdf['default'])
vdf['housing'] = encoder.fit_transform(vdf['housing'])
vdf['loan'] = encoder.fit_transform(vdf['loan'])
vdf['contact'] = encoder.fit_transform(vdf['contact'])
vdf['poutcome'] = encoder.fit_transform(vdf['poutcome'])
vdf['month'] = encoder.fit_transform(vdf['month'])
vdf['y'] = encoder.fit_transform(vdf['y'])

We split the data into data and label

In [14]:
label = vdf['y']
data = vdf.drop(columns='y', inplace=False)

In [15]:
print(f"label= {type(label)},\ndata = {type(data)}")

label= <class 'secretflow.data.vertical.dataframe.VDataFrame'>,
data = <class 'secretflow.data.vertical.dataframe.VDataFrame'>


Data standardization via MinMaxScaler

In [16]:
scaler = MinMaxScaler()

data = scaler.fit_transform(vdf[list(data.columns)])




Next we divide the data set into train-set and test-set

In [17]:
from secretflow.data.split import train_test_split
random_state = 1234
train_data,test_data = train_test_split(data,train_size=0.8,random_state=random_state)
train_label,test_label = train_test_split(label,train_size=0.8,random_state=random_state)

**Summary:** At this point, we have completed the definition of **federated tables**, **data preprocessing**, and **training set and test set partitioning**
The secretFlow framework defines a set of operations to be built on the federated table (its logical counterpart is `pandas.DataFrame`). The secretflow framework defines a set of operations to be built on the federated table (its logical counterpart is `sklearn`) Refer to our documentation and API introduction to learn more about other features

## Introduce Model

**local version**: 
For this task, a basic DNN can be completed, input 16-dimensional features, through a DNN network, output the probability of positive and negative samples.


**Federate version**：
* Alice：
    - base_net: Input 4-dimensional feature and go through a DNN network to get hidden
    - fuse_net: Receive hidden features calculated by Alice and Bob, input them to FUSENET for feature fusion, and complete the forward process and backward process
* Bob：
    - base_net: Input 12-dimensional features, get hidden through a DNN network, and then send hidden to Alice to complete the following operation

### Define Model

Next we start creating the federated model 
we define SLTFModel and SLTorchModel(WIP), which are used to build split learning of vertical scene. We define a simple and easy to use extensible interface, which can easily transform your existing Model into SF-Model, and then conduct vertical scene federation modeling

Split learning is to break up a model so that one part is executed locally on the data and the other part is executed on the label side.
First let's define the locally executed model -- base_model

In [18]:
def create_base_model(input_dim, output_dim,  name='base_model'):
    # Create model
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        model = keras.Sequential(
            [
                keras.Input(shape=input_dim),
                layers.Dense(100,activation ="relu" ),
                layers.Dense(output_dim, activation="relu"),
            ]
        )
        # Compile model
        model.summary()
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model


We use create_base_model to create their base models for 'Alice' and 'Bob', respectively

In [19]:
# prepare model
hidden_size = 64

model_base_alice = create_base_model(4, hidden_size)
model_base_bob = create_base_model(12, hidden_size)

In [20]:
model_base_alice()
model_base_bob()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 100)               500       
                                                                 
 dense_1 (Dense)             (None, 64)                6464      
                                                                 
Total params: 6,964
Trainable params: 6,964
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_2 (Dense)             (None, 100)               1300      
                                                                 
 dense_3 (Dense)             (None, 64)                6464      
                                                                 
Total params: 7,764
Trainable pa

2022-06-29 11:40:23.180851: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
2022-06-29 11:40:23.180895: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)


<keras.engine.sequential.Sequential at 0x7f06a5c1bdf0>

Next we define the side with the label, or the server-side model -- fuse_model
In the definition of fuse_model, we need to correctly define `loss`, `optimizer`, and `metrics`. This is compatible with all configurations of your existing Keras model

In [21]:
def create_fuse_model(input_dim, output_dim, party_nums, name='fuse_model'):
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        # input
        input_layers = []
        for i in range(party_nums):
            input_layers.append(keras.Input(input_dim,))
        
        merged_layer = layers.concatenate(input_layers)
        fuse_layer = layers.Dense(64, activation='relu')(merged_layer)
        output = layers.Dense(output_dim, activation='sigmoid')(fuse_layer)

        model = keras.Model(inputs=input_layers, outputs=output)
        model.summary()
        
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model

In [22]:
model_fuse = create_fuse_model(
    input_dim=hidden_size, party_nums=2, output_dim=1)

In [23]:
model_fuse()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, 64)]         0           []                               
                                                                                                  
 input_4 (InputLayer)           [(None, 64)]         0           []                               
                                                                                                  
 concatenate (Concatenate)      (None, 128)          0           ['input_3[0][0]',                
                                                                  'input_4[0][0]']                
                                                                                                  
 dense_4 (Dense)                (None, 64)           8256        ['concatenate[0][0]']        

<keras.engine.functional.Functional at 0x7f06a4aa83a0>

### Create Split Learning Model
Secretflow provides the split learning model——SLModelTF
To initial SLModelTF only need 3 parameters
* base_model_dict：A dictionary needs to be passed in all clients participating in the training along with base_model mappings
* device_y：PYU，which device has label
* model_fuse：The fusion model

Define base_model_dict  
```python
base_model_dict:Dict[PYU,model_fn]
```

In [24]:
base_model_dict = {
    alice: model_base_alice,
    bob:   model_base_bob
}

In [25]:
from secretflow.security.privacy import DPStrategy, GaussianEmbeddingDP, LabelDP

# Define DP operations
train_batch_size = 128
gaussian_embedding_dp = GaussianEmbeddingDP(
    noise_multiplier=0.5,
    l2_norm_clip=1.0,
    batch_size=train_batch_size,
    num_samples=train_data.values.partition_shape()[alice][0],
    is_secure_generator=False,
)
dp_strategy_alice = DPStrategy(embedding_dp=gaussian_embedding_dp)
label_dp = LabelDP(eps=64.0)
dp_strategy_bob = DPStrategy(label_dp=label_dp)
dp_strategy_dict = {alice: dp_strategy_alice, bob: dp_strategy_bob}
dp_spent_step_freq = 10

In [26]:
sl_model = SLModelTF(
    base_model_dict=base_model_dict, 
    device_y=alice,  
    model_fuse=model_fuse,
    dp_strategy_dict=dp_strategy_dict,)

In [27]:
sl_model.fit(train_data,
             train_label,
             validation_data=(test_data,test_label),
             epochs=10, 
             batch_size=train_batch_size,
             shuffle=True,
             verbose=1,
             validation_freq=1,
            dp_spent_step_freq=dp_spent_step_freq,)

[2m[36m(PYUSLTFModel pid=2230316)[0m 2022-06-29 11:40:28.379543: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib:/opt/rh/devtoolset-10/root/usr/lib64/dyninst:/opt/rh/devtoolset-10/root/usr/lib/dyninst:/opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/devtoolset-10/root/usr/lib
[2m[36m(PYUSLTFModel pid=2230316)[0m 2022-06-29 11:40:28.379573: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
[2m[36m(PYUSLTFModel pid=2230318)[0m 2022-06-29 11:40:28.381398: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-10/root/usr/lib64:/opt/rh/de

[2m[36m(PYUSLTFModel pid=2230316)[0m Model: "sequential"
[2m[36m(PYUSLTFModel pid=2230316)[0m _________________________________________________________________
[2m[36m(PYUSLTFModel pid=2230316)[0m  Layer (type)                Output Shape              Param #   
[2m[36m(PYUSLTFModel pid=2230316)[0m  dense (Dense)               (None, 100)               500       
[2m[36m(PYUSLTFModel pid=2230316)[0m                                                                  
[2m[36m(PYUSLTFModel pid=2230316)[0m  dense_1 (Dense)             (None, 64)                6464      
[2m[36m(PYUSLTFModel pid=2230316)[0m                                                                  
[2m[36m(PYUSLTFModel pid=2230316)[0m Total params: 6,964
[2m[36m(PYUSLTFModel pid=2230316)[0m Trainable params: 6,964
[2m[36m(PYUSLTFModel pid=2230316)[0m Non-trainable params: 0
[2m[36m(PYUSLTFModel pid=2230316)[0m _________________________________________________________________
[2m[36m(

100%|██████████| 29/29 [00:01<00:00, 18.39it/s, epoch: 0/10 -  train_loss:0.42552024126052856  train_accuracy:0.8590909242630005  train_auc_2:0.5520472526550293  val_loss:0.34310784935951233  val_accuracy:0.8877212405204773  val_auc_2:0.6612115502357483 ]
100%|██████████| 29/29 [00:01<00:00, 19.32it/s, epoch: 1/10 -  train_loss:0.33578866720199585  train_accuracy:0.8902838230133057  train_auc_2:0.6446549892425537  val_loss:0.3264640271663666  val_accuracy:0.8877212405204773  val_auc_2:0.7030531167984009 ]
100%|██████████| 29/29 [00:01<00:00, 19.05it/s, epoch: 2/10 -  train_loss:0.3369784355163574  train_accuracy:0.8823689818382263  train_auc_2:0.700323760509491  val_loss:0.31710633635520935  val_accuracy:0.8877212405204773  val_auc_2:0.7425126433372498 ]
100%|██████████| 29/29 [00:01<00:00, 19.38it/s, epoch: 3/10 -  train_loss:0.31362512707710266  train_accuracy:0.8870298862457275  train_auc_2:0.751912534236908  val_loss:0.2964775562286377  val_accuracy:0.8877212405204773  val_auc_2:0.

{'train_loss': [0.42552024,
  0.33578867,
  0.33697844,
  0.31362513,
  0.2956601,
  0.2666193,
  0.25582612,
  0.25017434,
  0.24141279,
  0.23765592],
 'train_accuracy': [0.8590909,
  0.8902838,
  0.882369,
  0.8870299,
  0.8855101,
  0.89097536,
  0.89396834,
  0.8987832,
  0.89961284,
  0.9034845],
 'train_auc_2': [0.55204725,
  0.644655,
  0.70032376,
  0.75191253,
  0.80064297,
  0.84383744,
  0.86289334,
  0.86585206,
  0.8756056,
  0.87462044],
 'val_loss': [0.34310785,
  0.32646403,
  0.31710634,
  0.29647756,
  0.27681866,
  0.25742745,
  0.25881812,
  0.24163179,
  0.24190263,
  0.24357407],
 'val_accuracy': [0.88772124,
  0.88772124,
  0.88772124,
  0.88772124,
  0.8885509,
  0.8949115,
  0.89463496,
  0.89850664,
  0.8993363,
  0.9004425],
 'val_auc_2': [0.66121155,
  0.7030531,
  0.74251264,
  0.7922882,
  0.83285224,
  0.85833603,
  0.87410724,
  0.8827099,
  0.88052267,
  0.8792129]}

Let's call the evaluation function

In [28]:
global_metric = sl_model.evaluate(test_data, test_label, batch_size=128)
print(global_metric)

Evaluate Processing:: 100%|██████████| 29/29 [00:00<00:00, 62.88it/s, loss:0.24504481256008148 accuracy:0.8976770043373108 auc_2:0.8779380321502686]

{'loss': 0.24504481, 'accuracy': 0.897677, 'auc_2': 0.87793803}





## Contrast to local model

#### Model
The model structure is consistent with the model of split learning above, but only the model structure of Alice is used here. The model definition refers to the code below.
#### Data
The data also use kaggle's anti-fraud data. Here, we just use Alice's data of the new bank.
1. The total sample size was 11162, including 8929 training set and 2233 test set.
2. The feature dimension is 4.

In [29]:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
from sklearn.model_selection import train_test_split

def create_model():

    model = keras.Sequential(
        [
            keras.Input(shape=4),
            layers.Dense(100,activation ="relu" ),
            layers.Dense(64, activation='relu'),
            layers.Dense(64, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ]
    )
    model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
    return model

single_model = create_model()

data process

In [30]:
dataset_dict[alice]= dataset_dict[alice].drop(columns="id",inplace=False)

In [31]:
dataset_dict[alice]

Unnamed: 0,age,job,marital,education,y
0,30,unemployed,married,primary,no
1,33,services,married,secondary,no
2,35,management,single,tertiary,no
3,30,management,married,tertiary,no
4,59,blue-collar,married,secondary,no
...,...,...,...,...,...
4516,33,services,married,secondary,no
4517,57,self-employed,married,tertiary,no
4518,57,technician,married,secondary,no
4519,28,blue-collar,married,secondary,no


In [32]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

alice_data = dataset_dict[alice]
encoder = LabelEncoder()
alice_data['job'] = encoder.fit_transform(alice_data['job'])
alice_data['marital'] = encoder.fit_transform(alice_data['marital'])
alice_data['education'] = encoder.fit_transform(alice_data['education'])
alice_data['y'] =  encoder.fit_transform(alice_data['y'])

In [33]:
y = alice_data['y']
alice_data = alice_data.drop(columns=['y'],inplace=False)

In [34]:
scaler = MinMaxScaler()
alice_data = scaler.fit_transform(alice_data)

In [35]:
train_data,test_data = train_test_split(alice_data,train_size=0.8,random_state=random_state)
train_label,test_label = train_test_split(y,train_size=0.8,random_state=random_state)

In [36]:
test_data.shape

(905, 4)

In [37]:
single_model.fit(train_data,train_label,validation_data=(test_data,test_label),batch_size=128,epochs=10,shuffle=False)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f06a4a35940>

### Summary
The above two experiments simulate a typical vertical scene training problem. Alice and Bob have the same sample group, but each side has only a part of the features. If Alice only uses her own data to train the model, an accuracy of **0.872**, AUC **0.53** model can be obtained. However, if Bob's data are combined, a model with an accuracy of **0.893**  and AUC **0.883** can be obtained.

## Conclusion

* This tutorial introduces what is split learning and how to do it in secretFlow  
* It can be seen from the experimental data that split learning has significant advantages in expanding sample dimension and improving model effect through joint multi-party training
* This tutorial uses plaintext aggregation to demonstrate, without considering the leakage problem of hidden layer. Secretflow provides AggLayer to avoid the leakage problem of hidden layer plaintext transmission through MPC,TEE,HE, and DP. If you are interested, please refer to relevant documents.
* Next, you may want to try different data sets, you need to vertically shard the data first and then follow the flow of this tutorial
