# Using Custom DataBuilder in SecretFlow (TensorFlow)

The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

In this tutorial, we will show you how to load data and train model using the custom DataBuilder schema in the multi-party secure environment of SecretFlow.
This tutorial will use the image classification task of the Flower dataset to introduce, how to use the custom DataBuilder to complete federated learning.


## Environment Setting

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=False)
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')

2023-04-17 15:18:33,602	INFO worker.py:1538 -- Started a local Ray instance.


## Interface Introduction

We support custom DataBuilder reads in SecretFlow's `FLModel` to make it easier for users to handle data inputs more flexibly according to their needs.
Let's use an example to demonstrate how to use the custom DataBuilder for federated model training.


Steps to use DataBuilder:

1. Use the single-machine version engine (TensorFlow, PyTorch) to develop and get the Builder function of the Dataset.
2. Wrap the Builder functions of each party to get `create_dataset_builder` function. *Note: The dataset_builder needs to pass in the stage parameter.*
3. Build the data_builder_dict [PYU, dataset_builder].
4. Pass the obtained data_builder_dict to the `dataset_builder` of the `fit` function. At the same time, the x parameter position is passed into the required input in dataset_builder (eg: the input passed in this example is the actual image path used).

Using DataBuilder in FLModel requires a pre-defined `data_builder_dict`. Need to be able to return `tf.dataset` and `steps_per_epoch`. And the steps_per_epoch returned by all parties must be consistent.
```python
data_builder_dict = 
        {
            alice: create_alice_dataset_builder(
                batch_size=32,
            ), # create_alice_dataset_builder must return (Dataset, steps_per_epoch)
            bob: create_bob_dataset_builder(
                batch_size=32,
            ), # create_bob_dataset_builder must return (Dataset, steps_per_epochstep_per_epochs)
        }

```

## Download Data

Flower Dataset Introduction: The Flower dataset consists of 4323 color images of 5 different types of flowers (daisy, dandelion, rose, sunflower, and tulip). Each flower has images from multiple angles and different lighting conditions, and the resolution of each image is 320x240.
This dataset is commonly used for training and testing of image classification and machine learning algorithms. The number of each category in the dataset is as follows: daisy (633), dandelion (898), rose (641), sunflower (699), and tulip (852).

Download link: [http://download.tensorflow.org/example_images/flower_photos.tgz](http://download.tensorflow.org/example_images/flower_photos.tgz)

<img alt="flower_dataset_demo.png" src="resources/flower_dataset_demo.png" width="600">



### Download Data and Unzip

In [3]:
import tempfile
import tensorflow as tf

_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
    "flower_photos",
    "https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/tf_flowers/flower_photos.tgz",
    untar=True,
    cache_dir=_temp_dir,
)

Downloading data from https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/tf_flowers/flower_photos.tgz


Next let's start building a custom DataBuilder

## 1. Develop DataBuilder with single-machine version engine

When we develop DataBuilder, we are free to follow the logic of single-machine development.
The purpose is to build a `tf.dataset` object.


In [4]:
import math
import tensorflow as tf

img_height = 180
img_width = 180
batch_size = 32
# In this example, we use the TensorFlow interface for development.
data_set = tf.keras.utils.image_dataset_from_directory(
    path_to_flower_dataset,
    validation_split=0.2,
    subset="both",
    seed=123,
    image_size=(img_height, img_width),
    batch_size=batch_size,
)

Found 436 files belonging to 5 classes.
Using 349 files for training.
Using 87 files for validation.


2023-04-10 13:16:34.492390: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [5]:
train_set = data_set[0]
test_set = data_set[1]

In [6]:
print(type(train_set), type(test_set))

<class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'> <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'>


In [7]:
x, y = next(iter(train_set))
print(f"x.shape = {x.shape}")
print(f"y.shape = {y.shape}")

x.shape = (32, 180, 180, 3)
y.shape = (32,)


## 2. Wrap the developed DataBuilder

The DataBuilder we developed needs to be distributed to each execution machine for execution, and we need to wrap them in order to serialize.
Note: **FLModel requires the incoming DataBuilder return two results (data_set, steps_per_epoch).**


In [8]:
def create_dataset_builder(
    batch_size=32,
):
    def dataset_builder(folder_path, stage="train"):
        import math

        import tensorflow as tf

        img_height = 180
        img_width = 180
        data_set = tf.keras.utils.image_dataset_from_directory(
            folder_path,
            validation_split=0.2,
            subset="both",
            seed=123,
            image_size=(img_height, img_width),
            batch_size=batch_size,
        )
        if stage == "train":
            train_dataset = data_set[0]
            train_step_per_epoch = math.ceil(len(data_set[0].file_paths) / batch_size)
            return train_dataset, train_step_per_epoch
        elif stage == "eval":
            eval_dataset = data_set[1]
            eval_step_per_epoch = math.ceil(len(data_set[1].file_paths) / batch_size)
            return eval_dataset, eval_step_per_epoch

    return dataset_builder

## 3. Build dataset_builder_dict

In the horizontal scenario, the logic for all parties to process data is the same, so we only need a wrapped DataBuilder construction method.
Next we build the `dataset_builder_dict`

In [9]:
data_builder_dict = {
    alice: create_dataset_builder(
        batch_size=32,
    ),
    bob: create_dataset_builder(
        batch_size=32,
    ),
}

## 4. After get dataset_builder_dict, we can pass it into the model for use

Next we define the model and use the custom data constructed above for training

In [10]:
def create_conv_flower_model(input_shape, num_classes, name='model'):
    def create_model():
        from tensorflow import keras

        # Create model

        model = keras.Sequential(
            [
                keras.Input(shape=input_shape),
                tf.keras.layers.Rescaling(1.0 / 255),
                tf.keras.layers.Conv2D(32, 3, activation='relu'),
                tf.keras.layers.MaxPooling2D(),
                tf.keras.layers.Conv2D(32, 3, activation='relu'),
                tf.keras.layers.MaxPooling2D(),
                tf.keras.layers.Conv2D(32, 3, activation='relu'),
                tf.keras.layers.MaxPooling2D(),
                tf.keras.layers.Flatten(),
                tf.keras.layers.Dense(128, activation='relu'),
                tf.keras.layers.Dense(num_classes),
            ]
        )
        # Compile model
        model.compile(
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            optimizer='adam',
            metrics=["accuracy"],
        )
        return model

    return create_model

In [11]:
from secretflow_fl.ml.nn import FLModel
from secretflow.security.aggregation import SecureAggregator

In [12]:
device_list = [alice, bob]
aggregator = SecureAggregator(charlie, [alice, bob])

# prepare model
num_classes = 5
input_shape = (180, 180, 3)

# keras model
model = create_conv_flower_model(input_shape, num_classes)


fed_model = FLModel(
    device_list=device_list,
    model=model,
    aggregator=aggregator,
    backend="tensorflow",
    strategy="fed_avg_w",
    random_seed=1234,
)

INFO:root:Create proxy actor <class 'secretflow.security.aggregation.secure_aggregator._Masker'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.security.aggregation.secure_aggregator._Masker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow_fl.ml.nn.fl.backend.tensorflow.strategy.fed_avg_w.PYUFedAvgW'> with party alice.
INFO:root:Create proxy actor <class 'secretflow_fl.ml.nn.fl.backend.tensorflow.strategy.fed_avg_w.PYUFedAvgW'> with party bob.


The input of our constructed dataset builder is the path of the image dataset, so we need to set the input data as a `Dict` here.
```python
data = {
    alice: folder_path_of_alice,
    bob: folder_path_of_bob
}
```

In [13]:
data = {
    alice: path_to_flower_dataset,
    bob: path_to_flower_dataset,
}
history = fed_model.fit(
    data,
    None,
    validation_data=data,
    epochs=5,
    batch_size=32,
    aggregate_freq=2,
    sampler_method="batch",
    random_seed=1234,
    dp_spent_step_freq=1,
    dataset_builder=data_builder_dict,
)

INFO:root:FL Train Params: {'self': <secretflow_fl.ml.nn.fl.fl_model.FLModel object at 0x7f7b7a28b8e0>, 'x': {alice: '../../public_dataset/datasets/flower_photos', bob: '../../public_dataset/datasets/flower_photos'}, 'y': None, 'batch_size': 32, 'batch_sampling_rate': None, 'epochs': 5, 'verbose': 1, 'callbacks': None, 'validation_data': {alice: '../../public_dataset/datasets/flower_photos', bob: '../../public_dataset/datasets/flower_photos'}, 'shuffle': False, 'class_weight': None, 'sample_weight': None, 'validation_freq': 1, 'aggregate_freq': 2, 'label_decoder': None, 'max_batch_size': 20000, 'prefetch_buffer_size': None, 'sampler_method': 'batch', 'random_seed': 1234, 'dp_spent_step_freq': 1, 'audit_log_dir': None, 'dataset_builder': {alice: <function create_dataset_builder.<locals>.dataset_builder at 0x7f7b7a2bb1f0>, bob: <function create_dataset_builder.<locals>.dataset_builder at 0x7f7b7a2bb0d0>}}
32it [00:18,  1.71it/s, epoch: 1/5 -  loss:1.5339548587799072  accuracy:0.314255982

Next, you can use your own dataset to try