# 在SecretFlow中加载Numpy数据

The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

本教程将展示下，怎样在SecretFlow的多方安全环境中加载numpy数据。  
SecretFlow支持`npy`，`npz`多种格式，接口封装和`numpy`保持一致

## 环境设置

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=True)
alice, bob ,charlie = sf.PYU('alice'), sf.PYU('bob') , sf.PYU('charlie')

2023-04-17 16:47:03,538	INFO worker.py:1538 -- Started a local Ray instance.


## 接口介绍

我们在SecretFlow中提供了类似于`numpy.load`的接口`secretflow.data.ndarray.load`来将各方数据的`ndarray`读取成为一个联邦概念的数据。  

通过secretflow.data.load可以读取多方的numpy文件，构成一个`FedNdarray`

接口介绍：[secretflow.data.load](https://www.secretflow.org.cn/docs/secretflow/en/source/secretflow.data.html#secretflow.data.ndarray.load)


## 数据下载与分割

In [3]:
%%capture
%%!
wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz
pip install opencv-python

E0417 16:47:08.224327619   13264 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers


In [4]:
import numpy as np
all_data = np.load("./mnist.npz")

对数据进行拆分

In [5]:
alice_train_x = all_data["x_train"][:30000]
alice_test_x = all_data["x_test"][:30000]
alice_train_y = all_data["y_train"][:30000]
alice_test_y = all_data["y_test"][:30000]

bob_train_x = all_data["x_train"][30000:]
bob_test_x = all_data["x_test"][30000:]
bob_train_y = all_data["y_train"][30000:]
bob_test_y = all_data["y_test"][30000:]

分别保存成npz格式文件

In [6]:
np.savez("./alice_mnist.npz",train_x = alice_train_x, test_x = alice_test_x, train_y=alice_train_y, test_y = alice_test_y)
np.savez("./bob_mnist.npz",train_x = bob_train_x, test_x = bob_test_x, train_y=bob_train_y, test_y = bob_test_y)

将alice和bob的train_x保存成npy格式。方便后文npy格式读取使用

In [7]:
np.save("./alice_mnist_train_x.npy",alice_train_x)
np.save("./bob_mnist_train_x.npy",bob_train_x)

## 加载npz文件

In [8]:
alice_path = "./alice_mnist.npz"
bob_path = "./bob_mnist.npz"

In [9]:
from secretflow.data.ndarray import load
from secretflow.data.split import train_test_split

In [10]:
fed_npz = load({alice:alice_path, bob:bob_path},allow_pickle=True)

In [11]:
fed_npz

{'train_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fba90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fb580>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_x': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec825fbdc0>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570070>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'train_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570190>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec825704f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_y': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570280>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7fec82570850>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}

FedNpz的每一个value是是FedNdarray

In [12]:
type(fed_npz["train_x"])

secretflow.data.ndarray.FedNdarray

## 加载npy文件

加载npy就很简单了，直接调用load接口读取出来就是一个标准的FedNdarray对象

In [13]:
alice_path = "./alice_mnist_train_x.npy"
bob_path = "./bob_mnist_train_x.npy"

In [14]:
fed_ndarray = load({alice:alice_path, bob:bob_path},allow_pickle=True)

In [15]:
type(fed_ndarray)

secretflow.data.ndarray.FedNdarray

## 应该怎样将我已有的数据转成FedNdarray再进行读取？

那我们怎么样将其他类型的数据转成FedNdarray 数据呢？  
比如我有一个图像数据集，语音数据集，我该怎么样通过FedNdarray将数据传入联邦模型  
我们这里以花卉分类数据集Flower来举个例子

In [16]:
import tempfile
import tensorflow as tf

_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
    "flower_photos",
    "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz",
    untar=True,
    cache_dir=_temp_dir,
)

2023-04-17 11:41:31.422425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
2023-04-17 11:41:31.456434: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-17 11:41:32.352006: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/us

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz


下载解压后根目录存在 Root = "flower_photos"

In [17]:
import os, glob
import numpy as np
import cv2 #依赖需要自行安装 pip install opencv-python

root = path_to_flower_dataset
classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
img_paths = []  # 保存所有图片路径
labels = []  # 保存图片的类别标签(0,1,2,3,4)
for i, label in enumerate(classes):
    cls_img_paths = glob.glob(os.path.join(root, label, "*.jpg"))
    img_paths.extend(cls_img_paths)
    labels.extend([i] * len(cls_img_paths))

# 图片->numpy
img_numpys = []
labels = np.array(labels)
for img_path in img_paths:
    img_numpy = cv2.imread(img_path)
    img_numpy = cv2.resize(img_numpy, (240, 240))
    img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))
    # If use Pytorch backend dimension should be exchanged
    # img_numpy = np.transpose(img_numpy, (0,3,1,2))
    img_numpys.append(img_numpy)

images = np.concatenate(img_numpys, axis=0)
print(images.shape)  
print(labels.shape)  

# 给两个节点分配images和labels，各分配50%的数据
per=0.5
alice_images=images[:int(per*images.shape[0]),:,:,:]
alice_label=labels[:int(per*images.shape[0])]
bob_images=images[int(per*images.shape[0]):,:,:,:]
bob_label=labels[int(per*images.shape[0]):]
print(f"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}")  
print(f"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}")  

# 分别保存为npz，然后发送给两台机器
np.savez("flower_alice.npz",image=alice_images,label=alice_label)
np.savez("flower_bob.npz",image=bob_images,label=bob_label)

(3670, 240, 240, 3)
(3670,)
alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)
bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)


得到需要的NPZ之后，使用上面介绍过的Load函数读取成FedNdarray，输入到模型中即可开始训练

In [18]:
fed_flower_npz = load({alice:"./flower_alice.npz",bob:"./flower_bob.npz"},allow_pickle=True)

In [19]:
fed_flower_npz

{'image': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7feccc781d90>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea640>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'label': FedNdarray(partitions={alice: <secretflow.device.device.pyu.PYUObject object at 0x7febc2cea790>, bob: <secretflow.device.device.pyu.PYUObject object at 0x7febc2ceaa30>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}

In [20]:
fed_image = fed_flower_npz["image"]

In [21]:
fed_image.partition_shape()

{alice: (1835, 240, 240, 3), bob: (1835, 240, 240, 3)}

## 小建议

建议将数据转为ndarray类型之后，使用单机版训练引擎进行测试，检查数据格式是否正确匹配模型。然后再使用隐语的联邦框架进行测试，可以提高排查效率.  
*注意：在使用图像数据集是要注意维度排列*