# DataFrame

>The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial.

Secretflow provides federated data encapsulation in the form of DataFrame. DataFrame is composed of data blocks of multiple parties and supports horizontal or vertical partitioned data.

<img alt="dataframe.png" src="../resources/dataframe.png" width="600">

Currently secretflow.DataFrame provides a subset of pandas operations, which are basically the same as pandas. During the calculation process, the original data is kept in the data holder and will not go out of the domain.



The following will demonstrate how to use a DataFrame.

## Preparation

Initialize secretflow and create three parties alice, bob and carol.

In [1]:
import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob', 'carol'], address='local')
alice, bob, carol = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('carol')

## Data preparation

Here we use [iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) as example data.

In [2]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
data = pd.concat([iris.data, iris.target], axis=1)
data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


We partition the data according to horizontal (the same features, each holds some samples) and vertical mode (each holds some features) to facilitate subsequent display.

In [3]:
# Horizontal partitioning.
h_alice, h_bob, h_carol = data.iloc[:40, :], data.iloc[40:100, :], data.iloc[100:, :]

# Save to temporary files.
import tempfile
import os

temp_dir = tempfile.mkdtemp()

h_alice_path = os.path.join(temp_dir, 'h_alice.csv')
h_bob_path = os.path.join(temp_dir, 'h_bob.csv')
h_carol_path = os.path.join(temp_dir, 'h_carol.csv')
h_alice.to_csv(h_alice_path, index=False)
h_bob.to_csv(h_bob_path, index=False)
h_carol.to_csv(h_carol_path, index=False)

In [4]:
h_alice.head(), h_bob.head(), h_carol.head()

(   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 0                5.1               3.5                1.4               0.2   
 1                4.9               3.0                1.4               0.2   
 2                4.7               3.2                1.3               0.2   
 3                4.6               3.1                1.5               0.2   
 4                5.0               3.6                1.4               0.2   
 
    target  
 0       0  
 1       0  
 2       0  
 3       0  
 4       0  ,
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 40                5.0               3.5                1.3               0.3   
 41                4.5               2.3                1.3               0.3   
 42                4.4               3.2                1.3               0.2   
 43                5.0               3.5                1.6               0.6   
 44                5.1            

In [5]:
# Vertical partitioning.
v_alice, v_bob, v_carol = data.iloc[:, :2], data.iloc[:, 2:4], data.iloc[:, 4:]

# Save to temporary files.
v_alice_path = os.path.join(temp_dir, 'v_alice.csv')
v_bob_path = os.path.join(temp_dir, 'v_bob.csv')
v_carol_path = os.path.join(temp_dir, 'v_carol.csv')
v_alice.to_csv(v_alice_path, index=False)
v_bob.to_csv(v_bob_path, index=False)
v_carol.to_csv(v_carol_path, index=False)

In [6]:
v_alice, v_bob, v_carol

(     sepal length (cm)  sepal width (cm)
 0                  5.1               3.5
 1                  4.9               3.0
 2                  4.7               3.2
 3                  4.6               3.1
 4                  5.0               3.6
 ..                 ...               ...
 145                6.7               3.0
 146                6.3               2.5
 147                6.5               3.0
 148                6.2               3.4
 149                5.9               3.0
 
 [150 rows x 2 columns],
      petal length (cm)  petal width (cm)
 0                  1.4               0.2
 1                  1.4               0.2
 2                  1.3               0.2
 3                  1.5               0.2
 4                  1.4               0.2
 ..                 ...               ...
 145                5.2               2.3
 146                5.0               1.9
 147                5.2               2.0
 148                5.4               2.3
 149   

## Creation

### Horitontal DataFrame

Create a DataFrame consisting of horizontally partitioned data.

> 💡 The original data is still stored locally in the data holder and is not transmitted out of the domain.

Here, as a simple show case, we choose secure aggregation and spu comparison. You can refer to [Secure Aggregation](../../developer/algorithm/secure_aggregation.ipynb) to learn more about secure aggregation solutions and implement appropriate security policies according to your needs.

In [7]:
from secretflow.data.horizontal import read_csv as h_read_csv
from secretflow.security.aggregation import SecureAggregator
from secretflow.security.compare import SPUComparator

# The aggregator and comparator are respectively used to aggregate
# or compare data in subsequent data analysis operations.
aggr = SecureAggregator(device=alice, participants=[alice, bob, carol])

spu = sf.SPU(sf.utils.testing.cluster_def(parties=['alice', 'bob', 'carol']))
comp = SPUComparator(spu)
hdf = h_read_csv(
    {alice: h_alice_path, bob: h_bob_path, carol: h_carol_path},
    aggregator=aggr,
    comparator=comp,
)

### Vertical DataFrame

Create a DataFrame consisting of vertically partitioned data.

> 💡 The original data is still stored locally in the data holder and is not transmitted out of the domain.

In [8]:
from secretflow.data.vertical import read_csv as v_read_csv

vdf = v_read_csv({alice: v_alice_path, bob: v_bob_path, carol: v_carol_path})

## Data analysis

For data privacy protection purposes, DataFrame does not allow the view of raw data. DataFrame provides an interface similar to pandas for users to analyze data. These interfaces usually support both horizontal and vertical partitioned data.

> 💡 During the following operations, the original data of the DataFrame is still stored locally on the node and is not transmitted out of the domain.

In [9]:
hdf.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

In [10]:
vdf.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

Get the minimum value, you can see that it is consistent with the original data.

In [11]:
print('Horizontal df:\n', hdf.min())
print('\nVertical df:\n', vdf.min())
print('\nPandas:\n', data.min())

Horizontal df:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

Vertical df:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

Pandas:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64


You can also view information such as maximum value, mean value, and quantity.

In [12]:
hdf.max()

sepal length (cm)    7.9
sepal width (cm)     4.4
petal length (cm)    6.9
petal width (cm)     2.5
target               2.0
dtype: float64

In [13]:
vdf.max()

sepal length (cm)    7.9
sepal width (cm)     4.4
petal length (cm)    6.9
petal width (cm)     2.5
target               2.0
dtype: float64

In [14]:
hdf.mean(numeric_only=True)

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

In [15]:
vdf.mean(numeric_only=True)

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

In [16]:
hdf.count()

sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
target               150
dtype: int64

In [17]:
vdf.count()

sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
target               150
dtype: int64

### Selection

Get partial columns.

In [18]:
hdf_part = hdf[['sepal length (cm)', 'target']]
hdf_part.mean(numeric_only=True)

sepal length (cm)    5.843333
target               1.000000
dtype: float64

In [19]:
vdf_part = hdf[['sepal width (cm)', 'target']]
vdf_part.mean(numeric_only=True)

sepal width (cm)    3.057333
target              1.000000
dtype: float64

### Modification

Horizontal DataFrame

In [20]:
hdf_copy = hdf.copy()
print('Min of target: ', hdf_copy['target'].min()[0])
print('Max of target: ', hdf_copy['target'].max()[0])

Min of target:  0.0
Max of target:  2.0


In [21]:
# Set target to 1。
hdf_copy['target'] = 1

# You can see that the value of target has become 1.
print('Min of target: ', hdf_copy['target'].min()[0])
print('Max of target: ', hdf_copy['target'].max()[0])

Min of target:  1.0
Max of target:  1.0


Vertical DataFrame.

In [22]:
vdf_copy = vdf.copy()
print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])
print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])

Min of sepal width (cm):  2.0
Max of sepal width (cm):  4.4


In [23]:
# Set sepal width (cm) to 20。
vdf_copy['sepal width (cm)'] = 20

# You can see that the value of sepal width (cm) has become 20.
print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])
print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])

Min of sepal width (cm):  20
Max of sepal width (cm):  20


## Ending

In [24]:
# Clean up temporary files

import shutil

shutil.rmtree(temp_dir, ignore_errors=True)

## What's Next?

Learn how to do data preprocessing with DataFrame with [this tutorial](../../tutorial/data_preprocessing_with_data_frame.ipynb).