# Encrypted Data-frames

The following notebook shows how to encrypt Pandas data-frames and run a left join on them using Fully Homomorphic Encryption (FHE) in a client-server setting using Concrete ML. This example is separated into three main sections : 
1) Two independent clients load their own csv file using Pandas, encrypt their data and send them to a server
2) The server runs a left join in FHE
3) One of the client receives the encrypted output data-frame and decrypts it 

In such a setting, several parties are thus able to merge private databases without ever disclosing any of their sensitive data. Additionally, Concrete ML provides a user-friendly API meant to be as close as possible to Pandas. 

In [1]:
import shutil
import time
from pathlib import Path
from tempfile import TemporaryDirectory

import numpy
import pandas

from concrete.ml.pandas import ClientEngine, load_encrypted_dataframe
from concrete.ml.pytest.utils import pandas_dataframe_are_equal

numpy.random.seed(0)

DATA_PATH = Path("data/encrypted_pandas")

# pylint: disable=pointless-statement, consider-using-with

## Clients

### User 1

On the first user's side, load the private data using Pandas. For this example, we took the [Tips]( https://www.kaggle.com/code/sanjanabasu/tips-dataset/input) dataset and separated it into two csv files so that: 
- all columns are different, except for column "index", representing the initial data-frame's index
- some indexes are common, some others are not

In [2]:
CLIENT_1_DIR = DATA_PATH / "client_1"

df_left = pandas.read_csv(CLIENT_1_DIR / "df_left.csv")

df_left

Unnamed: 0,index,total_bill,tip,sex,smoker
0,1,12.54,2.5,Male,No
1,2,11.17,1.5,Female,No
2,3,20.29,2.75,Female,No
3,4,14.07,2.5,Male,No
4,5,15.69,3.0,Male,Yes
5,6,18.29,3.0,Male,No
6,7,16.93,3.07,Female,No
7,8,24.27,2.03,Male,Yes
8,9,8.77,2.0,Male,No


A `ClientEngine` instance is then initialized, which is used for managing keys (encryption, decryption).

In [3]:
client_1_temp_dir = TemporaryDirectory(dir=str(CLIENT_1_DIR))
client_1_temp_path = Path(client_1_temp_dir.name)

# Define the directory where to store the serialized keys
client_1_keys_path = client_1_temp_path / "keys"

client_1 = ClientEngine(keys_path=client_1_keys_path)

Using the `ClientEngine` instance, the user is now able to encrypt the Pandas data-frame, building a new `EncryptedDataFrame` instance.

In [4]:
df_left_enc = client_1.encrypt_from_pandas(df_left)

`EncryptedDataFrame` objects are able to handle multiple data-types: integers, floating points and strings. Under the hood, the data needs to be quantized under a certain precision before encryption (more info on why: [Key Concepts](../getting-started/concepts.md) and [Quantization](../explanations/inner-workings/quantization_internal.md) ). More specifically:
- integers: the values are kept as they are but an error is raised if they are not within the range currently allowed
- floating points: the values are quantized under a certain precision, and quantization parameters (scale, zero-point) are sent to the server
- strings: the values are mapped to integers using a dict, which is sent to the server as well

More generally, the quantized values need be within the range currently allowed. This notably means that the number of rows allowed in a data-frame are also limited, as we expect the keys on which to merge to be unique.

Once the inputs are quantized and encrypted, the user can print the encrypted data-frame's schema. A schema represents the data-frame's columns as well as their dtype and associated quantization parameters or mappings.  

In [5]:
df_left_enc.get_schema()

Unnamed: 0,index,total_bill,tip,sex,smoker
dtype,int64,float64,float64,object,object
scale,,0.903226,8.917197,,
zero_point,,6.92129,12.375796,,
str_to_int,,,,"{'Male': 1, 'Female': 2}","{'No': 1, 'Yes': 2}"


The encrypted data-frame can be serialized and saved using the `save` method. 

In [6]:
df_left_enc_path = client_1_temp_path / "df_left_enc"
df_left_enc.save(df_left_enc_path)

### User 2

The second user's steps are very similar to the first one. It is important to note that both users are expected not to share any of their data-base with each other.

In [7]:
CLIENT_2_DIR = DATA_PATH / "client_2"

df_right = pandas.read_csv(CLIENT_2_DIR / "df_right.csv")

df_right

Unnamed: 0,index,day,time,size
0,2,Thur,Lunch,2
1,5,Sat,Dinner,3
2,9,Sun,Dinner,2


Currently, the users need to share the private keys in order to be able to run an encrypted merge. We are currently working on new techniques that would avoid this.

In [8]:
client_2_temp_dir = TemporaryDirectory(dir=str(CLIENT_2_DIR))
client_2_temp_path = Path(client_2_temp_dir.name)

# Define the directory where to store the serialized keys
client_2_keys_path = client_2_temp_path / "keys"

# Copy the first user's keys
shutil.copy2(client_1_keys_path, client_2_keys_path)

client_2 = ClientEngine(keys_path=client_2_keys_path)

Encrypt the second user's data-frame. It is possible to get the encrypted data-frame's representation by simply returning the variable.

In [9]:
df_right_enc = client_2.encrypt_from_pandas(df_right)

df_right_enc

index,day,time,size
..4c6228db5e..,..34201c3528..,..4b06cde26f..,..a8e057e092..
..128796dd3e..,..a8585d0f21..,..b5b5bb545f..,..c82afbda96..
..7790a7620f..,..c3c49176fd..,..472743ea49..,..e9202edb1c..


Save the second user's encrypted data-frame

In [10]:
df_right_enc_path = client_2_temp_path / "df_right_enc"
df_right_enc.save(df_right_enc_path)

## Server

The server only receives serialized encrypted data-frames. Once it has them, anyone is able to decide which operation to run on which data-frames, but only the parties that encrypted them will be able to decrypt the result.

First, the server can deserialize the data-frames using Concrete ML's `load_encrypted_dataframe` function. 

In [11]:
df_left_enc = load_encrypted_dataframe(df_left_enc_path)
df_right_enc = load_encrypted_dataframe(df_right_enc_path)

We now chose to run a left join on the encrypted data-frames' common column "index" using FHE. This step can take several seconds.  

In [12]:
start = time.time()
df_joined_enc_server = df_left_enc.merge(df_right_enc, how="left", on="index")
end = time.time() - start

print(f"Total execution time: {end:.2f}s")

Total execution time: 8.26s


The encrypted output data-frame is then serialized.

In [13]:
df_joined_enc_server_path = client_1_temp_path / "df_joined_enc"

df_joined_enc_server.save(df_joined_enc_server_path)

## Clients

Both user 1 and 2 are able to decrypt the server's encrypted output data-frame, but it first needs to be deserialized.

In [14]:
df_joined_enc = load_encrypted_dataframe(df_joined_enc_server_path)

The user can now decrypt it and recover the joined data-frame as a Pandas `DataFrame` object. 

In [15]:
df_joined_cml = client_1.decrypt_to_pandas(df_joined_enc)

df_joined_cml

Unnamed: 0,index,total_bill,tip,sex,smoker,day,time,size
0,1,12.091429,2.509286,Male,No,,,
1,2,10.984286,1.5,Female,No,Thur,Lunch,2.0
2,3,19.841429,2.733571,Female,No,,,
3,4,14.305714,2.509286,Male,No,,,
4,5,15.412857,2.957857,Male,Yes,Sat,Dinner,3.0
5,6,18.734286,2.957857,Male,No,,,
6,7,16.52,3.07,Female,No,,,
7,8,24.27,2.060714,Male,Yes,,,
8,9,8.77,1.948571,Male,No,Sun,Dinner,2.0


### Concrete ML vs Pandas comparison

As this is only a demo in a notebook, we are able to compute Pandas' expected output (in a non-private setting) and compare it to the result above. 

In [16]:
df_joined_pandas = pandas.merge(df_left, df_right, how="left", on="index")

df_joined_pandas

Unnamed: 0,index,total_bill,tip,sex,smoker,day,time,size
0,1,12.54,2.5,Male,No,,,
1,2,11.17,1.5,Female,No,Thur,Lunch,2.0
2,3,20.29,2.75,Female,No,,,
3,4,14.07,2.5,Male,No,,,
4,5,15.69,3.0,Male,Yes,Sat,Dinner,3.0
5,6,18.29,3.0,Male,No,,,
6,7,16.93,3.07,Female,No,,,
7,8,24.27,2.03,Male,Yes,,,
8,9,8.77,2.0,Male,No,Sun,Dinner,2.0


We can observe slight differences between Pandas and Concrete ML with floating points values. This is only due to quantization artifacts, as we currently only allow a few bits of precision. We can still see that both data-frames are equal under a small float relative tolerance.

In [17]:
df_are_equal = pandas_dataframe_are_equal(
    df_joined_pandas, df_joined_cml, float_rtol=0.1, equal_nan=True
)

print("Concrete ML data-frame is equal to Pandas data-frame:", df_are_equal, "\n")

Concrete ML data-frame is equal to Pandas data-frame: True 



In [18]:
# Clean the temporary directories and their content
client_1_temp_dir.cleanup()
client_2_temp_dir.cleanup()

## Conclusion

Concrete ML provides a way for multiple parties to run Pandas operations on their data-frames without ever disclosing any sensitive data. This is done through a Pandas-like API that enables users to encrypt the data-frames and a server to run the operations in a private and secure manner using Fully Homomorphic Encryption (FHE). The users are then able to decrypt the output and obtain a result similar to what Pandas would have provided in a non-private setting.  

#### Future Work

We are currently working on improving the encrypted data-frame feature. In the near future, we are planning on allowing bigger precisions, which would make encrypted data-frames able to handle larger integers, floating points with better precisions and more unique strings values, as well as provide more rows. We will also add support for more encrypted operations on data-frames. Additionally, we are working new techniques that would avoid users having to share a private keys between themselves. 