# Federated Deep Learning on Vertically Partitioned SGCP Dataset - Proof of Concept on Pailliar Encryption Scheme

By Xiaochen Zhu

## Background

This notebook is an implementation of `vFedCCE` which is a private deep learning method using categorical cross entropy loss and gradient optimization to solve multi-category classfication problem in vertically partitioned datasets where labelled are only stored in one of the clients.

This particular implementation is an example when there are two categories but the same method applies to all classfication problem where categorical cross entropy loss function is the minimization goal.

This notebook aims to demonstrate the use of Pailliar encryption scheme, a widely used additively homomorphic encryption technique. Due to the computational complexity of encryption, this is a proof of concept of the technique instead of a full implementation of the original algorithm with encryption. To reduce the computational complexity, only one batch will be ran in this notebook to demonstrate the encryption feasibility.

## Set up environment

If you encounter error when running these `import`s, please restart runtime and try again.

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
import numpy as np
np.set_printoptions(precision=3, suppress=True)
import seaborn as sns
sns.set(style='whitegrid')
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from keras import layers
from keras import regularizers
from sklearn.model_selection import train_test_split
from keras.layers.experimental import preprocessing
import math
import uuid
import random
import zipfile

! pip -q install phe
from phe import paillier

# ! pip -q install clkhash
# from clkhash import clk, randomnames

## Data preperation

We need to load the data and vertically partition the dataset into two clients where each of them will have half the features and one of them will store the labels.

### Load and vertically partition the data

Just load the complete `csv` file into a dataframe.

In [2]:
# Download the zip file from the internet and unzip it
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00573/SouthGermanCredit.zip
with zipfile.ZipFile('SouthGermanCredit.zip', 'r') as zip_ref:
    zip_ref.extractall('./SouthGermanCredit/')

--2021-07-06 19:51:43--  https://archive.ics.uci.edu/ml/machine-learning-databases/00573/SouthGermanCredit.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13130 (13K) [application/x-httpd-php]
Saving to: ‘SouthGermanCredit.zip.2’


2021-07-06 19:51:44 (90.1 MB/s) - ‘SouthGermanCredit.zip.2’ saved [13130/13130]



In [3]:
original_df = pd.read_csv('./SouthGermanCredit/SouthGermanCredit.asc', sep=' ')
original_df.describe()

Unnamed: 0,laufkont,laufzeit,moral,verw,hoehe,sparkont,beszeit,rate,famges,buerge,wohnzeit,verm,alter,weitkred,wohn,bishkred,beruf,pers,telef,gastarb,kredit
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2.577,20.903,2.545,2.828,3271.248,2.105,3.384,2.973,2.682,1.145,2.845,2.358,35.542,2.675,1.928,1.407,2.904,1.845,1.404,1.963,0.7
std,1.257638,12.058814,1.08312,2.744439,2822.75176,1.580023,1.208306,1.118715,0.70808,0.477706,1.103718,1.050209,11.35267,0.705601,0.530186,0.577654,0.653614,0.362086,0.490943,0.188856,0.458487
min,1.0,4.0,0.0,0.0,250.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,19.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,1.0,12.0,2.0,1.0,1365.5,1.0,3.0,2.0,2.0,1.0,2.0,1.0,27.0,3.0,2.0,1.0,3.0,2.0,1.0,2.0,0.0
50%,2.0,18.0,2.0,2.0,2319.5,1.0,3.0,3.0,3.0,1.0,3.0,2.0,33.0,3.0,2.0,1.0,3.0,2.0,1.0,2.0,1.0
75%,4.0,24.0,4.0,3.0,3972.25,3.0,5.0,4.0,3.0,1.0,4.0,3.0,42.0,3.0,2.0,2.0,3.0,2.0,2.0,2.0,1.0
max,4.0,72.0,4.0,10.0,18424.0,5.0,5.0,4.0,4.0,3.0,4.0,4.0,75.0,3.0,3.0,4.0,4.0,2.0,2.0,2.0,1.0


Give all entries a uuid and shuffle the two datasets so that later they can find matches based on that uuid.

In [4]:
id = pd.Series(range(0,1000)).apply(lambda i : str(uuid.uuid4()))
df_with_id = original_df.copy()
df_with_id['id'] = id
df_with_id = df_with_id.set_index('id')
client1_data = df_with_id[['moral','verw','beszeit','famges','wohnzeit','alter','wohn','beruf','telef','gastarb']]#.sample(frac=1)
client2_data = df_with_id[['laufkont','laufzeit','hoehe','sparkont','rate','buerge','verm','weitkred','bishkred','pers','kredit']]#.sample(frac=1)

In [5]:
# Client 1 has 10 features, no labels and the entries are shuffled
# client 2 has 11 features and labels and the entries are shuffled
print(client1_data)
print(client2_data)

                                      moral  verw  ...  telef  gastarb
id                                                 ...                
677a5bbf-0ef5-48f7-b960-9fb9ea564cf9      4     2  ...      1        2
ee6deefa-8b94-434b-9918-d949b763d602      4     0  ...      1        2
c177b067-accc-4b42-9672-25c4c7f86cba      2     9  ...      1        2
f6b8cef6-16d5-43d2-8588-0222959754d3      4     0  ...      1        1
787aaa76-64ae-4d25-a4f1-44d7a169ce40      4     0  ...      1        1
...                                     ...   ...  ...    ...      ...
0b12cfea-acef-4e6d-8dde-f3c338414152      2     3  ...      1        2
9be219fa-2909-4efe-8b7d-76de34e92b11      2     0  ...      1        2
ca6658a8-4de4-4489-a475-03abc7b2f46c      4     0  ...      2        2
eac08885-26ee-4442-8616-4fbbaffcb8e2      2     3  ...      2        2
5c8c5733-33cb-414b-96be-c845da911f06      2     2  ...      1        2

[1000 rows x 10 columns]
                                      laufkont  lau

### Train/test split (overlapping)

In [6]:
client1_train, client1_test = train_test_split(client1_data, test_size=0.2, random_state=69)
client2_train = client2_data.loc[client1_train.index]
client2_test = client2_data.loc[client1_test.index]

### Train/test datasets info 

In [7]:
common_train_index = client1_train.index.intersection(client2_train.index)
common_test_index = client1_test.index.intersection(client2_test.index)

print(
    'There are {} common entries (out of {}) in client 1 and client 2\'s training datasets,\nand {} common entries (out of {}) in their test datasets'
    .format(
        len(common_train_index),
        len(client1_train),
        len(common_test_index),
        len(client1_test)))

There are 800 common entries (out of 800) in client 1 and client 2's training datasets,
and 200 common entries (out of 200) in their test datasets


## `vFedCCE`

### Parameters

In [8]:
batch_size = 32
learning_rate = 1e-3
epochs = 1

# Instantiate an optimizer.
optimizer=keras.optimizers.Adam(learning_rate=learning_rate)
# Instantiate a loss function.
# Not from logits because of the softmax layer converting logits to probability.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# Instantiate a metric function (accuracy)
train_acc_metric = tf.keras.metrics.SparseCategoricalAccuracy()

### The `Client` class

Client 1 and client 2 both have half the features. Client 2 also stores the labels. If before each update of parameters, client 1 can send its partial prediction and other intermediate data to client 2, client 2 will be able to calculate the total loss and update the total model and give that updated model back to client 1.

Note that in this iteration of implementation, I do not plan to add implementation of entity resolution. Therefore, the examples with the same index on the two clients is ensured to point to the same end user.

In [9]:
class Client:

  def __init__(self, train_data, test_data, labelled):
    self.__trainX = train_data.copy()
    self.__testX = test_data.copy()
    self.labelled = labelled

    if (labelled):
      self.__trainY = self.__trainX.pop('kredit')
      self.__testY = self.__testX.pop('kredit')
    else:
      self.public_key, self.private_key = paillier.generate_paillier_keypair(n_length=256)

    normalizer = preprocessing.Normalization()
    normalizer.adapt(np.array(self.__trainX.loc[common_train_index]))

    self.model = tf.keras.Sequential([
      normalizer,
      layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)),
      layers.Dropout(0.5),
      layers.Dense(128, activation='elu', kernel_regularizer=regularizers.l2(0.01)),
      layers.Dropout(0.5),
      layers.Dense(2),
      layers.Softmax()])
    
    self.shapes = [[10,128], [128], [128,128], [128], [128,2], [2]]
    
  def next_batch(self, index):
    self.batchX = self.__trainX.loc[index]
    if not self.labelled:
      grads = []
      self.model_output = np.zeros((len(index), 2))
      for i in range(len(index)):
        with tf.GradientTape() as gt:
          gt.watch(self.model.trainable_weights)
          output_by_example = self.model(self.batchX.iloc[i:i+1], training=True)
          output_for_grad = output_by_example[:,1]
        self.model_output[i] = output_by_example
        grads.append(gt.gradient(output_for_grad, self.model.trainable_weights))
      return grads, [[[self.public_key.encrypt(i) for i in tf.reshape(x, [-1]).numpy().tolist()] for x in g] for g in grads]
    else:
      self.batchY = self.__trainY.loc[index]
      with tf.GradientTape() as self.gt:
        self.gt.watch(self.model.trainable_weights)
        self.model_output = self.model(self.batchX, training=True)

  def cal_model(self):
    return self.model_output
  
  def predict(self, test_index):
    return self.model.predict(self.__testX.loc[test_index])# + 1e-8

  def test_answers(self, test_index):
    if self.labelled:
      return self.__testY.loc[test_index]
  
  def batch_answers(self):
    if self.labelled:
      return self.batchY

  def loss_and_update(self, a):
    if not self.labelled:
      raise AssertionError("This method can only be called by client 2")
    self.prob = (a + self.model_output)/2
    self.c = self.coefficient_and_update()/len(self.batchX)
    return self.prob, loss_fn(self.batchY, self.prob)
  
  def coefficient_and_update(self):
    if not self.labelled:
      raise AssertionError("This method can only be called by client 2")
    p = self.prob[:,1]
    c = (p-self.batchY)/((p)*(1-p))
    with self.gt:
      output = sum(c * self.model_output[:,1])/len(c)
    grads = self.gt.gradient(output, self.model.trainable_weights)
    optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
    return c
  
  def update_with_plain(self, grads):
    weights = self.model.trainable_weights
    optimizer.apply_gradients(zip(grads, weights))
    return weights

  def update_with_cipher(self, cipher_grads):
    weights = self.model.trainable_weights
    # cipher_grads is a list of six lists
    for i in range(len(cipher_grads)):
      cipher_grads[i] = [self.private_key.decrypt(x) for x in cipher_grads[i]]
      cipher_grads[i] = tf.reshape(tf.convert_to_tensor(cipher_grads[i]), self.shapes[i])
    optimizer.apply_gradients(zip(cipher_grads, weights))
    return weights

  def assemble_grad_plain(self, partial_grads):
    if not self.labelled:
      raise AssertionError("This method can only be called by client 2")
    # to assemble the gradient for client 1
    for i in range(len(self.c)):
      partial_grads[i] = [x * self.c[i] for x in partial_grads[i]]
    return [sum(x) for x in zip(*partial_grads)]
  
  def assemble_grad_cipher(self, cipher_partial_grads):
    if not self.labelled:
      raise AssertionError("This method can only be called by client 2")
    # assemble the cipher gradient for client 1 from cipher partial gradients
    for i in range(len(self.c)):
      # cipher_partial_grads[i] is a list of 6 lists, c[i] is a float
      # x in cipher_partial_grads[i] is a 1d list flatten from gradients
      cipher_partial_grads[i] = [[t * float(self.c[i]) for t in x] for x in cipher_partial_grads[i]]
    return [[sum(i) for i in zip(*x)] for x in zip(*cipher_partial_grads)]

In [10]:
client1 = Client(client1_train, client1_test, False)
client2 = Client(client2_train, client2_test, True)

### Trial run on single batch

In [11]:
# train_index_batches = [common_train_index[i:i + batch_size] for i in range(0, len(common_train_index), batch_size)] 
common_train_index_list = common_train_index.to_list()

In [12]:
for epoch in range(epochs):
  random.shuffle(common_train_index_list)
  train_index_batches = [common_train_index_list[i:i + batch_size] for i in range(0, len(common_train_index_list), batch_size)]
  total_loss = 0.0
  # Only iterate over the first batch to prove the concept.
  train_index_batches = train_index_batches[:1]
  for step, batch_index in enumerate(train_index_batches):
    
    plain_partial_grads, cipher_partial_grads = client1.next_batch(batch_index)
    client2.next_batch(batch_index)

    prob, loss_value = client2.loss_and_update(client1.cal_model())
    cipher_grad = client2.assemble_grad_cipher(cipher_partial_grads)
    plain_grad = client2.assemble_grad_plain(plain_partial_grads)
    weights_from_cipher = client1.update_with_cipher(cipher_grad)
    weights_from_plain = client1.update_with_plain(plain_grad)
    
    # No need to record train loss or accuracy, or further predictions

### Compare the two gradients

Then we can show that the two weights (updated with encrypted gradient or updated with plain gradient) are the same.

In [13]:
diff = [weights_from_cipher[i] - weights_from_plain[i] for i in range(len(weights_from_cipher))]
print(diff)

[<tf.Tensor: shape=(10, 128), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: shape=(128,), dtype=float32, numpy=
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>, <tf.Tensor: shape=(128, 128), dtype=float32, numpy=
arra