## Privacy-preserving Linear Regression (PPLR) -Secure multiparty computation techniques

* This notebook shows how to do privacy-preserving linear regression in the scenario that data is vertically partitioned. 
* The following code is based on two data parties having the same data instances but different features/attributes/variables. 
* It can be easily extended to more than two data parties. 
* We assume data parties are both semi-honest which means they will follow the protocol but still curious about each other's data.
* Data parties only learn the outcome of linear regression which is y = b1x + b0 (b1:coefficient, b0:intercept)
* We use Ordinary Least Squares (OLS): minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function. 
* If you need more mathematical knowledge of OLS, please read: https://en.wikipedia.org/wiki/Ordinary_least_squares
* This code applied secure scalar product protocol from: https://dl.acm.org/citation.cfm?id=775142 

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model, datasets
from numpy.linalg import inv
from IPython.display import Image
from IPython.core.display import HTML 

In [2]:
# If you know OLS in linear regression, then you can understand the problem we are solving is 
# b1 = (X.T * X)-1 * X.T * Y  The challenging part is how to calculate X_a * X_b without disclosing original data.
# This problem can be solved by secure scalar product which is one of secure multiparty computation techniques.
# Please see the picture below.
Image(url= "matrix.JPG", width=500, height=500)

#### Let's use one dataset from Scikit Learn. For you to try this code, this dataset is the easiest way to get and run.

In [3]:
# Load the diabetes dataset from Sklearn
diabetes = datasets.load_diabetes()
df = pd.DataFrame.from_records(diabetes.data)
col = ['feature0', 'feature1', 'feature2', 'feature3', 'feature4', 'feature5',\
      'feature6', 'feature7', 'feature8', 'target']
df.columns = col

# Make a centralized dataset which will be used to test whether our PPLR
X_centralized = df.drop(['target'], axis=1)
Y = df['target']
B_divide_set = 17 # need to be a divisor of the number of instances (try function divisor)
print('The number of rows:', len(df))
print('The number of features:', len(col))

('The number of rows:', 442)
('The number of features:', 10)


### We simulate the vertically partitioned data scenario so that dataset is splited into two sets for two data parties. 
* Data Site A: feature 0 to feature 4
* Data Site B: feature 5 to feature 8 + target feature (feature 9)
* Please note that the target class is only available at Data Site B

### Data site A ###

In [4]:
X_a = df[col[0:5]] # feature0 - feature4

# Add one columns with all values of 1 to dataset which uses to calculate b0
b0 = np.ones((1, len(X_a))).tolist()[0]
X_a.insert(loc=0, column='b0', value=b0)

# Calculate X_a.T * X_a locally at data site A 
XaTXa = np.matrix(X_a).T * np.matrix(X_a)
len_A = len(X_a.columns)

In [5]:
# Generate random numbers and add to data at Data Site A
A_randoms = []
for i in range(0, len_A):
#     np.random.seed(1)
    A_randoms.append(np.random.randint(0,5, len(X_a.iloc[:,i])))
    
C_matrix = [] # C_noises is shared between A and B 
for i in range(0, len_A):
#     np.random.seed(2)
    C_matrix.append(np.random.randint(0,5, (len(X_a.iloc[:,i]), len(X_a.iloc[:,i]))))

In [6]:
Sum_noises_A = [] # which will be sent to B
for i in range(0, len_A):
    Sum_noises_A.append(np.add(X_a.iloc[:,i], np.dot(C_matrix[i], A_randoms[i])))

In [24]:
np.save('C_matrix.npy', C_matrix)
C_matrix = np.load('C_matrix.npy')

np.save('Sum_noises_A.npy', Sum_noises_A)
Sum_noises_A = np.load('Sum_noises_A.npy')

In [9]:
C_matrix[0].tobytes()

'\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x03\x0

Data Site A: send __C_matrix__ and __Sum_noises_A__ and __A_randoms_Sumset__ to Data Site B

### Data site B ###

In [28]:
X_b = df[col[5:10]] # feature5 - feature8 and target feature
XbTXb = np.matrix(X_b).T * np.matrix(X_b)
len_B = len(X_b.columns)

In [29]:
Sum_coef_B = []
for i in range(0, len_B):
    Sum_noises_temp = []
    for j in range(0, len_A):
        Sum_noises_temp.append(np.dot(C_matrix[j].transpose(), X_b.iloc[:,i])) 
    Sum_coef_B.append(Sum_noises_temp)

In [30]:
B_random_set = []
for i in range(0, len_A):
#     np.random.seed(3)
    B_random_set.append(np.random.randint(0,5, int(len(X_b.iloc[:,0])/B_divide_set))) 

Sum_noises_B = [] # which will be send to A
for n in range(0, len_B):
    B_noise = []
    for i in range(0, len_A):
        B_random_inter = []
        for j in range(0, len(B_random_set[i])): 
            for k in range(0, B_divide_set):
                B_random_inter.append(B_random_set[i][j])
        B_noise.append(Sum_coef_B[n][i] + B_random_inter)
    Sum_noises_B.append(B_noise)

In [31]:
# Add noises dataset A to the dataset B
Sum_noises_AB = []
for i in range(0, len_B):
    Sum_noises_temp = []
    for j in range(0, len_A):
        Sum_noises_temp.append(np.dot(Sum_noises_A[j], X_b.iloc[:,i])) # X_b[:,i]
    Sum_noises_AB.append(Sum_noises_temp)

In [32]:
np.save('Sum_noises_B.npy', Sum_noises_B)
Sum_noises_B = np.load('Sum_noises_B.npy')

B sends __Sum_noises_B__ to A

### Back to Data Site A ###

In [33]:
A_randoms_Sumset = []
for i in range(0, len_A):
    sum_temp = []
    for j in range(0, int(len(X_a)/B_divide_set)):
        temp = 0
        for k in range(0, B_divide_set):
            temp = temp + A_randoms[i][B_divide_set*j + k]
        sum_temp.append(temp)
        
    A_randoms_Sumset.append(sum_temp)
 
    
Sum_noises_B_Arand = []
for n in range(0, len_B):
    temp = []
    for i in range(0, len_A):
        temp.append(np.dot(A_randoms[i],Sum_noises_B[n][i]))
    Sum_noises_B_Arand.append(temp)

In [34]:
np.save('A_randoms_Sumset.npy', A_randoms_Sumset)
A_randoms_Sumset = np.load('A_randoms_Sumset.npy')

np.save('Sum_noises_B_Arand.npy', Sum_noises_B_Arand)
Sum_noises_B_Arand = np.load('Sum_noises_B_Arand.npy')

A sends __A_randoms_Sumset__ and __Sum_noises_B_Arand__ back to B

### Back to Data Site B --> As B has the target feature, B calculate the final results ###

In [24]:
rand_sums = []
for i in range(0, len_A):
    r_sum = 0
    for j in range(0, len(B_random_set[0])):
        r_sum = r_sum + A_randoms_Sumset[i][j] * B_random_set[i][j]
    rand_sums.append(r_sum)

In [25]:
outcomes = []
for n in range(0, len_B):
    out = []
    for i in range(0, len_A):
        out.append(Sum_noises_AB[n][i] - Sum_noises_B_Arand[n][i] + rand_sums[i]) 
    outcomes.append(out)

### Combine matrix to compute linear regresssion (b1 and b0)

In [26]:
XaTXb = np.matrix(outcomes)[:-1]

XbTXa = XaTXb.T

XaTY = np.matrix(outcomes)[-1]

XbTXb_exclY = XbTXb[:-1].T[:-1]

XbTY = np.delete(XbTXb[-1], -1)

pp_XTX = np.concatenate((np.concatenate((XaTXa, XbTXa), axis=1), np.concatenate((XaTXb, XbTXb_exclY), axis=1)),axis=0) 
pp_XTY = np.concatenate((XaTY, XbTY),axis=1).T

pp_out = np.linalg.inv(pp_XTX) * pp_XTY

In [27]:
b1 = pp_out[1:]
b0 = pp_out.item(0)
print('Coefficients: \n' ,b1)
print('Intercept: ', b0)

Coefficients: 
 [[ 0.11921674]
 [ 0.06715709]
 [ 0.15204223]
 [ 0.15622151]
 [-0.07554103]
 [ 0.08227142]
 [ 0.08514922]
 [ 0.1732423 ]
 [ 0.23266615]]
Intercept:  -1.5298482250575266e-16


### Checking with the centralized data/method ####
Scikit learn linear regression method is used to compare the results

In [28]:
regr = linear_model.LinearRegression(fit_intercept=True, normalize=True)
regr.fit(X_centralized, Y)

# The coefficients
print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)

Coefficients: 
 [ 0.11921674  0.06715709  0.15204223  0.15622151 -0.07554103  0.08227142
  0.08514922  0.1732423   0.23266615]
Intercept: 
 -1.59626897613e-16
