### Pre-processing your data to preserve the privacy ###

In privacy-preserving machine learning and data minging in a distributed data scenario, we are facing a very challenging problem that the existing solutions are not very feasible and scalable to apply in real-world datasets. 
The most popular method -secure multiparty computation- takes very high costs of time, communication, and computations. To solve this problem, I suggest we can classify privacy level of features before doing analysis might decrease the cost. Considering the privacy level, features are categoried to:
1. Identifiable features
2. (Quasi-)identifiable features (instances can be re-identified by combining with other features)
3. Sensitive features (depends on domain. e.g. Health data)
4. Others 

At the same time, do a pre-selection on features can eliminate the risk of breaching personal privacy. For instance, removing outliers is one way to protect 'outstanding' instances in the dataset which can avoid instances being recognized. If the features are sensitive which needs anonymization, this code also provides K-means to do the generalization. Users are free to set how general the values they want. Additionally, this jupyter notebook automate the data pre-process procedure and applied ipywidget which supports a very user-friendly interface. Users can easily do the data pre-processing, detect and remove outliers, and anonymize the features. It it very easy to follow and use in practice.

In this code, following methods/steps are presented:
* Step 0: Input your data. Missing values will be checked automatically. If missing values in the dataset are represented by certain characters (e.g. '?', 'missing','Nan',etc), you can indicate it when you upload the data. 
* Step 1: Deal with your missing values. You can choose to remove the column if it contains too many missing values. Or, fill the missing values with certain numbers, characters, etc. 
* Step 2: Detect outliers to avoid 'outstanding' instances being re-identified. Local Outlier Factor (LOF) method is used to detect outliears (you can choose to remove the outliers or not based on the detection results). You can customize he percentage of outliers and the number of neighbors according to your dataset. The advantage of LOF is that not only global outliers can be detected, but local outliers are also sensitive to LOF.
* Step 3: Detect (Quasi-)identifiable features. This code provides two methods to detect (quasi-)identifiable features: 1) Check the diversity of the feature which means how many different values this feature contains; 2) Using decision tree or randon forest to detect which features can be used to easily identify people (based on its possibilities)
* Step 4: Anonymize identifiable/sensitve features. Generalization as one method to anonymize values is implemented by using K-means. You can select which features you want to anonymize according to your data, domain, and questions you want to answer by the data. You can customize how 'general' the values will be for this sensitive feature. Additionally, before anonymization, you can choose to normalize your data by using standard, min-max, or robust scaler. In the end, your pre-processed data will be saved in your local computer.

In [2]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import linear_model
from numpy.linalg import inv
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import LocalOutlierFactor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.metrics.pairwise import euclidean_distances

import ipywidgets as widgets
from traitlets import traitlets
from IPython.display import display
from ipywidgets import Layout, Button, Box, interact, interactive, fixed, interact_manual, Checkbox, FloatSlider

In [3]:
### Adopt the ipywidget button function to get the outcome of previous button function
class LoadedButton(widgets.Button):
    """A button that can holds a value as a attribute."""

    def __init__(self, value=None, *args, **kwargs):
        super(LoadedButton, self).__init__(*args, **kwargs)
        # Create the value attribute.
        self.add_traits(value=traitlets.Any(value))

## Input data ##

In [4]:
### Set ipywidget layout and style ### 
style = {'description_width': 'initial'}
uniLayout = Layout(width='50%', height='30px')

##### ##### #####
init_file = 'diabetic_data.csv'
pathiwg = widgets.Text(value=init_file, description='Data File Path: ', style=style,layout=uniLayout)

##### ##### #####
mis_charac = widgets.Text(value='?', description='Any special characters to present missing values: ', style=style,layout=uniLayout)

##### ##### #####
check_mis = widgets.Checkbox(value=False, description='Check the missing values', disabled=False,\
                             style=style, layout=uniLayout)

##### ##### #####
button1 = LoadedButton(description="Input data", button_style='success', value=None)
#widgets.Button(description="Input data", button_style='success')


In [5]:
### check if there is any missing value in the dataset ###
def check_missing(df, col):
    missing  = 0
    CheckNull = df.isnull().sum()
    for var in range(0, len(CheckNull)):
        if CheckNull[var] != 0:
            print(col[var], '\t', CheckNull[var])
            missing = missing + 1

    if missing == 0:
        print('Dataset is complete with no blanks.')
    else:
        print('Totally, %d features have missing values (blanks).' %missing)

In [6]:
### Input data function ###
def input_data(event):
    
    file_path = pathiwg.value
    df_ori =  pd.DataFrame.from_csv(file_path, index_col=None)
    col = df_ori.columns
    
    if event != 'silent':
        print('***** Step 0 *****')
        print('Input data successfully!')

        ##### Replace customized missing valve #####
        mis_value_code = mis_charac.value
        if len(mis_value_code) > 0 :
            df_ori = df_ori.replace({mis_value_code : np.nan})

        ##### Check missing values #####
        if check_mis.value == True: 
            check_missing(df_ori, col)
    print('****** Done ****** \n')
    event.value = df_ori

In [7]:
group0 = [pathiwg, mis_charac, check_mis, button1]

box_layout = Layout(display='center',
                    flex_flow='column',
                    align_items='stretch',
                    align_content='flex-start',
                    width='80%')
box0 = widgets.Box(children=group0, layout=box_layout)
accordion = widgets.Accordion(children=[box0], style=style)
accordion.set_title(0, 'Step 0 Input data')
button1.on_click(input_data)
accordion

Accordion(children=(Box(children=(Text(value='diabetic_data.csv', description='Data File Path: ', layout=Layou…

***** Step 0 *****
Input data successfully!
race 	 2273
weight 	 98569
payer_code 	 40256
medical_specialty 	 49949
diag_1 	 21
diag_2 	 358
diag_3 	 1423
Totally, 7 features have missing values (blanks).
****** Done ****** 



## Detect local outlier which can be easily re-identify ##

In [11]:
### Set ipywidget layout and style ### 

##### ##### #####
df_ori = button1.value
col = df_ori.columns
col_forRemove = col.tolist()
col_forRemove.insert(0, 'No columns to remove')
remove_col = widgets.SelectMultiple(
                                options=col_forRemove,
                                value=[col_forRemove[0]],
                                description='Remove certain columns: ',
                                disabled=False,
                                style=style,
                                layout=Layout(width='50%', height='150px')
                                )

##### ##### ####
fill_missing = widgets.Text(value='?', description='Fill missing value with certain character/num: ', \
                            style=style,layout=uniLayout)

##### ##### #####
button2 = LoadedButton(description="Process", button_style='success', value = button1.value)

##### ##### #####
id_feature = widgets.SelectMultiple(
                                options=col,
                                value=[col[0]],
                                description='Select ID/data index column: ',
                                disabled=False,
                                style=style,
                                layout=Layout(width='50%', height='150px')
                                )

##### ##### #####
ngb_wgt = widgets.BoundedIntText(value=50,min=1,step=1,description='Neighbors:',\
                                 disabled=False, style=style, layout=uniLayout)

##### ##### #####
perc_wgt = widgets.BoundedFloatText(value=0.05,min=0, max=1,step=1,description='Outliers percent:',\
                                    disabled=False, style=style, layout=uniLayout)

##### ##### #####
checkbox_remove = widgets.Checkbox(value=False, description='Remove detected outliers', disabled=False,\
                             style=style, layout=uniLayout)

In [12]:
##### Replace string data #####
def encode_string(df_ori):
    le = LabelEncoder()

    obj = 0
    colList = df_ori.columns
    for i in range(0, len(colList)):
        if df_ori.dtypes.values[i] == 'object':
            obj = obj + 1
            le.fit(df_ori[colList[i]].drop_duplicates()) 
            df_ori[colList[i]] = le.transform(df_ori[colList[i]])
    return df_ori

##### Outlier detection #####
def detectOUT (df, IDs, neighbours, percent):
    X = df.drop(IDs, axis=1)
    clf = LocalOutlierFactor(n_neighbors=neighbours, contamination=percent, leaf_size=1)
    y_pred = clf.fit_predict(X)
    X_scores = clf.negative_outlier_factor_
    return y_pred, X_scores

##### Remove outliers #####
def remove_outliers(df_ori, y_pred):
    df_ori['outlier'] = pd.Series(y_pred)
    df_inliers = df_ori[df_ori['outlier']==1]
    df_inliers = df_inliers.drop('outlier', axis=1)

    return df_inliers

##### Remove empty or un-diverse features #####
def removeCol(df, reCol):
    new_df = df.drop(reCol, axis=1)
    return new_df
# Replace blank with new character
def replaceBlank(df, new):
    new_df = df.fillna(new)
    return new_df

In [13]:
def pre_process(event2):
    print("Step 2 Processing... ")
    df_ori = event2.value
    
    ##### Remove certain columns
    reCol = []
    for i in remove_col.value:
        if i != 'No columns to remove':
            reCol.append(i)
    if len(reCol) > 0:
        df_ori = removeCol(df_ori, reCol)

    ##### Fill missing values
    if len(fill_missing.value) > 0:
        df_ori = replaceBlank(df_ori, fill_missing.value)
    
    ##### String to numbers
    df_ori = encode_string(df_ori)
        
    ##### Outlier detection #####
    if perc_wgt.value > 0:
        y_pred, X_scores = detectOUT (df_ori, list(id_feature.value), ngb_wgt.value, perc_wgt.value)
        print('Number of outliers', Counter(y_pred).get(-1))
        print('Number of inliers', Counter(y_pred).get(1))
    
    ##### Remove outliers #####
    if checkbox_remove.value == True:
        df_inliers = remove_outliers(df_ori, y_pred)
        print('Outliers removed successfully!')
        event2.value = df_inliers
    else:
        event2.value = df_ori

In [14]:
group1 = [fill_missing, remove_col]
group2 = [id_feature, ngb_wgt, perc_wgt, checkbox_remove, button2]

box1 = widgets.Box(children=group1, layout=box_layout)
box2 = widgets.Box(children=group2, layout=box_layout)

accordion1 = widgets.Accordion(children=[box1, box2], style=style)
accordion1.set_title(0, 'Step 1 Pre-process')
accordion1.set_title(1, 'Step 2 Detect potential outliers')

button2.on_click(pre_process)
accordion1

Accordion(children=(Box(children=(Text(value='?', description='Fill missing value with certain character/num: …

## Detect (Potential) identifiable features ##

In [15]:
##### Check diversity of features #####
id_detect_wgt = widgets.RadioButtons(options=['Diversity Checking', \
                                               'Decision Tree', 'Random Forest'],
                                        value='Diversity Checking',
                                        description='Detect potential (quasi-)identifiable features:',
                                        disabled=False,
                                        style=style,
                                        layout=Layout(width='80%', height='60px')
                                        )

##### Check diversity of features #####
id_options = list(id_feature.value)
id_col_wgt = widgets.RadioButtons(options=id_options,
                                    value=id_options[0],
                                    description='Select your ID column:',
                                    disabled=False,
                                    style=style,
                                    layout=Layout(width='80%', height='60px')
                                    )

##### ##### #####
rank_wgt = widgets.BoundedIntText(value=len(button2.value.columns),min=1,max=len(button2.value.columns),\
                                  step=1,description='Number of features to show:', disabled=False, style=style, layout=uniLayout)

##### ##### #####
button3 = LoadedButton(description="Process", button_style='success', value = button2.value)

In [16]:
def id_feature_detect(event3):
    print("Step 3 Processing... ")

    df_inliers = event3.value
    limitNum = rank_wgt.value
    
    if id_detect_wgt.index == 0:
        ##### Check diversity of features #####
        diver = []
#         diver2 = []
        for c in df_inliers.columns:
            diver.append(len(Counter(df_inliers[c]).keys()))
#             diver2.append(len(Counter(df_ori[c]).keys()))

        diver_df = pd.DataFrame.from_records([df_inliers.columns, diver]).transpose()
        diver_df.columns = ['Features', 'Inliers']
        print(diver_df.sort_values(by=['Inliers'], ascending=False)[0:limitNum])
    else:
        tree_detection(id_detect_wgt.index, limitNum, df_inliers, id_col_wgt.value)
        

In [17]:
##### Feature Selection/ranking #####
def tree_detection(index, limitNum, df_inliers, id_column):
    # Build a forest and compute the feature importances
    if index == 1:
        regressor = DecisionTreeRegressor(min_impurity_decrease=0)
    elif index == 2:
        regressor = RandomForestRegressor(n_estimators=200,
                                              random_state=0, 
                                            min_samples_split=2, 
                                            min_samples_leaf=1, 
                                            min_impurity_decrease=0)

    y = df_inliers[id_column]
    X = df_inliers.drop(id_column, axis=1)
    X_col = X.columns
    regressor.fit(X, y)
    importances = regressor.feature_importances_

    
    indices = np.argsort(importances)[::-1]

    # Print the feature ranking
    print("Feature ranking:")

    for f in range(0, limitNum): #X.shape[1]
        print("%d. feature %s (%f)" % (f + 1, X_col[indices[f]], importances[indices[f]]))
        
    if index == 2:
        std = np.std([tree.feature_importances_ for tree in regressor.estimators_], axis=0)
        # Plot the feature importances of the forest
        plt.figure(figsize=(12,9))
        plt.title("Feature importances")
        plt.bar(range(X.shape[1]), importances[indices],
               color="r", yerr=std[indices], align="center")
        plt.xticks(range(X.shape[1]), indices)
        plt.xlim([-1, X.shape[1]])
        plt.show()

## Quasi-identifiable features/attributes/variables ##

In [18]:
##### Remove highly identifiable features #####
df_ori = button2.value
col = df_ori.columns
col_forRemove = col.tolist()
col_forRemove.insert(0, 'No columns to remove')
remove_id_col = widgets.SelectMultiple(
                                options=col_forRemove,
                                value=[col_forRemove[0]],
                                description='Exclude ID/categorical columns: ',
                                disabled=False,
                                style=style,
                                layout=Layout(width='50%', height='150px')
                                )

##### ##### #####
clu_method_wgt = widgets.RadioButtons(options=['None', 'K-means', 'K-means (extra distance column)'],
                                    value='None',
                                    description='Generalize features by using',
                                    disabled=False,
                                    style=style,
                                    layout=Layout(width='80%', height='80px')
                                    )


##### ##### #####
scaler_wgt = widgets.RadioButtons(options=['None', 'Standardization', 'Min-Max Scaler', 'Robust Scaler'],
                                    value='None',
                                    description='Normalize data by',
                                    disabled=False,
                                    style=style,
                                    layout=Layout(width='80%', height='90px')
                                    )


##### Generalize (Quasi-) identifiable features #####
clu_wgt = widgets.BoundedIntText(value=20,min=1,max=len(button2.value)/10,\
                                  step=1,description='Number of clusters:', disabled=False, style=style, layout=uniLayout)


##### ##### #####
saveiwg = widgets.Text(value='preprocessed_dataFile.csv', description='Save processed data: ', style=style,layout=uniLayout)

##### ##### #####
button4 = LoadedButton(description="Process", button_style='success', value = button2.value)

In [19]:
##### K-means clustering #####
def kmeans_func (df_inliers_3, cluster_mode, scaler_mode, clusterNum):
    
    scaler = [None, StandardScaler(), MinMaxScaler(), RobustScaler()]
    
    if scaler_mode == 0:
        scaled_df_inliers_3 = df_inliers_3
    else:
        scaled_df_inliers_3 = scaler[scaler_mode].fit_transform(df_inliers_3)
    if cluster_mode > 0:
        kmeans = KMeans(n_clusters=clusterNum, random_state=0).fit(scaled_df_inliers_3)
        data_clusterLabels = kmeans.labels_
        data_clusterCenters = kmeans.cluster_centers_

        anonymized = []
        dist_center = []
        for i in range(0, len(data_clusterLabels)):
            anonymized.append(data_clusterCenters[data_clusterLabels[i]])
            if cluster_mode == 2:
                dist_center.append(euclidean_distances([data_clusterCenters[data_clusterLabels[i]]], \
                                                  [scaled_df_inliers_3.iloc[i]])[0])
        df_anonymized = pd.DataFrame.from_records(anonymized, columns=scaled_df_inliers_3.columns)

        if cluster_mode == 2:
            df_anonymized['dist_center'] = pd.Series(dist_center)

        print(Counter(kmeans.labels_))
    else:
        df_anonymized = scaled_df_inliers_3
        
    print('Done!!')
    return df_anonymized

In [20]:
def anonymization(event4):
    print('\n\n Step 4 Processing ...') 
    df_inliers_2 = event4.value

    ##### Remove certain columns
    reCol = []
    for i in remove_id_col.value:
        if i != 'No columns to remove':
            reCol.append(i)
            
    if len(reCol) > 0:
        df_inliers_3 = removeCol(df_inliers_2, reCol)
    else:
        df_inliers_3 = df_inliers_2
        
    ##### K-means clustering #####
    df_anonymized = kmeans_func(df_inliers_3, clu_method_wgt.index, scaler_wgt.index, clu_wgt.value)
    
    replace_col = df_inliers_3.columns
    for i in replace_col:
        df_inliers_2[i] = df_anonymized[i]
    
    if clu_method_wgt.index == 0:
        df_inliers_3.to_csv(saveiwg.value, sep=',', encoding='utf-8')
        event4.value = df_inliers_3
    else:
        df_inliers_2.to_csv(saveiwg.value, sep=',', encoding='utf-8')
        event4.value = df_inliers_2

In [21]:
group3 = [id_detect_wgt, id_col_wgt, rank_wgt, button3]
group4 = [remove_id_col, scaler_wgt, clu_method_wgt, clu_wgt, saveiwg, button4]

box3 = widgets.Box(children=group3, layout=box_layout)
box4 = widgets.Box(children=group4, layout=box_layout)

accordion3 = widgets.Accordion(children=[box3, box4], style=style)
accordion3.set_title(0, 'Step 3 Detect potential (quasi-)identifiable features')
accordion3.set_title(1, 'Step 4 Anonymize (generalize) (quasi-)identifiable features')

button3.on_click(id_feature_detect)
button4.on_click(anonymization)
accordion3

Accordion(children=(Box(children=(RadioButtons(description='Detect potential (quasi-)identifiable features:', …

In [None]:
##### Clustering #####
# from sklearn.cluster import KMeans
# from sklearn.cluster import MiniBatchKMeans
# from sklearn.preprocessing import MinMaxScaler


# inpu = button2.value[['diag_2', 'diag_3']]

# scalermm = MinMaxScaler(feature_range=(0,1))
# scal_inpu = scalermm.fit_transform(inpu)

# kmeans = KMeans(n_clusters=25, random_state=0).fit(scal_inpu)
# # minikmeans = MiniBatchKMeans(n_clusters=20, random_state=0, batch_size=6).fit(scal_inpu)
# print(Counter(kmeans.labels_))
# print(kmeans.cluster_centers_)


# plot the learned frontier, the points
# xx, yy = np.meshgrid(np.linspace(0, 1, 100), np.linspace(0, 1, 100))

# resMap = np.c_[xx.ravel(), yy.ravel()]
# Z = kmeans.predict(resMap)

# plt.figure(figsize=(8,8))
# plt.scatter(resMap[:,0], resMap[:,1], c = Z, marker='o',  alpha=0.02)

# plt.scatter(scal_inpu[:,0], scal_inpu[:,1], c = kmeans.labels_, marker='.',  alpha=0.5)
# plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker='o')
# plt.show()

### Sensitive features analysis SMC ###

In [None]:
##### Level 1: Naive Bayes (all parties know target class) #####
from sklearn.model_selection import train_test_split

scalermm = MinMaxScaler(feature_range=(1,2))
inliers_sld = scalermm.fit_transform(df_inliers)
df_inliers_sld = pd.DataFrame.from_records(inliers_sld)
df_inliers_sld.columns = df_inliers.columns

TargetClass = df_inliers['readmitted']
df_pcd = df_inliers_sld.drop(['encounter_id', 'patient_nbr', 'readmitted'], axis=1)

col_data = df_pcd.columns
X_train, X_test, y_train, y_test = train_test_split(df_pcd, TargetClass, test_size=0.25, random_state=0)

In [None]:
SA_train = X_train[col_data[0:20]]
SA_test = X_test[col_data[0:20]]

SB_train = X_train[col_data[20:40]]
SB_test = X_test[col_data[20:40]]

tot_train = pd.concat([SA_train, SB_train], axis=1, join='inner')#X_train[col_data[0:40]]
tot_test = pd.concat([SA_test, SB_test], axis=1, join='inner')#X_test[col_data[0:40]]

In [None]:
# Train three classifiers separately for A, B sites and entire data

In [None]:
from sklearn.naive_bayes import GaussianNB
from scipy.special import comb, logsumexp

gnb1 = GaussianNB()
gnb1.fit(SA_train, y_train)

gnb2 = GaussianNB()
gnb2.fit(SB_train, y_train)

gnb3 = GaussianNB()
gnb3.fit(tot_train, y_train)
# print("Number of mislabeled points out of a total %d points : %d")

In [None]:
lh1 = gnb1._joint_log_likelihood(SA_train.iloc[0:2])
lhl1 = logsumexp(lh1, axis=1)
np.exp(lh1 - np.atleast_2d(lhl1).T)

In [None]:
lh2 = gnb2._joint_log_likelihood(SB_train.iloc[0:2])
lhl2 = logsumexp(lh2, axis=1)
np.exp(lh2 - np.atleast_2d(lhl2).T)

In [None]:
lh3 = gnb3._joint_log_likelihood(tot_train.iloc[0:2])
lhl3 = logsumexp(lh3, axis=1)
np.exp(lh3 - np.atleast_2d(lhl3).T)

In [None]:
lht = lh1 + lh2 - np.log(gnb1.class_prior_)

lhlt = logsumexp(lht, axis=1)
np.exp(lht - np.atleast_2d(lhlt).T)

In [None]:
# joint_log_likelihood = []
# for i in range(np.size(gnb1.classes_)):###
#     jointi = np.log(gnb3.class_prior_[i])
#     n_ij = - 0.5 * np.sum(np.log(2. * np.pi * np.append(gnb1.sigma_[i, :], gnb2.sigma_[i, :])))
#     n_ij -= 0.5 * np.sum(((tot_train.iloc[0:2] - np.append(gnb1.theta_[i, :], gnb2.theta_[i, :]))**2)/np.append(gnb1.sigma_[i, :], gnb2.sigma_[i, :]), 1)
#     joint_log_likelihood.append(jointi + n_ij)

# joint_log_likelihood = np.array(joint_log_likelihood).T
# joint_log_likelihood

### Level 2: Naive Bayes (partially have target class) ###

In [None]:
SA = df_inliers['age'].iloc[0:100]
SB = df_inliers['readmitted'].iloc[0:100]
SB_classes = list(Counter(SB).keys())

In [None]:
from sklearn.preprocessing import label_binarize
bi_SB = label_binarize(SB, classes=SB_classes)

In [None]:
SA_mul = len(SA) * [1] # len(SA_train)
SA_mul = np.multiply(SA_mul, SA)
    
SB_mul = len(bi_SB[:,0]) * [1] # len(SB_train)
SB_mul = np.multiply(SB_mul, bi_SB[:,0])

In [None]:
# SA_mul = len(df_inliers['age']) * [1] # len(SA_train)
# for s in SA_train.columns:
#     SA_mul = np.multiply(SA_mul, SA_train[s].values)
    
# SB_mul = len(df_inliers['diabetesMed']) * [1] # len(SB_train)
# for s in SB_train.columns:
#     SB_mul = np.multiply(SB_mul, SB_train[s].values)

In [None]:
scaPro = np.dot(SA_mul, SB_mul)
scaPro

In [None]:
np.random.seed(1)
A_random = np.random.randint(0,5, len(SA_mul))
np.random.seed(2)
C_noise = np.random.randint(0,5, (len(SA_mul), len(SA_mul)))

In [None]:
SA_noise = np.add(SA_mul, np.dot(C_noise, A_random))

In [None]:
S_noise = np.dot(SA_noise, SB_mul)  
SB_noise = np.dot(C_noise.transpose(), SB_mul)
SB_noise

In [None]:
np.random.seed(3)
B_random_single = np.random.randint(0,5, int(len(SB_mul)/2)) #in this case r =5
B_random = []
for i in range(0, len(B_random_single)): 
    B_random.append(B_random_single[i])
    B_random.append(B_random_single[i])
B_noise_2 = SB_noise + B_random

#B sends A: Snoise_1 and ynoise_2

In [None]:
rand_sum = 0
for i in range(0, len(B_random_single)):
    rand_sum = rand_sum + (A_random[2*i] + A_random[2*i+1]) * B_random_single[i]

In [None]:
S_noise - np.dot(A_random,B_noise_2) + rand_sum

#### Secure Sum ####

In [None]:
D = np.array([[0,0,0,0,0,0],[2,2,2,2,2,2],[3,3,3,3,3,3]])

k = 2
n = 3

In [None]:
def chunkIt(seq, num):
    avg = len(seq) / float(num)
    out = []
    last = 0.0

    while last < len(seq):
        out.append(seq[int(last):int(last + avg)])
        last += avg

    return out

In [None]:
DB = []
for i in range(0, len(D)):
    DB.append(chunkIt(D[i], k))
DB

In [None]:
s = np.zeros((3, 2, 3))

for rc in range(k, 0, -1):
    for j in range(0, k):
        for i in range(0, n):
            s[i][j] = DB[i][j] + s[i][j]

### Privacy Preserving Linear Regression ##

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from numpy.linalg import inv

df_inliers = button2.value
### Centralized data ###
X = df_inliers[['age','insulin','time_in_hospital', 'diag_1', 'diag_2']].iloc[0:1000]
b0_X = np.c_[np.ones((len(X),1)), X]
Y = np.array(df_inliers['diag_3'].iloc[0:1000])
Y_chunk = chunkIt(Y, len(Y))

In [None]:
### Data site A and B (target class Y is at B site ###
X_a = b0_X[:,0:3]
X_b = b0_X[:,3:6]

### At site A ###
XaTXa = np.matrix(X_a).T * X_a
len_A = len(X_a[0])
### At site B
XbTXb = np.matrix(X_b).T * X_b
len_B = len(X_b[0])

### Computation ###
# Computation = 2 * (len(X_a[0]) * len(X_b[0]))

In [None]:
### At site A ###
A_randoms = []
for i in range(0, len_A):
    np.random.seed(1)
    A_randoms.append(np.random.randint(0,5, len(X_a[:,i])))
    
C_noises = []    
for i in range(0, len_A):
    np.random.seed(2)
    C_noises.append(np.random.randint(0,5, (len(X_a[:,i]), len(X_a[:,i]))))

In [None]:
### At site A ###
SA_noises = []
for i in range(0, len_A):
    SA_noises.append(np.add(X_a[:,i], np.dot(C_noises[i], A_randoms[i])))

In [None]:
### At site B ###
S_noises = []
for i in range(0, len_B):
    S_noises_site = []
    for j in range(0, len_A):
        S_noises_site.append(np.dot(SA_noises[j], X_b[:,i])) # X_b[:,i]
    S_noises.append(S_noises_site)
    
SB_noises = []
for i in range(0, len_B): 
    SB_noise_site = []
    for j in range(0, len_A):
        SB_noise_site.append(np.dot(C_noises[j].transpose(), X_b[:,i])) # X_b[:,i]
    SB_noises.append(SB_noise_site)

In [None]:
### At site B ###
B_random_singles = []
for i in range(0, len_A):
    np.random.seed(3)
    B_random_singles.append(np.random.randint(0,5, int(len(X_b[:,0])/2))) #in this case r =5 # X_b[:,0]
# B_random_singles = np.array(B_random_singles)

B_noises_add = []
for n in range(0, len_B):
    B_noise_2 = []
    for i in range(0, len_A):
        B_random = []
        for j in range(0, len(B_random_singles[i])): 
            B_random.append(B_random_singles[i][j])
            B_random.append(B_random_singles[i][j])
        B_noise_2.append(SB_noises[n][i] + B_random)
    B_noises_add.append(B_noise_2)
#B sends A: Snoise_1 and ynoise_2

In [None]:
rand_sums = []
for i in range(0, len_A):
    r_sum = 0
    for j in range(0, len(B_random_singles[0])):
        r_sum = r_sum + (A_randoms[i][2*j] + A_randoms[i][2*j+1]) * B_random_singles[i][j]
    rand_sums.append(r_sum)


outcomes = []
for n in range(0, len_B):
    out = []
    for i in range(0, len_A):
        out.append(S_noises[n][i] - np.dot(A_randoms[i],B_noises_add[n][i]) + rand_sums[i]) 
    outcomes.append(out)

In [None]:
outcomes

In [None]:
XaTXb = [[4616.0, 26824.0, 6676.0],
 [319209.0, 1823609.0, 457053.0],
 [258630.0, 1450363.0, 367140.0]]


XbTXa = [[4616.0, 319209.0, 258630.0],
 [26824.0, 1823609.0, 1450363.0],
 [6676.0, 457053.0, 367140.0]]

XaTY = np.matrix(chunkIt(([285387.0, 1559025.0, 411779.0]),3))
XbTY = np.matrix(X_b).T * Y_chunk

In [None]:
pp_XTX = np.concatenate((np.concatenate((XaTXa, XbTXa), axis=1), np.concatenate((XaTXb, XbTXb), axis=1)),axis=0) 
pp_XTY = np.concatenate((XaTY, XbTY),axis=0)
np.linalg.inv(pp_XTX) * pp_XTY

In [None]:
### correct X.T * X, X.T * Y ###
# XTX = np.dot(b0_X.T, b0_X)
# XTY = (np.matrix(b0_X).T * Y_chunk) 
# corr_Out = np.linalg.inv(XTX) * (np.matrix(b0_X).T * Y_chunk) 

In [None]:
### Checking with the centralized method ####
m,c = np.linalg.lstsq(b0_X,Y_chunk)[0:2]
print(m,c)

In [None]:
### Checking with the scikit learn method ####
regr = linear_model.LinearRegression(fit_intercept=True, normalize=True)
regr.fit(X, Y)

# The coefficients
print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)