### Privacy preserving logistic regression
#### Unweighted, discrete

In this notebook we give examples of performing more complicated analyses, such as regression, as [post-processing] on data which has already had noise infused.

Several common formally private noise injection methods depend upon the concept of global sensitivity: how much can a given output change due to adding or deleting a single person across all possible datasets we could observe.

Regression poses a problem for these methods. If we consider a simple ordinary least squares model with a single predictor, we can imagine scenarios where adding or deleting a single person can have a marked effect upon the slope of the regression line:
[Illustration here]

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import requests, zipfile, io
import sklearn
import random
from typing import Union
import statsmodels.api as sm

## Logistic Regression as Post-Processing

In [3]:
if 'z' not in locals():
    r = requests.get('https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pla.zip')
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extract('psam_p22.csv')

keepcols = ['PUMA','RACWHT','PINCP','AGEP','SCHL','MIGPUMA','MIGSP']
pa = pd.read_csv("psam_p22.csv", usecols=keepcols)

In [4]:
#getting data
pa.query('PUMA==2400 and AGEP>=18 and AGEP<=65', inplace=True)
pa.head()
#TODO clean data

Unnamed: 0,PUMA,AGEP,SCHL,MIGPUMA,MIGSP,PINCP,RACWHT
207,2400,30,18.0,2390.0,22.0,35000.0,0
275,2400,56,22.0,,,68000.0,0
276,2400,57,20.0,,,58700.0,0
382,2400,26,12.0,,,18000.0,1
383,2400,31,11.0,3700.0,6.0,0.0,1


In [21]:
def migrecode(migpuma):
    if pd.isnull(migpuma):
        return 0
    else:
        return 1
    
def agerecode(age):
    if age < 18:
        return 0
    else:
        return 1

def schlrecode(schl):
    if schl < 18:
        return 0
    else:
        return 1

pa['MIGRATED']=pa.MIGPUMA.apply(migrecode)
pa['ADULT']=pa.AGEP.apply(agerecode)
pa['COLLEGE']=pa.SCHL.apply(schlrecode)
print(pa.head())
print(pa.MIGRATED.value_counts())

     PUMA  AGEP  SCHL  MIGPUMA  MIGSP    PINCP  RACWHT  MIGRATED  ADULT  \
207  2400    30  18.0   2390.0   22.0  35000.0       0         1      1   
275  2400    56  22.0      NaN    NaN  68000.0       0         0      1   
276  2400    57  20.0      NaN    NaN  58700.0       0         0      1   
382  2400    26  12.0      NaN    NaN  18000.0       1         0      1   
383  2400    31  11.0   3700.0    6.0      0.0       1         1      1   

     COLLEGE  
207        1  
275        1  
276        1  
382        0  
383        0  
0    4775
1     695
Name: MIGRATED, dtype: int64


In [22]:
#feature selection 
#TODO: change Y value to rent or mortage
X = pa[['RACWHT','ADULT','COLLEGE']]
y = pa.MIGRATED

logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.375800
         Iterations 6
                         Results: Logit
Model:              Logit            Pseudo R-squared: 0.013     
Dependent Variable: MIGRATED         AIC:              4117.2525 
Date:               2019-08-16 11:02 BIC:              4137.0736 
No. Observations:   5470             Log-Likelihood:   -2055.6   
Df Model:           2                LL-Null:          -2082.7   
Df Residuals:       5467             LLR p-value:      1.7131e-12
Converged:          1.0000           Scale:            1.0000    
No. Iterations:     6.0000                                       
------------------------------------------------------------------
              Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
------------------------------------------------------------------
RACWHT        0.6363    0.0867    7.3360  0.0000   0.4663   0.8063
ADULT        -2.0941    0.0749  -27.9769  0.0000  -2.2409  -1.

Help on function melt in module pandas.core.reshape.melt:

melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
    Unpivots a DataFrame from wide format to long format, optionally
    leaving identifier variables set.
    
    This function is useful to massage a DataFrame into a format where one
    or more columns are identifier variables (`id_vars`), while all other
    columns, considered measured variables (`value_vars`), are "unpivoted" to
    the row axis, leaving just two non-identifier columns, 'variable' and
    'value'.
    
    
    Parameters
    ----------
    frame : DataFrame
    id_vars : tuple, list, or ndarray, optional
        Column(s) to use as identifier variables.
    value_vars : tuple, list, or ndarray, optional
        Column(s) to unpivot. If not specified, uses all columns that
        are not set as `id_vars`.
    var_name : scalar
        Name to use for the 'variable' column. If None it uses
        ``frame.colum

## Laplace Noise

In [24]:
def laplace_mech(mu: Union[float, np.ndarray], epsilon: int, sensitivity: float = 1.0):
    """
    Implementation of the Laplace Mechanism

    Args:
      mu (float or numpy array): the true answer
      epsilon (int): the privacy budget
      sensitivity (float): the global sensitivity of the query
    """
    eps = epsilon/float(sensitivity)
    scale = 1/eps
    np_shape = np.shape(mu)
    shape = None if np_shape == () else np_shape
    z = np.random.laplace(0.0, scale=scale, size=shape)
    return mu + z

In [68]:
tab=pd.crosstab(pa.MIGRATED, [pa.RACWHT, pa.ADULT, pa.COLLEGE])
#tab=pd.crosstab(pa.MIGRATED, [pa.RACWHT, pa.ADULT, pa.COLLEGE]).unstack()
noise = laplace_mech(np.zeros(tab.shape), 0.1, 1.0)
noisy_tab = tab + noise
print(tab.unstack().index.get_values())

[(0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1) (1, 1, 0, 0)
 (1, 1, 0, 1) (1, 1, 1, 0) (1, 1, 1, 1)]


In [None]:
def avg_l1_laplace(epsilon, mu, n=1000):
    """Takes the average error of the laplace mechanism on an array over n samples.
  　
    Args:
      epsilon (int): the privacy budget
      mu (float or numpy array): the true answer
      n (int): number of samples
    """
    total = 0
    for i in range(n):
        noisy_arr = laplace_mech(mu, epsilon, sensitivity=1.0)
        accuracy = 1 - (np.linalg.norm(noisy_arr-mu, 1)/(2*noisy_arr.shape[1]))
        total += accuracy
    return total/n


In [None]:
orig_arr = pd.DataFrame(new_pa.fillna(0))
accuracy_df = pd.DataFrame()
eps_range = np.arange(1,6.0,1)
accuracy_df['Privacy Loss (ϵ)'] = eps_range
accuracy_df['Accuracy'] = [avg_l1_laplace(x, orig_arr) for x in eps_range]
accuracy_df.plot.scatter('Privacy Loss (ϵ)', 'Accuracy')
plt.title('Trade-Off Between Privacy Loss and Accuracy')
plt.style.use('seaborn-paper')
plt.savefig('out/fig.png',facecolor='w', edgecolor='w',
        orientation='portrait', ransparent=False, bbox_inches=None, pad_inches=0.1)
plt.show()

In [None]:
noisy_microdata = laplace_mech(orig_arr,3)
X = noisy_microdata[['RAC1P','AGEP','SCHL']]
y = noisy_microdata['PINCP'].apply(labeler)
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

## Random Response 

In [None]:
print(noisy_microdata)