# Binary Classifier 2
TJ Kim, 21 Oct 2019

We build a logistic regression model for the data in encoded_adult.csv. The inputs are all the encoded characteristics, while the output is whether or not the income is greater than 50k or not for a person (binary).

## Importing Data
First we import CSV data as pandas and divide to test and training set.

In [4]:
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Import Dataset
filename = "adult_encoded.csv"
df = pd.read_csv(filename, sep='\s*,\s*',engine = 'python')

# Separate each dataset into training and testing data
total_train,total_test = train_test_split(df, random_state=40, test_size=0.3, shuffle=True)

# Separate each sub dataset to input and output
total_train_data = total_train.loc[:,total_train.columns != 'income_over_50k']
total_train_label = total_train.loc[:,total_train.columns =='income_over_50k']
total_test_data = total_train.loc[:,total_test.columns != 'income_over_50k']
total_test_label = total_train.loc[:,total_test.columns =='income_over_50k']

df

Unnamed: 0.1,Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,Cls_Cat_?,Cls_Cat_Federal-gov,Cls_Cat_Local-gov,Cls_Cat_Private,...,Job_Cat_Protective-serv,Job_Cat_Sales,Job_Cat_Tech-support,Job_Cat_Transport-moving,Race_Cat_Amer-Indian-Eskimo,Race_Cat_Asian-Pac-Islander,Race_Cat_Black,Race_Cat_Other,Race_Cat_White,Is_Male
35574,35574,53,9,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
22261,22261,45,9,0,0,40,0,0,0,1,...,0,0,0,1,0,0,0,0,1,1
16944,16944,36,12,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1
34290,34290,35,9,0,0,36,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1
3795,3795,53,4,0,0,60,0,0,0,1,...,0,0,0,1,0,0,0,0,1,1
40738,40738,44,13,0,0,45,0,0,0,1,...,0,1,0,0,0,0,0,0,1,1
26597,26597,40,10,0,0,40,0,0,0,1,...,0,1,0,0,0,0,0,0,1,1
16323,16323,59,10,0,0,40,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
23196,23196,49,15,15024,0,65,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
16211,16211,23,14,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1


## Build and Run Logistic Binary Classifier
We use pandas to build dataframes and SKlearn's logistic regression based on string based classes

In [2]:
# Run Logistic Regression
logreg = LogisticRegression(solver="liblinear")
logreg.fit(total_train_data,total_train_label)

# Predict Test Result and calculate Accuracy
y_pred = logreg.predict(total_train_data)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(total_test_data, total_test_label)))

  y = column_or_1d(y, warn=True)


Accuracy of logistic regression classifier on test set: 0.82


## Obtain Parameter Vector
Here we obtain what the parameter vector is so we can solve for it later on.

In [3]:
logreg.coef_.shape

(1, 40)

## Linear System of Equation Solving Attack

Next, we attack the binary logistic classifier we built by inputting d+1 predictions. "d" is the number of values in the weight vectors. It is important that we obtain confidence values rather than just binary classifications to make this possible

The number of linearly independent equations we need are 40. It will be a challenging and annoying task to ensure that every single equation we use is linearly independent.

We can simply take 40 values from the training data set.

In [5]:
# Function - calculate sigmoid inverse given output to sigmoid function, can take 2D vector inputs
def get_sigmoid_inv(out_val):
    idx = 0
    for i in out_val:
        if i == 1:
            out_val[idx] = 0.999999999999999
        idx += 1
    in_val = -1*np.log(np.divide(1,out_val) - 1)
    return in_val

In [6]:
# Get a appropriate number of samples from database to attack with
num_queries = logreg.coef_.size - 20
attack_sample = df.sample(num_queries)

# Divide data and labels
sample_x = attack_sample.loc[:,attack_sample.columns != 'income_over_50k']
sample_y = attack_sample.loc[:,attack_sample.columns =='income_over_50k']

# Obtain P(y = 1) vals for entire array - the confidence intervals for iris setosa (0) vs iris versicolor (1)
conf = logreg.predict_proba(sample_x)[:,1]
xw_s = get_sigmoid_inv(conf)
bias = np.ones((num_queries,1))
x_testvals = np.concatenate((bias, sample_x.values), 1)

w_approx = np.linalg.lstsq(x_testvals,xw_s)
w_approx[0]

  from ipykernel import kernelapp as app


array([-2.82108726e-01, -1.19753702e-05, -1.17845873e-02,  5.29326058e-02,
        3.11177707e-04, -6.93889390e-16, -1.58044667e-02, -8.08849658e-02,
        8.90181325e-02, -9.02802439e-03, -3.11702172e-01,  5.55111512e-17,
        3.04883040e-02,  0.00000000e+00, -1.87684011e-01,  8.92960020e-01,
       -9.16468069e-02, -8.56743910e-01, -3.89940177e-02,  0.00000000e+00,
       -2.52878111e-01, -8.08849658e-02, -8.82999978e-02, -1.21620633e-01,
        3.63712572e-01,  3.04883040e-02, -9.15515367e-02, -1.03104989e-01,
       -5.62291114e-01,  0.00000000e+00,  2.82425501e-01,  0.00000000e+00,
        8.90181325e-02,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -2.82108726e-01,
        1.82556396e-01])

In [86]:
real_weights = np.append(logreg.intercept_,logreg.coef_)
real_weights

array([-4.49094231e-01, -1.30403203e-05, -1.44312943e-02,  5.03137221e-02,
        3.09383102e-04,  6.76566732e-04, -4.19043504e-03, -1.49623969e-01,
        5.01265503e-02,  1.34829539e-02, -4.28483040e-01,  1.00617457e-01,
       -2.35188858e-02, -1.16952966e-02, -2.80953837e-01,  9.31203644e-01,
       -3.07384610e-02, -8.99961277e-01, -8.94433673e-02, -7.92009328e-02,
       -3.22767694e-01, -1.48468597e-01, -1.72619718e-01, -7.57962444e-02,
        3.19681468e-01, -6.75095317e-02, -1.14746125e-01, -1.22040218e-01,
       -2.90428464e-01, -1.65977976e-02,  2.64438922e-01,  1.39764301e-02,
       -1.22162243e-02,  1.50941845e-02, -4.18623162e-02, -1.98327115e-02,
       -9.78331990e-03, -1.79557204e-01, -1.78689053e-02, -2.22052090e-01,
        2.11488232e-01])