# Week 1: Binary Logistic Regression Classifier Build From Iris Data
September 28 2019

Goal: Build a binary logistic classifier from the iris.csv data using sklearn.
Stretch Goal: Build a tri-class classifier

## Running Binary Classifier

Here we import the iris.csv dataset

In [1]:
import re
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

# Import Dataset
filename = "iris.csv"
df = pd.read_csv(filename, sep='\s*,\s*',engine = 'python')

# Separate Dataset into each
setosa_set = df[df["class"]=="Iris-setosa"]
versicolor_set = df[df["class"]=="Iris-versicolor"]
virginica_set = df[df["class"]=="Iris-virginica"]

# Separate each dataset into training and testing data
total_train,total_test = train_test_split(df, random_state=42, test_size=0.3, shuffle=True)
setosa_train, setosa_test = train_test_split(setosa_set, random_state=42, test_size=0.3, shuffle=True)
versicolor_train, versicolor_test = train_test_split(versicolor_set, random_state=42, test_size=0.3, shuffle=True)
virginica_train, virginica_test = train_test_split(virginica_set, random_state=42, test_size=0.3, shuffle=True)

## Run Logistic Fitting On Iris Data
We use pandas to build dataframes and SKlearn's logistic regression based on string based classes

In [19]:
# Logistic Model Fitting
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Split training/test data to input and output
cols = ['sepal length','sepal width', 'petal length', 'petal width']
y_val = 'class'
# Train Data
x_setosa_train = setosa_train[cols]
y_setosa_train = setosa_train[y_val]
x_versicolor_train = versicolor_train[cols]
y_versicolor_train = versicolor_train[y_val]

"""Running Simulation for 2 class case"""

x_train = x_setosa_train.append(x_versicolor_train)
y_train = y_setosa_train.append(y_versicolor_train)

x_test = setosa_test[cols].append(versicolor_test[cols])
y_test = setosa_test[y_val].append(versicolor_test[y_val])

"""Running Simulations for the 3 class case"""
x_train = total_train[cols]
y_train = total_train[y_val]

x_test = total_test[cols]
y_test = total_test[y_val]

# Run Logistic Regression
logreg = LogisticRegression()
logreg.fit(x_train,y_train)

# Predict Test Result and calculate Accuracy
y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.98




## Obtain Parameter Vector
Here we obtain what the parameter vector is so we can solve for it later on.

In [20]:
logreg.coef_

array([[ 0.37158254,  1.35098324, -2.09936396, -0.93263471],
       [ 0.46758048, -1.57259888,  0.39692171, -1.0678223 ],
       [-1.52865509, -1.43245908,  2.30484329,  2.08586834]])

## Linear System of Equation Solving Attack

Next, we attack the binary logistic classifier we built by inputting d+1 predictions. "d" is the number of values in the weight vectors. It is important that we obtain confidence values rather than just binary classifications to make this possible

In [4]:
# Generate query data for d+1 (5) inputs
sepal_lengths = np.random.rand(4) * 5
sepal_widths = np.random.rand(4) * 4
petal_lengths = np.random.rand(4) * 7
petal_widths = np.random.rand(4) * 2.5

# Generate Pandas Dataframe and use model to predict output
d = {'sepal length': sepal_lengths, 'sepal width': sepal_widths,'petal length':petal_lengths,'petal width':petal_widths}
x_test1 = pd.DataFrame(data=d)
y_test1 = logreg.predict(x_test1)

In [5]:
# Function - calculate sigmoid inverse given output to sigmoid function, can take 2D vector inputs
def get_sigmoid_inv(out_val):
    in_val = -1*np.log(np.divide(1,out_val) - 1)
    return in_val

In [6]:
# Obtain P(y = 1) vals for entire array - the confidence intervals for iris setosa (0) vs iris versicolor (1)
conf = logreg.predict_proba(x_test1)[:,1]
xw_s = get_sigmoid_inv(conf)

In [13]:
xw_s

array([-0.26585045,  5.62252251,  7.852434  , -2.57833001])

In [8]:
np.transpose(x_test1.values)

array([[3.76858352, 3.70895561, 1.56246442, 3.43561954],
       [3.01621121, 1.58559   , 1.6590499 , 1.80230802],
       [2.39763095, 4.12439864, 4.68714323, 0.24307408],
       [0.50940415, 0.96771208, 1.33256781, 0.96003377]])

In [14]:
w_approx = np.linalg.solve(x_test1.values,xw_s)
w_approx

array([-0.40890087, -1.35253721,  2.06460538,  0.7940704 ])

In [15]:
logreg.coef_

array([[-0.38897155, -1.31781989,  2.06492322,  0.90383684]])

In [16]:
x_test1

Unnamed: 0,sepal length,sepal width,petal length,petal width
0,3.768584,3.016211,2.397631,0.509404
1,3.708956,1.58559,4.124399,0.967712
2,1.562464,1.65905,4.687143,1.332568
3,3.43562,1.802308,0.243074,0.960034


In [18]:
conf

array([0.43392609, 0.99639752, 0.99961135, 0.07054615])