# Logistic regression

This exercise has been taken from the [Stanford Machine Learning Course](https://github.com/krasserm/machine-learning-notebooks). With the original exercise written for Octave available [here](https://github.com/krasserm/machine-learning-notebooks/blob/master/data/ml-ex2/ex2.pdf)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from matplotlib import colors
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

%matplotlib inline

### Unregularized Logistic Regression

In this part of the exercise, you will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants
that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision.Your task is to build a classification model that estimates an applicant’s
probability of admission based the scores from those two exams. 

In [None]:
# Admission data: 
# - exam 1 score (x1) 
# - exam 2 score (x2)
# - admitted (y)
data = np.loadtxt('/aiuoa/datasets/students_1.txt', delimiter=',')

In [None]:
# Separate features (x1, x2) from target (y)
X, y = np.hsplit(data, np.array([2]))

In [None]:
# LogisticRegression estimator expect an y row vector
y = y.ravel()

Now lets visualise the data that has been provided to us.

In [None]:
fig, ax = plt.subplots()

# Mask for selecting positive and negative examples
y_pos = y == 1
y_neg = y == 0

# Plot examples and decision boundary
ax.plot(X[y_pos,0], X[y_pos,1], 'b+', label='Admitted')
ax.plot(X[y_neg,0], X[y_neg,1], 'yo', label='Not admitted')
ax.set_xlabel('Exam 1 score')
ax.set_ylabel('Exam 2 score')
ax.legend(loc='upper right')

In [None]:
# Use 'lbfgs' solver for logistic regression as this is what Octave fminunc does.
# Parameter C is the inverse regularization strength (high values = low regularization).
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)

In [None]:
theta0 = clf.intercept_[0]
theta1 = clf.coef_[0,0]
theta2 = clf.coef_[0,1]

In [None]:
# Computes x2 at y=0.5 from x1 and model parameters
# (used for computing the linear decision boundary)
def x2(x1):
    return (0.5 - theta0 - theta1*x1) / theta2

In [None]:
x1_min = X[:,0].min()
x1_max = X[:,0].max()

In [None]:
# x1 and x2 data of linear decision boundary
x1_plot = np.array([x1_min, x1_max])
x2_plot = x2(x1_plot)

In [None]:
fig, ax = plt.subplots()

# Mask for selecting positive and negative examples
y_pos = y == 1
y_neg = y == 0

# Plot examples and decision boundary
ax.plot(X[y_pos,0], X[y_pos,1], 'b+', label='Admitted')
ax.plot(X[y_neg,0], X[y_neg,1], 'yo', label='Not admitted')
ax.set_xlabel('Exam 1 score')
ax.set_ylabel('Exam 2 score')
ax.legend(loc='upper right')

# Plot decision boundary
ax.plot(x1_plot, x2_plot)

In [None]:
X_test = np.array([
    [45., 85.],
    [50., 50.],
    [80., 80.]
])

In [None]:
# Predict class
clf.predict(X_test)

In [None]:
# Predict class probabilities
clf.predict_proba(X_test)

In [None]:
theta0 = clf.intercept_[0]
theta1 = clf.coef_[0,0]
theta2 = clf.coef_[0,1]

In [None]:
theta0, theta1, theta2

In [None]:
# Classification accuracy on training set
clf.score(X, y)

### Regularized Logistic Regression

In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure
it is functioning correctly. Suppose you are the product manager of the factory and you have the test results for some microchips on two di↵erent tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.

In [None]:
# Microchip test: 
# - test 1 (x1) 
# - test 2 (x2)
# - accepted=1, rejected=0 (y)
data = np.loadtxt('/aiuoa/datasets/students_2.txt', delimiter=',')

In [None]:
# Separate features (x1, x2) from target (y)
X, y = np.hsplit(data, np.array([2]))

In [None]:
# Logistic regression estimator requires an y row vector
y = y.ravel()

In [None]:
fig, ax = plt.subplots()

# Mask for selecting positive and negative examples
y_pos = y == 1
y_neg = y == 0

# Plot examples
ax.plot(X[y_pos,0], X[y_pos,1], 'b+', label='Accepted')
ax.plot(X[y_neg,0], X[y_neg,1], 'yo', label='Rejected')
ax.set_xlabel('Microchip test 1')
ax.set_ylabel('Microchip test 2')
ax.legend(loc='upper right')

In [None]:
# Preprocessor to include polynomial features up to degree 6
poly = PolynomialFeatures(6, include_bias=False)

# Mean and standard deviation scaler
scaler = StandardScaler()

# Logistic regression classifier. 
# - C=1.0 will result in good fit
# - C=1e4 will result in overfit (to little regularization)
# - C=1e-2 will result in underfit (to much regularization)
clf = LogisticRegression(C=1.0, solver='lbfgs')

# Pipeline of polynomial feature generator, feature scaler and linear regressor
model = Pipeline([('poly', poly), ('scaler', scaler), ('clf', clf)])

# Fit data to model
model.fit(X, y)

In [None]:
# Classification accuracy on training set
model.score(X, y)

In [None]:
grid = np.mgrid[-1:1:500j, -1:1:500j]

# Compute acceptance probabilities on 500*500 grid 
X_grid = np.c_[grid[0].ravel(), grid[1].ravel()]
y_grid = model.predict(X_grid).reshape(grid[0].shape)

# Plot decision boundary on previous figure
cs = ax.contour(grid[0], grid[1], y_grid, 'g-', levels=[0.5])
ax.clabel(cs)

# Show previous figure with decision boundary
fig