# Binary Classification - Credit Approval

Dataset:
https://archive.ics.uci.edu/ml/datasets/Credit+Approval

## Dataset observations

https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names

- Class distribution is quite balanced
- Columns are anonymized
- There are missing values

## Workflow

Data Gathering
1. read_csv

Data Transformation
2. transform dataframe
3. PCA to plot (for classification)
4. train-test split
5. scale

Training
6. logistic regression
7. SGD logistic regression

Validation
8. metrics
9. learning curve
10. prediction

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.model_selection import learning_curve

## Data Gathering

1. read_csv

In [3]:
df = pd.read_csv('D:/tmp/credit-approval/crx.data',
                names=['A1', 'A2', 'A3', 'A4', 'A5', 'A6',
                      'A7', 'A8', 'A9', 'A10', 'A11', 'A12',
                      'A13', 'A14', 'A15', 'y'],
                na_values=['?', 'nan'])
df.head()

FileNotFoundError: File b'D:/tmp/credit-approval/crx.data' does not exist

## Data Transformation
2. transform dataframe
3. PCA to plot (for classification)
4. train-test split
5. scale

2. transform dataframe
  - change to numeric types
  - handle NaN values

In [None]:
# since we cannot interpolate the values, we'll drop them
# drop the NaN values before we perform encoding
df.dropna(inplace=True)

In [None]:
df.dtypes

In [None]:
df.A1.unique()

In [None]:
df.A4.unique()

In [None]:
# let's try one-hot encoding
columns_to_encode = ['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
np.testing.assert_array_equal(columns_to_encode, df.loc[:, columns_to_encode].columns)

dummies = pd.get_dummies(df.loc[:, columns_to_encode])
dummies.columns

In [None]:
df_1 = pd.concat([df, dummies], axis=1)
df_1.columns

In [None]:
# let's clean up some columns
df_1.drop(columns_to_encode, axis=1, inplace=True)
df_1.columns

In [None]:
# now check to make sure dtypes are all numeric
df_1.dtypes

In [None]:
# the last one we deal with is class
# since this is the classification output, the convention is to use 1, 0
df_1.y.unique()

In [None]:
# We can use sklearn.preprocessing.LabelEncoder, but that doesn't give us the
# ability to assign labels to numbers. For example, if we want '+' to be 1,
# and '-' to be 0. This is because LabelEncoder picks the first class it encounters
# and assigned the number accordingly.

y_enc = df_1.y.map({'+': 1, '-': 0})
y_enc

In [None]:
df_1.drop(['y'], axis=1, inplace=True) # drop the original y column
df_2 = pd.concat([df_1, y_enc], axis=1) # add the encoded y column
df_2.columns

In [None]:
df_2.dtypes

In [None]:
df_2.describe()

2. PCA to plot (for classification)

  - Plot a scatter plot with 2 feature dimensions (or 3 feature dimensions)  
  - Use colours for y_enc

In [None]:
pca = PCA(n_components=3)

X = df_2.loc[:, 'A2':'A13_s']
y = df_2.y

X.columns

In [None]:
X_3d = pca.fit_transform(X)
print('Before:', X.shape, 'After:', X_3d.shape)

In [None]:
# A better tutorial:
# https://matplotlib.org/gallery/mplot3d/scatter3d.html

# interactive plot
%matplotlib notebook

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_3d[y==0, 0], X_3d[y==0, 1], X_3d[y==0, 2], color='r', label='y=0')
ax.scatter(X_3d[y==1, 0], X_3d[y==1, 1], X_3d[y==1, 2], color='b', label='y=1')
ax.set(xlabel='X_3d[:, 0]', ylabel='X_3d[:, 1]', zlabel='X_3d[:, 2]',
       title='PCA plot of Credit Approval dataset')
ax.legend()
plt.show()

The plot looks squished, let's try scaling the features first to see if we get a better view.

In [None]:
# we are plotting all the datapoints, so we want to fit to the whole dataset
# during training/testing, we'll still fit separately.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=3)
X_3d = pca.fit_transform(X_scaled)

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_3d[y==0, 0], X_3d[y==0, 1], X_3d[y==0, 2], color='r', label='y=0')
ax.scatter(X_3d[y==1, 0], X_3d[y==1, 1], X_3d[y==1, 2], color='b', label='y=1')
ax.set(xlabel='X_3d[:, 0]', ylabel='X_3d[:, 1]', zlabel='X_3d[:, 2]',
       title='Scaled PCA plot of Credit Approval dataset')
ax.legend()
plt.show()

Okay, lastly, let's try a 2D plot.

In [None]:
# we are plotting all the datapoints, so we want to fit to the whole dataset
# during training/testing, we'll still fit separately.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111)

ax.scatter(X_2d[y==0, 0], X_2d[y==0, 1], color='r', label='y=0')
ax.scatter(X_2d[y==1, 0], X_2d[y==1, 1], color='b', label='y=1', alpha=.2) # alpha sets transparency
ax.set(xlabel='X_2d[:, 0]', ylabel='X_2d[:, 1]',
       title='Scaled PCA plot of Credit Approval dataset')
ax.grid()
ax.legend()
plt.show()

As you can see above, visualization is also iterative. 

4. train-test split

In [None]:
# we'll split the unscaled features
# then scale them using just the mean & variance of the training set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

5. scale

In [None]:
X_scaler = StandardScaler()
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# note that you don't scale y. It's a class output, which has only individual (discrete)
# values such as 0 vs 1.

## Training
6. logistic regression
7. SGD logistic regression

In [None]:
logistic = LogisticRegression(random_state=42)
logistic.fit(X_train_scaled, y_train)
y_pred_logistic = logistic.predict(X_test_scaled)

In [None]:
sgd = SGDClassifier(tol=1e-4, max_iter=1000, verbose=True, random_state=42)
sgd.fit(X_train_scaled, y_train)
y_pred_sgd = sgd.predict(X_test_scaled)

## Validation
8. metrics
9. learning curve
10. prediction

8. metrics

In [None]:
# Classification report. See classification.ipynb for details
print(classification_report(y_test, y_pred_logistic))
print(classification_report(y_test, y_pred_sgd))

In [None]:
# Confusion matrix. See classification.ipynb for details
cm_logistic = confusion_matrix(y_test, y_pred_logistic)
cm_sgd = confusion_matrix(y_test, y_pred_sgd)

In [None]:
cm_sgd

In [None]:
# !conda install -y seaborn

# matplotlib can plot confusion matrices, isn't as easy as seaborn

%matplotlib inline
import seaborn as sns

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))
ax = axes.flatten()

# annotate cells with ticks
sns.heatmap(cm_logistic, annot=True, ax=ax[0])
sns.heatmap(cm_sgd, annot=True, ax=ax[1])

ax[0].set(xlabel='Predicted labels', ylabel='True labels', title='Confusion Matrix (Logistic Regression)') 
ax[0].xaxis.set_ticklabels(['Denied', 'Approved'])
ax[0].yaxis.set_ticklabels(['Denied', 'Approved'])

ax[1].set(xlabel='Predicted labels', ylabel='True labels', title='Confusion Matrix (Logistic Regression using SGD)'); 
ax[1].xaxis.set_ticklabels(['Denied', 'Approved'])
ax[1].yaxis.set_ticklabels(['Denied', 'Approved'])
plt.show()

9. learning curve

In [None]:
logistic_2 = LogisticRegression(random_state=42)
train_sizes, train_score, val_score = learning_curve(logistic_2, X_train_scaled, y_train)

train_mean = np.mean(train_score, axis=1)
val_mean = np.mean(val_score, axis=1)

print('train_size', 'mean_train_score (3-fold cv)', 'mean_val_score (3-fold cv)')
for train_size, t, m in zip(train_sizes, train_mean, val_mean):
    print(train_size, t, m)

fig, ax = plt.subplots(figsize=(15, 10))
ax.plot(train_sizes, train_mean, label='train score', marker='x')
ax.plot(train_sizes, val_mean, label='val score', marker='o')

# LogisticRegression.score() is the mean accuracy
ax.set(xlabel='train size', ylabel='mean accuracy', title='Learning Curve for Logistic Regression')
ax.grid()
ax.legend()
plt.show()    

In [None]:
sgd_2 = SGDClassifier(tol=1e-4, max_iter=1000, random_state=42)
train_sizes, train_score, val_score = learning_curve(sgd_2, X_train_scaled, y_train)

train_mean = np.mean(train_score, axis=1)
val_mean = np.mean(val_score, axis=1)

print('train_size', 'mean_train_score (3-fold cv)', 'mean_val_score (3-fold cv)')
for train_size, t, m in zip(train_sizes, train_mean, val_mean):
    print(train_size, t, m)

fig, ax = plt.subplots(figsize=(15, 10))
ax.plot(train_sizes, train_mean, label='train score', marker='x')
ax.plot(train_sizes, val_mean, label='val score', marker='o')

# LogisticRegression.score() is the mean accuracy
ax.set(xlabel='train size', ylabel='mean accuracy', title='Learning Curve for Logistic Regression (SGD)')
ax.grid()
ax.legend()
plt.show()    

10. prediction

In [147]:
test = X_test
truth = y_test.values

pred_lr = logistic.predict(test)
pred_sgd = sgd.predict(test)

print('Number of mislabeled points out of %d points:' % test.shape[0])
print('Logistic Regression: %d, Mean Accuracy: %.3f' % ((truth != pred_lr).sum(),
                                              logistic.score(test, truth)))
print('Logistic Regression (SGD): %d, Mean Accuracy: %.3f' % ((truth != pred_sgd).sum(),
                                                            sgd.score(test, truth)))

# print first 10 test datapoints and predictions
print()
print('Truth (1=approved, 0=denied)', truth[:10])
print('Logistic Regression', pred_lr[:10])
print('Logistic Regression with SGD', pred_sgd[:10])

Number of mislabeled points out of 164 points:
Logistic Regression: 53, Mean Accuracy: 0.677
Logistic Regression (SGD): 52, Mean Accuracy: 0.683

Truth (1=approved, 0=denied) [0 0 1 0 0 0 0 1 0 0]
Logistic Regression [1 1 1 1 0 1 1 1 0 1]
Logistic Regression with SGD [1 1 1 1 0 1 0 1 0 1]


## Naive Bayes Classifier

This is a probability-based classifier using the Bayes Theorem.

http://scikit-learn.org/stable/modules/naive_bayes.html

The probability distribution used here is Gaussian. There are other distributions supported (multinomial, bernoulli), depending on how the dataset could be distributed.

In [1]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train_scaled, y_train)
y_pred_nb = nb.predict(X_test_scaled)

print(classification_report(y_test, y_pred_nb))
cm_nb = confusion_matrix(y_test, y_pred_nb)

fig, ax = plt.subplots()
sns.heatmap(cm_nb, annot=True, ax=ax)

ax.set(xlabel='Predicted labels', ylabel='True labels', title='Confusion Matrix (Gaussian Naive Bayes)') 
ax.xaxis.set_ticklabels(['Denied', 'Approved'])
ax.yaxis.set_ticklabels(['Denied', 'Approved'])
plt.show()

NameError: name 'X_train_scaled' is not defined

In [None]:
# learning curve
train_sizes, train_score, val_score = learning_curve(GaussianNB(), X_train_scaled, y_train)

train_mean = np.mean(train_score, axis=1)
val_mean = np.mean(val_score, axis=1)

print('train_size', 'mean_train_score (3-fold cv)', 'mean_val_score (3-fold cv)')
for train_size, t, m in zip(train_sizes, train_mean, val_mean):
    print(train_size, t, m)

fig, ax = plt.subplots(figsize=(15, 10))
ax.plot(train_sizes, train_mean, label='train score', marker='x')
ax.plot(train_sizes, val_mean, label='val score', marker='o')

ax.set(xlabel='train size', ylabel='mean accuracy', title='Learning Curve for Gaussian Naive Bayes')
ax.grid()
ax.legend()
plt.show()

## Support Vector Classifier

We'll try the SVC classifier with the default kernel (radial basis function). 

This kernel function is used to generate the predictions for a test data value, based on finding the separation boundary (hyperplane) with the largest margin.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Other supported kernels are: linear, polynomial.

In [None]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train_scaled, y_train)
y_pred_svc = svc.predict(X_test_scaled)

print(classification_report(y_test, y_pred_svc))
cm_nb = confusion_matrix(y_test, y_pred_svc)

fig, ax = plt.subplots()
sns.heatmap(cm_nb, annot=True, ax=ax)

ax.set(xlabel='Predicted labels', ylabel='True labels', title='Confusion Matrix (SVM with RBF kernel)') 
ax.xaxis.set_ticklabels(['Denied', 'Approved'])
ax.yaxis.set_ticklabels(['Denied', 'Approved'])
plt.show()

In [None]:
# learning curve
train_sizes, train_score, val_score = learning_curve(SVC(), X_train_scaled, y_train)

train_mean = np.mean(train_score, axis=1)
val_mean = np.mean(val_score, axis=1)

print('train_size', 'mean_train_score (3-fold cv)', 'mean_val_score (3-fold cv)')
for train_size, t, m in zip(train_sizes, train_mean, val_mean):
    print(train_size, t, m)

fig, ax = plt.subplots(figsize=(15, 10))
ax.plot(train_sizes, train_mean, label='train score', marker='x')
ax.plot(train_sizes, val_mean, label='val score', marker='o')

ax.set(xlabel='train size', ylabel='mean accuracy', title='Learning Curve for SVM with RBF kernel')
ax.grid()
ax.legend()
plt.show()

## K-nearest Neighbours Classifier

The last classifier we'll try is K-nearest neighbours.

As the name implies, this looks at the k-closest neighbouring labeled datapoints, and determines (by majority vote) which class a test data point belongs to.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [None]:
from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier()
kn.fit(X_train_scaled, y_train)
y_pred_kn = kn.predict(X_test_scaled)

print(classification_report(y_test, y_pred_kn))
cm_nb = confusion_matrix(y_test, y_pred_kn)

fig, ax = plt.subplots()
sns.heatmap(cm_nb, annot=True, ax=ax)

ax.set(xlabel='Predicted labels', ylabel='True labels', title='Confusion Matrix (k=5 nearest neighbours)') 
ax.xaxis.set_ticklabels(['Denied', 'Approved'])
ax.yaxis.set_ticklabels(['Denied', 'Approved'])
plt.show()

In [None]:
# learning curve
train_sizes, train_score, val_score = learning_curve(KNeighborsClassifier(), X_train_scaled, y_train)

train_mean = np.mean(train_score, axis=1)
val_mean = np.mean(val_score, axis=1)

print('train_size', 'mean_train_score (3-fold cv)', 'mean_val_score (3-fold cv)')
for train_size, t, m in zip(train_sizes, train_mean, val_mean):
    print(train_size, t, m)

fig, ax = plt.subplots(figsize=(15, 10))
ax.plot(train_sizes, train_mean, label='train score', marker='x')
ax.plot(train_sizes, val_mean, label='val score', marker='o')

ax.set(xlabel='train size', ylabel='mean accuracy', title='Learning Curve for k=5 Nearest Neighbors')
ax.grid()
ax.legend()
plt.show()

## Visualizing Boundaries

In classification (and clustering) problems, it is helpful to visualize how each model defines the boundaries separating the classes.

This can also help give some intuition on what each classifier does, and more helpfully, how the boundaries can change with different hyperparameters.

To do this, we fit each model on the 2-dimensional features and then plot a contour.

In [None]:
# Helper plotting functions to plot boundaries
def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Args:
        x: data to base x-axis meshgrid on
        y: data to base y-axis meshgrid on
        h: stepsize for meshgrid, optional

    Returns:
        xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy

def plot_contours(ax, clf, xx, yy, **params):
    """Plot the decision boundaries for a classifier.

    Args:
        ax: matplotlib axes object
        clf: a classifier
        xx: meshgrid ndarray
        yy: meshgrid ndarray
        params: dictionary of params to pass to contourf, optional
    """
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

In [None]:
# For visualization purposes only
pca = PCA(n_components=2)
X_train_scaled_2d = pca.fit_transform(X_train_scaled)

X0, X1 = X_train_scaled_2d[:, 0], X_train_scaled_2d[:, 1]
xx, yy = make_meshgrid(X0, X1)

In [None]:
# Logistic Regression
logistic_2d = LogisticRegression()
logistic_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, logistic_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_train_scaled_2d[:, 0]', ylabel='X_train_scaled_2d[:, 1]',
       title='2d boundary plot (Logistic Regression)')
plt.show()

In [None]:
# Gaussian Naive Bayes
nb_2d = GaussianNB()
nb_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, nb_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_train_scaled_2d[:, 0]', ylabel='X_train_scaled_2d[:, 1]',
       title='2d boundary plot (Naive Bayes with Gaussian distribution)')
plt.show()

In [None]:
# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB

bnb_2d = BernoulliNB()
bnb_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, bnb_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_train_scaled_2d[:, 0]', ylabel='X_train_scaled_2d[:, 1]',
       title='2d boundary plot (Naive Bayes with Bernoulli distribution)')
plt.show()

In [None]:
# SVM with RBF kernel
svc_2d = SVC()
svc_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, svc_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_train_scaled_2d[:, 0]', ylabel='X_train_scaled_2d[:, 1]',
       title='2d boundary plot (SVM with RBF kernel)')
plt.show()

In [None]:
# SVM with polynomial kernel
svc_2d_poly = SVC(kernel='poly') # defaults to degrees=3
svc_2d_poly.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, svc_2d_poly, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_train_scaled_2d[:, 0]', ylabel='X_train_scaled_2d[:, 1]',
       title='2d boundary plot (SVM with 3-degree polynomial kernel)')
plt.show()

In [None]:
# K-nearest neighbours, k=5 (default)
kn_2d = KNeighborsClassifier()
kn_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, kn_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_train_scaled_2d[:, 0]', ylabel='X_train_scaled_2d[:, 1]',
       title='2d boundary plot (k=5 Nearest Neighbours)')
plt.show()

In [None]:
# K-nearest neighbours, k=3
kn_2d = KNeighborsClassifier(n_neighbors=3)
kn_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, kn_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_2d[:, 0]', ylabel='X_2d[:, 1]',
       title='2d boundary plot (k=3 Nearest Neighbours)')
plt.show()

In [None]:
# K-nearest neighbours, k=10
kn_2d = KNeighborsClassifier(n_neighbors=10)
kn_2d.fit(X_train_scaled_2d, y_train)

fig, ax = plt.subplots()
plot_contours(ax, kn_2d, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8)
ax.scatter(X0, X1, c=y_train, cmap=plt.cm.coolwarm, s=20, edgecolors='k')
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set(xlabel='X_2d[:, 0]', ylabel='X_2d[:, 1]',
       title='2d boundary plot (k=10 Nearest Neighbours)')
plt.show()

### Classifier Quality

Area under ROC curve - the larger the better.

![example](https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/ROC_space-2.png/1024px-ROC_space-2.png)

In [None]:
# For comparison, let's see how an ROC curve for a dummy classifier looks like
# http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
from sklearn.dummy import DummyClassifier

# Baseline
baseline = DummyClassifier(random_state=42)
baseline.fit(X_train_scaled, y_train)
y_confidence_baseline = baseline.predict_proba(X_test_scaled)

# predict_proba: prediction confidence
# decision_function: distance to the decision boundary/hyperplane

# y_confidence_baseline[:, 0] returns probabilities for class 0
# y_confidence_baseline[:, 1] returns probabilities for class 1
fpr_baseline, tpr_baseline, _ = roc_curve(y_test, y_confidence_baseline[:, 1], pos_label=1)
auc_baseline = auc(fpr_baseline, tpr_baseline)

# Logistic Regression
y_confidence_logistic = logistic.predict_proba(X_test_scaled)
fpr_logistic, tpr_logistic, _ = roc_curve(y_test, y_confidence_logistic[:, 1], pos_label=1)
auc_logistic = auc(fpr_logistic, tpr_logistic)

# Naive Bayes
y_confidence_nb = nb.predict_proba(X_test_scaled)
fpr_nb, tpr_nb, _ = roc_curve(y_test, y_confidence_nb[:, 1], pos_label=1)
auc_nb = auc(fpr_nb, tpr_nb)

# Support Vector Machine
y_confidence_svc = svc.decision_function(X_test_scaled)
fpr_svc, tpr_svc, _ = roc_curve(y_test, y_confidence_svc, pos_label=1)
auc_svc = auc(fpr_svc, tpr_svc)

# K-nearest Neighbor
y_confidence_kn = kn.predict_proba(X_test_scaled)
fpr_kn, tpr_kn, _ = roc_curve(y_test, y_confidence_kn[:, 1], pos_label=1)
auc_kn = auc(fpr_kn, tpr_kn)

# Plot the ROCs
fig, ax = plt.subplots(figsize=(10, 8))

ax.plot(fpr_baseline, tpr_baseline, label='Baseline (area = %0.2f)' % auc_baseline,
        linestyle='dashed')
ax.plot(fpr_logistic, tpr_logistic, label='Logistic Regression (area = %0.2f)' % auc_logistic)
ax.plot(fpr_nb, tpr_nb, label='Naive Bayes (area = %0.2f)' % auc_nb)
ax.plot(fpr_svc, tpr_svc, label='SVM (area = %0.2f)' % auc_svc)
ax.plot(fpr_kn, tpr_kn, label='K Nearest Neighbors (area = %0.2f)' % auc_kn)

# bigger area is better
ax.set(xlabel='false positive rate', ylabel='true positive rate',
       title='ROC curves for Credit Approval Dataset (positive label=1)')
ax.legend()
ax.grid()
plt.show()

### ROC curve - how is it derived

- y-axis = True positive rate
- x-axis = False positive rate

What it does:
1. Gets a subset of thresholds from the confidences or decision_function output (how far from decision boundary) from predicting X_test_scaled
2. Sorts the thresholds, dropping duplicates.
3. At each threshold, preform a cumulative sum to get its true positive count. Derive the false positive count from the true positive count.
4. Compute the tpr and fpr from the counts.

Source:
- https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/metrics/ranking.py#L453

In [None]:
fpr_logistic, tpr_logistic, threshold = roc_curve(y_test, y_confidence_logistic[:, 1],
                                                  pos_label=1)

fig, ax = plt.subplots(figsize=(10, 8))

ax.plot(fpr_logistic, tpr_logistic, label='Logistic Regression (area = %0.2f)' % auc_logistic,
       marker='o')

for i, (th, fpr, tpr) in enumerate(zip(threshold, fpr_logistic, tpr_logistic)):
    # label every few points
    if i % 5 == 0:
        ax.annotate('%.2f' % th, (fpr, tpr))

ax.set(xlabel='false positive rate', ylabel='true positive rate',
       title='ROC curve for Credit Approval Dataset (positive label=1)')

ax.legend()
ax.grid()
plt.show()

In [None]:
print(threshold[:5]) # prediction confidence values
print(fpr_logistic[:5]) # false positive rates
print(tpr_logistic[:5]) # true positive rates

In [None]:
# prediction confidence for each X_test_scaled
# either the estimator.predict_proba() or the estimator.decision_function()
#
# only a subset will be selected to be thresholds
# (drop colinear (same line) and duplicates)
#
print(y_confidence_logistic)