<div style="text-align:center;">
    <img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">
    <h1>CSCI 416-01/516-01: Fundamentals of AI/ML</h1>
    <h1>Fall 2025</h1>
    <h1>Model evaluation and tuning</h1>
</div>

## The Wisconsin cancer dataset

We will use the Wisconsin cancer dataset.

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

print(data.DESCR)
print("Feature names: ", data.feature_names)
print("Target names:  ", data.target_names)

In [None]:
X = data.data
y = data.target
print(X[0:9,:])
print(y[0:9])

The version of the dataset in Scikit-Learn does not explain this, but the label 0 means **malignant** and label 1 means **benign**.  I find this confusing, so let's reverse the labels:

In [None]:
y = (y == 0)

print(y[0:9])

In [None]:
print(X.shape)

## Create training and test sets

We will use a 70/30 training/test split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Combining transformers and estimators in a pipeline

Our pipeline will consist of (in order):
1. the standard scaler (mean 0, variance 1), and
3. a logistic regression classifier.

We will also fit and test the model.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe_lr = Pipeline([('scl', StandardScaler()),
            ('clf', LogisticRegression(random_state=42))])

pipe_lr.fit(X_train, y_train)
print(f"Test Accuracy: {pipe_lr.score(X_test, y_test):.3f}")

# Using k-fold cross-validation to assess model performance

[**Cross-validation**](http://scikit-learn.org/stable/modules/cross_validation.html) is one approach to assessing model performance.  Cross-validation allows us to assess our model **before** we try it on the test set.

In basic k-fold cross-validation, we randomly partition the t training cases into k disjoint sets (**folds**) of size t/k.

We then treat, in turn, each of the k sets as a validation set.

For very small datasets one would typically choose t-fold cross-validation.  That is, there are t validation sets, each of size 1.  This is called **leave-out-one (LOO) validation**.

**Stratified k-fold cross-validation** is a variation of k-fold cross-validation that can yield better results, particulary when the class proportions are markedly unequal.  In stratified cross-validation, the class proportions are preserved in each fold.  This ensures that each fold is representative of the class proportions in the training dataset.

The term "stratified" comes from **stratified sampling**.  In stratified sampling, each sub-population, or stratum, is sampled independently so that the sample proportions will reflect the population proportions.  Stratification refers to the process of dividing the population into disjoint, homogeneous subgroups before sampling.

We will use stratified cross-validation as implemented in the [StratifiedKFold class](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) to evaluate our classifier.

In [None]:
import numpy as np

print('no. of benign cases:   ', np.sum(y_train == 0))
print('no. of malignant cases:', np.sum(y_train == 1))

In [None]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=10)

scores = []
for k, (train, test) in enumerate(skfold.split(X_train, y_train)):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    print(f"Fold: {k+1:2d}, Class distributions: {str(np.bincount(y_train[train])):s}, Accuracy: {score:f}")
    
print()
print(f"Cross-validation accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

We can obtain the same cross-validation score more succinctly using scikit-learn's [cross_val_score()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) method.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator=pipe_lr, 
                         X=X_train, 
                         y=y_train, 
                         cv=10,
                         n_jobs=1)
print(f"CV accuracy scores: {str(scores):s}")
print(f"CV accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

Cross-validation is "embarassingly parallel" and the calculations for each fold can be spread across multiple processors independently.  We can set the number of cores/processors to use with the <code>n_jobs</code> argument.

# Debugging algorithms with learning curves

An ML model must navigate the Scylla and Charybdis of **bias** and **variance**.

Bias means we underfit the data due to modeling errors or other deficiencies in our model.  **Bias typically decreases with model complexity &ndash; more complex models are capable of capturing more features of the problem**.

Variance refers to how our model's performance varies from one set of data to another.  **Variance increases with model complexity &ndash; in general, the more complex a model, the more sensitive it becomes to the choice of training data, and the more likely it is that the model will not generalize to new data because of overfitting to the training set**.

There is a trade-off between bias and variance: a certain level of model complexity is needed to avoid bias, but too complex a model will suffer from high variance.

## Diagnosing bias and variance problems with learning curves

We can understand the bias and variance behavior of our model by examining how it does on training/test sets of varying size.
* We look at the trends of performance on the training and test sets as the size of each increases.
* When both sets are small, we should not expect great results.  But what happens as the size increases?
* If the bias is large, we should not expect good performance for either the training set or the test set. &#128577; &#128577;
* If the variance is high, we should see good performance on the training set but significantly poorer performance on the test set. &#128578; &#128577;

Ideally we will have low bias and low variance achieving good accuracy on both the training and test sets. &#128578; &#128578;

In Scikit-Learn we can generate plots like this using the <code>[learning_curve()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html)</code> function.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

pipe_lr = Pipeline([('scl', StandardScaler()),
            ('clf', LogisticRegression(penalty='l2', random_state=0))])

train_sizes, train_scores, test_scores = \
                learning_curve(estimator=pipe_lr, 
                X=X_train, 
                y=y_train, 
                train_sizes=np.linspace(0.1, 1.0, 10), 
                cv=10,
                n_jobs=1)

train_mean = np.mean(train_scores, axis=1)
train_std  = np.std(train_scores, axis=1)
test_mean  = np.mean(test_scores, axis=1)
test_std   = np.std(test_scores, axis=1)

plt.plot(train_sizes, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='training accuracy')

plt.fill_between(train_sizes, 
                 train_mean + train_std,
                 train_mean - train_std, 
                 alpha=0.15, color='blue')

plt.plot(train_sizes, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='validation accuracy')

plt.fill_between(train_sizes, 
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')

plt.grid()
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.8, 1.0])
plt.tight_layout()
# plt.savefig('./figures/learning_curve.png', dpi=300)
plt.show()

How does this classifier do with regards to bias (underfitting) and variance (overfitting)?

Bias: &#128578;

Variance: &#128578;

# Addressing overfitting and underfitting with validation curves

We can often address overfitting (variance) and underfitting (bias) by tuning hyperparameters in our model during training.

We can see the effects of changing one hyperparameter using Scikit-Learn's <code>[validation_curve()](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.validation_curve.html#sklearn.model_selection.validation_curve)</code> function.

Let's look at logistic regression.  The problem being solved in [training the scikit-learn implementation of logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for binary classification is
$$
  \newcommand{\norm}[1]{\| #1 \|}
  \newcommand{\minimize}{\mbox{minimize}}
  \minimize_{w,b} \sum_{i=1}^{N} \log\left(1 + \exp(y_{i}(x_{i}^{T}w + b))\right) + \frac{1}{C} \norm{w}_{2}^{2}.
$$
The term in the summation is the negative of the log-likelihood function (minimizing the negative is equivalent to maximizing the log-likelihood), with $y_{i} = \pm 1$ being the class indicators.  We used this trick to simplify notation for SVC (compare the preceding with the formulation in the notes).

$\newcommand{\twonorm}[1]{\norm{#1}_{2}}$
The term $\frac{1}{C} \twonorm{w}$ is a **regularization** or **penalty** terms that keep the model parameters from becoming too big.

The weight $1/C$ controls the tradeoff between keeping $w$ small (a simpler model with more bias and less variance) and minimizing the negative log-likelihood (a more complex model with less bias and more variance).

If $C$ is small, we are placing a greater emphasis on reducing variance; if $C$ is large we are are placing a greater emphasis on reducing bias.

In [None]:
from sklearn.model_selection import validation_curve

param_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(
                estimator=pipe_lr, 
                X=X_train, 
                y=y_train, 
                param_name='clf__C', 
                param_range=param_values,
                cv=10)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(param_values, train_mean, 
         color='blue', marker='o', 
         markersize=5, label='training accuracy')

plt.fill_between(param_values, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_values, test_mean, 
         color='green', linestyle='--', 
         marker='s', markersize=5, 
         label='validation accuracy')

plt.fill_between(param_values,
                 test_mean + test_std,
                 test_mean - test_std, 
                 alpha=0.15, color='green')

plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1.0])
plt.tight_layout()
plt.show()

* When $C$ is small, the accuracies for the training and validation sets are similar, but the accuracy is lower than we can achieve for other values of $C$.  This is a sign of higher bias (underfitting), but lower variance (overfitting).
* As $C$ increases, we obtain higher accuracy (lower bias) but performance on the two sets begins to diverge.  This is a sign of decreasing bias, but increasing variance (overfitting).
* When $C$ is large, we obtain higher accuracy on the training set but decreasing accuracy on the validation set.  This is a sign of increasing variance and overfitting.

# Tuning hyperparameters via grid search

We can tune the hyperparameter values for our ML algorithms by searching over a grid of candidate values.

The search is embarassingly parallel, so we can distribute it over multiple cores/processors.

On the other hand, the number of models we need to train can grow quickly: if we are changing, say, 6 hyperparameters, and we want to try 10 values for each, then we have a $10 \times 10 \times 10 \times 10 \times 10 \times 10$ grid, which has $10^{6}$ grid points! 

We can perform exhaustive grid search using scikit-learn's [<code>GridSearchCV()</code>](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function.  Rather than train on a single model for each value of the hyperparameters, <code>GridSearchCV()</code> evaluates each model using cross-validation.  By default it uses three-fold stratified cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

pipe_svc = Pipeline([('scl', StandardScaler()),
                     ('clf', SVC(random_state=1))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range, 
               'clf__kernel': ['linear']},
                 {'clf__C': param_range, 
                  'clf__gamma': param_range, 
                  'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  cv=10,
                  n_jobs=-1)  # n_jobs = -1 means use all cores
gs = gs.fit(X_train, y_train)
print(f"{gs.best_score_:.3f}")
print(gs.best_params_)

In [None]:
clf = gs.best_estimator_
clf.fit(X_train, y_train)
print(f"Test accuracy: {clf.score(X_test, y_test):.3f}")

## Algorithm selection with nested cross-validation

Cuidado! the use of cross-validation further increases the computation cost!

For instance, if we pass <code>cv=2</code> in our call to <code>GridSearchCV()</code> to specify two-fold CV in the preceding example, then we are actually performing a $5 \times 2$ nested CV.

In [None]:
gs = GridSearchCV(estimator=pipe_svc, 
                  param_grid=param_grid, 
                  scoring='accuracy', 
                  cv=2)

scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=5)
print(f"CV accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}")

Here is grid search applied to a decision tree.  We vary the maximum depth of the tree in the search.

In [None]:
from sklearn.tree import DecisionTreeClassifier
gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0), 
                            param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6, 7, None]}], 
                            scoring='accuracy', 
                            cv=2)
scores = cross_val_score(gs, X_train, y_train, scoring='accuracy', cv=5)
print('CV accuracy: {0:.3f} +/- {1:.3f}'.format(np.mean(scores), np.std(scores)))

## Randomized search

There is also a [<code>RandomizedSearchCV()</code>](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) function that performs a randomized search of a subset of the hyperparameter space as a means of reducing the computational cost of tuning.

# The confusion matrix for the cancer data

Let's look at the confusion matrix for the cancer data, using our SVC pipeline for classification.

In [None]:
from sklearn.metrics import confusion_matrix
pipe_svc.fit(X_train, y_train)
y_pred = pipe_svc.predict(X_test)
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay.from_estimator(
    pipe_svc,
    X_test,
    y_test,
    display_labels=["Malignant", "Benign"],
    normalize="true"
)
disp.ax_.set_title("Normalized confusion matrix for the SVC")

## Optimizing the precision and recall of a classification model

The cancer data provide a good context for reviewing the concepts of precision and recall.
* We will denote the malignant class by $\oplus$, meaning "positive for cancer".
* We will denote the benign class by $\ominus$, meaning "negative for cancer".

**Precision (P)**: of all the $\oplus$'s you found, how many of them were really $\oplus$'s?  Good precision means that the cases we diagnose as cancer really are cancer.

**Recall (R)**: of all the $\oplus$'s that are really there, how many of them did you find?  Good recall means that we are not missing cases of cancer.

The F1 score measures both precision and recall by taking their harmonic mean:
$$
  \mbox{F1} = \frac{1}{\frac{1}{2}\left(\frac{1}{P} + \frac{1}{R}\right)}.
$$

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

print(f"Precision: {precision_score(y_true=y_test, y_pred=y_pred):.3f}")
print(f"Recall:    {recall_score(y_true=y_test, y_pred=y_pred):.3f}")
print(f"F1:        {f1_score(y_true=y_test, y_pred=y_pred):.3f}")

By default the grid search algorithms use the performance metric (scoring function) provided by the classifier.

However, we can create our own scorer and use that to optimize hyperparameters in a grid search.

Here is an example of using the F1 score:

In [None]:
from sklearn.metrics import make_scorer, f1_score

scorer = make_scorer(f1_score, pos_label=1)

c_gamma_range = [0.01, 0.1, 1.0, 10.0]

param_grid = [{'clf__C': c_gamma_range, 
               'clf__kernel': ['linear']},
                 {'clf__C': c_gamma_range, 
                  'clf__gamma': c_gamma_range, 
                  'clf__kernel': ['rbf'],}]

gs = GridSearchCV(estimator=pipe_svc, 
                  param_grid=param_grid, 
                  scoring=scorer, 
                  cv=2,
                  n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

## Precision vs recall: which is right for you?

**Precision**: of all the $\oplus$'s you found, how many of them were really $\oplus$'s?  Good precision means that the cases we diagnose as cancer really are cancer.

**Recall**: of all the $\oplus$'s that are really there, how many of them did you find?  Good recall means that we are not missing cases of cancer.

Sometimes we care more about precision, sometimes we care more about recall, and sometimes we care about both.

Suppose our classifier is deciding whether websites are safe for children.  Let $\oplus$ be acceptable sites.  Then we probably care more about precision (not letting bad sites past) than recall (we're willing to reject quite a few good sites).

One the other hand, if our classifier is looking for a disease, and $\oplus$ are people with the disease, then we'd probably care more about recall (finding all the people with the disease) than precision (since once we see the people our dragnet rounds up we can tell they are not sick).

## The precision/recall trade-off

It's not always easy to achieve high precision and high recall at the same time.  

In fact, increasing precision reduces recall, and increasing recall reduces precision.

To understand this, let's consider logistic regression as implemented in scikit-learn.

In binary classification, logistic regression builds a score $S(x)$ that instance $x$ belongs to one class (let's call it $\oplus$).  If $S(x)$ exceeds some threshold, the model says that $x$ is an $\oplus$; otherwise, it says $x$ is an $\ominus$.

Scikit-Learn won't let us see the threshold, but it will tell us the values of the score $S(x)$ it is using.  We can get at them using the ```decision_function()``` method.

In [None]:
gs.fit(X_train, y_train)

y_pred = gs.predict(X_train)
y_scores = gs.decision_function(X_train)
print('    score        class prediction')
print(np.column_stack((y_scores, y_pred)))

We see that negative scores are Class 0 (benign) and positive scores are Class 1 (malignant).

### Precision and recall vs scoring threshold

Let's look at the effect of changing the scoring threshold on precision and recall.


We use the [```precision_recall_curve()```](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) function to create a plot of the precision and recall for different values of the threshold used by in our classifier to decide whether cells are malignant or benign.

In [None]:
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="right", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([np.amin(thresholds), np.amax(thresholds)])
plt.show()

Recall:
- **Precision**: of all the $\oplus$'s you found, how many of them were really $\oplus$'s?  Good precision means that the cases we diagnose as cancer really are cancer.
- **Recall**: of all the $\oplus$'s that are really there, how many of them did you find?  Good recall means that we are not missing cases of cancer.

We can see that as we increase the decision threshold in our classifier, the precision increases because we are making the requirement to be classified as $\oplus$ more stringent.

At the same time, recall is decreasing because we are missing more and more $\oplus$ cases.

Thus, if someone says they want 99.9% precision in their classsifier, they should also ask about the associated level of recall!  One is not meaningful without the other.

## The precision-recall curve

Now we will plot the **precision-recall** curve that shows the trade-off between precision and recall.

We have values
\begin{align*}
  (\mbox{threshold}_{1}, \mbox{precision}_{1}) &\quad (\mbox{threshold}_{1}, \mbox{recall}_{1}) \\
  (\mbox{threshold}_{2}, \mbox{precision}_{2}) &\quad (\mbox{threshold}_{2}, \mbox{recall}_{2}) \\
  \vdots &\quad \vdots \\
  (\mbox{threshold}_{r}, \mbox{precision}_{r}) &\quad (\mbox{threshold}_{r}, \mbox{recall}_{r})
\end{align*}

In the precision-recall curve we plot precision vs recall:
\begin{align*}
  (\mbox{recall}_{1}, \mbox{precision}_{1}) \\
  (\mbox{recall}_{2}, \mbox{precision}_{2}) \\
  \vdots \\
  (\mbox{recall}_{r}, \mbox{precision}_{r}).
\end{align*} 

In [None]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(6, 6))
plot_precision_vs_recall(precisions, recalls)
plt.show()

This is a good precision-recall plot.  As our recall improves, we do not lose much precision until the upper limit of recall is reached.

A classifier with more room for improvement would be one that does not hug the upper right corner of the enclosing square.

### A precision-recall curve for an ungood classifier

Let's look at an example where the classifier does not do well.  

We will use two features of the iris data set and a logistic regression classifier.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

# Extract sepal length and width from columns 0 and 1.
iris_X = iris.data[:,[0,1]]
iris_y = iris.target
print(iris.target_names)

# Select I. versicolor and I. virginica.
mask = (iris_y != 0)
iris_X = iris_X[mask,:]
iris_y = iris_y[mask]

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
iris_y = le.fit_transform(iris_y)

from sklearn.model_selection import train_test_split

iris_X_train, iris_X_test, iris_y_train, iris_y_test = \
        train_test_split(iris_X, iris_y, test_size=0.30, random_state=1)

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)

lr.fit(iris_X_train, iris_y_train)
print(f"Test accuracy: {lr.score(iris_X_test, iris_y_test):.3f}")

In [None]:
from sklearn.metrics import precision_recall_curve

iris_y_scores = lr.decision_function(iris_X_train)

precisions, recalls, thresholds = precision_recall_curve(iris_y_train, iris_y_scores)

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(8, 8))
plot_precision_vs_recall(precisions, recalls)
plt.show()

This is not a good precision-recall curve.  There is a rapid loss of precision as recall increases.

## Plotting a receiver operating characteristic curve

Another tool for evaluating binary classifiers is the **receiver operating characteristic (ROC) curve** or **normalized coverage plot**.
The name comes from ROC curves in radio engineering, which were introduced in World War II to measure radio and radar performance.

Recall:
* **true positives** are correctly classified positives;
* **false positives** are incorrectly classified negatives.
In true/false positive/negative, 
* positive/negative refers to the classifier's prediction, while 
* true/false refers to whether this prediction is correct.
 Let
\begin{align*}
  P  &= \mbox{true number of $\oplus$'s}, \\
  N  &= \mbox{true number of $\ominus$'s}, \\
  TP &= \mbox{number of true positives}, \\
  TN &= \mbox{number of true negatives}, \\
  FP &= \mbox{number of false positives}, \\
  FN &= \mbox{number of false negatives}.
\end{align*}

The **true positive rate** (TPR) and **false positive rate** (FPR) are the fraction of positives correctly classified and negatives incorrectly classified, respectively:
\begin{align*}
  TPR &= \frac{TP}{P} = \frac{TP}{FN + TP}, \\
  FPR &= \frac{FP}{N} = \frac{FP}{FP + TN}.
\end{align*}

An ROC curve is similar to that in the precision-recall curve.  For various levels of the scoring threshold we compute the associated true positive rate (TPR) and false positive rate (FPR):
\begin{align*}
  (\mbox{threshold}_{1}, \mbox{TPR}_{1}) &\quad (\mbox{threshold}_{1}, \mbox{FPR}_{1}) \\
  (\mbox{threshold}_{2}, \mbox{TPR}_{2}) &\quad (\mbox{threshold}_{2}, \mbox{FPR}_{2}) \\
  \vdots &\quad \vdots \\
  (\mbox{threshold}_{r}, \mbox{TPR}_{r}) &\quad (\mbox{threshold}_{r}, \mbox{FPR}_{r})
\end{align*}

In an ROC curve we plot the true positive rates (vertical axis) versus the false negative rates (horizontal axis):
\begin{align*}
  (\mbox{FPR}_{1}, \mbox{TPR}_{1}) \\
  (\mbox{FPR}_{2}, \mbox{TPR}_{2}) \\
  \vdots \\
  (\mbox{FPR}_{r}, \mbox{TPR}_{r}).
\end{align*} 

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train, y_scores)

In [None]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(6, 6))
plot_roc_curve(fpr, tpr)
plt.show()

The blue curve is the ROC curve for our classifier.

The dotted 45&#176; line represents the performance of a random classifier.  By that we mean a classifier that selects a random number $S$ from a uniform distribution on the interval [0,1], and classifies an instance as $\oplus$ if $S > \theta$, where $\theta$ is the scoring threshold.
* At one extreme, when $\theta = 0$, all instances are classified as $\oplus$.  This means TPR = 1 and FPR = 1.
* At the other extreme, when $\theta = 1$, all instances are classified as $\ominus$.  This means TPR = 0 and FPR = 0.
* For intermediate values of $\theta$, equal fractions of true $\oplus$ and true $\ominus$ are classified as $\oplus$, so TPR = FPR.

## The area under the ROC curve

From the perspective of the ROC curve, the ideal classifier would be the constant value 1.
* Such a classifier would always have perfect precision.
* The area under such a curve would be 1 (height = 1 $\times$ width = 1).
* The area under the random classifier's ROC is 1/2.

Th
e area under an ROC curve is abbreviated AUC: **area under curve**.  AUC values near 1 are desirable.

In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score

y_pred = gs.predict(X_train)

print('ROC AUC:  {0:.3f}'.format(roc_auc_score(y_train, y_scores)))
print('Accuracy: {0:.3f}'.format(accuracy_score(y_train, y_pred)))

# The precision/recall curve vs the ROC curve

The precision/recall curve typically gives better insight when the number of $\oplus$ cases is small and you care more about false positives than false negatives.