# L3c: Logistic Regression Models for Binary Classification 
This lecture introduces our second binary classification method: [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression). 
[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a statistical method used for binary classification problems, where the dependent variable (label) is a binary categorical variable (e.g., $\pm{1}$, etc.), and the independent variables (features) are continuous or categorical variables. Unlike the Perceptron model, which outputs the class label directly, logistic regression models the _probability_ that a given input belongs to a particular class based on the input features.

The key concepts covered in this lecture include:
* __Logistic regression__ is a binary classification that models the relationship between a dependent categorical variable (label) and one or more independent variables (features) by estimating the probability of a label using the [logistic function](https://en.wikipedia.org/wiki/Logistic_function).
* __Maximum likelihood estimation (MLE)__ is an approach to estimate the parameters of a probability distribution by maximizing the likelihood function of the output (label), given the input (features), thereby determining the parameter values that make the output most probable given the input.
* __Gradient descent__ is an optimization algorithm that minimizes a function by iteratively adjusting parameters in the opposite direction of [the gradient](https://en.wikipedia.org/wiki/Gradient). Iteration continues until some stopping criteria are met, e.g., a logical minimum is found, or we run out of iterations. Sometimes, computing [the gradient](https://en.wikipedia.org/wiki/Gradient) is a hassle.
* __Alternatives to gradient descent__ include heuristic optimization algorithms such as the [Nelder-Mead Simplex Algorithm](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method), [Simulated Annealing](https://en.wikipedia.org/wiki/Simulated_annealing), [Genetic Algorithms](https://en.wikipedia.org/wiki/Genetic_algorithm), [Particle Swarm Optimization](https://en.wikipedia.org/wiki/Particle_swarm_optimization), etc, which can estimate model parameters without relying on the gradient.

Today, we'll analyze the [banknote authentication dataset from the UCI archive](https://archive.ics.uci.edu/dataset/267/banknote+authentication) first using [the Perceptron](https://en.wikipedia.org/wiki/Perceptron) and then [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression). Lecture notes for logistic regression can be found: [here!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3c/docs/Notes.pdf)

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. The `Include.jl` file loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem.

In [3]:
include("Include.jl");

### Data
This lecture will look at a [banknote authentication dataset](https://archive.ics.uci.edu/dataset/267/banknote+authentication) for classification tasks. We'll load the banknote dataset and split it into `training` and `test` data subsets (randomly).
* __Training data__: Training datasets are collections of labeled data used to teach machine learning models, allowing these tools to learn patterns and relationships within the data.
* __Test data__: Test datasets, on the other hand, are separate sets of labeled data used to evaluate the performance of trained models on unseen examples, providing an unbiased assessment of the _model's generalization capabilities_.

#### Banknote Authentication Dataset
The second dataset we will explore is the [banknote authentication dataset from the UCI archive](https://archive.ics.uci.edu/dataset/267/banknote+authentication). This dataset has `1372` instances of 4 continuous features and an integer $\{-1,1\}$ class variable. 
* __Description__: Data were extracted from images taken from genuine and forged banknote-like specimens.  An industrial camera, usually used for print inspection, was used for digitization. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object, gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tools were used to extract features from images.
* __Features__: The data has four continuous features from each image: `variance` of the wavelet transformed image, `skewness` of the wavelet transformed image, `kurtosis` of the wavelet transformed image, and the `entropy` of the wavelet transformed image. The class is $\{-1,1\}$ where a class value of `-1` indicates genuine, `1` forged.

In [6]:
df_banknote = CSV.read(joinpath(_PATH_TO_DATA, "data-banknote-authentication.csv"), DataFrame)

Row,variance,skewness,curtosis,entropy,class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Int64
1,3.6216,8.6661,-2.8073,-0.44699,-1
2,4.5459,8.1674,-2.4586,-1.4621,-1
3,3.866,-2.6383,1.9242,0.10645,-1
4,3.4566,9.5228,-4.0112,-3.5944,-1
5,0.32924,-4.4552,4.5718,-0.9888,-1
6,4.3684,9.6718,-3.9606,-3.1625,-1
7,3.5912,3.0129,0.72888,0.56421,-1
8,2.0922,-6.81,8.4636,-0.60216,-1
9,3.2032,5.7588,-0.75345,-0.61251,-1
10,1.5356,9.1772,-2.2718,-0.73535,-1


In [7]:
D_banknote = Matrix(df_banknote); # get the data as a Matrix (alias for Array{Float64,2})
number_of_training_examples_banknote = 1000; # how many training points for the banknote dataset?

In [8]:
banknote_training, banknote_test = let

    number_of_features = size(D_banknote,2); # number of cols of housing data
    number_of_examples = size(D_banknote,1); # number of rows of housing data
    full_index_set = range(1,stop=number_of_examples,step=1) |> collect |> Set;
    
    # build index sets for training and testing
    training_index_set = Set{Int64}();
    should_stop_loop = false;
    while (should_stop_loop == false)
        i = rand(1:number_of_examples);
        push!(training_index_set,i);

        if (length(training_index_set) == number_of_training_examples_banknote)
            should_stop_loop = true;
        end
    end
    test_index_set = setdiff(full_index_set,training_index_set);

    # build the test and train datasets -
    banknote_training = D_banknote[training_index_set |> collect,:];
    banknote_test = D_banknote[test_index_set |> collect,:];

    # return
    banknote_training,banknote_test
end;

In [9]:
banknote_training

1000×5 Matrix{Float64}:
 -2.4621     2.7645   -0.62578  -2.8573     1.0
 -3.2051    -0.14279   0.97565   0.045675   1.0
  4.0932     5.4132   -1.8219    0.23576   -1.0
  0.11686    3.735    -4.4379   -4.3741     1.0
  3.2403    -3.7082    5.2804    0.41291   -1.0
 -0.9607     2.6963   -3.1226   -1.3121     1.0
 -1.7064     3.3088   -2.2829   -2.1978     1.0
  1.5799    -4.7076    7.9186   -1.5487    -1.0
 -2.9821     4.1986   -0.5898   -3.9642     1.0
 -1.7559    11.9459    3.0946   -4.8978    -1.0
 -3.0731    -0.53181   2.3877    0.77627    1.0
  1.9572    -5.1153    8.6127   -1.4297    -1.0
 -3.7503   -13.4586   17.5932   -2.7771     1.0
  ⋮                                        
 -3.1273    -7.1121   11.3897   -0.083634   1.0
 -3.6961   -13.6779   17.5795   -2.6181     1.0
  1.8799     2.4707    2.4931    0.37671   -1.0
 -1.1391     1.8127    6.9144    0.70127   -1.0
  4.3848    -3.0729    3.0423    1.2741    -1.0
  3.6277     0.9829    0.68861   0.63403   -1.0
 -1.6514    -8.4985 

## Method 1: Perceptron
[The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) takes the (scalar) output of a linear regression model $y_{i}\in\mathbb{R}$ and then transforms it using the $\sigma(\star) = \texttt{sign}(\star)$ function to a discrete set of values representing categories, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,1\}$ in the binary classification case. 
* Suppose there exists a data set
$\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ _labeled_ examples, where each example has been labeled by an expert, i.e., a human to be in a category $y_{i}\in\{-1,1\}$, given the $m$-dimensional feature vector $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 
* [The Perceptron](https://en.wikipedia.org/wiki/Perceptron) _incrementally_ learns a linear decision boundary between _two_ classes of possible objects (binary classification) by repeatedly processing the dataset $\mathcal{D}$. During each pass, a regression parameter vector $\mathbf{\beta}$ is updated until it makes no more than a specified number of mistakes. 

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) computes the estimated label $\hat{y}_{i}$ for feature vector $\hat{\mathbf{x}}_{i}$ using the $\texttt{sign}:\mathbb{R}\to\{-1,1\}$ function:
$$
\begin{equation*}
    \hat{y}_{i} = \texttt{sign}\left(\hat{\mathbf{x}}_{i}^{\top}\cdot\beta\right)
\end{equation*}
$$
where $\beta=\left(w_{1},\dots,w_{n}, b\right)$ is a column vector of (unknown) classifier parameters, $w_{j}\in\mathbb{R}$ corresponding to the importance of feature $j$ and $b\in\mathbb{R}$ is a bias parameter, the features $\hat{\mathbf{x}}^{\top}_{i}=\left(x^{(i)}_{1},\dots,x^{(i)}_{m}, 1\right)$ are $p = m+1$-dimensional (row) vectors (features augmented with bias term), and $\texttt{sign}(z)$ is the function:
$$
\begin{equation*}
    \texttt{sign}(z) = 
    \begin{cases}
        1 & \text{if}~z\geq{0}\\
        -1 & \text{if}~z<0
    \end{cases}
\end{equation*}
$$
__Hypothesis__: [If the dataset $\mathcal{D}$ is linearly separable](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-LinearSeperableData-Hyperplane.pdf), the Perceptron will _incrementally_ learn a separating hyperplane in a finite number of passes through the data set $\mathcal{D}$. However, if the [dataset $\mathcal{D}$ is __not__ linearly separable](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-NotLinearSeperableData-Hyperplane.svg), the Perceptron may not converge. Check out a [perceptron pseudo-code here!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf)
* __Training__: Our Perceptron implementation [based on pseudo-code](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf) stores problem information in [a `MyPerceptronClassificationModel` instance, which holds the (initial) parameters and other data](src/Types.jl) required by the problem.
* We then _learn_ the model parameters [using the `learn(...)` method](src/Compute.jl), which takes the training features array `X,` the training labels vector `y`, and the problem instance `model` and returns an updated problem instance holding the updated parameters.

In [11]:
model_perceptron = let

    # data -
    D = banknote_training; # What dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the target data (label)

    # model
    model = build(MyPerceptronClassificationModel, (
        parameters = ones(number_of_features), # initial value for the parameters: these will be updated
        mistakes = 0 # willing to like with m mistakes
    ));

    # train -
    model = learn(X,y,model, maxiter = 10000, verbose = true); # this is learning the model parameters

    # return -
    model;
end;

Stopped after number of iterations: 10000. We have number of errors: 11


In [12]:
model_perceptron.β

5-element Vector{Float64}:
 -180.25442099982783
  -99.66765799978478
 -104.83100600015813
   -9.344243000011186
  168.0

__Inference__: Now that we have parameters estimated from the `training` data, we can use those parameters on the `test` dataset to see how well the model can differentiate between an actual banknote and a forgery on data it has never seen. We run the classification operation on the (unseen) test data [using the `classify(...)` method](src/Compute.jl). This method takes a feature array `X` and the (trained) model instance. It returns the estimated labels. 
* We store the actual (correct) label in the `y_banknote_perceptron::Array{Int64,1}` vector, while the model predicted label is stored in the `ŷ_banknote_perceptron::Array{Int64,1}` array.

In [14]:
ŷ_banknote_perceptron,y_banknote_perceptron = let

    D = banknote_test; # what dataset are going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    ŷ = classify(X,model_perceptron)

    # return -
    ŷ,y
end;

### Confusion Matrix
The confusion matrix is a $2\times{2}$ matrix that contains four entries: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). [Click me for a confusion matrix schematic!](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-BinaryConfusionMatrix.pdf) The four cases are:
* The __true positive (TP)__ case $(\text{actual}, \text{model}) = (+,+)$ in the confusion matrix is the number of positive examples that were correctly classified as positive.
* The __false negative (FN)__ case $(\text{actual}, \text{model}) = (+,-)$ is the number of actual positive examples the model incorrectly classified as negative.
* The __false positive (FP)__ case $(\text{actual}, \text{model}) = (-,+)$ is the number of actual negative examples that were incorrectly classified as positive by the model.
* The __true negative (TN)__ case $(\text{actual}, \text{model}) = (-,-)$ is the number of actual negative examples that were correctly classified as negative by the model.

Let's compute these four values [using the `confusion(...)` method](src/Compute.jl) and store them in the `CM_perceptron::Array{Int64,2}` variable:

In [16]:
CM_perceptron = confusion(y_banknote_perceptron, ŷ_banknote_perceptron) # call with the percepton values

2×2 Matrix{Int64}:
 164    5
   1  202

Let's compute the overall error rate for the perceptron using [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). The [`confusion(...)` method](src/Compute.jl) takes the actual labels and the computed labels and returns the confusion matrix.

In [18]:
number_of_test_banknotes = length(y_banknote_perceptron);
correct_prediction_perceptron = CM_perceptron[1,1] + CM_perceptron[2,2];
(correct_prediction_perceptron/number_of_test_banknotes) |> f-> println("Fraction correct: $(f) Fraction incorrect $(1-f)")

Fraction correct: 0.9838709677419355 Fraction incorrect 0.016129032258064502


## Method 2: Logistic Regression Model
[Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) is a method for binary classification problems, where the dependent variable (label) is a binary categorical variable (e.g., $\pm{1}$, etc.), and the independent variables (features) are continuous or categorical variables. Unlike the Perceptron model, which outputs the class label directly, logistic regression models the _probability_ that a given input belongs to a particular class based on the input features, 

### Model
Suppose we have a dataset $\mathcal{D} = \left\{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\right\}$ where the features $\mathbf{x}\in\mathbb{R}^{m}$ are $m$-dimensional vectors composed of continuous or categorical variables, and $y\in\mathbb{R}$ is a scalar label, e.g., $y\in\left\{-1,1\right\}$.
The logistic regression problem attempts to estimate the parameters $\theta\in\mathbb{R}^{p}$ (where $p=m+1$) of the conditional probability $P_{\theta}(y|\hat{\mathbf{x}})$, i.e., the probability that a particular label is observed given the augmented feature vector $\hat{\mathbf{x}} = \left(x_{1},x_{2},\dots,x_{m},1\right)$. We model this probability using [the logistic function](https://en.wikipedia.org/wiki/Logistic_function):
$$
\begin{equation}
P_{\theta}(y|\hat{\mathbf{x}}) = \frac{1}{1 + e^{-y\cdot\left(\hat{\mathbf{x}}^{\top}\theta\right)}}
\end{equation}
$$
The challenge is to estimate the parameters $\theta\in\mathbb{R}^{p}$. One (standard) way we do this is to minimize the negative log-likelihood: function $\mathcal{L}(\theta) = -\log{L}(\theta)$ (see the [notes](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3c/docs/Notes.pdf) for a more detailed discussion):
$$
\begin{equation}
    \theta^{\star}  = \arg\min_{\theta}\left[\sum_{i=1}^{n}\log\left(1 + e^{-y_{i}\cdot\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)}\right)\right]
\end{equation}
$$
Unfortunately, this problem has no closed-form analytical solution. Thus, we have to use some numerical technique, e.g., [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent), to estimate an approximate value for the parameters, i.e., $\theta^{\star}\sim\hat{\theta}^{\star}$.

### Gradient descent
Gradient descent is a numerical search algorithm that minimizes a function by iteratively adjusting the parameters in the opposite direction of the gradient. Suppose there exists an objective function, e.g., the negative log-likelihood $\mathcal{L}(\theta)$ that we want to minimize with respect to parameters $\theta\in\mathbb{R}^{p}$. We assume $\mathcal{L}(\theta)$ is _at least once differentiable_ with respect to the parameters, i.e., we can compute the gradient $\nabla_{\theta}{\mathcal{L}}(\theta)$. The gradient points in the direction of the steepest increase of the function. Thus, we can iteratively update the parameters to minimize the objective function using the update rule:
$$
\begin{equation*}
\theta_{k+1} = \theta_{k} - \alpha(k)\cdot\nabla_{\theta}\mathcal{L}(\theta_{k})\quad\text{where}{~k = 0,1,2,\dots}
\end{equation*}
$$
where $k$ denotes the iteration index, and $\nabla_{\theta}\mathcal{L}(\theta)$ is the gradient of the negative log-likelihood function with respect to the parameters $\theta$.
* __What is $\alpha(k)$?__ The (hyper) parameter $\alpha(k)>0$ is the _learning rate_ which can be a function of the iteration count $k$. This is a user-adjustable parameter, and we'll assume it's constant for today.
* __Stopping?__ Gradient descent will continue to iterate until a stopping criterion is met, i.e., $\lVert\theta_{k+1} - \theta_{k}\rVert\leq\epsilon$ or the maximum number of iterations is reached, or some other stopping criterion is met, i.e., the gradient is small at the current iteration $\lVert\nabla_{\theta}\mathcal{L}(\theta_{k})\rVert\leq\epsilon$.

Pusedocode for a naive gradient descent algorithm (for a fixed learning rate) is shown in the [lecture notes](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3c/docs/Notes.pdf). If you don't like computing derivatives (who does, am I right?) there are alternatives:
* __Alternatives to gradient descent__ include heuristic optimization algorithms such as the [Nelder-Mead Simplex Algorithm](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method), [Simulated Annealing](https://en.wikipedia.org/wiki/Simulated_annealing), [Genetic Algorithms](https://en.wikipedia.org/wiki/Genetic_algorithm), [Particle Swarm Optimization](https://en.wikipedia.org/wiki/Particle_swarm_optimization), etc, which can estimate model parameters without relying on the gradient.

### Implementation
We implemented [the `MyLogisticRegressionClassificationModel` type](src/Types.jl), which contains data required to solve the logistic regression problem, i.e., parameters, the learning rate, a stopping tolerance parameter $\epsilon$, and a loss (objective) function that we want to minimize. 
* __Technical note__: In this implementation, we approximated the gradient calculation using [a forward finite difference](https://en.wikipedia.org/wiki/Finite_difference). In general, this is not a great idea. This is one of my super pet peeves of gradient descent; computing the gradient is usually a hassle, and we do a bunch of function evaluations to get a good approximation of the gradient. However, finite difference is easy to implement.
* In the code block below, we [build a `model::MyLogisticRegressionClassificationModel` instance using a `build(...)` method](src/Factory.jl). The model instance initially has a random guess for the classifier parameters. We use gradient descent to refine that guess [using the `learn(...)` method](src/Compute.jl), which returns an updated model instance (with the best parameters that we found so far). We return the updated model instance and save it in the `model_logistic::MyLogisticRegressionClassificationModel` variable.

In [21]:
model_logistic = let

    # data -
    D = banknote_training; # What dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the target data (label)

    # model
    model = build(MyLogisticRegressionClassificationModel, (
        parameters = 0.01*ones(number_of_features), # initial value for the parameters: these will be updated
        learning_rate = 0.005, # you pick this
        ϵ = 1e-4, # you pick this (this is also the step size for the fd approx to the gradient)
        loss_function = (x,y,θ) -> log10(1+exp(-y*(dot(x,θ)))) # what??!? Wow, that is nice. Yes, we can pass functions as args!
    ));

    # train -
    model = learn(X,y,model, maxiter = 10000, verbose = true); # this is learning the model parameters

    # return -
    model;
end;

Stopped after number of iterations: 10001. We have error: 0.025613373045017016


In [43]:
model_logistic.β

5-element Vector{Float64}:
 -8.382256499125878
 -4.262197574175405
 -5.448355414826784
 -0.5491634846777512
  7.544177029375256

Let's use the updated `model_logistic::MyLogisticRegressionClassificationModel` instance (that has learned some parameters from the `training` data) and test how well we classify data that we have never seen, i.e., how well we classify the `test` dataset.

__Inference__: We run the classification operation on the (unseen) test data [using the `classify(...)` method](src/Compute.jl). This method takes a feature array `X` and the (trained) model instance. It returns the probability of a label in the `P::Array{Float64,2}` array (which is different than the Perceptron). Each row of `P` corresponds to a test instance, in which each column corresponds to a label, in the case `1` and `-1`.
* We store the actual (correct) label in the `y_banknote_logistic::Array{Int64,1}` vector. We compute the predicted label for each test instance by finding the highest probability column. We store the predicted labels in the `ŷ_banknote_logistic::Array{Int64,1}` vector.

In [23]:
ŷ_banknote_logistic,y_banknote_logistic, P = let

    D = banknote_test; # What dataset are you going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    P = classify(X,model_logistic) # logistic regression returns a x x 2 array holding the probability

    # convert the probability to a choice ... for each row (test instance), compute the col with the highest probability
    ŷ = zeros(number_of_examples);
    for i ∈ 1:number_of_examples
        a = argmax(P[i,:]); # col index with largest value
        ŷ[i] = 1; # default
        if (a == 2)
            ŷ[i] = -1;
        end
    end
    
    # return -
    ŷ, y, P
end;

__Performance__: Once we have has converged, we can evaluate the binary classifier's performance using various metrics. The central idea is to compare the predicted labels $\hat{y}_{i}$ to the actual labels $y_{i}$ in the `test` dataset. 
Various metrics can be used to evaluate the performance of a binary classifier, but they all start with computing [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).
* Let's compute confusion matrix [using the `confusion(...)` method](src/Compute.jl) and store it in the `CM_logistic::Array{Int64,2}` variable. The [`confusion(...)` method](src/Compute.jl) takes the actual labels and the computed labels and returns the confusion matrix.

In [25]:
CM_logistic = confusion(y_banknote_logistic, ŷ_banknote_logistic)

2×2 Matrix{Int64}:
 167    2
   1  202

Let's compute the overall error rate for the logistic regression using [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

In [27]:
number_of_test_banknotes = length(y_banknote_perceptron);
correct_prediction_logistic = CM_logistic[1,1] + CM_logistic[2,2];
(correct_prediction_logistic/number_of_test_banknotes) |> f-> println("Fraction correct: $(f) Fraction incorrect $(1-f)")

Fraction correct: 0.9919354838709677 Fraction incorrect 0.008064516129032251


## Today?
That's a wrap! What are some of the interesting things we discussed today?