# L9d: Estimating a Logistic Regression Binary Classifier using Gradient Descent
In this lab, we will implement a logistic regression binary classifier using gradient descent. Logistic regression is a statistical method for predicting binary outcomes based on one or more predictor variables. It is widely used in various fields, including machine learning, medical research, and social sciences.

> __Learning Objectives:__
>
> After completing this activity, students will be able to: 
> * **Derive logistic regression from statistical mechanics:** We develop the logistic regression model by viewing binary class labels as states in a Boltzmann distribution and derive the cross-entropy loss function. This approach connects statistical mechanics concepts with machine learning classification problems.
> * **Implement gradient descent optimization:** We use gradient descent to minimize the cross-entropy loss function and find optimal classifier parameters. The implementation uses finite difference approximations to compute gradients and iteratively updates parameters until convergence.
> * **Train and evaluate a binary classifier:** We train a logistic regression model on the banknote authentication dataset and evaluate its performance using a confusion matrix. The model learns to distinguish genuine from forged banknotes based on wavelet-transformed image features.

This is going to be cool, so let's get started!
___

## Background: Cross-entropy loss
Suppose we view our two–class labels $y\in\{-1,1\}$ as _states_ in a Boltzmann distribution conditioned on the input $\hat{\mathbf{x}}\in\mathbb{R}^{m+1}$ (the original feature vector with a `1` as the last element to account for a bias). Then for any state $y$ with energy $E(y,\hat{\mathbf{x}})$ at (unit) temperature, the conditional probability of observing the label $y\in\left\{-1,+1\right\}$ given the feature vector $\hat{\mathbf{x}}$ can be represented as
$$
\begin{align*}
P(y\mid \hat{\mathbf{x}})
=\frac{\exp\bigl(-E(y,\hat{\mathbf{x}})\bigr)}
      {\underbrace{\sum_{y' \in\{-1,1\}} \exp\bigl(-E(y',\hat{\mathbf{x}})\bigr)}_{Z(\hat{\mathbf{x}})}}.
\end{align*}
$$
For the energy function, we can use a linear model of the form:
$$
\begin{align*}
E(y,\hat{\mathbf{x}})\;=\;-\,y\;\bigl(\hat{\mathbf{x}}^{\top}\theta \bigr).
\end{align*}
$$
where $\theta\in\mathbb{R}^{p}$ is a vector of __unknown__ parameters (weights plus bias) that we want to learn. When $y=+1$, the energy $E(1,\hat{\mathbf{x}})=-\hat{\mathbf{x}}^{\top}\theta$ is *lower* (more probable) if $\hat{\mathbf{x}}^{\top}\theta$ is large. On the other hand, when $y=-1$, the energy $E(-1,\hat{\mathbf{x}})=+\hat{\mathbf{x}}^{\top}\theta$, so $y=-1$ is favored when $\hat{\mathbf{x}}^{\top}\theta$ is very negative.

Let's substitute the energy function into the conditional probability expression and do some algebra:
$$
\begin{align*}
P_{\theta}(y\mid \hat{\mathbf{x}})
& =\frac{\exp\bigl(-E(y,\hat{\mathbf{x}})\bigr)}
      {\underbrace{\sum_{y' \in\{-1,1\}} \exp\bigl(-E(y',\hat{\mathbf{x}})\bigr)}_{Z(\hat{\mathbf{x}})}}\\
&=\frac{\exp\bigl(y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}
      {\exp\bigl(\hat{\mathbf{x}}^{\top}\theta\bigr) + \exp\bigl(-\hat{\mathbf{x}}^{\top}\theta\bigr)}\quad\Longrightarrow\;{\text{substituting } z = \hat{\mathbf{x}}^{\top}\theta}\\
& = \frac{\exp\bigl(yz\bigr)}
      {\exp\bigl(z\bigr) + \exp\bigl(-z\bigr)}\quad\Longrightarrow\;{\text{factor out}\; \exp(yz)\;\text{from denominator}}\\
& = \frac{\exp\bigl(yz\bigr)}
      {\exp\bigl(yz\bigr)\left(\exp\bigl((1-y)z\bigr) + \exp\bigl(-(1+y)z\bigr)\right)}\quad\Longrightarrow\;\text{cancel}\;\exp(yz)\\
& = \frac{1}
      {\exp\bigl((1-y)z\bigr) + \exp\bigl(-(1+y)z\bigr)}\quad\blacksquare\\
\end{align*}
$$

This expression is the probability of observing the label $y$ given the feature vector $\hat{\mathbf{x}}$ and the parameters $\theta$. Let's look at the case when $y=+1$ and $y=-1$:

> __Cases:__
>
> When $y=+1$, we have:
> $$
\begin{align*}
P_{\theta}(y = +1\mid \hat{\mathbf{x}})
& = \frac{1}
      {\exp\bigl(0\bigr) + \exp\bigl(-2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\\
& = \frac{1}
      {1 + \exp\bigl(-2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\blacksquare\\
\end{align*}
$$
> 
> When $y=-1$, we have:
> $$\begin{align*}
P_{\theta}(y = -1\mid \hat{\mathbf{x}})
& = \frac{1}
      {\exp\bigl(2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr) + \exp\bigl(0\bigr)}\\
& = \frac{1}
      {1+\exp\bigl(2\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\blacksquare\\
\end{align*}
$$
> Putting this all together, we can write the conditional probability of observing the label $y$ given the feature vector $\hat{\mathbf{x}}$ and the parameters $\theta$ as:
> $$\begin{align*}
P_{\theta}(y\mid \hat{\mathbf{x}}) & = \frac{1}{1+\exp\bigl(-2y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)}\quad\Longrightarrow\;\text{Logistic function!}\\
& = \sigma\bigl(2y\left(\hat{\mathbf{x}}^{\top}\theta\right)\bigr)\\
\end{align*}$$

### Parameter Estimation
Of course, we want to learn the parameters $\theta$ so that we maximize the log likelihood (or minimize the negative log-likelihood) of the observed labels given the feature vectors. The likelihood function is given by:
$$
\begin{align*}
\mathcal{L}(\theta) & = \prod_{i=1}^{n} P_{\theta}(y_{i}\mid \hat{\mathbf{x}}_{i})\\
& = \prod_{i=1}^{n} \frac{1}{1+\exp\bigl(-2y_{i}\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)}\quad\Longrightarrow\;\text{Product is $\textbf{hard}$ to optimize! Take the $\log$}\\
\log\mathcal{L}(\theta) & = -\sum_{i=1}^n \log\!\bigl(1+\exp\bigl(-2y_i\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)\bigr)\\
\end{align*}
$$  

We can use gradient descent to minimize the negative log-likelihood (also known as the cross-entropy loss function):
$$
\boxed{
\begin{align*}
J(\theta) & = -\log\mathcal{L}(\theta)\\
& = \sum_{i=1}^n \log\!\bigl(1+\exp\bigl(-2y_i\,\left(\hat{\mathbf{x}}^{\top}_{i}\theta\right)\bigr)\bigr)\quad\blacksquare\\
\end{align*}}
$$      
This will give us the optimal parameters $\theta$ for our logistic regression model:
$$
\hat{\theta} = \arg\min_{\theta} J(\theta)
$$
Ok, let's give this a try.
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [4]:
include(joinpath(@__DIR__, "Include-banknote-solution.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
The dataset we will explore is the [banknote authentication dataset from the UCI archive](https://archive.ics.uci.edu/dataset/267/banknote+authentication). This dataset has `1372` instances with 4 continuous features and an integer $\{-1,1\}$ class variable. 

> __Description of the dataset__ 
> 
> * Data were extracted from images taken from genuine and forged banknote-like specimens. An industrial camera, usually used for print inspection, was used for digitization. The final images have 400x400 pixels. Due to the object lens and distance to the investigated object, gray-scale pictures with a resolution of about 660 dpi were obtained. Wavelet Transform tools were used to extract features from images.
> * __Features__: The data has four continuous features from each image: `variance` of the wavelet transformed image, `skewness` of the wavelet transformed image, `kurtosis` of the wavelet transformed image, and the `entropy` of the wavelet transformed image. The class is $\{-1,1\}$ where a class value of `-1` indicates genuine and `1` indicates forged.

We've included this dataset in [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl) and have provided [the `MyBanknoteAuthenticationDataset(...)` helper function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyBanknoteAuthenticationDataset) for easy access. 

This method returns the data in [a `DataFrame` instance](https://github.com/JuliaData/DataFrames.jl), which we'll save in the `df_banknote` variable.

In [33]:
df_banknote =  MyBanknoteAuthenticationDataset()

Row,variance,skewness,curtosis,entropy,class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Int64
1,3.6216,8.6661,-2.8073,-0.44699,-1
2,4.5459,8.1674,-2.4586,-1.4621,-1
3,3.866,-2.6383,1.9242,0.10645,-1
4,3.4566,9.5228,-4.0112,-3.5944,-1
5,0.32924,-4.4552,4.5718,-0.9888,-1
6,4.3684,9.6718,-3.9606,-3.1625,-1
7,3.5912,3.0129,0.72888,0.56421,-1
8,2.0922,-6.81,8.4636,-0.60216,-1
9,3.2032,5.7588,-0.75345,-0.61251,-1
10,1.5356,9.1772,-2.2718,-0.73535,-1


Now let's split the dataset into the system input matrix $\mathbf{X}$ (independent variables, characteristics of the banknote) and the output vector $\mathbf{y}$ (dependent variable, the banknote class).

The input matrix $\mathbf{X}$ will contain all the columns except for the `class` column (the output variable). The output vector $\mathbf{y}$ will contain only the `class` column.

In [9]:
X = Matrix(df_banknote[:, Not(:class)]); # data matrix: select all the columns *except* class
y = Vector(df_banknote[:, :class]); # output vector: select the class column

Finally, let's partition the data into a `training` and `testing` set so that we can determine how well the model can predict unseen data, i.e., how well the model generalizes.

In [11]:
training, testing = let

    # initialize -
    s = 0.80; # fraction of data for training
    number_of_training_samples = Int(round(s * size(X,1))); # 80% of the data for training
    i = randperm(size(X,1)); # random permutation of the indices
    training_indices = i[1:number_of_training_samples]; # first 80% of the indices
    testing_indices = i[number_of_training_samples+1:end]; # last 20% of
    

    # setup training -
    one_vector = ones(number_of_training_samples);
    training = (X=[X[training_indices, :] one_vector], y=y[training_indices]);

    # setup testing -
    one_vector = ones(length(testing_indices));
    testing = (X=[X[testing_indices, :] one_vector], y=y[testing_indices]);

    training, testing;
end;

___

## Gradient descent
Let's develop a simple gradient descent algorithm for this classification problem. We'll first present the general gradient descent algorithm that can handle inequality and equality constraints, then simplify it for our specific unconstrained logistic regression problem where we only need to minimize the cross-entropy loss function $J(\theta)$.

### General gradient descent algorithm
The general algorithm iteratively updates the parameter vector $\theta_k$ using the gradient of an augmented objective function $P_{\mu,\rho}(\theta)$ that includes penalty and barrier terms for constraints.

__Initialization__: Given an initial guess $\theta_0$, set $\mu > 0$ and $\rho > 0$. Specify a tolerance $\epsilon > 0$, a maximum number of iterations $K$, and a step size (learning rate) $\alpha > 0$. Set $\texttt{converged} \gets \texttt{false}$, the iteration counter to $k \gets 0$ and specify values for the penalty update parameters $(\tau_{\mu},\tau_{\rho})\in\left(0,1\right)$.

While not $\texttt{converged}$ __do__:
1. Compute the gradient: $\nabla P_{\mu,\rho}(\theta_k) = \nabla f(\theta_k) + \frac{1}{\mu} \sum_{i=1}^m \frac{\nabla g_i(\theta_k)}{-g_i(\theta_k)} + \frac{1}{\rho} \sum_{j=1}^p h_j(\theta_k) \nabla h_j(\theta_k)$ evaluated at the current solution $\theta_k$, where $f(\theta)$ is the objective function, $g_i(\theta)$ are inequality constraints, and $h_j(\theta)$ are equality constraints.
2. Update the solution: $\theta_{k+1} = \theta_k - \alpha \nabla P_{\mu,\rho}(\theta_k)$. $\texttt{Note}$: $\alpha$ is fixed here, but it can be adapted dynamically based on the convergence behavior.
3. Check convergence: 
     - If $\|\theta_{k+1} - \theta_k\|_{2} \leq \epsilon$, set $\texttt{converged} \gets \texttt{true}$. Return $\theta_{k+1}$ as the approximate solution. $\texttt{Note}$: here we look at the Euclidean norm of the difference between the current and next solution. However, many other criteria can be used, such as the change in the objective function value or the gradient norm.
     - If $k \geq K$, set $\texttt{converged} \gets \texttt{true}$. Warn that the maximum number of iterations has been reached without convergence.
4. Increment the iteration counter: $k \gets k + 1$, update $\mu\gets \tau_\mu\,\mu$ and $\rho\gets \tau_\rho\,\rho$ as needed, and repeat.

As $\mu\to0$, the coefficient $\frac{1}{\mu}$ in the barrier term grows, creating an increasingly strong barrier that keeps the solution away from constraint boundaries (where $g_i(\theta)\to 0^-$). Similarly, as $\rho\to0$, the coefficient $\frac{1}{\rho}$ in the penalty term grows, enforcing $h_j(\theta)\to0$ ever more strictly.

### Simplified algorithm for logistic regression
For our logistic regression problem, we have no inequality constraints ($m=0$) and no equality constraints ($p=0$). Thus, the augmented objective function reduces to the original cross-entropy loss function $P_{\mu,\rho}(\theta) = J(\theta)$, and the gradient descent update simplifies to:
$$
\theta_{k+1} = \theta_k - \alpha \nabla J(\theta_k)
$$
This is the form we'll implement below to find the optimal parameters $\theta$ for our binary classifier.
___

## Implementation
We implemented [the `MyLogisticRegressionClassificationModel` type](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/types/#VLDataScienceMachineLearningPackage.MyLogisticRegressionClassificationModel), which contains data required to solve the logistic regression problem, i.e., parameters, the learning rate, a stopping tolerance parameter $\epsilon$, and a loss (objective) function that we want to minimize. 
* __Technical note__: In this implementation, we approximated the gradient calculation using [a forward finite difference](https://en.wikipedia.org/wiki/Finite_difference). In general, this is not a great idea. This is one of my super pet peeves of gradient descent; computing the gradient is usually a hassle, and we do a bunch of function evaluations to get a good approximation of the gradient. However, finite difference is easy to implement.
* __Note on the loss function__: In the code below, we use the natural logarithm `log` in the loss function. You could also use `log10`. While this differs from the mathematical derivation above (which uses natural log), it doesn't change the location of the minimum since `log10` is simply a scaled version of the natural log. The gradient descent algorithm will find the same optimal parameters $\theta$.
* In the code block below, we [build a `model::MyLogisticRegressionClassificationModel` instance using a `build(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/factory/#Factory-methods). The model instance initially has a random guess for the classifier parameters. We use gradient descent to refine that guess [using the `learn(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.learn), which returns an updated model instance (with the best parameters that we found so far). 

We return the updated model instance and save it in the `model_logistic::MyLogisticRegressionClassificationModel` variable.

In [47]:
model_logistic = let

    # data -
    X = training.X; # feature matrix
    y = training.y; # labels
    number_of_features = size(X,2); # number of features + 1

    # model
    model = build(MyLogisticRegressionClassificationModel, (
        parameters = 0.01*ones(number_of_features), # initial value for the parameters: these will be updated
        learning_rate = 0.005, # you pick this
        ϵ = 1e-6, # you pick this (this is also the step size for the fd approx to the gradient)
        loss_function = (x,y,θ) -> log(1+exp(-2*y*(dot(x,θ)))) # what??!? Wow, that is nice. Yes, we can pass functions as args!
    ));

    # train -
    model = learn(X,y,model, maxiter = 20000, verbose = true); # this is learning the model parameters

    # return -
    model;
end;

Stopped after number of iterations: 40001. We have error: 142.0130872410051


Let's use the updated `model_logistic::MyLogisticRegressionClassificationModel` instance (that has learned some parameters from the `training` data) and test how well we classify data that we have never seen, i.e., how well we classify the `test` dataset.

__Inference__: We run the classification operation on the (unseen) test data [using the `classify(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.classify). This method takes a feature array `X` and the (trained) model instance. It returns the probability of a label in the `P::Array{Float64,2}` array (which is different than the Perceptron). Each row of `P` corresponds to a test instance, in which each column corresponds to a label, in the case `1` and `-1`.

We store the actual (correct) label in the `y_banknote_logistic::Array{Int64,1}` vector. We compute the predicted label for each test instance by finding the highest probability column. We store the predicted labels in the `ŷ_banknote_logistic::Array{Int64,1}` vector.

In [43]:
ŷ_banknote_logistic,y_banknote_logistic, P = let

    # initialize -
    X = testing.X; # feature matrix
    y = testing.y; # labels
    number_of_examples = size(X,1); # how many examples do we have (rows)
    number_of_features = size(X,2); # how many features do we have (cols) + 1

    # compute the estimated labels -
    P = classify(X, model_logistic) # logistic regression returns a x x 2 array holding the probability

    # convert the probability to a choice ... for each row (test instance), compute the col with the highest probability
    ŷ = zeros(number_of_examples);
    for i ∈ 1:number_of_examples
        a = argmax(P[i,:]); # col index with largest value
        ŷ[i] = 1; # default
        if (a == 2)
            ŷ[i] = -1;
        end
    end
    
    # return -
    ŷ, y, P
end;

In [45]:
P

274×2 Matrix{Float64}:
 0.995993     0.00400706
 3.46843e-63  1.0
 1.0          1.65177e-26
 1.15496e-26  1.0
 1.0          1.13219e-40
 3.61692e-73  1.0
 1.0          4.60244e-36
 1.125e-61    1.0
 4.35644e-68  1.0
 1.0          6.72363e-20
 3.88299e-66  1.0
 6.4706e-91   1.0
 1.0          3.95856e-30
 ⋮            
 1.0          2.27876e-19
 1.0          9.0244e-37
 1.7604e-42   1.0
 1.0          6.61743e-39
 1.82287e-61  1.0
 2.36476e-71  1.0
 1.0          1.2333e-34
 1.0          5.69645e-24
 1.0          1.17819e-35
 1.57079e-84  1.0
 0.042262     0.957738
 4.31939e-76  1.0

### Confusion Matrix
Let's now compute the __confusion matrix__. The confusion matrix for a binary classifier is typically structured as follows:

|                     | **Predicted Positive** | **Predicted Negative** |
|---------------------|------------------------|------------------------|
| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |

We've implemented [the `confusion(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.confusion) to compute the confusion matrix given the actual and predicted labels. Let's save the confusion matrix in the `CM_logistic::Array{Int64,2}` variable and compute the accuracy of the classifier on the test data.

In [19]:
CM_logistic = confusion(y_banknote_logistic, ŷ_banknote_logistic)

2×2 Matrix{Int64}:
 112    2
   2  158

Now let's compute the accuracy of the classifier on the test data. 

> __Overall accuracy:__ The overall accuracy is the proportion of correctly classified instances among the total instances in the test set. In our case, it is the trace of the confusion matrix (sum of the diagonal elements) divided by the total number of instances. This gives us a measure of the overall performance of the classifier, but does not tell us if we are biased towards one class or the other.

What is the overall accuracy?

In [21]:
let

    # initialize -
    number_of_test_banknotes = length(y_banknote_logistic); # what is the total number of test banknotes
    correct_prediction_logistic = CM_logistic[1,1] + CM_logistic[2,2]; # true positives + true negatives
    (correct_prediction_logistic/number_of_test_banknotes) |> f-> round(f, digits=4) |> f-> println("Overall test fraction correct: $(f) versus incorrect $((1-f) |> x-> round(x, digits=4))");
end

Overall test fraction correct: 0.9854 versus incorrect 0.0146


___

## Summary
In this activity, we built a logistic regression binary classifier from first principles using gradient descent optimization. 

We derived the cross-entropy loss function from a Boltzmann distribution, implemented the gradient descent algorithm with finite difference gradient approximations, and trained the model on the banknote authentication dataset.

> __Key Takeaways:__
> 
> * **Cross-entropy loss minimization:** We minimized the negative log-likelihood function (cross-entropy loss) to find optimal classifier parameters. The gradient descent algorithm iteratively updated parameters using a learning rate of 0.005 until convergence.
> * **Training and testing data split:** We randomly partitioned the banknote dataset into 80% training data and 20% testing data. This split allowed us to evaluate how well the trained model generalizes to unseen examples.
> * **Performance evaluation with confusion matrix:** We computed a confusion matrix to assess the classifier's performance on the test set. The matrix shows true positives, true negatives, false positives, and false negatives for genuine versus forged banknote classifications.

The logistic regression model successfully learned to classify banknotes based on wavelet-transformed image features, demonstrating the effectiveness of gradient descent for parameter estimation in binary classification problems.
___