# Activity: Linear Models for Classification
In this task, we will implement the perceptron algorithm for a binary classification problem.

> __Learning Objectives:__
> 
> * **Train a classifier**: You will train a perceptron classifier to distinguish between genuine and forged banknotes using real image features.
> * **Test model performance**: You will test your trained model on unseen data to measure how well it performs on new examples.
> * **Analyze classification errors**: You will analyze classification mistakes using confusion matrices to understand false positives and false negatives.
>

Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

Let's set up our code environment:

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
The dataset we will explore is the [banknote authentication dataset from the UCI archive](https://archive.ics.uci.edu/dataset/267/banknote+authentication). This dataset has `1372` instances with 4 continuous features and an integer $\{-1,1\}$ class variable. 

> __Description of the dataset__ 
> 
> * Data were extracted from images taken from genuine and forged banknote-like specimens. An industrial camera, usually used for print inspection, was used for digitization. The final images have 400x400 pixels. Due to the object lens and distance to the investigated object, gray-scale pictures with a resolution of about 660 dpi were obtained. Wavelet Transform tools were used to extract features from images.
> * __Features__: The data has four continuous features from each image: `variance` of the wavelet transformed image, `skewness` of the wavelet transformed image, `kurtosis` of the wavelet transformed image, and the `entropy` of the wavelet transformed image. The class is $\{-1,1\}$ where a class value of `-1` indicates genuine and `1` indicates forged.

We've included this dataset in [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl) and have provided [the `MyBanknoteAuthenticationDataset(...)` helper function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyBanknoteAuthenticationDataset) for easy access. 

This method returns the data in [a `DataFrame` instance](https://github.com/JuliaData/DataFrames.jl), which we'll save in the `df_banknote` variable.

In [2]:
df_banknote =  MyBanknoteAuthenticationDataset();

In [3]:
df_banknote

Row,variance,skewness,curtosis,entropy,class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Int64
1,3.6216,8.6661,-2.8073,-0.44699,-1
2,4.5459,8.1674,-2.4586,-1.4621,-1
3,3.866,-2.6383,1.9242,0.10645,-1
4,3.4566,9.5228,-4.0112,-3.5944,-1
5,0.32924,-4.4552,4.5718,-0.9888,-1
6,4.3684,9.6718,-3.9606,-3.1625,-1
7,3.5912,3.0129,0.72888,0.56421,-1
8,2.0922,-6.81,8.4636,-0.60216,-1
9,3.2032,5.7588,-0.75345,-0.61251,-1
10,1.5356,9.1772,-2.2718,-0.73535,-1


Now let's split the dataset into the system input matrix $\mathbf{X}$ (independent variables, characteristics of the banknote) and the output vector $\mathbf{y}$ (dependent variable, the banknote class).

The input matrix $\mathbf{X}$ will contain all the columns except for the `class` column (the output variable). The output vector $\mathbf{y}$ will contain only the `class` column.

In [4]:
X = Matrix(df_banknote[:, Not(:class)]); # data matrix: select all the columns *except* class
y = Vector(df_banknote[:, :class]); # output vector: select the class column

Finally, let's partition the data into a `training` and `testing` set so that we can determine how well the model can predict unseen data, i.e., how well the model generalizes.

In [5]:
training, testing = let

    # initialize -
    s = 0.80; # fraction of data for training
    number_of_training_samples = Int(round(s * size(X,1))); # 80% of the data for training
    i = randperm(size(X,1)); # random permutation of the indices
    training_indices = i[1:number_of_training_samples]; # first 80% of the indices
    testing_indices = i[number_of_training_samples+1:end]; # last 20% of
    

    # setup training -
    one_vector = ones(number_of_training_samples);
    training = (X=[X[training_indices, :] one_vector], y=y[training_indices]);

    # setup testing -
    one_vector = ones(length(testing_indices));
    testing = (X=[X[testing_indices, :] one_vector], y=y[testing_indices]);

    training, testing;
end;

___

## Implement the Perceptron Algorithm
In this task, we will implement the perceptron algorithm for a binary classification problem using the banknote authentication dataset. We'll use the online learning version of the perceptron algorithm.

Let's examine the pseudocode for the Perceptron learning algorithm. 

__Initialize__: Given a linearly separable dataset $\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$, the maximum number of iterations $T$, and the maximum number of mistakes $M$ (e.g., $M=1$), initialize the parameter vector $\theta = \left(\mathbf{w}, b\right)$ to small random values and set the loop counter $t\gets{0}$.

> **Rule of thumb for $T$**: Set $T = 10n$ to $100n$, where $n$ is the number of training examples. The algorithm often converges faster for linearly separable data.

While $\texttt{true}$ __do__:
1. Initialize the number of mistakes $\texttt{mistakes} = 0$.
2. For each training example $(\mathbf{x}, y) \in \mathcal{D}$: compute $y\;\left(\theta^{\top}\;\mathbf{x}\right)\leq{0}$. If this condition is $\texttt{true}$, then the training example $(\mathbf{x}, y)$ is misclassified. Update the parameter vector $\theta \gets \theta + y\;\mathbf{x}$ and increment the error counter $\texttt{mistakes} \gets \texttt{mistakes} + 1$.
3. After processing all training examples, if $\texttt{mistakes} \leq {M}$ or $t \geq T$, break the loop. Otherwise, increment the loop counter $t \gets t + 1$ and repeat from step 1.

__Hypothesis__: If the banknote dataset $\mathcal{D}$ is linearly separable, the Perceptron will _incrementally_ learn a separating hyperplane in a finite number of passes through the dataset $\mathcal{D}$. However, if the dataset $\mathcal{D}$ is not linearly separable, the Perceptron may not converge. Check out the [perceptron pseudocode here!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf)

__Training__: Our Perceptron implementation [based on pseudocode](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf) stores problem information in [a `MyPerceptronClassificationModel` instance, which holds the (initial) parameters and other data](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/types/#VLDataScienceMachineLearningPackage.MyPerceptronClassificationModel) required by the problem. We then _learn_ the model parameters [using the `learn(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.learn), which takes the (augmented) training features array `X`, the training labels vector `y`, and the problem instance and returns an updated problem instance holding the updated parameters.

We save the trained classifier in the `model::MyPerceptronClassificationModel` variable.

In [None]:
model = let

    # data -
    X = training.X; # input matrix
    y = training.y; # output vector
    number_of_examples = size(X,1); # how many examples do we have (rows)
    number_of_features = size(X,2); # how many features do we have (cols)?

    # model
    model = build(MyPerceptronClassificationModel, (
        parameters = ones(number_of_features), # initial value for the parameters: these will be updated
        mistakes = 0 # willing to live with m mistakes
    ));

    # train -
    model = learn(X,y,model, maxiter = 1000, verbose = true);

    # return -
    model;
end;

Stopped after number of iterations: 1000. We have number of errors: 12


Now that we have parameters estimated from the `training` data, we can use those parameters on the `test` dataset to see how well the model can differentiate between an actual banknote and a forgery on data it has never seen. 
> __Inference__: We run the classification operation on the (unseen) test data [using the `classify(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.classify). This method takes a feature array `X` and the (trained) model instance. It returns the estimated labels. We store the actual (correct) labels in the `y_banknote::Array{Int64,1}` vector, while the model predicted labels are stored in the `ŷ_banknote::Array{Int64,1}` array.

Let's run the classifier on the `testing` data and see how well it performs.

In [7]:
ŷ_banknote,y_banknote = let

    X = testing.X; # what dataset are going to use?
    y = testing.y; # what are the actual labels?
    number_of_examples = size(X,1); # how many examples do we have (rows)
    number_of_features = size(X,2); # how many features do we have (cols)?

    # compute the estimated labels -
    ŷ = classify(X,model)

    # return -
    ŷ,y
end;

How many mistakes did the classifier make on the `testing` dataset? Let's count the number of times $\hat{y}_{i}\neq{y}_{i}$, i.e., when the inference predicts the wrong label.
> __Note__: Does having the total error count tell us the whole story? It would be helpful to know how many were false positives and how many were false negatives. Let's compute those values as well. Let's start with the total number of errors.

Let's store the total number of errors in the `number_of_prediction_mistakes::Int64` variable.

In [8]:
number_of_prediction_mistakes = let

    number_of_test_examples = length(ŷ_banknote);
    error_counter = 0;

    for i ∈ 1:number_of_test_examples
        if (ŷ_banknote[i] != y_banknote[i])
            error_counter += 1;
        end
    end
    
    error_counter
end;

In [9]:
println("Perceptron mistake percentage: $((number_of_prediction_mistakes/length(ŷ_banknote))*100)%")

Perceptron mistake percentage: 1.094890510948905%


If we were predicting a continuous variable, we could compute the residuals of the predictions, but since we are predicting a categorical variable, we can only count the number of times we predicted the wrong label.

> __Error analysis__: Knowing the total number of mistakes is only part of the story. It is useful to know if we are biased towards false positives or false negatives, that is, how many times did we predict a banknote was a forgery when it was genuine (false positive), and how many times did we predict a banknote was genuine when it was a forgery (false negative).

We can get the false positive and false negative counts by computing the __confusion matrix__. We've implemented a helper function, [the `confusion(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.confusion), that computes the confusion matrix for us. This method takes the actual labels vector `y` and the estimated labels vector `ŷ` and returns the confusion matrix.

In [10]:
confusion_matrix = confusion(y_banknote,ŷ_banknote); # important: actual labels first, estimated labels second

In [11]:
confusion_matrix

2×2 Matrix{Int64}:
 112    2
   1  159

According to the confusion matrix, how many false positives and false negatives did we have? We have `2` false positives and `1` false negative.

> __Note__: A perfect classifier would have `0` false positives and `0` false negatives. The results we reported here may vary slightly each time you run the notebook because of the random partitioning of the data into `training` and `testing` sets.

Not bad for a simple linear classifier! We could take this further by computing other metrics such as accuracy, precision, recall, and specificity from the confusion matrix. However, we'll leave that for another time.
___