# Example: Linear Models for Classification
In this example, we implement the Perceptron algorithm for a binary classification problem.

> __Learning Objectives:__
> 
> * **Train a classifier**: Implement the Perceptron algorithm and understand how it learns a linear decision boundary. Train a classifier to distinguish between genuine and forged banknotes using image features, and observe how the algorithm updates parameters with each pass through the data.
> * **Test model performance**: Evaluate your trained model on unseen test data to measure how well it generalizes. Understand the difference between training error and test error, and why we evaluate on data the model has never seen.
> * **Analyze classification errors**: Use confusion matrices to break down classification mistakes into false positives and false negatives. Interpret what these errors mean for your specific problem and understand the trade-offs between different types of errors.
>

Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading required resources.

> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the input source file `Include.jl` in the notebook's global scope. This file sets paths, loads packages, and more. For more information on functions and types, see the [Julia documentation](https://docs.julialang.org/en/v1/). 

Let's set up the code environment:


In [3]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

We also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). See [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for details on functions, types, and data.


### Data
We use the [banknote authentication dataset from UCI](https://archive.ics.uci.edu/dataset/267/banknote+authentication), which has 1372 instances with 4 continuous features and a $\{-1,1\}$ class variable. 

> __Dataset Description__ 
> 
> * Images of genuine and forged banknote specimens were captured using an industrial camera. Final images are 400x400 pixels. Gray-scale images with ~660 dpi resolution were obtained. Wavelet Transform tools extracted features.
> * __Features__: Four continuous features from each image: `variance` of the wavelet transform, `skewness` of the wavelet transform, `kurtosis` of the wavelet transform, and `entropy` of the wavelet transform. The class is $\{-1,1\}$ where $-1$ indicates genuine and $1$ indicates forged.

The [VLDataScienceMachineLearningPackage.jl](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl) includes this dataset. We use the [MyBanknoteAuthenticationDataset(...) helper function](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/data/#VLDataScienceMachineLearningPackage.MyBanknoteAuthenticationDataset) for easy access. 

This method returns the data in a [DataFrame](https://github.com/JuliaData/DataFrames.jl), which we save in the `df_banknote` variable.


In [6]:
df_banknote =  MyBanknoteAuthenticationDataset();

In [7]:
df_banknote

Row,variance,skewness,curtosis,entropy,class
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Int64
1,3.6216,8.6661,-2.8073,-0.44699,-1
2,4.5459,8.1674,-2.4586,-1.4621,-1
3,3.866,-2.6383,1.9242,0.10645,-1
4,3.4566,9.5228,-4.0112,-3.5944,-1
5,0.32924,-4.4552,4.5718,-0.9888,-1
6,4.3684,9.6718,-3.9606,-3.1625,-1
7,3.5912,3.0129,0.72888,0.56421,-1
8,2.0922,-6.81,8.4636,-0.60216,-1
9,3.2032,5.7588,-0.75345,-0.61251,-1
10,1.5356,9.1772,-2.2718,-0.73535,-1


Now we split the dataset into the input matrix $\mathbf{X}$ (independent variables, banknote features) and output vector $\mathbf{y}$ (dependent variable, banknote class).

We create $\mathbf{X}$ with all columns except the `class` column, and $\mathbf{y}$ with only the `class` column.


In [9]:
X = Matrix(df_banknote[:, Not(:class)]); # data matrix: select all the columns *except* class
y = Vector(df_banknote[:, :class]); # output vector: select the class column

Finally, we partition the data into `training` and `testing` sets. This allows us to measure how well the model generalizes to unseen data.


In [36]:
training, testing = let

    # initialize -
    s = 0.80; # fraction of data for training
    number_of_training_samples = Int(round(s * size(X,1))); # 80% of the data for training
    i = randperm(size(X,1)); # random permutation of the indices
    training_indices = i[1:number_of_training_samples]; # first 80% of the indices
    testing_indices = i[number_of_training_samples+1:end]; # last 20% of
    

    # setup training -
    one_vector = ones(number_of_training_samples);
    training = (X=[X[training_indices, :] one_vector], y=y[training_indices]);

    # setup testing -
    one_vector = ones(length(testing_indices));
    testing = (X=[X[testing_indices, :] one_vector], y=y[testing_indices]);

    training, testing;
end;

In [42]:
training.y

1098-element Vector{Int64}:
  1
  1
 -1
  1
 -1
 -1
 -1
  1
  1
  1
 -1
 -1
  1
  ⋮
 -1
 -1
  1
  1
 -1
 -1
 -1
 -1
  1
 -1
  1
  1

___

## Implement the Perceptron Algorithm
We implement the Perceptron algorithm for the banknote dataset using the online learning version.

Here is the Perceptron learning algorithm pseudocode: 

__Initialize__: Given a linearly separable dataset $\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$, maximum iterations $T$, and maximum mistakes $M$ (e.g., $M=1$), initialize parameter vector $\theta = \left(\mathbf{w}, b\right)$ to small random values and set loop counter $t\gets{0}$.

> **Rule of thumb for $T$**: Set $T = 10n$ to $100n$, where $n$ is the number of training examples. Convergence is usually faster for linearly separable data.

While $\texttt{true}$ __do__:
1. Initialize mistakes $\texttt{mistakes} = 0$.
2. For each training example $(\mathbf{x}, y) \in \mathcal{D}$: compute $y\;\left(\theta^{\top}\;\mathbf{x}\right)\leq{0}$. If true, the example is misclassified. Update $\theta \gets \theta + y\;\mathbf{x}$ and increment $\texttt{mistakes} \gets \texttt{mistakes} + 1$.
3. After processing all examples, if $\texttt{mistakes} \leq {M}$ or $t \geq T$, exit. Otherwise, increment $t \gets t + 1$ and repeat from step 1.

__Convergence__: If the dataset $\mathcal{D}$ is linearly separable, the Perceptron converges to a separating hyperplane in finite iterations. If not linearly separable, the Perceptron may not converge. See the [Perceptron pseudocode](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf) for details.

__Training__: Our Perceptron implementation stores problem data in a [MyPerceptronClassificationModel instance](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/types/#VLDataScienceMachineLearningPackage.MyPerceptronClassificationModel). We learn parameters using the [learn(...) method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.learn), which takes the feature array `X`, labels vector `y`, and problem instance, returning an updated instance with the learned parameters.

The trained classifier is stored in the `model` variable.


In [14]:
model = let

    # data -
    X = training.X; # input matrix
    y = training.y; # output vector
    number_of_examples = size(X,1); # how many examples do we have (rows)
    number_of_features = size(X,2); # how many features do we have (cols)?

    # model
    model = build(MyPerceptronClassificationModel, (
        parameters = ones(number_of_features), # initial value for the parameters: these will be updated
        mistakes = 0 # willing to live with m mistakes
    ));

    # train -
    model = learn(X,y,model, maxiter = 1000, verbose = true);

    # return -
    model;
end;

Stopped after number of iterations: 1000. We have number of errors: 12


In [50]:
typeof(model) |> T-> fieldnames(T)

(:β, :mistakes)

In [56]:
model.mistakes # show we store the number of mistakes???

0

Now we evaluate the trained model on unseen test data.

> __Inference__: We run classification on test data using the [classify(...) method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.classify). This takes the feature array `X` and trained model, returning estimated labels. We store actual labels in `y_banknote` and predicted labels in `ŷ_banknote`.

Let's evaluate the classifier on test data.


In [58]:
ŷ_banknote,y_banknote = let

    X = testing.X; # what dataset are going to use?
    y = testing.y; # what are the actual labels?
    number_of_examples = size(X,1); # how many examples do we have (rows)
    number_of_features = size(X,2); # how many features do we have (cols)?

    # compute the estimated labels -
    ŷ = classify(X,model)

    # return -
    ŷ,y
end;

How many mistakes did the classifier make on the test dataset? We count the number of times $\hat{y}_{i}\neq{y}_{i}$, when predictions are wrong.

> __Note__: Total error count alone doesn't tell the whole story. We also need to know how many were false positives and false negatives. Let's compute these values.

We store the total number of errors in the `number_of_prediction_mistakes` variable.


In [18]:
number_of_prediction_mistakes = let

    number_of_test_examples = length(ŷ_banknote);
    error_counter = 0;

    for i ∈ 1:number_of_test_examples
        if (ŷ_banknote[i] != y_banknote[i])
            error_counter += 1;
        end
    end
    
    error_counter
end;

In [19]:
println("Perceptron mistake percentage: $((number_of_prediction_mistakes/length(ŷ_banknote))*100)%")

Perceptron mistake percentage: 1.094890510948905%


For categorical predictions, we can only count wrong labels, not compute residuals like with continuous predictions.

> __Error analysis__: Total mistakes is only part of the story. We should understand whether we are biased toward false positives or false negatives. How many times did we predict a banknote was forged when it was genuine (false positive)? How many times did we predict genuine when it was forged (false negative)?

We compute the __confusion matrix__ to get these counts. We use the [confusion(...) method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.confusion), which takes actual labels and estimated labels, returning the confusion matrix.


In [21]:
confusion_matrix = confusion(y_banknote,ŷ_banknote); # important: actual labels first, estimated labels second

In [22]:
confusion_matrix

2×2 Matrix{Int64}:
 112    2
   1  159

According to the confusion matrix, we have 2 false positives and 1 false negative.

> __Note__: A perfect classifier has 0 false positives and 0 false negatives. Results may vary slightly each run due to random data partitioning.

This is good performance for a simple linear classifier! We could compute additional metrics (accuracy, precision, recall, specificity) from the confusion matrix, but we'll leave that for now.
___

## Summary

In this activity, we implemented the Perceptron algorithm to classify genuine and forged banknotes:

> __Key takeaways:__
>
> 1. **Training data preparation**: We split the banknote dataset into training (80%) and testing (20%) sets to evaluate generalization. The input features $\mathbf{X}$ included four wavelet-based measurements, and labels $\mathbf{y}$ were $\{-1,1\}$ for genuine and forged.
> 2. **Perceptron learning**: We trained the model using the online Perceptron algorithm, which incrementally updates parameters when misclassifications occur. The algorithm converged in a finite number of passes through the training data.
> 3. **Performance evaluation**: On test data, the classifier made errors we analyzed using a confusion matrix. The matrix showed true positives, false positives, false negatives, and true negatives—enabling detailed error analysis beyond simple accuracy.

The Perceptron provides a straightforward approach to linear classification. Understanding its behavior and limitations prepares you for more advanced methods that handle complex, non-linearly separable data.

___