# PS2: Logistic Regression Classification of a Clinical Dataset

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file. The `Include.jl` file loads external packages, various functions that we will use in the exercise, and custom types to model the components of our lab problem.

In [3]:
include("Include.jl");

### Data
Next, let's load up the dataset that we will explore. The data for this lab was taken from this `2020` publication:
* [Davide Chicco, Giuseppe Jurman: "Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone." BMC Medical Informatics and Decision Making 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5](https://pubmed.ncbi.nlm.nih.gov/32013925/)

In this paper, the authors analyzed a dataset of `299` heart failure patients collected in 2015. The patients comprised 105 women and 194 men, aged between 40 and 95 years old. The dataset contains `13` features (a mixture of continuous/categorical data and the label), which report clinical, body, and lifestyle information:
* Some features are binary: anemia, high blood pressure, diabetes, sex, and smoking status.
* The remaining features were continuous biochemical measurements, such as the level of the Creatinine phosphokinase (CPK) enzyme in the blood, the number of platelets, etc.
* The class (target) variable is encoded as a binary (boolean) death event: `1` if the patient died during the follow-up period, `0` if the patient did not die during the follow-up period.

We'll load this dataset as a [DataFrame instance](https://dataframes.juliadata.org/stable/) and store it in the `originaldataset::DataFrame` variable:

In [5]:
originaldataset = CSV.read(joinpath(_PATH_TO_DATA, "heart_failure_clinical_records_dataset.csv"), DataFrame)

Row,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death_event
Unnamed: 0_level_1,Float64,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Int64,Int64,Int64,Int64,Int64
1,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
2,55.0,0,7861,0,38,0,263358.0,1.1,136,1,0,6,1
3,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
4,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
5,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1
6,90.0,1,47,0,40,1,204000.0,2.1,132,1,1,8,1
7,75.0,1,246,0,15,0,127000.0,1.2,137,1,0,10,1
8,60.0,1,315,1,60,0,454000.0,1.1,131,1,1,10,1
9,65.0,0,157,0,65,0,263358.0,1.5,138,0,0,10,1
10,80.0,1,123,0,35,1,388000.0,9.4,133,1,1,10,1


#### Data rangling
We know from lecture that our classification algorithms work on [`Matrix` instance](https://docs.julialang.org/en/v1/base/arrays/#Base.Matrix-Tuple{UndefInitializer,%20Any,%20Any}) and not [`DataFrame` instances](https://dataframes.juliadata.org/stable/). Thus, we need to convert the data to [a `Matrix`](https://docs.julialang.org/en/v1/base/arrays/#Base.Matrix-Tuple{UndefInitializer,%20Any,%20Any}). In addition, there are several ways we can pretreat the data to make the classification easier.
* Convert `0,1` data to `-1,1`. This is a preference (not technically required), but it makes (binary) classification problems easier, so let's convert all categorical `0,1` data to `-1,1`. In this rescaled data, `0` will be replaced by `-1`. Thus, _false_ (no death event) will be mapped to `-1`, and _true_ (death event) will remain `1`.
* Next, let's [z-score center](https://en.wikipedia.org/wiki/Feature_scaling) the continous feature data. In [z-score feature scaling](https://en.wikipedia.org/wiki/Feature_scaling), we subtract off the mean of each feature and then divide by the standard deviation, i.e., $x^{\prime} = (x - \mu)/\sigma$ where $x$ is the unscaled data, and $x^{\prime}$ is the scaled data. Under this scaling regime, $x^{\prime}\leq{0}$ will be values that are less than or equal to the mean value $\mu$, while $x^{\prime}>0$ indicate values that are greater than the mean. The range of data is measured in quanta of the standard deviation $\sigma$.

In [7]:
(D, dataset) = let

    # convert 0,1 into -1,1
    treated_dataset = copy(originaldataset);
    transform!(treated_dataset, :anaemia => ByRow(x -> (x==0 ? -1 : 1)) => :anaemia); # maps anaemia to -1,1
    transform!(treated_dataset, :diabetes => ByRow(x -> (x==0 ? -1 : 1)) => :diabetes); # maps diabetes to -1,1
    transform!(treated_dataset, :high_blood_pressure => ByRow(x -> (x==0 ? -1 : 1)) => :high_blood_pressure); # maps high_blood_pressure to -1,1
    transform!(treated_dataset, :sex => ByRow(x -> (x==0 ? -1 : 1)) => :sex); # maps sex to -1,1
    transform!(treated_dataset, :smoking => ByRow(x -> (x==0 ? -1 : 1)) => :smoking); # maps smoking to -1,1
    transform!(treated_dataset, :death_event => ByRow(x -> (x==0 ? -1 : 1)) => :death_event); # maps death_event to -1,1
    
    D = treated_dataset[:,1:end] |> Matrix; # build a data matrix from the DataFrame
    (number_of_examples, number_of_features) = size(D);

    # Which cols do we want to rescale?
    index_to_z_scale = [
        1 ; # 1 age
        3 ; # 2 creatinine_phosphokinase
        5 ; # 3 ejection_fraction
        7 ; # 4 platelets
        8 ; # 5 serum_creatinine
        9 ; # 6 serum_sodium
        12 ; # 7 time
    ];

    D̂ = copy(D);
    for i ∈ eachindex(index_to_z_scale)
        j = index_to_z_scale[i];
        μ = mean(D[:,j]); # compute the mean
        σ = std(D[:,j]); # compute std

        # rescale -
        for k ∈ 1:number_of_examples
            D̂[k,j] = (D[k,j] - μ)/σ;
        end
    end
    
    D̂, treated_dataset
end;

In [8]:
D

299×13 Matrix{Float64}:
  1.19095    -1.0   0.000165451  -1.0  …   1.0  -1.0  -1.62678   1.0
 -0.490457   -1.0   7.50206      -1.0      1.0  -1.0  -1.60101   1.0
  0.350246   -1.0  -0.449186     -1.0      1.0   1.0  -1.58812   1.0
 -0.910808    1.0  -0.485257     -1.0      1.0  -1.0  -1.58812   1.0
  0.350246    1.0  -0.434757      1.0     -1.0  -1.0  -1.57524   1.0
  2.452       1.0  -0.551217     -1.0  …   1.0   1.0  -1.57524   1.0
  1.19095     1.0  -0.346124     -1.0      1.0  -1.0  -1.54947   1.0
 -0.0701056   1.0  -0.275011      1.0      1.0   1.0  -1.54947   1.0
  0.350246   -1.0  -0.437849     -1.0     -1.0  -1.0  -1.54947   1.0
  1.6113      1.0  -0.47289      -1.0      1.0   1.0  -1.54947   1.0
  1.19095     1.0  -0.516176     -1.0  …   1.0   1.0  -1.54947   1.0
  0.098035   -1.0  -0.361583     -1.0      1.0   1.0  -1.54947   1.0
 -1.33116     1.0   0.411384     -1.0      1.0  -1.0  -1.53659   1.0
  ⋮                                    ⋱         ⋮              
 -1.33116    -

Next, let's split that dataset `D` into `training` and `test` subsets. We do this randomly, where the `number_of_training_examples::Int64` variable specifies the number of training points. The `training::Array{Float64,2}` data will be used to estimate the model parameters, and `test::Array{Float64,2}` will be used for model testing.

In [10]:
training, test = let

    number_of_training_examples = 199; # set the number to set the number of training examples
    number_of_examples = size(D,1); # number of rows in the full dataset
    full_index_set = range(1,stop=number_of_examples,step=1) |> collect |> Set;
    
    # build index sets for training and testing
    training_index_set = Set{Int64}();
    should_stop_loop = false;
    while (should_stop_loop == false)
        i = rand(1:number_of_examples);
        push!(training_index_set,i);

        if (length(training_index_set) == number_of_training_examples)
            should_stop_loop = true;
        end
    end
    test_index_set = setdiff(full_index_set,training_index_set);

    # build the test and train datasets -
    training = D[training_index_set |> collect,:];
    test = D[test_index_set |> collect,:];

    # return
    training, test
end;

In [11]:
training

199×13 Matrix{Float64}:
  0.350246   -1.0  -0.502778      1.0  …   1.0  -1.0  -1.30467     1.0
  0.938738   -1.0  -0.22451       1.0      1.0   1.0  -0.918142    1.0
 -0.490457   -1.0   0.000165451   1.0     -1.0  -1.0   0.859883   -1.0
  0.350246    1.0  -0.460523     -1.0      1.0  -1.0   0.82123    -1.0
  2.03165    -1.0   5.46246      -1.0      1.0   1.0  -0.750647    1.0
 -0.910808   -1.0   1.99957      -1.0  …  -1.0  -1.0   1.07891    -1.0
 -0.0701056   1.0   0.177432      1.0      1.0  -1.0  -0.505846   -1.0
 -0.490457   -1.0  -0.537819     -1.0      1.0   1.0  -0.518731   -1.0
 -0.826738   -1.0  -0.519268     -1.0      1.0  -1.0  -0.660457   -1.0
 -0.154176   -1.0  -0.531635      1.0      1.0  -1.0   0.0610601   1.0
  0.350246    1.0  -0.333756      1.0  …   1.0  -1.0   1.34948     1.0
  0.350246   -1.0   0.000165451   1.0      1.0   1.0   1.05315    -1.0
  0.350246   -1.0  -0.192561      1.0      1.0   1.0   0.305861    1.0
  ⋮                                    ⋱         ⋮   

Fill me in

In [13]:
Σ = cov(D[:,1:end-1])

12×12 Matrix{Float64}:
  1.0         0.0873213  -0.0815839   …   0.0174608   -0.224068
  0.0873213   0.98449    -0.189256       -0.0995713   -0.140313
 -0.0815839  -0.189256    1.0             0.00226468  -0.00934565
 -0.0998138  -0.0124801  -0.00952414     -0.136024     0.0333253
  0.0600984   0.0313113  -0.0440796      -0.0629621    0.0417292
  0.0892094   0.0362281  -0.0675033   …  -0.0498305   -0.18785
 -0.0523544  -0.0434447   0.0244634       0.0264088    0.0105139
  0.159187    0.0517674  -0.0164085      -0.0256416   -0.149315
 -0.0459658   0.0415555   0.0595502       0.00450198   0.08764
  0.0625685  -0.0899194   0.0763016       0.398824    -0.0149257
  0.0174608  -0.0995713   0.00226468  …   0.874863    -0.0213622
 -0.224068   -0.140313   -0.00934565     -0.0213622    1.0

Finally, let's set up the color dictionary to visualize the classification datasets. The keys of the `my_color_dictionary::Dict Int64, RGB` dictionary class labels, i.e., $ y\in\{1,-1\}$ while the values are the colors mapped to that label.

In [15]:
my_color_dictionary = Dict{Int64,RGB}();
my_color_dictionary[1] = colorant"#03045e"; # color for Label = 1
my_color_dictionary[-1] = colorant"#e36414"; # color for Label = -1

## Task 1: Build a Perceptron Classification Model and Learn the Parameters
In this task, we'll build a model of our classification problem and train the model using an online learning method. 
* __Training__: Our Perceptron implementation [based on pseudo-code](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf) stores problem information in [a `MyPerceptronClassificationModel` instance, which holds the (initial) parameters and other data](src/Types.jl) required by the problem. We initialize the parameters using a vector of `1`'s.
* Next, we then _learn_ the model parameters [using the `learn(...)` method](src/Compute.jl), which takes the training features array `X,` the training labels vector `y`, and the problem instance and returns an updated problem instance holding the updated parameters. 

In [17]:
perceptron_model = let
    
    # setup
    D = training; # what dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features, what??
    y = D[:,end]; # output: this is the target data (label)
    
    # build an initial model
    model = build(MyPerceptronClassificationModel, (
        parameters = ones(number_of_features),
        mistakes = 0 # willing to live with m mistakes
    ));

    # TODO: uncomment me to train the model -
    trainedmodel = learn(X,y,model, maxiter = 1000, verbose = true);

    # return -
    trainedmodel;
end;

Stopped after number of iterations: 1000. We have number of errors: 56


__Inference__: Now that we have parameters estimated from the `training` data, we can use those parameters on the `test` dataset to see how well the model can differentiate between classes on data it has never seen. 
* We run the classification operation on the (unseen) test data [using the `classify(...)` method](src/Compute.jl). This method takes a feature array `X` and the (trained) model instance. It returns the estimated labels. We store the actual (correct) label in the `y_perceptron::Array{Int64,1}` vector, while the model predicted label is stored in the `ŷ_perceptron::Array{Int64,1}` array.

In [19]:
ŷ_perceptron,y_perceptron = let

    D = test; # what dataset are going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    ŷ = classify(X,perceptron_model)

    # return -
    ŷ,y
end;

__Confusion matrix__: Let's compute the confusion matrix for the perceptron [using the `confusion(...)` method](src/Compute.jl) and store it in the `CM_perceptron::Array{Int64,2}` variable. [Click me for a confusion matrix schematic!](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-BinaryConfusionMatrix.pdf). 

In [21]:
CM_perceptron = confusion(y_perceptron, ŷ_perceptron)

2×2 Matrix{Int64}:
 25  10
  9  56

Finally, we can compute the overall error rate for the perceptron (or other performance metrics) using values from [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix). The [`confusion(...)` method](src/Compute.jl) takes the actual labels and the computed labels and returns the confusion matrix.

In [23]:
number_of_test_points = length(y_perceptron);
correct_prediction_perceptron = CM_perceptron[1,1] + CM_perceptron[2,2];
(correct_prediction_perceptron/number_of_test_points) |> f-> println("Fraction correct: $(f) Fraction incorrect $(1-f)")

Fraction correct: 0.81 Fraction incorrect 0.18999999999999995


## Task 2: Build and Train Logistic Regression Classification Model using Gradient Descent
In this task, we build and train a [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) classifier using the training data, and then challenge this classifier using the `test` dataset. We'll use [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) for parameter estimation.

We implemented [the `MyLogisticRegressionClassificationModel` type](src/Types.jl), which contains data required to solve the logistic regression problem, i.e., parameters, the learning rate, a stopping tolerance parameter $\epsilon$, and a loss (objective) function that we want to minimize. 
* __Technical note__: We approximated the gradient calculation using [a forward finite difference](https://en.wikipedia.org/wiki/Finite_difference). This is generally not a great idea. This is one of my super pet peeves with gradient descent; computing the gradient is (usually) a hassle. Typically, we must do at least two function evaluations to approximate the gradient well. Why do finite diference? It is easy to implement.
* In the code below, we [build a `model::MyLogisticRegressionClassificationModel` instance using a `build(...)` method](src/Factory.jl). The model instance initially has a random guess for the classifier parameters. We use gradient descent to refine that guess [using the `learn(...)` method](src/Compute.jl), which returns an updated model instance (with the best parameters that we found so far). We return the updated model instance and save it in the `model_logistic::MyLogisticRegressionClassificationModel` variable.

In [25]:
model_logistic = let

    # data -
    D = training; # What dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the target data (label)

    # model
    model = build(MyLogisticRegressionClassificationModel, (
        parameters = 0.01*ones(number_of_features), # initial value for the parameters: these will be updated
        learning_rate = 0.01, # you pick this
        ϵ = 1e-4, # you pick this (this is also the step size for the fd approx to the gradient)
        loss_function = (x,y,θ) -> log10(1+exp(-y*(dot(x,θ)))) # what??!? Wow, that is nice. Yes, we can pass functions as args!
    ));

    # train -
    model = learn(X,y,model, maxiter = 10000, verbose = true); # this is learning the model parameters

    # return -
    model;
end;

Stopped after number of iterations: 10001. We have error: 32.34784690545399


Let's use the updated `model_logistic::MyLogisticRegressionClassificationModel` instance (with parameters learned from the `training` data) and test how well we can classify data in the `test` dataset.

* __Inference__: We run the classification operation on the (unseen) test data [using the `classify(...)` method](src/Compute.jl). This method takes a feature array `X` and the (trained) model instance. It returns the probability of a label in the `P::Array{Float64,2}` array (which is different than the Perceptron). Each row of `P` corresponds to a test instance, in which each column corresponds to a label, in the case `1` and `-1`.
* We store the actual (correct) label in the `y_logistic::Array{Int64,1}` vector. We compute the predicted label for each test instance by finding the highest probability column. We store the predicted labels in the `ŷ_logistic::Array{Int64,1}` vector.

In [27]:
ŷ_logistic,y_logistic, P = let

    D = test; # What dataset are you going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    P = classify(X,model_logistic) # logistic regression returns a x x 2 array holding the probability

    # convert the probability to a choice ... for each row (test instance), compute the col with the highest probability
    ŷ = zeros(number_of_examples);
    for i ∈ 1:number_of_examples
        a = argmax(P[i,:]); # col index with largest value
        ŷ[i] = 1; # default
        if (a == 2)
            ŷ[i] = -1;
        end
    end
    
    # return -
    ŷ, y, P
end;

__Performance__: Once we have has converged (or exhasted our iterations), we can evaluate the binary classifier's performance using various metrics. The central idea is to compare the predicted labels $\hat{y}_{i}$ to the actual labels $y_{i}$ in the `test` dataset and measure wins (when the label is the same) and losses (label is different). This is easily represented in [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).
* We compute confusion matrix [using the `confusion(...)` method](src/Compute.jl) and store it in the `CM_logistic::Array{Int64,2}` variable. The [`confusion(...)` method](src/Compute.jl) takes the actual labels and the computed labels and returns the confusion matrix.

In [29]:
CM_logistic = confusion(y_logistic, ŷ_logistic)

2×2 Matrix{Int64}:
 26   9
  6  59

Let's compute the overall error rate for the logistic regression using [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

In [31]:
number_of_test_points = length(y_perceptron);
correct_prediction_logistic = CM_logistic[1,1] + CM_logistic[2,2];
(correct_prediction_logistic/number_of_test_points) |> f-> println("Fraction correct: $(f) Fraction incorrect $(1-f)")

Fraction correct: 0.85 Fraction incorrect 0.15000000000000002


## Task 3: Build and Train Logistic Regression Classification Model using Simulated Annealing
In this task, we build and train a [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) classifier using the `training` dataset, and then challenge this classifier using the `test` dataset. We'll use [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing#:~:text=Simulated%20annealing%20(SA)%20is%20a,can%20find%20the%20global%20optimum.) for parameter estimation (instead of [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)).

Simulated annealing, inspired by the annealing process in metallurgy, is a probabilistic optimization technique used to estimate the global optimum of a given function in large search spaces _without_ computing the gradient. It is particularly effective for problems with numerous local optima. 

__Algorithm__: The simulated annealing algorithm generates a random parameter set and evaluates its loss function. This loss is compared with the best found so far. If the loss decreases, the candidate set becomes the best. If the loss increases, the candidate set may be kept with a probability tied to a temperature $T$. The pseudocode for simulated annealing can be found in [L3c course notes](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3c/docs/Notes.pdf).

### Implementation

* We implemented [the `MyLogisticRegressionSimulatedAnnealingClassificationModel` type](src/Types.jl), which contains data required to train the logistic regression problem using simulated annealing. To build this type, we [use a `build(...)` method](src/Factory.jl) and pass in an initial value for the parameters, a cooling rate parameter (important to the probability of accepting an uphill move), the loss function, etc. 
* We then estimate the model parameters [using the `learn(...)` method](src/Compute.jl). This method takes the `training` feature matrix $\hat{\mathbf{X}}$, the label vector $\mathbf{y}$, and the model instance, along with some optional parameter, such the `maxiter::Int64` parameters. The `maxiter::Int64` argument has a slightly different meaning in simulated annealing; it's the number of steps we take at each temperature $T$.



In [33]:
model_simulated_annealing = let

    # data -
    D = training; # What dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the target data (label)

    # model
    model = build(MyLogisticRegressionSimulatedAnnealingClassificationModel, (
        parameters = ones(number_of_features), # initial value for the parameters: these will be updated
        cooling_rate = 0.01, # you pick this: the smaller the value, the *slower* we cool.
        ϵ = 1e-4, # you pick this (this is also the step size for the fd approx to the gradient)
        loss_function = (x,y,θ) -> log10(1+exp(-y*(dot(x,θ)))) # what??!? Wow, that is nice. Yes, we can pass functions as args!
    ));

    # train -
    model = learn(X,y,model, maxiter = 100, verbose = true); # this is learning the model parameters

    # return -
    model;
end;

SA routine: Stopped with best loss: 32.356797539351746


Let's use the updated `model_simulated_annealing::MyLogisticRegressionSimulatedAnnealingClassificationModel` instance (with parameters learned from the `training` data) and test how well we can classify data in the `test` dataset (using simulated annealing instead of gradient descent).

* __Inference__: We run the classification operation on the (unseen) test data [using the `classify(...)` method](src/Compute.jl). This method takes a feature array `X` and the (trained) model instance. It returns the probability of a label in the `P::Array{Float64,2}` array (which is different than the Perceptron). Each row of `P` corresponds to a test instance, in which each column corresponds to a label, in the case `1` and `-1`.
* We store the actual (correct) label in the `y_logistic_sa::Array{Int64,1}` vector. We compute the predicted label for each test instance by finding the highest probability column. We store the predicted labels in the `ŷ_logistic_sa::Array{Int64,1}` vector.

In [35]:
ŷ_logistic_sa,y_logistic_sa, P = let

    D = test; # What dataset are you going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = [D[:,1:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    P = classify(X,model_simulated_annealing) # logistic regression returns a x x 2 array holding the probability 1 = col 1, -1 = col 2

    # convert the probability to a choice ... for each row (test instance), compute the col with the highest probability
    ŷ = zeros(number_of_examples);
    for i ∈ 1:number_of_examples
        a = argmax(P[i,:]); # col index with largest value
        ŷ[i] = 1; # default
        if (a == 2)
            ŷ[i] = -1;
        end
    end
    
    # return -
    ŷ, y, P
end;

__Performance__: Let's compute [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for the simulated annealing case.
* We compute confusion matrix [using the `confusion(...)` method](src/Compute.jl) and store it in the `CM_logistic_sa::Array{Int64,2}` variable. The [`confusion(...)` method](src/Compute.jl) takes the actual labels and the computed labels and returns the confusion matrix.

In [37]:
CM_logistic_sa = confusion(y_logistic_sa, ŷ_logistic_sa)

2×2 Matrix{Int64}:
 26   9
  6  59

In [38]:
number_of_test_points = length(y_perceptron);
correct_prediction_logistic_sa = CM_logistic_sa[1,1] + CM_logistic_sa[2,2];
(correct_prediction_logistic_sa/number_of_test_points) |> f-> println("Fraction correct: $(f) Fraction incorrect $(1-f)")

Fraction correct: 0.85 Fraction incorrect 0.15000000000000002


## Task 4: Let's visualize what we missed using PCA!
In L3d, we had a kind of cool visualization of which points our various classifiers missed. Let's try to do the same here. However, we have a problem!

In [40]:
P = let
    F = eigen(Σ);
    λ = F.values;
    V = F.vectors;

    # get the largest and next largets eigenvectors
    v₁ = V[:,end]; # largest
    v₂ = V[:,end-1]; # next largest

    # build the projection matrix -
    P = [
        transpose(v₁) ;
        transpose(v₂)
    ];

    # return -
    P
end

2×12 Matrix{Float64}:
 -0.492007  -0.259916  0.18883   …  -0.145568  -0.0837004  0.499135
 -0.059407  -0.322449  0.247051      0.579358   0.484378   0.102824

In [41]:
Z = let

    D = training; # what data set?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2); # how many features do we have (cols)?
    X = D[:,1:end-1]; # features: need to add a 1 to each row (for bias), after removing the label
    
    Z = P*transpose(X) |> transpose |> Matrix

    Z
end

199×2 Matrix{Float64}:
 -0.378744  -0.579692
 -1.11256    1.18376
  1.85559   -1.27028
 -0.448313  -0.16056
 -0.525391   3.24188
  2.1989    -0.123081
 -0.936017  -0.681483
 -0.224546   1.8979
  1.06482    0.160619
 -0.558047   1.10854
 -0.245946   0.382246
  0.654841   1.70119
  0.271468   1.67965
  ⋮         
 -1.455     -1.36458
 -1.36046    0.107387
 -0.568989  -2.60082
 -0.377279  -0.945909
 -0.53634    1.02711
  0.126468   0.851097
  1.13029   -0.305329
  0.67227   -1.07147
  1.6972     0.661798
 -0.580629  -1.30722
  1.20747   -1.33693
  1.95118    2.77718

In [42]:
test

100×13 Matrix{Float64}:
  2.87235     1.0  -0.217296     -1.0  …   1.0  -1.0  -1.0341     1.0
 -1.33116    -1.0  -0.298715      1.0      1.0   1.0  -0.544499  -1.0
  2.452       1.0  -0.551217     -1.0      1.0   1.0  -1.57524    1.0
 -1.58337     1.0  -0.342001      1.0     -1.0  -1.0  -0.840837   1.0
 -0.910808   -1.0  -0.481135     -1.0      1.0   1.0  -0.157972  -1.0
 -0.238246    1.0  -0.450216     -1.0  …   1.0   1.0   0.512008   1.0
 -0.0140307   1.0  -0.492472      1.0      1.0  -1.0   0.524893   1.0
  0.098035   -1.0  -0.310052      1.0     -1.0  -1.0  -0.286814  -1.0
 -0.910808    1.0  -0.485257     -1.0      1.0  -1.0  -1.58812    1.0
 -1.75151     1.0  -0.495564     -1.0     -1.0  -1.0   0.731041  -1.0
 -1.58337    -1.0   4.76885      -1.0  …   1.0   1.0  -0.557383  -1.0
  0.686527   -1.0   0.862796     -1.0      1.0   1.0   0.215671  -1.0
  0.098035   -1.0  -0.361583     -1.0      1.0   1.0  -1.54947    1.0
  ⋮                                    ⋱         ⋮               


In [43]:
let
    model = perceptron_model; # which model am I using?
    dataset = training; # what dataset am I looking at?
    caselabel = "training";
    actual = y_perceptron;
    predicted = ŷ_perceptron;
    number_of_points = size(dataset,1); # number of rows
    p = plot(bg="gray95", background_color_outside="white", framestyle = :box, fg_legend = :transparent); # make an empty plot
    
    # plot label = 1
    testlabel = 1;
    i = findfirst(label -> label == testlabel,  dataset[:,end])
    c = my_color_dictionary[testlabel]
    scatter!([Z[i,1]], [Z[i,2]], label="Label: $(testlabel)", c=c)

    # plot label = -1
    testlabel = -1;
    i = findfirst(label -> label == testlabel,  dataset[:,end])
    c = my_color_dictionary[testlabel]
    scatter!([Z[i,1]], [Z[i,2]], label="Label: $(testlabel)", c=c)

    # let's draw the separating hyperplane (in our case, a line)
    p = model.β;
    number_of_plane_points = 200;
    x₂ = zeros(number_of_plane_points);
    x₁ = range(-3,stop=3,length = number_of_plane_points) |> collect;
    for i ∈ 1:number_of_plane_points
        x₂[i] = -1*((p[1]/p[2])*x₁[i] + p[3]/p[2]);
    end
    plot!(x₁,x₂,lw=2, c=:green, label="Learned boundary")
    
    # data -
    for i ∈ 1:number_of_points
        actuallabel = actual[i]; # actual label
        testlabel = predicted[i]; # predited label

        c = :gray60;
        if (actuallabel == testlabel)
            c = my_color_dictionary[actuallabel]
        end
        scatter!([Z[i, 1]], [Z[i, 2]], label="", mec=:navy, c=c)
    end

    title!("Perceptron: $(caselabel)", fontsize=18)
    xlabel!("Feature 1 (AU)", fontsize=18);
    ylabel!("Feature 2 (AU)", fontsize=18);
end

LoadError: BoundsError: attempt to access 100-element Vector{Float64} at index [101]

## Tests
In the code block below, we check some values in your notebook and give you feedback on which items are correct or different. `Unhide` the code block below (if you are curious) about how we implemented the tests and what we are testing.

In [45]:
let
    # fill me in here
end