# L3b: Classification of Clinical Breast Cancer Samples
Linear regression can be adapted for classification tasks by transforming the continuous output of the linear regression model directly to a class designation, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,+1\}$ or into a probability using an output function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$. Let's take a look at two examples of these strategies:

* [The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) is a simple yet powerful algorithm used in machine learning for binary classification tasks. It operates by _incrementally_ learning a linear decision boundary (linear regression model) that separates two classes based on input features by directly mapping the continuous output to a class such as $\sigma:\mathbb{R}\rightarrow\{-1,+1\}$, where the output function is $\sigma(\star) = \texttt{sign}(\star)$.
* [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression#) is a statistical method used in machine learning for binary classification tasks using the [logistics function](https://en.wikipedia.org/wiki/Logistic_function) as the transformation function. Applying the logistic function transforms the output of a linear regression model into a probability, enabling effective decision-making in various applications. We'll consider this approach next time.

### Perceptron
[The Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) takes the (scalar) output of a linear regression model $y_{i}\in\mathbb{R}$ and then transforms it using the $\sigma(\star) = \texttt{sign}(\star)$ function to a discrete set of values representing categories, e.g., $\sigma:\mathbb{R}\rightarrow\{-1,1\}$ in the binary classification case. 
* Suppose there exists a data set
$\mathcal{D} = \left\{(\mathbf{x}_{1},y_{1}),\dotsc,(\mathbf{x}_{n},y_{n})\right\}$ with $n$ _labeled_ examples, where each example has been labeled by an expert, i.e., a human to be in a category $y_{i}\in\{-1,1\}$, given the $m$-dimensional feature vector $\mathbf{x}_{i}\in\mathbb{R}^{m}$. 
* [The Perceptron](https://en.wikipedia.org/wiki/Perceptron) _incrementally_ learns a linear decision boundary between _two_ classes of possible objects (binary classification) by repeatedly processing the dataset $\mathcal{D}$. During each pass, a regression parameter vector $\mathbf{\beta}$ is updated until it makes no more than a specified number of mistakes.  

[The Perceptron](https://en.wikipedia.org/wiki/Perceptron) computes the estimated label $\hat{y}_{i}$ for feature vector $\hat{\mathbf{x}}_{i}$ using the $\texttt{sign}:\mathbb{R}\to\{-1,1\}$ function:
$$
\begin{equation*}
    \hat{y}_{i} = \texttt{sign}\left(\hat{\mathbf{x}}_{i}^{\top}\cdot\beta\right)
\end{equation*}
$$
where $\beta=\left(w_{1},\dots,w_{n}, b\right)$ is a column vector of (unknown) classifier parameters, $w_{j}\in\mathbb{R}$ corresponding to the importance of feature $j$ and $b\in\mathbb{R}$ is a bias parameter, the features $\hat{\mathbf{x}}^{\top}_{i}=\left(x^{(i)}_{1},\dots,x^{(i)}_{m}, 1\right)$ are $p = m+1$-dimensional (row) vectors (features augmented with bias term), and $\texttt{sign}(z)$ is the function:
$$
\begin{equation*}
    \texttt{sign}(z) = 
    \begin{cases}
        1 & \text{if}~z\geq{0}\\
        -1 & \text{if}~z<0
    \end{cases}
\end{equation*}
$$
__Hypothesis__: [If the dataset $\mathcal{D}$ is linearly separable](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-LinearSeperableData-Hyperplane.pdf), the Perceptron will _incrementally_ learn a separating hyperplane in a finite number of passes through the data set $\mathcal{D}$. However, if the [dataset $\mathcal{D}$ is __not__ linearly separable](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-NotLinearSeperableData-Hyperplane.svg), the Perceptron may not converge. Check out a [perceptron pseudo-code here!](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf)

__Challenge__: We've never seen the dataset in this lab and have no idea if it's linearly separable. Thus, we have no theoretical guarantee that [the Perceptron](https://en.wikipedia.org/wiki/Perceptron) will work. Let's load the dataset, do some preprocessing, and then explore the performance of [the Perceptron](https://en.wikipedia.org/wiki/Perceptron) on this data.


### Tasks
Before we start, divide into teams and familiarize yourself with the lab. Then, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!

* __Task 1: Setup, Data, Constants (10 min)__: Let's take 10 minutes to review the dataset we'll explore today and set up some values we'll use in the other tasks. We'll load the data and do some initial _data munging_ (also called [data wrangling](https://en.wikipedia.org/wiki/Data_wrangling)) to get the dataset in a form that we'll use in our analysis.
* __Task 2: Build a Classification Model and Learn the Parameters (10 min)__: In this task, we'll build a model of our classification problem and train the model using an online learning method.
* __Task 3: Classify the test data and compute the Confusion matrix (20 min)?__: In this task, we'll use the updated `model::MyPerceptronClassificationModel` instance (that has learned its parameters from the `training` data in _Task 2_) and test how well we classify data that we have never seen, i.e., how well we classify the `test` dataset. We'll then compute [the confusion matrix](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-BinaryConfusionMatrix.pdf) which we can use to test how well the classifier is working.

## Task 1: Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. The `Include.jl` file loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem.

In [3]:
include("Include.jl");

### Data
In this lab, we'll use [the Perceptron (Rosenblatt, 1957)](https://en.wikipedia.org/wiki/Perceptron) to classify clinical Breast Cancer samples taken from the University of Wisconsin. 
* __Description__: The breast cancer dataset developed by [Wolberg, W. (1990)](https://doi.org/10.24432/C5HP4Z) was obtained from the University of Wisconsin Hospitals, Madison, from [Dr. William H. Wolberg](https://pages.cs.wisc.edu/~olvi/uwmp/cancer.html), and is available [from the UCI dataset archive](https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original). It contains `699` instances, with `9` clinical features and a class label `{benign | malignant}.`

In [5]:
df = CSV.read(joinpath(_PATH_TO_DATA, "breast-cancer-wisconsin.csv"), DataFrame)

Row,id,ClumpThickness,UniformityCellSize,UniformityCellShape,MarginalAdhesion,SingleEpithelialCellSize,BareNuclei,BlandChromatin,NormalNucleoli,Mitoses,Class
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1000025,5,1,1,1,2,1,3,1,1,2
2,1002945,5,4,4,5,7,10,3,2,1,2
3,1015425,3,1,1,1,2,2,3,1,1,2
4,1016277,6,8,8,1,3,4,3,7,1,2
5,1017023,4,1,1,3,2,1,3,1,1,2
6,1017122,8,10,10,8,7,10,9,7,1,4
7,1018099,1,1,1,1,2,10,3,1,1,2
8,1018561,2,1,2,1,2,1,3,1,1,2
9,1033078,2,1,1,1,2,1,1,1,5,2
10,1033078,4,2,1,1,2,1,2,1,1,2


__Data wrangling__: The `Class` label is not in the form of $\{-1,1\}$ that the perceptron expects, so let's transform the original data where we map $2\rightarrow{-1}$ and $4\rightarrow{1}$. We'll save the transformed data in the `dataset::DataFrame` variable. In our transformed dataset the `-1` label is _not cancer_ `benign` while the label `1` denotes _cancer_ or `malignant`.

In [7]:
dataset = let

    number_of_examples = nrow(df);
    for i ∈ 1:number_of_examples
        c = df[i,:Class];
        if (c == 2)
            df[i,:Class] = -1 # not cancer
        elseif (c == 4)
            df[i,:Class] = 1 # cancer
        end
    end
    df
end

Row,id,ClumpThickness,UniformityCellSize,UniformityCellShape,MarginalAdhesion,SingleEpithelialCellSize,BareNuclei,BlandChromatin,NormalNucleoli,Mitoses,Class
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1000025,5,1,1,1,2,1,3,1,1,-1
2,1002945,5,4,4,5,7,10,3,2,1,-1
3,1015425,3,1,1,1,2,2,3,1,1,-1
4,1016277,6,8,8,1,3,4,3,7,1,-1
5,1017023,4,1,1,3,2,1,3,1,1,-1
6,1017122,8,10,10,8,7,10,9,7,1,1
7,1018099,1,1,1,1,2,10,3,1,1,-1
8,1018561,2,1,2,1,2,1,3,1,1,-1
9,1033078,2,1,1,1,2,1,1,1,5,-1
10,1033078,4,2,1,1,2,1,2,1,1,-1


Next, let's (randomly) partition the clinical data into `training` and `test` sets. 
* __Training data__: Training datasets are collections of labeled data used to teach machine learning models, allowing these tools to learn patterns and relationships within the data. In our case, we'll use the training data to estimate the classifier parameters $\beta$.
* __Test data__: Test datasets, on the other hand, are separate sets of labeled data used to evaluate the performance of trained models on unseen examples, providing an unbiased assessment of the _model's generalization capabilities_.

In [9]:
training, test = let

    number_of_training_examples = 456; # from Sidey-Gibbons, 2019
    D = Matrix(dataset);
    number_of_features = size(D,2); # number of cols of housing data
    number_of_examples = size(D,1); # number of rows of housing data
    full_index_set = range(1,stop=number_of_examples,step=1) |> collect |> Set;
    
    # build index sets for training and testing
    training_index_set = Set{Int64}();
    should_stop_loop = false;
    while (should_stop_loop == false)
        i = rand(1:number_of_examples);
        push!(training_index_set,i);

        if (length(training_index_set) == number_of_training_examples)
            should_stop_loop = true;
        end
    end
    test_index_set = setdiff(full_index_set,training_index_set);

    # build the test and train datasets -
    training = D[training_index_set |> collect,:];
    test = D[test_index_set |> collect,:];

    # return
    training, test
end;

In [40]:
training

456×11 Matrix{Int64}:
 1111249  10   6   6   3  4   5   3   6  1   1
 1223967   6   1   3   1  2   1   3   1  1  -1
  897471   4   8   8   5  4   5  10   4  1   1
 1222047  10  10  10  10  3  10  10   6  1   1
 1354840   2   1   1   1  2   1   3   1  1  -1
 1124651   1   3   3   2  2   1   7   2  1  -1
  718641   1   1   1   1  5   1   3   1  1  -1
  183913   1   2   2   1  2   1   1   1  1  -1
 1330361   5   1   1   1  2   1   2   1  1  -1
  536708   1   1   1   1  2   1   1   1  1  -1
 1171845   8   6   4   3  5   9   3   1  1   1
  831268   1   1   1   1  1   1   1   3  1  -1
 1205579   8   7   6   4  4  10   5   1  1   1
       ⋮                  ⋮                  ⋮
 1183596   3   1   3   1  3   4   1   1  1  -1
 1266124   5   1   2   1  2   1   1   1  1  -1
 1164066   1   1   1   1  2   1   3   1  1  -1
 1238777   6   1   1   3  2   1   1   1  1  -1
 1187457   3   1   1   3  8   1   5   8  1  -1
 1190485   1   1   1   1  2   1   1   1  1  -1
 1285722   4   1   1   3  2   1   1   

## Task 2: Build a Classification Model and Learn the Parameters
In this task, we'll build a model of our classification problem and train the model using an online learning method. 
* __Training__: Our Perceptron implementation [based on pseudo-code](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-3/L3a/docs/Notes.pdf) stores problem information in [a `MyPerceptronClassificationModel` instance, which holds the (initial) parameters and other data](src/Types.jl) required by the problem. We initialize the parameters using a vector of `1`'s.

In [11]:
model = let

    # How many features do we have?
    D = training; # let's look at the training data
    number_of_features = size(D,2) - 1; # why minus one?
    
    # build a model
    model = build(MyPerceptronClassificationModel, (
        parameters = ones(number_of_features),
        mistakes = 0 # willing to live with m mistakes
    ));

    model;
end;

Next, we then _learn_ the model parameters [using the `learn(...)` method](src/Compute.jl), which takes the training features array `X,` the training labels vector `y`, and the problem instance and returns an updated problem instance holding the updated parameters. 

In [13]:
trainedmodel = let

    D = training; # what dataset are we going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2) - 1; # how many features do we have (cols)?
    X = [D[:,2:end-1] ones(number_of_examples)]; # features, what??
    y = D[:,end]; # output: this is the target data (label)
    
    # train the model -
    trainedmodel = learn(X,y,model, maxiter = 1000, verbose = true);

    # return
    trainedmodel;
end;

Stopped after number of iterations: 1000. We have number of errors: 15


__Hmmmm__: Given the exit message above, can we say if this dataset is linearly separable?

## Task 3: Classify the test data and compute the Confusion matrix
In this task, we'll use the updated `model::MyPerceptronClassificationModel` instance (that has learned some parameters from the `training` data) and test how well we classify data that we have never seen, i.e., how well we classify the `test` dataset.
* __Inference__: We run the classification operation on the (unseen) test data [using the `classify(...)` method](src/Compute.jl). This method takes a feature array `X` and the (trained) model instance. It returns the estimated labels. We store the actual (correct) label in the `y::Array{Int64,1}` vector, while the model predicted label is stored in the `ŷ::Array{Int64,1}` array.
* __Performance__: Once the Perceptron (or any binary classifier) has converged, we can evaluate the binary classifier's performance using various metrics. The central idea is to compare the predicted labels $\hat{y}_{i}$ to the actual labels $y_{i}$ in the data set $\mathcal{D}$. 
Various metrics can be used to evaluate the performance of a binary classifier, but they all start with computing the confusion matrix.

In [16]:
ŷ,y = let

    D = test; # What dataset are you going to use?
    number_of_examples = size(D,1); # how many examples do we have (rows)
    number_of_features = size(D,2) - 1; # how many features do we have (cols)?
    X = [D[:,2:end-1] ones(number_of_examples)]; # features: need to add a 1 to each row (for bias), after removing the label
    y = D[:,end]; # output: this is the *actual* target data (label)

    # compute the estimated labels -
    ŷ = classify(X,model)

    # return -
    ŷ,y
end;

### Confusion Matrix
The confusion matrix is a $2\times{2}$ matrix that contains four entries: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). [Click me for a confusion matrix schematic!](https://github.com/varnerlab/CHEME-5820-Labs-Spring-2025/blob/main/labs/week-3/L3b/figs/Fig-BinaryConfusionMatrix.pdf) The four cases are:
* The __true positive (TP)__ case $(\text{actual}, \text{model}) = (+,+)$ in the confusion matrix is the number of positive examples that were correctly classified as positive.
* The __false negative (FN)__ case $(\text{actual}, \text{model}) = (+,-)$ is the number of actual positive examples the model incorrectly classified as negative.
* The __false positive (FP)__ case $(\text{actual}, \text{model}) = (-,+)$ is the number of actual negative examples that were incorrectly classified as positive by the model.
* The __true negative (TN)__ case $(\text{actual}, \text{model}) = (-,-)$ is the number of actual negative examples that were correctly classified as negative by the model.

Let's compute these four values and store them in the `confusion_matrix::Array{Int64,2}` variable:

In [18]:
confusion_matrix = let

    # initialize -
    number_of_test_examples = length(ŷ);
    confusion_matrix = Array{Int64,2}(undef,2,2); # 2 x 2 array

    # True positive: TP (cancer)
    counter = 0;
    for i ∈ 1:number_of_test_examples
        if (y[i] == 1 && ŷ[i] == 1)
            counter+=1;
        end
    end
    confusion_matrix[1,1] = counter;

    # False negative: FN
    counter = 0;
    for i ∈ 1:number_of_test_examples
        if (y[i] == 1 && ŷ[i] == -1)
            counter+=1;
        end
    end
    confusion_matrix[1,2] = counter;

    # False position: FP
    counter = 0;
    for i ∈ 1:number_of_test_examples
        if (y[i] == -1 && ŷ[i] == 1)
            counter+=1;
        end
    end
    confusion_matrix[2,1] = counter;

    # True negative: TN
    counter = 0;
    for i ∈ 1:number_of_test_examples
        if (y[i] == -1 && ŷ[i] == -1)
            counter+=1;
        end
    end
    confusion_matrix[2,2] = counter;

    # return -
    confusion_matrix
end

2×2 Matrix{Int64}:
 77    1
  9  156

### Discussion questions
Let's answer some questions using the confusion matrix about how well the Perceptron worked, i.e., how well it correctly predicted the class label `{benign | malignant}` given the clinical features.
1. What fraction of the `test` examples did the classifier agent get correct, i.e., the observed label was correctly predicted?

In [20]:
number_of_test_samples = length(ŷ);

In [21]:
TC = confusion_matrix[1,1] + confusion_matrix[2,2];
TC/number_of_test_samples |> f-> println("Fraction correct: $(f)")

Fraction correct: 0.9588477366255144


2. For those cases where the classifier was wrong, is it biased toward being (incorrectly) positive or (incorrecty) negative?

In [23]:
TW = confusion_matrix[2,1] + confusion_matrix[1,2]; # total wrong
FN = confusion_matrix[1,2]; # (+,-)
FP = confusion_matrix[2,1]; # (-,+)

In [24]:
println("False negative rate $(FN/TW)")

False negative rate 0.1


In [25]:
println("False positive rate $(FP/TW)")

False positive rate 0.9
