# PS3: Let's Classify RNA data using K-Nearest Neighbors (KNN)
In this problem set, we will implement the K-Nearest Neighbors (KNN) algorithm to classify RNA data. 

> __Learning Objectives:__
>
> By the end of this problem set, you should be able to:
> * __Prepare RNA data for distance-based classification:__ Parse the LIBSVM training and test files and package each example as a feature-label pair. Apply z-score scaling to each feature so no single feature range dominates distance calculations.
> * __Build and run a kernelized KNN model:__ Define an RBF kernel and verify the sampled kernel matrix is positive semidefinite. Construct a weighted KNN classifier from reference samples and predict labels on a configurable test subset.
> * __Evaluate model behavior with task-relevant metrics:__ Compute confusion-matrix counts together with accuracy, precision, recall, balanced accuracy, and F1 for RNA label predictions. Interpret a sweep over `K` and `gamma` to understand tradeoffs between overall correctness and positive-class recovery.

Let's get started!
___

## Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.

> __Environment Setup with Include.jl__
>
> The [`include(...)` command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/).

Let's set up our code environment:

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). Check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types, and data used in this material.

### Data
Let's load [a dataset from the `LIBSVM` data archive](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) that describes the detection of non-coding RNA sequences that was initially published by:
* [Andrew V Uzilov, Joshua M Keegan, and David H Mathews. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7(173), 2006.](https://pubmed.ncbi.nlm.nih.gov/16566836/)


Non-coding RNAs (ncRNAs) have many roles in cells. However, detecting novel ncRNAs in biochemical screens is challenging. Accurate computational methods for detecting ncRNAs in sequenced genomes are important to understanding the roles ncRNAs play in cells. 

> __What's in the dataset?__
> 
> In this dataset, there are `59535` training instances in the `training` data; each instance has `8` continuous features and a binary label $y\in\left\{-1,1\right\}$, where a label of `1` indicates that the RNA sequence is non-coding and a label of `-1` indicates that the RNA sequence is coding. 
> 
> The features are continuous values for each RNA sequence pair consisting of Dynalign-predicted total folding free energy (`ΔG_total`), the shorter-sequence length, and the A/U/C nucleotide frequencies for each sequence.
> 
> The `test` dataset has `271617` instances (with the same `8` continuous features and a binary label).

We begin by loading the `training` and `test` datasets. The [`LIBSVM` library authors](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/) have developed these subsets. Let's start by setting some constants and then loading the data. Please look at the comment next to the constant for a definition of what it is, units, permissible values, etc.

In [2]:
number_of_features = 8; # there are eight continuous features
number_of_tests_to_run = 1000; # how many test examples to run through the KNN algorithm?
number_of_reference_samples = 10001; # how many training examples to use as reference samples for the KNN algorithm?

In the code block below we preprocess the `training` dataset. We have [z-score centered](https://en.wikipedia.org/wiki/Standard_score) the training data and combined it into an array where each row is a training instance, while the first `1:number_of_features` columns hold the features. The last column has the label. 

We store the training data in the `training::Array{NamedTuple,1}` array:

In [3]:
training = let

    # load the training data -
    data = parser(joinpath(_PATH_TO_DATA, "cod-rna-training.data"));
    number_of_rows = size(data,1);
    data_perm = randperm(number_of_rows);
    training_data = Array{NamedTuple,1}(undef, number_of_rows);
    X = data[data_perm,:];
    
    # z-score center the data -
    μ = mean(X[:,1:number_of_features],dims=1);
    σ = std(X[:,1:number_of_features],dims=1);
    X̂ = zeros(number_of_rows,number_of_features+1);
    for i ∈ 1:number_of_rows
        for j ∈ 1:number_of_features
            X̂[i,j] = (X[i,j] - μ[j])/(σ[j]);
        end
        X̂[i,end] = X[i,end]; # get the label
    end

    # package the data into an array of named tuples -
    for i ∈ 1:number_of_rows
        features = X̂[i,1:number_of_features];
        label = X̂[i,end];
        training_data[i] = (x = features, y = (label |> Int ));
    end

    training_data; # return scaled - balanced data
end;

Next, we preprocess the `test` dataset.  We [z-score center](https://en.wikipedia.org/wiki/Standard_score) the test data and combined it into an array where each row is a test instance, while the first `1:number_of_features` columns hold the features. The last column has the label.

We store the test data in the `test::Array{NamedTuple,1}` array:

In [4]:
test = let

    # load the training data -
    data = parser(joinpath(_PATH_TO_DATA, "cod-rna-testing.data"));
    number_of_rows = size(data,1);
    data_perm = randperm(number_of_rows);
    test_data = Array{NamedTuple,1}(undef, number_of_rows);
    X = data[data_perm,:];
    
    # z-score center the data -
    μ = mean(X[:,1:number_of_features],dims=1);
    σ = std(X[:,1:number_of_features],dims=1);
    X̂ = zeros(number_of_rows,number_of_features+1);
    for i ∈ 1:number_of_rows
        for j ∈ 1:number_of_features
            X̂[i,j] = (X[i,j] - μ[j])/(σ[j]);
        end
        X̂[i,end] = X[i,end]; # get the label
    end

    # package the data into an array of named tuples -
    for i ∈ 1:number_of_rows
        features = X̂[i,1:number_of_features];
        label = X̂[i,end];
        test_data[i] = (x=features, y = (label |> Int ));
    end

    test_data; # return scaled - balanced data
end;

> __Are the labels in the training dataset balanced?__
>
> Since we are using a KNN classifier, it is important to check label balance in the reference dataset. If labels are imbalanced, neighbor voting can bias predictions toward the majority class and reduce minority-class performance.

Let's compute the positive and negative label fractions in the training data.

In [5]:
let 

    # initialize -
    D = training; # specify what data set we are working with
    number_of_training_examples = length(D); # how many examples are in the training dataset?

    # Fancy! Let's use a list comprehension to count the number of positive and negative labels in the training dataset. The `sum()` function will sum up the number of times the condition is true for each example in the training dataset.
    count_positive_labels = sum(D[i].y == 1 for i ∈ 1:number_of_training_examples); # how many positive labels are in the training dataset?
    count_negative_labels = sum(D[i].y == -1 for i ∈ 1:number_of_training_examples); # how many negative labels are in the training dataset?
    
    # print the results (fraction of positive and negative labels in the training dataset)
    println("Fraction of positive labels in the training dataset: ", count_positive_labels / number_of_training_examples);
    println("Fraction of negative labels in the training dataset: ", count_negative_labels / number_of_training_examples);
end

Fraction of positive labels in the training dataset: 0.3333333333333333
Fraction of negative labels in the training dataset: 0.6666666666666666


___

## Task 2: Let's look at our KNN Classifier
In this task, we will build and evaluate the KNN classifier that we will be using to make predictions on the test dataset. We'll build [a `MyWeightedKernelizedKNNClassificationModel` model](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.MyWeightedKernelizedKNNClassificationModel) using the `training` dataset as the reference data.

Let's build a kernel function $k:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}$ to measure similarity. For now, let's make up our own kernel function, and save this function in the `k(x,y)::Function` variable.

In [6]:
k(x,y,γ) = exp(-γ * norm(x-y,2)^2) # RBF kernel

k (generic function with 1 method)

#### Check: Are we using a valid Kernel function?
Let's check to see if the distance (similarity) metric we built is a valid kernel function.

> __Condition:__
>
> A function $k:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}$ is a _valid kernel function_ if and only if the kernel matrix $\mathbf{K}\in\mathbb{R}^{n\times{n}}$ is positive (semi)definite for all possible choices of the data vectors $\mathbf{v}_i$, where $K_{ij} = k(\mathbf{v}_i, \mathbf{v}_j)$. If $\mathbf{K}$ is positive (semi)definite, then for any real-valued vector $\mathbf{x} \in \mathbb{R}^n$, the kernel matrix $\mathbf{K}$ must satisfy $\mathbf{x}^{\top}\mathbf{K}\mathbf{x} \geq 0$. 

Let's compute the kernel matrix `KM::Array{Float64,2}` for a data matrix `X::Array{Float64,2}` using the distance/kernel function `k(x,y)::Function` we built above.

In [7]:
KM = let

    D = training; # specify what data set we are working with
    number_of_training_examples = 100; # how many examples are in the training dataset (we will use only small number examples to compute the kernel matrix for computational efficiency)
    number_of_features = D[1].x |> length; # number of features in the dataset
    γ = 0.5; # kernel parameter

    # fill up the feature matrix -
    X = zeros(number_of_training_examples, number_of_features); # initialize a matrix to hold the features
    for i ∈ 1:number_of_training_examples
        X[i,:] = D[i].x; # fill the matrix with the features from the training dataset
    end

    # fill up the kernel matrix -
    K = zeros(number_of_training_examples,number_of_training_examples);
    for i ∈ 1:number_of_training_examples
        vᵢ = X[i,:];
        for j ∈ 1:number_of_training_examples
            vⱼ = X[j,:];
            K[i,j] = k(vᵢ,vⱼ,γ) # compute kernel value
        end
    end
    K
end;

Next, let's check to see if the kernel matrix `K::Array{Float64,2}` is positive (semi)definite by checking if all of its eigenvalues are non-negative.

> __Check:__
>
> For this kernel to be valid, the kernel matrix $\mathbf{K}$ needs to be positive (semi)definite, i.e., all eigenvalues $\lambda_i \geq 0$. We compute the eigenvalues using [`eigvals`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigvals) and verify they are all non-negative using [the `@assert` macro](https://docs.julialang.org/en/v1/base/base/#Base.@assert) in combination with [the `all` function](https://docs.julialang.org/en/v1/base/collections/#Base.all-Tuple%7BAny%7D).

Do we blow up? If not, the matrix is PSD for this dataset, which supports using this kernel in this lab.

In [8]:
let
    λ = eigvals(KM);
    @assert all(λ .≥ -1e-10) "Kernel matrix is not PSD: min eigenvalue = $(minimum(λ))"
end

Ok, so now let's build the KNN classifier model. There are a few design choices to make, such as the number of neighbors to consider (`K`), the kernel function, and any kernel parameters. 

> __How do we choose the `K` parameter?__
>
> The choice of `K` controls the bias-variance tradeoff: __bias__ means a systematic error from an oversimplified neighborhood vote, and __variance__ means sensitivity to which training points happen to be in the dataset. 
> 
> A small `K` uses local voting (low bias, high variance), while large `K` uses global voting (high bias, low variance). We use $K = mC + 1$, where $m \geq 0$ is adjustable and $C$ is the number of classes, to explore this tradeoff systematically. 

Can we see this tradeoff in action by varying `K`? Let's test that on the dataset. Next let's consider our second design choice: the kernel width parameter $\gamma$ in the RBF kernel.

> __Gating parameter $\gamma$:__
>
> The parameter $\gamma>0$ controls how quickly similarity decays with squared distance in $k(\mathbf{x},\mathbf{y})=\exp\left(-\gamma\lVert\mathbf{x}-\mathbf{y}\rVert_2^2\right)$. Larger $\gamma$ makes the similarity measure more sensitive, so even small-to-moderate $\lVert\mathbf{x}-\mathbf{y}\rVert_2^2$ values can drive similarity toward zero and emphasize very local neighbors. Smaller $\gamma$ makes the kernel less sensitive, so even larger squared-distance differences can still produce moderate similarity and allow more distant neighbors to influence the vote.

Let's see how varying $\gamma$ and the number of neighbors `K` affects the performance of our KNN classifier on the test dataset.

In [9]:
model = let
    
    # initialize -
    D = training; # specify what data set we are working with
    number_of_training_examples = number_of_reference_samples; # how many examples are in the training dataset?
    number_of_features = D[1].x |> length; # number of features in the dataset
    γ = 0.50; # kernel parameter
    m = 10; # neighborhood size multiple
    C = 2; # number of classes

    # fill up the feature matrix -
    X = zeros(number_of_training_examples, number_of_features); # initialize a matrix to hold the features
    for i ∈ 1:number_of_training_examples
        X[i,:] = D[i].x; # fill the matrix with the features from the training dataset
    end

    # fill up the label vector -
    y = zeros(number_of_training_examples); # initialize a vector to hold the labels
    for i ∈ 1:number_of_training_examples
        y[i] = D[i].y; # fill the vector with the labels from the training dataset
    end

    # build a model -
    model = build(MyWeightedKernelizedKNNClassificationModel, (
        K = (m*C+1), # we look at this many points
        features = X,
        labels = y,
        k = (x,y) -> k(x,y,γ), # RBF kernel similarity metric
    ));

    model; # return the model
end;

### Inference
Now that we have defined a kernel function, and built the model, let's use it to classify our data. We use the KNN classifier from [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl). 

> __What is going on in this code block?__
>
> In the code block below, we:
> * Construct [a `MyWeightedKernelizedKNNClassificationModel` model](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.MyWeightedKernelizedKNNClassificationModel) using [a `build(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/factory/). The `model` instance holds the data for the problem, i.e., how many neighbors to look at `K`, and the kernel function $k$.
> * Next, we pass this `model` instance to [the `classify(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.classify) which takes a test feature $\mathbf{z}$ and the classifier `model` instance and returns the predicted label value $\hat{y}$ for the test feature vector $\mathbf{z}$.

We return the predicted labels in `ŷ_KNN` and the actual labels in `y_KNN`.

In [10]:
ŷ_KNN,y_KNN = let

    # Data -
    D = test; # what dataset are we working with?
    number_of_test_examples = number_of_tests_to_run; # how many samples are we going to test?
    number_of_features = D[1].x |> length; # number of features in the dataset

     # fill up the feature matrix -
    X = zeros(number_of_test_examples, number_of_features); # initialize a matrix to hold the features
    for i ∈ 1:number_of_test_examples
        X[i,:] = D[i].x; # fill the matrix with the features from the test dataset
    end

    # fill up the label vector (actual labels) -
    y = zeros(number_of_test_examples); # initialize a vector to hold the labels
    for i ∈ 1:number_of_test_examples
        y[i] = D[i].y; # fill the vector with the labels from the test dataset
    end

    # process each vector in the test set, compare that to training (reference), and compute the predicted label -
    ŷ = zeros(number_of_test_examples);  # initialize some storage for the predicted label
    for i ∈ 1:number_of_test_examples
        z = X[i,:]; # get feature vector for test
        ŷ[i] = classify(z,model) # classify the test vector using the training data
    end
 
    # return -
    ŷ,y
end;

### Performance
We can evaluate the binary classifier's performance using various metrics. The central idea is to compare the predicted labels $\hat{y}_{i}$ to the actual labels $y_{i}$ in the `test` dataset and measure wins (when the label is the same) and losses (label is different). This is easily represented in [the confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix).

> __Error analysis using the confusion matrix__
>
> Total mistakes (or mistake percentage) is only part of the story. We should understand whether we are biased toward false positives or false negatives for RNA labels. 
>
> The confusion matrix for a binary classifier is typically structured as:
>
>|                     | **Predicted Positive** | **Predicted Negative** |
>|---------------------|------------------------|------------------------|
>| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
>| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |
>
> From the confusion matrix, we derive five key metrics:
>
> * __Accuracy__ is the fraction of correct predictions overall: $\texttt{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$. It tells us the overall success rate but can be misleading if classes are imbalanced; a classifier that predicts "negative" for everything might achieve 95% accuracy on an imbalanced dataset.
> * __Precision__ answers: "When we predict positive, how often are we right?" $\texttt{precision} = \frac{TP}{TP + FP}$. In this RNA context, high precision means fewer false positives; if we predict non-coding, that prediction is usually correct. Low precision means many coding sequences are predicted as non-coding.
> * __Recall__ (also called sensitivity) answers: "Of all the true positives, how many did we catch?" $\texttt{recall} = \frac{TP}{TP + FN}$. High recall means we are catching most non-coding sequences. Low recall means many non-coding sequences are predicted as coding.
> * __Balanced Accuracy__ averages recall (sensitivity) and specificity: $\texttt{balanced\_accuracy} = \frac{1}{2}\left(\frac{TP}{TP + FN} + \frac{TN}{TN + FP}\right)$. A high balanced accuracy means the model performs well on both classes, while a low balanced accuracy means the model is weak on at least one class.
> * __F1 Score__ is the harmonic mean of precision and recall: $\texttt{F1} = \frac{2\cdot\texttt{precision}\cdot\texttt{recall}}{\texttt{precision} + \texttt{recall}}$. A high F1 means we are balancing missed positives and false positives well, while a low F1 means one or both of those error types are large.

We compute the __confusion matrix__ to get these counts. We use the [confusion(...) method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/binaryclassification/#VLDataScienceMachineLearningPackage.confusion), which takes actual labels and estimated labels, returning the confusion matrix. We save the confusion matrix in the `CM_KNN::Array{Int64,2}` variable.

In [11]:
CM_KNN = confusion(y_KNN,ŷ_KNN)

2×2 Matrix{Int64}:
  52  276
 123  549

Let's compute the __accuracy__, __precision__, __recall__, __balanced_accuracy__, and __f1__ for our KNN classifier using the confusion matrix. We save these metrics in the `accuracy_KNN::Float64`, `precision_KNN::Float64`, `recall_KNN::Float64`, `balanced_accuracy_KNN::Float64`, and `f1_KNN::Float64` variables, respectively.

In [12]:
accuracy_KNN, precision_KNN, recall_KNN, balanced_accuracy_KNN, f1_KNN = let

    # compute the confusion matrix -
    CM = CM_KNN; # confusion matrix for KNN classifier

    # compute accuracy, precision, recall, balanced accuracy, and F1 -
    TP = CM[1,1]; # true positives
    TN = CM[2,2]; # true negatives
    FP = CM[2,1]; # false positives
    FN = CM[1,2]; # false negatives

    accuracy = (TP + TN) / sum(CM); # overall accuracy
    precision = (TP + FP) == 0 ? 0.0 : TP / (TP + FP); # precision
    recall = (TP + FN) == 0 ? 0.0 : TP / (TP + FN); # recall
    specificity = (TN + FP) == 0 ? 0.0 : TN / (TN + FP); # specificity
    balanced_accuracy = 0.5 * (recall + specificity); # balanced accuracy
    f1 = (precision + recall) == 0 ? 0.0 : 2 * precision * recall / (precision + recall); # F1 score

    # print so the user can see -
    println("KNN Accuracy: $(round(accuracy*100,digits=2))%")
    println("KNN Precision: $(round(precision*100,digits=2))%")
    println("KNN Recall: $(round(recall*100,digits=2))%")
    println("KNN Balanced Accuracy: $(round(balanced_accuracy*100,digits=2))%")
    println("KNN f1: $(round(f1*100,digits=2))%")

    (accuracy, precision, recall, balanced_accuracy, f1); # return the metrics
end;

KNN Accuracy: 60.1%
KNN Precision: 29.71%
KNN Recall: 15.85%
KNN Balanced Accuracy: 48.78%
KNN f1: 20.68%


___

## Task 3: Let's do a hyperparameter sweep
In this task, we perform a grid search to see how the number of neighbors `K` and kernel gating parameter $\gamma$ affect classification performance on the test dataset.

> __What is going on in this code block?__
>
> We sweep over several $\left(K,\gamma\right)$ combinations. For each pair, we build a `MyWeightedKernelizedKNNClassificationModel`, classify every test point, compute a confusion matrix, then calculate accuracy, precision, recall, F1, and balanced accuracy. Finally, we sort all results (by F1) so the best precision-recall tradeoff settings appear first.

Let's execute the sweep and inspect the top-performing settings. We store the results in the `knn_sweep_results::DataFrame` variable, which contains the `K`, `gamma`, `TP`, `TN`, `FP`, `FN`, `accuracy`, `precision`, `recall`, `f1`, and `balanced_accuracy` values for each combination. We sort this DataFrame by F1 in descending order to see the strongest precision-recall settings at the top.

In [13]:
knn_sweep_results = let

    # load data -
    Dtr = training;
    Dte = test;
    ntr = number_of_reference_samples; # how many training examples do we have?
    number_of_test_examples = number_of_tests_to_run; # how many samples are we going to test?

    # feature/label arrays for training -
    Xtr = zeros(ntr, number_of_features);
    ytr = zeros(ntr);
    for i in 1:ntr
        Xtr[i, :] = Dtr[i].x;
        ytr[i] = Dtr[i].y;
    end

    # feature/label arrays for test -
    Xte = zeros(number_of_test_examples, number_of_features);
    yte = zeros(number_of_test_examples);
    for i in 1:number_of_test_examples
        Xte[i, :] = Dte[i].x;
        yte[i] = Dte[i].y;
    end

    # hyperparameter grids -
    K_grid = [3, 5, 9, 15, 25, 41, 65, 101, 151, 201, 301];
    γ_grid = [0.01, 0.05, 0.10, 0.25, 0.50];

    # initialize a DataFrame to hold the results of the sweep -
    results = DataFrame(
        K = Int[],
        gamma = Float64[],
        TP = Int[],
        TN = Int[],
        FP = Int[],
        FN = Int[],
        accuracy = Float64[],
        precision = Float64[],
        recall = Float64[],
        f1 = Float64[],
        balanced_accuracy = Float64[]
    );

    # sweep all (K, gamma) combinations -
    for γ ∈ γ_grid
        for K_neighbors ∈ K_grid
            
            # Create a local model for this combination of hyperparameters -
            model_local = build(MyWeightedKernelizedKNNClassificationModel, (
                K = K_neighbors, # we are looking at this many neighbors
                features = Xtr,
                labels = ytr,
                k = (x, y) -> k(x, y, γ), # RBF kernel with the current gamma
            ));

            # classify test set -
            ŷ = zeros(number_of_test_examples);
            for i ∈ 1:number_of_test_examples
                ŷ[i] = classify(Xte[i, :], model_local);
            end

            # confusion-matrix metrics -
            CM = confusion(yte, ŷ);
            TP = CM[1, 1]; # true positives
            TN = CM[2, 2]; # true negatives
            FP = CM[2, 1]; # false positives
            FN = CM[1, 2]; # false negatives

            accuracy = (TP + TN) / sum(CM);
            precision = (TP + FP) == 0 ? 0.0 : TP / (TP + FP);
            recall = (TP + FN) == 0 ? 0.0 : TP / (TP + FN);

            specificity = (TN + FP) == 0 ? 0.0 : TN / (TN + FP)
            balanced_accuracy = 0.5 * (recall + specificity)
            f1_score = (precision + recall) == 0 ? 0.0 : 2 * precision * recall / (precision + recall)


            # package results into the DataFrame -
            push!(results, (
                K = K_neighbors, # how many neighbors were used?
                gamma = γ, # what was the kernel parameter?
                TP = TP, # true positives
                TN = TN, # true negatives
                FP = FP, # false positives
                FN = FN, # false negatives
                accuracy = accuracy, # overall accuracy
                precision = precision, # precision
                recall = recall, # recall
                f1 = f1_score, # F1 score
                balanced_accuracy = balanced_accuracy, # balanced accuracy
            ));
        end
    end

    # sort the results -
    df = sort(results, :f1, rev = true); # sort results by F1 (descending order);
    df[1:10, :] # return the top 10 results
end;

Next, let's build a table of the results from our hyperparameter search using [the `pretty_table(...)` function exported by the `PrettyTables.jl` package](https://github.com/ronisbr/PrettyTables.jl).

In [14]:
pretty_table(knn_sweep_results; 
    backend = :text,
    fit_table_in_display_horizontally = false,
    table_format = TextTableFormat(borders = text_table_borders__compact)
);

 ------- --------- ------- ------- ------- ------- ---------- ----------- ---------- ---------- -------------------
 [1m     K [0m [1m   gamma [0m [1m    TP [0m [1m    TN [0m [1m    FP [0m [1m    FN [0m [1m accuracy [0m [1m precision [0m [1m   recall [0m [1m       f1 [0m [1m balanced_accuracy [0m
 [90m Int64 [0m [90m Float64 [0m [90m Int64 [0m [90m Int64 [0m [90m Int64 [0m [90m Int64 [0m [90m  Float64 [0m [90m   Float64 [0m [90m  Float64 [0m [90m  Float64 [0m [90m           Float64 [0m
 ------- --------- ------- ------- ------- ------- ---------- ----------- ---------- ---------- -------------------
      5       0.5     241     192     480      87      0.433    0.334258   0.734756   0.459485            0.510235
      5      0.25     235     199     473      93      0.434    0.331921   0.716463   0.453668            0.506297
     15       0.5      89     523     149     239      0.612     0.37395   0.271341   0.314488            0.524808
    

## Discussion
We now have set up our KNN classifier for the RNA sequence classification task. Let's explore some of the results and discuss what we see.

**DQ1: Why is feature scaling essential for KNN on this RNA dataset?** KNN predictions are driven by distances between feature vectors. If one feature (for example, sequence length or total folding energy) spans a much larger numeric range than the nucleotide-frequency features, it can dominate distance calculations and bias neighbor voting.

> __Strategy__: Use the preprocessing cells to explain what z-score scaling is doing here and why it matters for Euclidean-distance KNN with an RBF kernel. In 2-3 sentences, describe what could go wrong if we skipped scaling and which features would likely dominate (you can modify the scaling cell to explore your answer).

Write your response in the next code cell.

In [15]:
# Answer DQ1 here after evaluating why scaling matters for KNN distances.

In [16]:
did_I_answer_DQ1 = true; # TODO: update to true if answered DQ1 {true | false}

**DQ2: What do accuracy, precision, recall, balanced accuracy, and F1 mean for this RNA classification problem?** This dataset is class-imbalanced, so high accuracy can still occur even when the model misses many true positives. Precision and recall tell us different things about false positives versus false negatives, which can change how we judge model quality.

> __Strategy__: Using `CM_KNN`, `accuracy_KNN`, `precision_KNN`, `recall_KNN`, `balanced_accuracy_KNN`, and `f1_KNN`, interpret each metric in the context of predicting RNA class labels. In 2-3 sentences, state whether this model is better at ruling out negatives or finding positives, and which metric you would prioritize for this task.

Write your response in the next code cell.

In [17]:
# Answer DQ2 here after interpreting CM_KNN and the three metrics.

In [18]:
did_I_answer_DQ2 = true; # TODO: update to true if answered DQ2 {true | false}

**DQ3: What does the hyperparameter sweep tell us about KNN behavior on this dataset?** The `knn_sweep_results` table shows how changing `K` and `gamma` affects TP, TN, FP, FN, and the derived metrics. Similar accuracies can still correspond to very different precision/recall tradeoffs, which reveals what the model is optimizing in practice.

> __Strategy__: Inspect the top-performing and lower-performing rows in `knn_sweep_results`. In 2-3 sentences, describe the trend you see as `K` and `gamma` change, and explain what that suggests about local-vs-global voting and the model's ability to recover positive examples.

Write your response in the next code cell.

In [19]:
# Answer DQ3 here after analyzing trends in `knn_sweep_results`.

In [20]:
did_I_answer_DQ3 = true; # TODO: update to true if answered DQ3 {true | false}

___

## Summary
This problem set implemented an end-to-end kernelized KNN workflow for classifying RNA sequences as coding or non-coding.

> __Key Takeaways:__
>
> * **Feature scaling controls distance behavior:** Z-score scaling puts all features on comparable numeric ranges before neighbor search. Without scaling, large-range features can dominate distances and distort KNN votes.
> * **Confusion-matrix metrics reveal different error modes:** Accuracy summarizes total correctness across both classes. Precision and recall show whether the model is producing too many false non-coding calls or missing true non-coding sequences.
> * **Hyperparameter sweeps expose model tradeoffs:** Changing `K` and `gamma` changes how local the voting behavior is and how fast similarity decays. Similar accuracy values can hide large recall differences, so parameter selection should match the task objective.

The next step is to choose `K` and `gamma` based on the error type you want to minimize for RNA screening.
___

## Tests
In the code block below, we check some values in your notebook and give you feedback on which items are correct or different. `Unhide` the code block below (if you are curious) about how we implemented the tests and what we are testing.

In [21]:
let
    @testset verbose = true "CHEME 5820 problem set 3 test suite" begin

        @testset "Setup, Data, and Prerequisites" begin
            @test number_of_features == 8
            @test number_of_tests_to_run > 0
            @test number_of_reference_samples > 0
            @test isnothing(training) == false
            @test isnothing(test) == false
            @test length(training) > number_of_reference_samples
            @test length(test) > number_of_tests_to_run

            ncheck_train = min(1000, length(training))
            ncheck_test = min(1000, length(test))
            @test all(d -> length(d.x) == number_of_features, training[1:ncheck_train])
            @test all(d -> length(d.x) == number_of_features, test[1:ncheck_test])
            @test all(d -> d.y in (-1, 1), training[1:ncheck_train])
            @test all(d -> d.y in (-1, 1), test[1:ncheck_test])

            count_positive_labels_training = sum(training[i].y == 1 for i in eachindex(training))
            count_negative_labels_training = sum(training[i].y == -1 for i in eachindex(training))
            count_positive_labels_test = sum(test[i].y == 1 for i in eachindex(test))
            count_negative_labels_test = sum(test[i].y == -1 for i in eachindex(test))
            @test count_positive_labels_training > 0
            @test count_negative_labels_training > 0
            @test count_positive_labels_test > 0
            @test count_negative_labels_test > 0
        end

        @testset "Task 2: Kernel Function, Model, and Inference" begin
            @test isnothing(k) == false
            @test k(training[1].x, training[2].x, 0.5) isa Real

            @test isnothing(KM) == false
            km_n1, km_n2 = size(KM)
            @test km_n1 == km_n2
            @test km_n1 >= 2
            @test all(isfinite, KM)
            @test isapprox(KM, transpose(KM), atol=1e-8)
            @test all(isapprox.(diag(KM), 1.0, atol=1e-8))
            λ = eigvals(Symmetric(KM))
            @test minimum(λ) >= -1e-8

            @test isnothing(model) == false
            @test length(ŷ_KNN) == number_of_tests_to_run
            @test length(y_KNN) == number_of_tests_to_run
            @test all(v -> v in (-1.0, 1.0), ŷ_KNN)
            @test all(v -> v in (-1.0, 1.0), y_KNN)
        end

        @testset "Task 2: Confusion Matrix and Metrics" begin
            @test isnothing(CM_KNN) == false
            @test size(CM_KNN) == (2, 2)
            @test all(CM_KNN .>= 0)
            @test sum(CM_KNN) == number_of_tests_to_run

            TP = CM_KNN[1,1]
            TN = CM_KNN[2,2]
            FP = CM_KNN[2,1]
            FN = CM_KNN[1,2]

            accuracy_expected = (TP + TN) / sum(CM_KNN)
            precision_expected = (TP + FP) == 0 ? 0.0 : TP / (TP + FP)
            recall_expected = (TP + FN) == 0 ? 0.0 : TP / (TP + FN)
            specificity_expected = (TN + FP) == 0 ? 0.0 : TN / (TN + FP)
            balanced_accuracy_expected = 0.5 * (recall_expected + specificity_expected)
            f1_expected = (precision_expected + recall_expected) == 0 ? 0.0 : 2 * precision_expected * recall_expected / (precision_expected + recall_expected)

            @test 0.0 <= accuracy_KNN <= 1.0
            @test 0.0 <= precision_KNN <= 1.0
            @test 0.0 <= recall_KNN <= 1.0
            @test 0.0 <= balanced_accuracy_KNN <= 1.0
            @test 0.0 <= f1_KNN <= 1.0
            @test isapprox(accuracy_KNN, accuracy_expected, atol=1e-12)
            @test isapprox(precision_KNN, precision_expected, atol=1e-12)
            @test isapprox(recall_KNN, recall_expected, atol=1e-12)
            @test isapprox(balanced_accuracy_KNN, balanced_accuracy_expected, atol=1e-12)
            @test isapprox(f1_KNN, f1_expected, atol=1e-12)
        end

        @testset "Task 3: Hyperparameter Sweep" begin
            @test isnothing(knn_sweep_results) == false
            @test nrow(knn_sweep_results) > 0
            @test names(knn_sweep_results) == ["K", "gamma", "TP", "TN", "FP", "FN", "accuracy", "precision", "recall", "f1", "balanced_accuracy"]
            @test issorted(knn_sweep_results.f1, rev=true)
            @test all(knn_sweep_results.K .> 0)
            @test all(isodd.(knn_sweep_results.K))
            @test all(knn_sweep_results.gamma .> 0.0)
            @test all((knn_sweep_results.TP .+ knn_sweep_results.TN .+ knn_sweep_results.FP .+ knn_sweep_results.FN) .== number_of_tests_to_run)
            @test all((0.0 .<= knn_sweep_results.accuracy) .& (knn_sweep_results.accuracy .<= 1.0))
            @test all((0.0 .<= knn_sweep_results.precision) .& (knn_sweep_results.precision .<= 1.0))
            @test all((0.0 .<= knn_sweep_results.recall) .& (knn_sweep_results.recall .<= 1.0))
            @test all((0.0 .<= knn_sweep_results.f1) .& (knn_sweep_results.f1 .<= 1.0))
            @test all((0.0 .<= knn_sweep_results.balanced_accuracy) .& (knn_sweep_results.balanced_accuracy .<= 1.0))

            for row in eachrow(knn_sweep_results)
                TP = row.TP
                TN = row.TN
                FP = row.FP
                FN = row.FN

                accuracy_expected = (TP + TN) / (TP + TN + FP + FN)
                precision_expected = (TP + FP) == 0 ? 0.0 : TP / (TP + FP)
                recall_expected = (TP + FN) == 0 ? 0.0 : TP / (TP + FN)
                specificity_expected = (TN + FP) == 0 ? 0.0 : TN / (TN + FP)
                balanced_accuracy_expected = 0.5 * (recall_expected + specificity_expected)
                f1_expected = (precision_expected + recall_expected) == 0 ? 0.0 : 2 * precision_expected * recall_expected / (precision_expected + recall_expected)

                @test isapprox(row.accuracy, accuracy_expected, atol=1e-12)
                @test isapprox(row.precision, precision_expected, atol=1e-12)
                @test isapprox(row.recall, recall_expected, atol=1e-12)
                @test isapprox(row.balanced_accuracy, balanced_accuracy_expected, atol=1e-12)
                @test isapprox(row.f1, f1_expected, atol=1e-12)
            end
        end

        @testset "Discussion Questions" begin
            @test did_I_answer_DQ1 == true
            @test did_I_answer_DQ2 == true
            @test did_I_answer_DQ3 == true
        end
    end
end;

[0m[1mTest Summary:                                   | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
CHEME 5820 problem set 3 test suite             | [32m 109  [39m[36m  109  [39m[0m1.0s
  Setup, Data, and Prerequisites                | [32m  15  [39m[36m   15  [39m[0m0.5s
  Task 2: Kernel Function, Model, and Inference | [32m  14  [39m[36m   14  [39m[0m0.2s
  Task 2: Confusion Matrix and Metrics          | [32m  14  [39m[36m   14  [39m[0m0.1s
  Task 3: Hyperparameter Sweep                  | [32m  63  [39m[36m   63  [39m[0m0.2s
  Discussion Questions                          | [32m   3  [39m[36m    3  [39m[0m0.0s
