# Lab 1d: K-Means Clustering Analysis of a Heart Failure Clinical Dataset
In this lab, we will cluster a dataset describing the clinical risk factors linked with death from heart disease using [the K-means clustering algorithm](https://en.wikipedia.org/wiki/K-means_clustering). In this dataset, several risk factors (features) are measured per patient (some of these features are continuous, some are categorical), along with the binary clinical outcome (target variable) `{death | not death}.` 

### Tasks

* __Initialize (5 min)__: Break up into teams and familiarize yourself with the components and the underlying codes we'll use in the lab. Execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues.


## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file. The `Include.jl` file loads external packages, various functions that we will use in the exercise, and custom types to model the components of our lab problem.

In [3]:
include("Include.jl"); # what 

### Data
Next, let's load up the dataset that we will explore. The data for this lab was taken from this `2020` publication:
* [Davide Chicco, Giuseppe Jurman: "Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone." BMC Medical Informatics and Decision Making 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5](https://pubmed.ncbi.nlm.nih.gov/32013925/)

In this paper, the authors analyzed a dataset of 299 heart failure patients collected in 2015. The patients comprised 105 women and 194 men, aged between 40 and 95 years old. The dataset contains 13 features (a mixture of continuous and categorical data), which report clinical, body, and lifestyle information:
* Some features are binary: anemia, high blood pressure, diabetes, sex, and smoking status.
* The remaining features were continuous biochemical measurements, such as the Level of the Creatinine phosphokinase (CPK) enzyme in the blood, the number of platelets, etc.
* The class (target) variable is encoded as a binary (boolean) death event: `1` if the patient died during the follow-up period, `0` otherwise.

We'll load this data as a [DataFrame instance](https://dataframes.juliadata.org/stable/) and store it in the `originaldataset::DataFrame` variable:

In [5]:
originaldataset = CSV.read(joinpath(_PATH_TO_DATA, "heart_failure_clinical_records_dataset.csv"), DataFrame)

Row,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death_event
Unnamed: 0_level_1,Float64,Int64,Int64,Int64,Int64,Int64,Float64,Float64,Int64,Int64,Int64,Int64,Int64
1,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
2,55.0,0,7861,0,38,0,263358.0,1.1,136,1,0,6,1
3,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
4,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
5,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1
6,90.0,1,47,0,40,1,204000.0,2.1,132,1,1,8,1
7,75.0,1,246,0,15,0,127000.0,1.2,137,1,0,10,1
8,60.0,1,315,1,60,0,454000.0,1.1,131,1,1,10,1
9,65.0,0,157,0,65,0,263358.0,1.5,138,0,0,10,1
10,80.0,1,123,0,35,1,388000.0,9.4,133,1,1,10,1


We know from lecture that [the K-means approach](https://en.wikipedia.org/wiki/K-means_clustering) works on [an instance of a `Matrix`](https://docs.julialang.org/en/v1/base/arrays/#Base.Matrix-Tuple{UndefInitializer,%20Any,%20Any}) and not [a `DataFrame` instance](https://dataframes.juliadata.org/stable/). Thus, we need to convert the data to [a `Matrix`](https://docs.julialang.org/en/v1/base/arrays/#Base.Matrix-Tuple{UndefInitializer,%20Any,%20Any}). In addition, there are several ways we can pretreat the data to make the clustering easier.

In [7]:
(D, dataset) = let

    # convert 0,1 into -1,1
    treated_dataset = copy(originaldataset);
    transform!(treated_dataset, :anaemia => ByRow(x -> (x==0 ? -1 : 1)) => :anaemia); # maps anaemia to -1,1
    transform!(treated_dataset, :diabetes => ByRow(x -> (x==0 ? -1 : 1)) => :diabetes); # maps diabetes to -1,1
    transform!(treated_dataset, :high_blood_pressure => ByRow(x -> (x==0 ? -1 : 1)) => :high_blood_pressure); # maps high_blood_pressure to -1,1
    transform!(treated_dataset, :sex => ByRow(x -> (x==0 ? -1 : 1)) => :sex); # maps sex to -1,1
    transform!(treated_dataset, :smoking => ByRow(x -> (x==0 ? -1 : 1)) => :smoking); # maps smoking to -1,1
    transform!(treated_dataset, :death_event => ByRow(x -> (x==0 ? -1 : 1)) => :death_event); # maps death_event to -1,1
    
    D = treated_dataset[:,1:end] |> Matrix; # build a data matrix from the DataFrame
    (number_of_examples, number_of_features) = size(D);

    # Which cols do we want to rescale?
    index_to_scale = [
        1 ; # 1 age
        3 ; # 2 creatinine_phosphokinase
        5 ; # 3 ejection_fraction
        7 ; # 4 platelets
        8 ; # 5 serum_creatinine
        9 ; # 6 serum_sodium
        12 ; # 7 time
    ];

    D̂ = copy(D);
    for i ∈ eachindex(index_to_scale)
        j = index_to_scale[i];
        μ = mean(D[:,j]); # compute the mean
        σ = std(D[:,j]); # compute std

        # rescale -
        for k ∈ 1:number_of_examples
            D̂[k,j] = (D[k,j] - μ)/σ;
        end
    end

    # remove categorical cols -
    #D̂₂ = D̂[:,index_to_scale];
    D̂₂ = D̂;
    
    D̂₂, treated_dataset
end;

In [8]:
D

299×13 Matrix{Float64}:
  1.19095    -1.0   0.000165451  -1.0  …   1.0  -1.0  -1.62678   1.0
 -0.490457   -1.0   7.50206      -1.0      1.0  -1.0  -1.60101   1.0
  0.350246   -1.0  -0.449186     -1.0      1.0   1.0  -1.58812   1.0
 -0.910808    1.0  -0.485257     -1.0      1.0  -1.0  -1.58812   1.0
  0.350246    1.0  -0.434757      1.0     -1.0  -1.0  -1.57524   1.0
  2.452       1.0  -0.551217     -1.0  …   1.0   1.0  -1.57524   1.0
  1.19095     1.0  -0.346124     -1.0      1.0  -1.0  -1.54947   1.0
 -0.0701056   1.0  -0.275011      1.0      1.0   1.0  -1.54947   1.0
  0.350246   -1.0  -0.437849     -1.0     -1.0  -1.0  -1.54947   1.0
  1.6113      1.0  -0.47289      -1.0      1.0   1.0  -1.54947   1.0
  1.19095     1.0  -0.516176     -1.0  …   1.0   1.0  -1.54947   1.0
  0.098035   -1.0  -0.361583     -1.0      1.0   1.0  -1.54947   1.0
 -1.33116     1.0   0.411384     -1.0      1.0  -1.0  -1.53659   1.0
  ⋮                                    ⋱         ⋮              
 -1.33116    -

Let's set some constants that we'll need in the examples below. See the comment next to the constant for a description of what it is, it's permissible values, etc.

In [10]:
n = nrow(originaldataset); # how many example data points do we have?
m = size(D,2); # number of features (number of cols)
maxiter = 100000; # maximum iterations
K = 3; # number of clusters. What number should we pick?
ϵ = 1e-6; # tolerance for termination. We can set this to whatever we want

Finally, let's set up the color dictionary for the visualizations in the lecture. The keys of the `my_color_dictionary::Dict{Int64,RGB}` dictionary are the cluster indexes, while the values are the colors mapped to that index.

In [12]:
my_color_dictionary = Dict{Int64,RGB}();
my_color_dictionary[1] = colorant"#03045e";
my_color_dictionary[2] = colorant"#0077b6";
my_color_dictionary[3] = colorant"#00b4d8";
my_color_dictionary[4] = colorant"#ffc300";
my_color_dictionary[5] = colorant"#e36414";

## Task 1: Build a K-means model and cluster the data
Based on the [lecture notes](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-1/L1c/docs/Notes.pdf), we've developed an [initial K-means implementation](src/Cluster.jl). Let's explore how this implementation performs on our sample clinical dataset. 
* Build [a `MyNaiveKMeansClusteringAlgorithm` instance](src/Types.jl), which holds information about the clustering, i.e., the number of clusters `K::Int,` information about the dataset such as the number of features `m::Int,` and the number of points `n::Int,` and stopping criteria information such as the maximum number of iterations `maxiter::Int` and tolerance `ϵ::Float64`. We'll store this model in the `model::MyNaiveKMeansClusteringAlgorithm` variable.
* This is an example of [a factory type pattern](https://en.wikipedia.org/wiki/Factory_method_pattern), which uses [a `build(...)` method](src/Factory.jl) to construct and configure a complex object, i.e., set values for the properties on the model that must be computed. The [`build(...)` method](src/Factory.jl) takes the type of thing we want to build as the first argument, and the data required to construct the object (encoded [in a `NamedTupe` type](https://docs.julialang.org/en/v1/base/base/#Core.NamedTuple)) as the second argument.

In [14]:
model = build(MyNaiveKMeansClusteringAlgorithm, (
        maxiter = maxiter,
        dimension = m,
        number_of_points = n,
        K = K,
        ϵ = ϵ,
        scale_factor = 1.0, # scale of the data
));

### Initial clustering values
The `model::MyNaiveKMeansClusteringAlgorithm` contains the data that we passed in, as well as two derived fields that we computed in [the `build(...)` method](src/Factory.jl), the centroids and initial assignments: 
* The `centroids::Dict{Int64, Vector{Float64}}` dictionary holds the centroid values $\mu_1, \dots, \mu_K$ for each cluster. The dictionary's keys are the cluster index, while the values are the `m`-dimensional centroids (means) of the data points in that cluster. We initialize the centroids randomly.
* The `assignments::Vector{Int64}` field is an `n`-dimensional vector holding the clustered index that each data point is assigned to. We initialize the assignments randomly.

What is in the `centroids` and `assignments` fields of your model?

In [16]:
model.centroids

Dict{Int64, Vector{Float64}} with 3 entries:
  2 => [0.0329206, 0.0514251, 0.0673781, 0.611732, 0.97196, 0.00661736, 0.48513…
  3 => [0.47667, 0.204957, 0.727464, 0.0553144, 0.985399, 0.994552, 0.674892, 0…
  1 => [0.725998, 0.958233, 0.110244, 0.827162, 0.00502242, 0.763321, 0.656501,…

### Execute the clustering
We call [the `cluster(...)` method](src/Cluster.jl) to refine our initial random cluster assignments and centroid values. The [`cluster(...)` method](src/Cluster.jl) takes a few arguments:

* `D::Array{<:Number, 2}`: The first argument is the data matrix `D::Array{<:Number,2}` which we want to cluster. The data matrix has the features along the columns, and each row is a data (feature) vector $\mathbf{x}$. Its values can be any type that is a subtype [of Number](https://docs.julialang.org/en/v1/base/numbers/#Core.Number). 
* `model::<: MyAbstractUnsupervisedClusteringAlgorithm`: The second argument is the cluster model instance, in this case, the `model::MyNaiveKMeansClusteringAlgorithm` instance that we built above. However, this can be any subtype of ` MyAbstractUnsupervisedClusteringAlgorithm.`
    * __Why?__ Suppose we have different k-means implementations or different clustering logic altogether. In that case, we can take advantage of Julia's multiple dispatch functionality by passing in a different clustering model. This provides a single method for a user to call, which calls a different implementation.
* `verbose::Bool`: The `verbose::Bool` argument tells our implementation whether to save data from each algorithm iteration. The default value is `false`. However, if the value is set to `true,` a save file holding the assignments, centroids, and loop index is written at each iteration.
* `d::Any`. The optional distance argument can change how the similarity between feature vectors $\mathbf{x}\in\mathcal{D}$ is calculated. We can use [any metrics exported by the `Distances.jl` package](https://github.com/JuliaStats/Distances.jl); by default, we [use the Euclidian distance](https://en.wikipedia.org/wiki/Euclidean_distance). 

The [`cluster(...)` method](src/Cluster.jl) returns cluster centroids, the assignments, and the number of iterations that it took to reach the final assignment in the `results::NamedTuple.`

In [18]:
result = cluster(D, model, verbose = false); # cluster the data

In [19]:
result.assignments

299-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1
 3
 1
 1
 1
 1
 ⋮
 2
 2
 3
 2
 2
 2
 2
 2
 2
 2
 2
 2

In [20]:
result.centroids

Dict{Int64, Vector{Float64}} with 3 entries:
  2 => [-0.339002, -0.572519, 0.0795391, -0.0229008, -0.160577, -0.648855, 0.07…
  3 => [0.0780186, 0.376623, -0.186002, -0.454545, 0.832411, 0.0649351, -0.0173…
  1 => [0.421998, 0.0549451, 0.0428853, -0.120879, -0.473187, -0.0989011, -0.09…

## Task 2: How many clusters should we choose?
Of K-means' shortcomings, the need to specify the number of clusters $K$ in advance can be addressed with several heuristic methods. 
There are several methods to estimate the number of clusters; here, we'll explore three: the [silhouette method](https://en.wikipedia.org/wiki/Silhouette_(clustering)), the [Calinski-Harabasz index](https://en.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_index) and [the Xie-Beni method](https://ieeexplore.ieee.org/document/7066274).

### Method 1: Silhouette method
The silhouette method evaluates data consistency in clusters. The score ranges from -1 to 1; a high score means a data point fits well in its cluster but not in neighboring ones. The number of clusters $K$ is likely correct if most points score high. Conversely, many low scores suggest there may be too many or too few clusters. 
* The silhouette method is exported by [the `Clustering.jl` package](https://github.com/JuliaStats/Clustering.jl), so let's choose the buy side of the buy versus bill trade and use the prebuilt implementation of clustering and evaluation of the performance of the clustering. 

In [23]:
performance_array_silhouette = let

    number_of_clusters_to_explore = 10;
    KA = range(2,stop=number_of_clusters_to_explore, step=1) |> collect;
    tmp = Array{Float64,2}(undef, length(KA), 2);

    for i ∈ eachindex(KA)

        Kᵢ = KA[i]; # how many clusters?
        R = kmeans(transpose(D),  Kᵢ; maxiter=maxiter, display=:none) # sensitive to the choice of maxiter. # Clustering.jl use the transpose of our method
        M = R.centers # get the cluster centers
        a = assignments(R) # get the assignments of points to clusters
        value = clustering_quality(transpose(D), a; quality_index = :silhouettes)
        
        tmp[i,1] = Kᵢ;
        tmp[i,2] = value;
    end
    tmp
end;

In [24]:
performance_array_silhouette

9×2 Matrix{Float64}:
  2.0  0.253674
  3.0  0.156472
  4.0  0.157629
  5.0  0.0954838
  6.0  0.158075
  7.0  0.162528
  8.0  0.159886
  9.0  0.14845
 10.0  0.160525

In [25]:
let
    kopt = argmax(performance_array_silhouette[:,2]) |> j -> performance_array_silhouette[j,1]
    println("Predicted number of clusters by the silhouette method : $(kopt)")
end

Predicted number of clusters by the silhouette method : 2.0


### Method 2: Calinski–Harabasz index (CHI)
The [Calinski–Harabasz index (CHI)](https://en.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_index), also known as the Variance Ratio Criterion, is a widely used metric for assessing the quality of clustering algorithms. Introduced by [Tadeusz Caliński and Jerzy Harabasz in 1974](https://www.tandfonline.com/doi/abs/10.1080/03610927408827101), it evaluates the clustering performance by comparing the ratio of between-cluster variance to within-cluster variance, with higher values indicating better-defined clusters
* The CHI method is exported by [the `Clustering.jl` package](https://github.com/JuliaStats/Clustering.jl), so let's choose the buy side of the buy versus bill trade and use the prebuilt implementation of clustering and evaluation of the performance of the clustering. 

In [27]:
performance_array_CHI = let

    number_of_samples = size(D,1);
    number_of_features = size(D,2);
    number_of_clusters_to_explore = 10;
    KA = range(2,stop=number_of_clusters_to_explore, step=1) |> collect;
    tmp = Array{Float64,2}(undef, number_of_clusters_to_explore-1, 2);

    for i ∈ eachindex(KA)

        Kᵢ = KA[i]; # how many clusters?
        R = kmeans(transpose(D), Kᵢ; maxiter=maxiter, display=:none); # Clustering.jl use the transpose of our method
        M = R.centers # get the cluster centers
        a = assignments(R) # get the assignments of points to clusters
        value = clustering_quality(transpose(D), M, a; quality_index = :calinski_harabasz)
        
        tmp[i,1] = Kᵢ;
        tmp[i,2] = value;
    end
    tmp
end;

In [28]:
let
    kopt = argmax(performance_array_CHI[:,2]) |> j -> performance_array_CHI[j,1]
    println("Predicted number of clusters by CHI: $(kopt)")
end

Predicted number of clusters by CHI: 2.0


### Method 3: Xie Beni index (XBI)
The Xie-Beni index measures the ratio between the summed inertia of clusters and the minimum distance between cluster centers. See: 

* [M. Muranishi, K. Honda and A. Notsu, "Application of xie-beni-type validity index to fuzzy co-clustering models based on cluster aggregation and pseudo-cluster-center estimation," 2014 14th International Conference on Intelligent Systems Design and Applications, Okinawa, Japan, 2014, pp. 34-38, doi: 10.1109/ISDA.2014.7066274.](https://ieeexplore.ieee.org/document/7066274)
* The Xie-Beni method is exported by [the `Clustering.jl` package](https://github.com/JuliaStats/Clustering.jl), so let's choose the buy side of the buy versus bill trade and use the prebuilt implementation of clustering and evaluation of the performance of the clustering. 

In [30]:
performance_array_xie_beni = let

    number_of_samples = size(D,1);
    number_of_features = size(D,2);
    number_of_clusters_to_explore = 10;
    KA = range(2,stop=number_of_clusters_to_explore, step=1) |> collect;
    tmp = Array{Float64,2}(undef, number_of_clusters_to_explore-1, 2);

    for i ∈ eachindex(KA)

        Kᵢ = KA[i]; # how many clusters?
        R = kmeans(transpose(D), Kᵢ; maxiter=maxiter, display=:none); # Clustering.jl use the transpose of our method
        M = R.centers # get the cluster centers
        a = assignments(R) # get the assignments of points to clusters
        value = clustering_quality(transpose(D), M, a; quality_index = :xie_beni)
        
        tmp[i,1] = Kᵢ;
        tmp[i,2] = value;
    end
    tmp
end;

In [31]:
performance_array_xie_beni

9×2 Matrix{Float64}:
  2.0  1.70343
  3.0  2.51436
  4.0  1.85367
  5.0  2.00084
  6.0  1.96317
  7.0  2.00121
  8.0  1.73321
  9.0  1.92793
 10.0  1.55098

In [32]:
let
    kopt = argmin(performance_array_xie_beni[:,2]) |> j -> performance_array_xie_beni[j,1]
    println("Predicted number of clusters by XBI: $(kopt)")
end

Predicted number of clusters by XBI: 10.0


## Task 3: Interpretation of the clusters
We've run the clustering algorithm on the clinical dataset. Now, let's figure out what is in each cluster.

In [97]:
my_cluster_index = 2; # i ∈ {1,2,...,K}

In [98]:
let
    df = DataFrame(); 
    assignment = result.assignments;
    index_array = findall(a -> a == my_cluster_index, assignment);

    for i ∈ eachindex(index_array)
        a = index_array[i]; # what is the assignment
        row_df = (
            c = my_cluster_index,
            age = dataset[a,:age],
            gender = dataset[a,:sex],
            high_blood_pressure = dataset[a,:high_blood_pressure],
            smoking = dataset[a,:smoking],
            death = dataset[a, :death_event]
        );
        push!(df, row_df);
    end
    pretty_table(df, tf = tf_simple)
end

 [1m     c [0m [1m     age [0m [1m gender [0m [1m high_blood_pressure [0m [1m smoking [0m [1m death [0m
 [90m Int64 [0m [90m Float64 [0m [90m  Int64 [0m [90m               Int64 [0m [90m   Int64 [0m [90m Int64 [0m
      2      60.0        1                    -1        -1      -1
      2      55.0        1                    -1         1      -1
      2      45.0        1                    -1        -1       1
      2      41.0        1                    -1         1      -1
      2      58.0        1                    -1         1      -1
      2      65.0        1                    -1         1      -1
      2      42.0        1                    -1        -1      -1
      2      67.0        1                    -1         1      -1
      2      44.0        1                     1        -1      -1
      2      70.0        1                    -1         1      -1
      2      60.0        1                    -1        -1      -1
      2      80.0        

We can see the averages, i.e., the centroids of each feature in the `my_cluster_index::Int` by looking at the centroids dictionary:

In [100]:
names(dataset)

13-element Vector{String}:
 "age"
 "anaemia"
 "creatinine_phosphokinase"
 "diabetes"
 "ejection_fraction"
 "high_blood_pressure"
 "platelets"
 "serum_creatinine"
 "serum_sodium"
 "sex"
 "smoking"
 "time"
 "death_event"

In [101]:
let

    # names - 
    names_dictionary = Dict{String,Int}();
    loopcounter = 1;
    for name ∈ names(dataset)
        names_dictionary[name] = loopcounter;
        loopcounter += 1;
    end
    
    μ = result.centroids[my_cluster_index]
    df = DataFrame();
    row_df = (
        c = my_cluster_index,
        age = μ[names_dictionary["age"]],
        gender = μ[names_dictionary["sex"]],
        high_blood_pressure = μ[names_dictionary["high_blood_pressure"]],
        smoking = μ[names_dictionary["smoking"]],
        death = μ[names_dictionary["death_event"]],
    );
    push!(df, row_df);
    pretty_table(df, tf=tf_simple)
end

 [1m     c [0m [1m       age [0m [1m   gender [0m [1m high_blood_pressure [0m [1m   smoking [0m [1m     death [0m
 [90m Int64 [0m [90m   Float64 [0m [90m  Float64 [0m [90m             Float64 [0m [90m   Float64 [0m [90m   Float64 [0m
      2   -0.339002   0.526718             -0.648855   -0.145038   -0.923664
