## HMS 2024 - Brain Clusters

Classifying different types of <b>Hazardous Brain Activity</b>, <b>HBA</b>, using EEG recordings and spectra.
This notebook looks at the k-means clustering of the 6-dimension probability vectors of the HBA samples.

General observations: <br>
- The number of votes falls into two ranges: 62.6% are in 1--7 and 37.4% in 10--28. <br>
- Distribution of the expert HBAs is very different between the vote ranges, most notable is that there are very few seizures in the 10--28 vote range. <br>
- Is the LB test data similar to the train data? Probably not: using the train average probabilities with train data gives a KL value (e.g., CV) of 1.38 but when those probabilities are submitted as predictions the LB value is 1.07. <br>
- The vote entropy gives a measure of how spread out the voting is; 89.3% have entropy at or less than a spread-over-2-bins value, and almost half have zero entropy (unanimous voting.)<br>
- p-values for the hypothesis test: Ho: p_max = 0.5 and Ha: P_max > 0.5 give a measure of how convincing the votes are that the maximum votes HBA has P>0.5, i.e. is the main HBA.
- Each patient has only about 2 HBA types used with them.<br>
- Each eeg_id is from a single patient and i) mostly has one HBA type, and ii) often multiple eeg_sub_ids have the same voting values (number of votes in HBA and total votes) suggesting they were evaluated together.<br>


Clustering results: <br>
- If clusters are determined using only the 10-28 vote range HBA samples, then 5 clusters are indicated, one for each HBA type except seizure (not surprisingly.) <br>
- Using all the data, 6 clusters are indicated, one for each HBA type. Each cluster has some admixture of the others HBA types. For example the cluster center for `LRDA` is
`[ 1.6% 6.1% 0.8% 69.9% 6.7% 15.0% ]` and shows a strong `LRDA` component, 69.9%, but also a strong 15% in the `Other` component. <br>
- Adding a 7th cluster produces a new center away from the extremes: `[3.4% 12.1% 5.5% 15.7% 18.8% 44.5%]` and perhaps captures the main ambiguity in the voting. <br>
- A submission could be made by using a multi-class classifier to predict the class (e.g., cluster membership) of each sample. Using 6 (or 7) clusters and assigning them perfectly to the train data gives a KL divergence score of 0.30 (0.28); this compares to a score of 0.38 if prob is set to an optimal 81.5% in the consensus HBA and 3.7% in all the others.<br>


## Things to use

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# For calculating p-values
from scipy.stats import binom

# For k-means
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

In [None]:
# Directory prefix for the data
above_dir = "../input/hms-harmful-brain-activity-classification/"
# or, offline, use my local directory
##above_dir = "D:/Kaggle/input/hms-harmful-brain-activity-classification/"

#### Functions, Etc.

In [None]:
# Define these since they get used a lot
HBA_names = ["seizure", "lpd", "gpd", "lrda", "grda", "other"]
iHBA_of_expert = {"Seizure":0,"LPD":1,"GPD":2,"LRDA":3,"GRDA":4,"Other":5}
HBA_votes = ["seizure_vote", "lpd_vote", "gpd_vote",
             "lrda_vote", "grda_vote", "other_vote"]
HBA_probs = ["seizure_prob", "lpd_prob", "gpd_prob",
             "lrda_prob", "grda_prob", "other_prob"]

# Output format for arrays
np.set_printoptions(precision=6, suppress=True)

In [None]:
def kld_score(solution, submission):
    '''
    Calculate the average KL divergence score.
    Ignores the "row id" assumed in the first column.
    '''
    sumsum = 0.0
    # Go through the probabilities
    for prob_col in solution.columns.values:
        sumsum += np.nansum(-1.0*solution[prob_col] *
                        np.log(submission[prob_col] / solution[prob_col]))
    return sumsum/(len(solution))

# Example of KL Divergence result from its Kaggle metric page
solution = pd.DataFrame({'id': range(3), 'ham': [0, 0.5, 0.5], 
                        'spam': [0.1, 0.5, 0.5], 'other': [0.9, 0, 0]})
submission = pd.DataFrame({'id': range(3), 'ham': [0.2, 0.3, 0.5], 
                        'spam': [0.1, 0.5, 0.5], 'other': [0.7, 0.2, 0]})
# score(solution, submission, 'id')
#    0.160531...

# Check that this simple version above gives the same value:
kld_score(solution, submission)

In [None]:
def read_hms_meta():
    '''
    Read in the train.csv and test.csv files.
    Add total_vote, _prob columns, and vote entropy to train_meta.
    Add extra cols to test to allow the same processing as train:
        eeg[spectro]_sub_id, eeg[spectro]_label_offset_seconds, label_id
    Make various plots of the train_meta values.
    '''

    # Read the test meta data
    test_meta = pd.read_csv(above_dir+"test.csv")
    test_meta_len = len(test_meta)
    print("Test has length", test_meta_len)
    # Add columns to allow similar train/test processing:
    test_meta["eeg_sub_id"] = 0
    test_meta["eeg_label_offset_seconds"] = 0.0
    test_meta["spectrogram_sub_id"] = 0
    test_meta["spectrogram_label_offset_seconds"] = 0.0
    test_meta["label_id"] = test_meta.eeg_id
    # Can decide to replace the not-real test with training data instead
    if test_meta_len > 1:
        REAL_TEST = True
    else:
        REAL_TEST = False
        print("  --> not the real LB test data.\n")
        # Replace the test_meta?
        pass
  
    # Read the train meta data
    train_meta = pd.read_csv(above_dir+"train.csv")
    train_meta_len = len(train_meta)
    print("Train has length", train_meta_len, " with:")
    
    # Add a total_vote column
    train_meta["total_vote"] = ( train_meta["seizure_vote"] +
                    train_meta["lpd_vote"] + train_meta["gpd_vote"] +
                    train_meta["lrda_vote"] + train_meta["grda_vote"] +
                                    train_meta["other_vote"] )
    # Add a max_vote column (i.e. number of votes in the expert consensus)
    train_meta["max_vote"] = np.max(np.array([train_meta["seizure_vote"] ,
                    train_meta["lpd_vote"] , train_meta["gpd_vote"] ,
                    train_meta["lrda_vote"] , train_meta["grda_vote"] ,
                    train_meta["other_vote"]]), axis=0)
    
    # Show various unique numbers
    for this_col in ["label_id","eeg_id","spectrogram_id",
                     "patient_id","total_vote"]:
        print("   ", len(train_meta[this_col].unique()),
            "unique "+this_col+" values.")
    
    # Look at the total votes values:  1 to 28 with missing 8 and 9
    print("\nHistogram of the total votes")
    plt.figure(figsize=(6,3))
    plt.hist(train_meta["total_vote"],bins=55,log=True)
    plt.title("Histogram of Total Votes")
    plt.show()
    
    # Distribution of the "Expert consensus" for less/more than 9 votes
    allvt = train_meta.expert_consensus.value_counts()
    less9 = train_meta[train_meta["total_vote"] < 9].expert_consensus.value_counts()
    more9 = train_meta[train_meta["total_vote"] > 9].expert_consensus.value_counts()
    vnot3 = train_meta[train_meta["total_vote"] != 3].expert_consensus.value_counts()
    print("Counts for votes less than 9:\n",less9[allvt.index])
    print("\nCounts for votes more than 9:\n",more9[allvt.index])
    ##print("\nCounts for votes not equal to 3:\n",vnot3[allvt.index])
    
    # Create _prob values from the _vote values
    print("\nHistograms of the probabilites of the different HBAs:")
    print("   (note that the large Prob=0 bin is not included.)")
    for col_pre in HBA_names:
        train_meta[col_pre + "_prob"] = (train_meta[col_pre + "_vote"] / 
                                     train_meta["total_vote"] )
        # Show the probability histogram for each type
        plt.figure(figsize=(6,1.5))
        plt.hist(train_meta[col_pre + "_prob"],bins=20,range=(0.02,1))
        plt.ylim(0,len(train_meta)/5)
        plt.title("Histogram of   "+col_pre+"_prob")
        plt.show()
        
    # Calculate the entropy for each row (~ amount of vote variation)
    print("Calculating voting entropy values ...")
    def calc_entropy(row):
        the_probs = np.clip(row[16:21+1].values.astype(float), 1.e-8,1.0)
        return np.nansum(the_probs * -1*np.log(the_probs))
    # Add an entropy column
    train_meta["entropy"] = train_meta.apply(calc_entropy, axis=1)
   
    return train_meta, test_meta


In [None]:
def prob_prob_scatter(name1, name2, probs2plot, clust_ids):
    '''
    Make a prob1 vs prob2 scatter plot.
    Include an x at the cluster centers in chosen axes.
    Use sqrt scaling.
    External: HBA_probs, iHBA_of_expert[ ], clust_probs
    '''
    # Color-code the vectors by their k-means label, arbitrary
    kmclrs = 2*["red","blue","green","black","purple","orange"]
    clstclrs = []
    for ilab in clust_ids:
        clstclrs.append(kmclrs[ilab])
        
    ixax = iHBA_of_expert[name1]
    iyax = iHBA_of_expert[name2]
    lenprob = len(probs2plot)
    plt.figure(figsize=(5,5))
    plt.scatter(np.sqrt(probs2plot[HBA_probs[ixax]]) + 
                0.04*(np.random.rand(lenprob)-0.5),
            np.sqrt(probs2plot[HBA_probs[iyax]]) + 
                0.04*(np.random.rand(lenprob)-0.5),
           s=3, c=clstclrs, alpha=0.02)
    # Add the centers
    for iclust in range(0,len(clust_probs)):
        plt.plot(np.sqrt([clust_probs[iclust,ixax]]),
                 np.sqrt([clust_probs[iclust,iyax]]),
                 c=kmclrs[iclust],marker="x",markersize=15)
    plt.xlabel("sqrt( "+name1+" )")
    plt.ylabel("sqrt( "+name2+" )")
    plt.show()

<HR>

## Get and look at the csv meta data

In [None]:
# Read in the meta data, routine also looks at the train values
train_meta, test_meta = read_hms_meta()

In [None]:
# The columns in train_meta
train_meta.info()

In [None]:
# The columns in test_meta
test_meta.info()

### Look at the vote entropies

In [None]:
if True:
    print("Histogram of the votes entropies")
    plt.figure(figsize=(6,3))
    plt.hist(train_meta["entropy"],bins=50,log=True)
    plt.title("Histogram of Vote Entropy (~ vote variation)")
    plt.show()
    
     # Show the Vote Entropy vs the Number of Votes
    plt.figure(figsize=(8,6))
    plt.scatter(train_meta['total_vote'] + 0.7*(np.random.rand(len(train_meta))-0.5),
            train_meta['entropy'] + 0.05*(np.random.rand(len(train_meta))-0.5),
           s=3, alpha=0.02)
    # For reference, plot entropy values if spread evenly into n bins
    ref_ents = []
    for ispread in [1,2,3,4,5,6]:
        spread_ent = np.log(ispread)
        ref_ents.append(spread_ent)
        plt.plot([ispread,28],[spread_ent,spread_ent],
                 lw=2, c='pink', alpha=0.5)
        plt.text(24.0, spread_ent+0.03, "{} x p=1/{}".format(
                ispread,ispread))
    plt.title("Vote Entropy vs Number of Votes") 
    plt.ylim(-0.05,1.90)
    plt.ylabel("Entropy of the Votes")
    plt.xlabel("Number of Votes")
    plt.show()
    # List the reference entropy values
    fmtstr = (len(ref_ents)-1) * '{:.4f}, ' + '{:.4f}'
    print("          The entropy reference lines are at:",
                      fmtstr.format(*ref_ents),"\n")
    frac_z = sum(train_meta["entropy"] < 0.02)/len(train_meta["entropy"])
    frac_lt2 = sum(train_meta["entropy"] < 0.70)/len(train_meta["entropy"])
    frac_lt3 = sum(train_meta["entropy"] < 1.10)/len(train_meta["entropy"])
    print("          The fraction with zero entropy is {:.1f}%".format(
                        100.0*frac_z))
    print("          The fraction below 2 x p=1/2 is {:.1f}%".format(
                        100.0*frac_lt2))
    print("          The fraction below 3 x p=1/3 is {:.1f}%".format(
                        100.0*frac_lt3))
    

### p-values for votes

In [None]:
# p-value for rejecting Ho: p_max = 0.5
#      and so accepting Ha: p_max > 0.5
##  x_votes = 12
##  n_votes = 15
##  1 - binom.cdf(x_votes-1, n_votes, 0.5)

# Get p-values for all the rows
p_vals = []
for (x_votes, n_votes) in zip(train_meta.max_vote, train_meta.total_vote):
    if x_votes == n_votes:
        # If it is unanimous assume very high confidence, very low p-value
        p_vals.append(np.exp(-7.2))
    else:
        # Assign the p-value based on binomial distribution 
        p_vals.append(np.clip((1 - binom.cdf(x_votes-1, n_votes, 0.5)),0.001,1.0))
    
p_vals = np.array(p_vals)
log_p_vals = np.log(p_vals)

In [None]:
# Various plots using the p-values
p_markers = [0.001,0.01,0.05,0.20]

plt.figure(figsize=(6,5))
plt.scatter(log_p_vals + 0.12*(np.random.rand(len(train_meta))-0.5),
            train_meta.entropy + 0.03*(np.random.rand(len(train_meta))-0.5),
           s=3,alpha=0.02)
for alpha in p_markers:
    plt.plot(2*[np.log(alpha)],[0.0,1.4],c='orange',lw=1)
    plt.text(np.log(alpha)-0.2, 1.43,str(100*alpha)+"%",c='orange')
plt.xlabel("log( p-value )")
plt.ylabel("One HBA  < - - Entropy - - > Spread among HBAs")
plt.title("Vote Entropy  vs  p-value")
plt.show()

plt.figure(figsize=(6,5))
plt.scatter(log_p_vals + 0.18*(np.random.rand(len(train_meta))-0.5),
            train_meta.max_vote/train_meta.total_vote +
                            0.03*(np.random.rand(len(train_meta))-0.5),
           s=3,alpha=0.02)
for alpha in p_markers:
    plt.plot(2*[np.log(alpha)],[0.01,0.95],c='orange',lw=1)
    plt.text(np.log(alpha)-0.2, 0.98,str(100*alpha)+"%",c='orange')
plt.xlabel("log( p-value )")
plt.ylabel("Votes in max HBA / Total votes")
plt.ylim(0.2,1.05)  # 0.2 is minimum possible
plt.title("Max Votes / Total Votes  vs  p-value")
plt.show()


print("\nFraction with p-value < 1.0% is {:.1f}%\n".format(
    100.0 * sum(p_vals < 0.01) / len(p_vals)))
print("Fraction with p-value > 20.0% is {:.1f}%\n".format(
    100.0 * sum(p_vals > 0.20) / len(p_vals)))

### Identical voting within an eeg
By using the number of rows when grouped in different ways, we can look into @patrob 's discussion comment about [multiple identical voting distributions](https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/470645#2618899).


In [None]:
# Look into the idea that a given patient_id--eeg_id will have sub ids
# with mostly similar expert consensus and vote numbers.

if True:   # results are in comments below
    
    # To show what happens if there is no similarity,
    # include randomly assigned HBA to all rows -- using 2 or 3 choices.
    train_meta["rand_id"] = np.random.choice(
               3, size=len(train_meta), replace=True, p=None)

    grouped = train_meta[["eeg_id","eeg_sub_id","patient_id","spectrogram_id",
            "expert_consensus","rand_id",
            "total_vote","max_vote"]].groupby(
                ["patient_id","eeg_id","expert_consensus",
                                "total_vote","max_vote"]).count()
    # Histogram of "number of identicals" in eegs
    plt.figure(figsize=(5,3))
    plt.hist(grouped[grouped.eeg_sub_id < 50].rand_id,bins=49,log=True)
    plt.xlabel("Number of identical votes in the eeg")
    plt.ylabel("Number of eegs")
    plt.show()

In [None]:
#     Number of types of HBAs for a patient: looks like about 2 for each.
#  1950 rows, grouped by: ["patient_id"]
#  3625 rows, grouped by: ["patient_id","expert_consensus"]
#  3799 rows, grouped by: ["patient_id","rand_id"] <-- 2 coices, 0,1
#  5533 rows, grouped by: ["patient_id","rand_id"] <-- 3 choices, 0,1,2

#     Variety of HBAs and votes in patient--eeg combinations
# 17089 rows, grouped by: ["patient_id","eeg_id"]
# 18013 rows, grouped by: ["patient_id","eeg_id","expert_consensus"]
# 19783 rows, grouped by: ["patient_id", . . . _consensus","total_vote"]
# 20072 rows, grouped by: ["patient_id", . . . _consensus","total_vote","max_vote"]
# 26266 rows, grouped by: ["patient_id","eeg_id","rand_id"] <-- 2 coices, 0,1
 

# Show the dataframe. 
# Can select by the number of "identicals" = the count is in non-groupby columns
grouped[grouped.eeg_sub_id > 5]

<HR>

## Make a constant-probabilities Submission
Make a submission file with constant probabilities: all 1/6, the mean probabilities, etc. Evaluate those probabilites vs the train probabilites using the KL Divergence metric. Can compare those KL values with the LB values to see how different train and test may be.

In [None]:
# Create a 'solution' of actual probabilities from the training data
solution_train = train_meta[["eeg_id"] + HBA_votes]
# In solution_train replace the _votes values with probabilities
for col_pre in HBA_names:
    solution_train.loc[:, col_pre + "_vote"] = train_meta[col_pre + "_prob"]
    
##solution_train

In [None]:
# Make a constant-probabilites 'submission' dataframe
submission_train = solution_train.copy()

# Calculate the mean prob values
mean_probs = solution_train.iloc[ : , 1:].mean().values
print("Mean train probabilities:", mean_probs)

# Put values into the 'submission' dataframe
##submit_probs = 6*[1/6]
submit_probs = mean_probs
# excentuate/reduce the differences between the mean probs
##expon = 0.25
##submit_probs = mean_probs**expon / np.sum(mean_probs**expon)
#
print("Submitting these prob.s:", submit_probs)
for iprob, col_pre in enumerate(HBA_names):
    submission_train.loc[ : , col_pre + "_vote"] = submit_probs[iprob]

##submission_train

In [None]:
# Run the KL divergence metric
kld_score(solution_train, submission_train)

# A small improvement using the mean probs:
#                   train KL      LB KL
# A constant 1/6  :   1.4023     1.09 v11
# Mean probs^.25  :   1.3926     1.08 v10
# Mean probs^.50  :   1.3856     1.07 v9
# Mean probs^.75  :   1.3815
#  The mean probs : * 1.3801     1.07 v3
# Mean probs^1.25 :   1.3815
# Mean probs^1.5  :   1.3855     1.07 v6
# Mean probs^2.0  :   1.4016     1.09 v8
#                   * = minimum, very broad

In [None]:
# Assemble the submission using the desired prob.s
# Start with a dataframe with just the eeg_id column from test_meta
test_submit = test_meta[["eeg_id"]].copy()
# Add "_vote" columns with the predicted probabilities
for iprob, col_pre in enumerate(HBA_names):
    test_submit[col_pre + "_vote"] = submit_probs[iprob]

print(test_submit)

# Output the file
test_submit.to_csv("submission.csv", header=True, 
                        index=False, na_rep='', float_format='%.6f')
# Look at the file
##!more submission.csv


<HR>

## Clustering of HBA Probability Vectors

In [None]:
# Fractions of HBAs in regions of the votes--entropy plane
#
#   votes<9 is 62.6% , just votes=3 is 48.6%
#   votes<9 and entropy<0.01 is 44.5%
#   votes<9 and entropy<0.70 is 59.5%
#
#   votes>9 is 37.4% , just votes=15 is 10.0%
#   votes>9 and entropy<0.01 is 3.3%  <-- very few are unanimous
#   votes>9 and entropy<0.70 is 19.4%

##print(len(train_meta[(train_meta.total_vote > 9) &
##           (train_meta.entropy < 0.70)]) / len(train_meta))

# Select all or a subset of the HBA samples to make clusters...
# Down-select to votes>9 ?
##prob_vectors = train_meta.loc[(train_meta.total_vote > 9), HBA_probs]
# or...    exclude votes=3 ?
##prob_vectors = train_meta.loc[(train_meta.total_vote != 3), HBA_probs]
# or...    Use all HBAs ?
prob_vectors = train_meta.loc[(train_meta.total_vote > -1), HBA_probs]

##prob_vectors

In [None]:
# Mean and standard deviation of each HBA's prob.s
print(prob_vectors.apply(np.mean,axis=0).values)
print(prob_vectors.apply(np.std,axis=0).values)

# For >9 votes:   The Seizures are under-represented
# [0.05205155 0.17736811 0.20306643 0.13129067 0.12840864 0.30781461]
# [0.11620024 0.28729203 0.31122879 0.22205968 0.22739241 0.30195994]
# Excluding votes=3   Some more Seizures
# [0.07298742 0.19346076 0.19402469 0.11244978 0.10907672 0.31800064]
# [0.18098827 0.31381592 0.31565785 0.21338437 0.21958587 0.32906653]
# For all HBAs:
# [0.20831852 0.13211966 0.12853332 0.13891264 0.1792938  0.21282206]
# [0.37827328 0.27772958 0.27617022 0.28005733 0.33636886 0.31519539]

In [None]:
# Use the k-means routine in sklearn:
#   KMeans(n_clusters=8, *, init='k-means++', n_init='auto', max_iter=300,
#   tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')

# Determine the appropriate number of clusters
if True:
    # Look for the "elbow" number of clusters
    # Use a fraction of the samples
    prob_array = np.array(prob_vectors[0::10])
    maxclsts = 15
    inertias = []
    silhos = []
    for iclust in range(2, maxclsts+1):
        kmeans = KMeans(n_clusters = iclust, init = 'k-means++', n_init = 10,
                    max_iter = 300, random_state=None)
        kmeans.fit(prob_array)
        inertias.append(kmeans.inertia_)
        # This metric has "The best value is 1 and the worst value is -1"
        silhos.append(silhouette_score(prob_array, kmeans.labels_))
    # Plot the metrics vs number of clusters
    plt.plot(range(2, maxclsts+1), inertias, 'b-')
    plt.plot(range(2, maxclsts+1), max(inertias)*np.array(silhos), 'g-')
    plt.title('Clustering Metrics vs Number of Clusters')
    plt.xlabel('Number of clusters')
    plt.ylabel('inertia (blue), scaled silhouette (green)')
    plt.show()

# For Votes>9: The inertia elbow (blue) is at 5, peak of silho' is also at 5.
#       Not=3: The inertia elbow (blue) is at 6, peak of silho' is also at 6.
#   Using all: The inertia elbow (blue) is at 6, peak of silho' is also at 6.

In [None]:
# Find the centers for the optimum (or other) number of clusters 
##iclust = 5  # when using votes > 9
iclust = 6+1  # when using all the vectors - adding a 7th cluster

prob_array = np.array(prob_vectors)
kmeans = KMeans(n_clusters = iclust, init = 'k-means++', n_init = 10,
                    max_iter = 300, random_state=None)
kmeans.fit(prob_array)
# Can get the cluster center coord.s/probs from:
clust_probs = kmeans.cluster_centers_
print("The centers for {} clusters:".format(iclust))
print(clust_probs)

# Using votes > 9 rows only
# The centers for 5 clusters:   * ordered by main column *
# [   none that is mostly seizure ]
# [0.03971239 0.75258142 0.03044835 0.05662892 0.00803234 0.11259658]
# [0.12050526 0.04732145 0.70046799 0.00519864 0.02985535 0.09665131]
# [0.04379057 0.11936496 0.01280059 0.53104421 0.05826677 0.2347329 ]
# [0.00887504 0.02846734 0.05263628 0.04578226 0.60112928 0.2631098 ]
# [0.02070779 0.05335455 0.03500729 0.04696622 0.06429828 0.77966588]

# Using all except votes=3
# The centers for 6 clusters:   * ordered by main column *
# [0.73180621 0.0631708  0.04338586 0.02012522 0.01133559 0.13017633]
# [0.0298963  0.79677792 0.02293553 0.04389117 0.00579372 0.10070537]
# [0.08950668 0.05180871 0.71981674 0.00475706 0.02952677 0.10458404]
# [0.03100587 0.11865955 0.01128878 0.53514721 0.05215337 0.25174522]
# [0.00691587 0.02549976 0.05270252 0.04168656 0.61507058 0.25812471]
# [0.01257443 0.04000609 0.02475847 0.03336829 0.05108967 0.83820305]

# Using All rows
# The centers for 6 clusters:   * ordered by main column * 
# [0.97144241 0.00613458 0.00485837 0.00248338 0.00133784 0.01374342]
# [0.03762207 0.78874927 0.02092333 0.04629211 0.00839672 0.0980165 ]
# [0.08720708 0.05977927 0.72124703 0.00427439 0.0329343  0.09455794]
# [0.01610773 0.06060393 0.00790338 0.69877109 0.06680986 0.149804  ]
# [0.0033314  0.00741154 0.01295589 0.02588149 0.88386741 0.06655228]
# [0.01802665 0.03991739 0.02326481 0.04377798 0.07160484 0.80340834]
#
# The centers for 7 clusters:   * ordered by main column *
# [0.97295318 0.00590648 0.00483242 0.00223349 0.00114005 0.01293438]
# [0.0359414  0.8263946  0.02026326 0.0416359  0.0039719  0.07179293]
# [0.09039322 0.05736363 0.73959869 0.0036609  0.02577728 0.08320629]
# [0.01311906 0.05057075 0.00535362 0.76954905 0.05295756 0.10844996]
# [0.00294371 0.00356274 0.00838875 0.02236721 0.92515794 0.03757965]
# [0.00794713 0.01162694 0.01349516 0.01482245 0.02306207 0.92904625]
# The new 7th cluster:
# [0.03401079 0.12059055 0.05457865 0.15692139 0.18847755 0.44542107] 

In [None]:
# Not sure what is the best way to view these...
# Try pair scatter plots with sqrt scaling
# Color code by cluster id
clust_ids = kmeans.labels_

prob_prob_scatter("LPD","GRDA", prob_vectors, clust_ids)

prob_prob_scatter("LRDA","Other", prob_vectors, clust_ids)

prob_prob_scatter("Seizure","GPD", prob_vectors, clust_ids)


## Evaluate KL metric when cluster probabilites are assigned

In [None]:
# Assign clusters to the train_meta
train_meta["clust_id"] = kmeans.predict(np.array(train_meta[HBA_probs]))
clust_probs = kmeans.cluster_centers_


# For comparison:
# Use the expert consensus classes with optimal near-'unit' probabilites:
if False:
    for irow in train_meta.index:
        train_meta.loc[irow,"clust_id"] = iHBA_of_expert[
                            train_meta.loc[irow,"expert_consensus"]]
    pyes = 0.815
    pnot = (1-pyes)/5.0
    clust_probs = np.array([[pyes-pnot,0.0,0.0,0.0,0.0,0.0],
                            [0.0,pyes-pnot,0.0,0.0,0.0,0.0],
                            [0.0,0.0,pyes-pnot,0.0,0.0,0.0],
                            [0.0,0.0,0.0,pyes-pnot,0.0,0.0],
                            [0.0,0.0,0.0,0.0,pyes-pnot,0.0],
                            [0.0,0.0,0.0,0.0,0.0,pyes-pnot]]) + pnot


# The cluster probability vectors
print(clust_probs)

# main component of each cluster
max_probs = np.argmax(clust_probs,axis=1)
clust_names = []
for iclust in range(len(clust_probs)):
    clust_names.append(HBA_names[max_probs[iclust]])
clust_names

In [None]:
# Compare cluster counts and expert counts
print(train_meta["clust_id"].value_counts())
print(train_meta["expert_consensus"].value_counts())

# These agree that the most are in Seizure and the Least are in LPD.

In [None]:
# Start with the previous submission_train and replace the const probs
# with the appropriate cluster probs: 
# Go through the 6 probability columns and
# use cluster id to select the correct cluster prob value
for iprob in range(len(clust_probs[0])):
    this_col_probs = clust_probs[: , iprob]
    submission_train[HBA_votes[iprob]] = this_col_probs[train_meta["clust_id"]]

submission_train

In [None]:
# Evaluate the KL divergence when predictions are the cluster probs
kld_score(solution_train, submission_train)

# Using the "votes>9" 5 clusters  :  0.8852
# Using expert consensus, 90%,2%  :  0.4163
#   "                 81.5% 3.7%  :  0.3841
# Using votes-not-3  6 clusters   :  0.3705
# Using the "all rows" 6 clusters :  0.3001
# Using all rows and 6+1 clusters :  0.2780 (+/-0.0020, variation in 7th cluster)

# This suggests that a multi-class classifier using the clusters could do well. 

<HR>

<HR>