# Comments on EDA

Looked at how evaluators were doing with regards to two of the Main Term designations in the ACNS Guidelines.

Created a train.csv using only rows with high agreement among experts (0.9 or better).  Tried the result on couple of the popular shared notebooks.  Result was consistently much worse than the standard train.

A discussion post talked about the need to model the 'experts' rather than model labels that match the ACNS definitions.
https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/476007#2651000


Further analysis indicates that two different distributions of labels exist dependent on the number of evaluators used.

The popular shared notebooks are only using the first row of data for each eeg-id.

I believe this is an error and a mis-interpertation of the overview.   

This notebook shows that valuable information is contained when using all rows per eeg_id.

# Load Libraries

In [None]:
import os
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

import warnings

# Ignore all warnings
warnings.filterwarnings('ignore')




# Load Train Data

In [None]:
df = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
TARGETS = df.columns[-6:]
print('df shape:', df.shape )
print('Targets', list(TARGETS))
df = df.sort_values(by=['patient_id', 'eeg_id', 'eeg_sub_id'])
df

# Create Row Accuracy

In 2003 when I was trained on the 6 sigma process, a standard starting point was to evaluate the measurement system.   Multiple readings on the same 'part' were made.


Multiple rows of data are available for many of the eeg_id.  Lets use those looking at how often agreement for the 'expert_consensus' existed for each row.



In [None]:
# Adding a new column 'total_evaluators' that sums up the six specified columns
df['total_evaluators'] = df[['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']].sum(axis=1)

df.sample(25)  # Display the DataFrame with the new column

In [None]:
import matplotlib.pyplot as plt

# Plotting a histogram for the 'total_evaluators' column in the 'df' DataFrame

plt.figure(figsize=(10, 6))
plt.hist(df['total_evaluators'], bins=10, color='blue', edgecolor='black')
plt.title('Histogram of Total Evaluators')
plt.xlabel('Total Evaluators')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

It would seem that the training data might be aggregrated from at least two different studies with different number of persons doing the evaluations.   

With more evaluators it is likely data that should have more weight if there is good consensus.

In [None]:
# Modifying the previous code to add an additional column 'consensus_column' to 'df'

# Finding the column with the largest number for each row and storing the value in 'consensus'
df['consensus'] = df[['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']].max(axis=1)

# Identifying the column name that corresponds to the max value for each row
df['consensus_column'] = df[['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']].idxmax(axis=1)

df.head()  # Display the DataFrame with the new columns



df.sample(25)

In [None]:
# create a new column that shows the percentage agreement
df['row_agreement'] = df['consensus']/df['total_evaluators']
df.sample(25)

In [None]:
df.to_csv('row_agreement.csv', index = False)

In [None]:
# Plotting a histogram for the 'row_agreement' column

import numpy as np


# Now, plotting the histogram for 'row_agreement'
plt.figure(figsize=(10, 6))
plt.hist(df['row_agreement'], bins=10, color='green', edgecolor='black')
plt.title('Histogram of Row Agreement')
plt.xlabel('Row Agreement')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

This looks a bit odd.  Again kind of thinking that multiple studies put togeather for for train data, with some of them having poor agreement among evalators.   Many of the 1.0 ratings are for rows with small number of evalators.

Rows that have very low agreement on the consensus are a problem to be addressed - not sure what's is best?

In [None]:
# Plotting an XY plot for 'row_agreement' vs 'total_evaluators'

plt.figure(figsize=(10, 6))
plt.scatter(df['row_agreement'], df['total_evaluators'], color='purple', edgecolor='black')
plt.title('XY Plot of Row Agreement vs Total Evaluators')
plt.xlabel('Row Agreement')
plt.ylabel('Total Evaluators')
plt.grid(True)
plt.show()



I am tempted to suggest that each of the 6 seizer types has a different degree of difficulty for evaluators.


In [None]:
# Assuming 'df' has a mechanism to identify which of the 6 columns ('seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote') is the consensus for each row
# We will generate a plot that shows 'row_agreement' values when each of these columns is the consensus vote

# For demonstration, let's assume 'consensus_column' is a column that indicates which of the 6 columns is the consensus
# This step is for demonstration purposes and should be replaced with your actual method of determining the consensus column
df['consensus_column'] = df[['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']].idxmax(axis=1)

# Now, let's plot 'row_agreement' for each of the 6 columns when they are the consensus
plt.figure(figsize=(12, 8))

for column in ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']:
    # Filter the DataFrame for rows where this column is the consensus
    filtered_df = df[df['consensus_column'] == column]
    # Plotting
    plt.scatter(filtered_df['row_agreement'], [column] * len(filtered_df), label=column)

plt.title('Row Agreement for Each Column as Consensus')
plt.xlabel('Row Agreement')
plt.yticks(['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote'])
plt.ylabel('Consensus Column')
plt.legend()
plt.grid(True)
plt.show()



Little hard to see the visual information with this plot.

Lets revise the plot to distribution curves.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# Plotting distribution curves for each column
for column in ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']:
    # Filter the DataFrame for rows where this column is the consensus
    filtered_df = df[df['consensus_column'] == column]

    # Plotting the distribution curve with clipping the x-axis range
    sns.kdeplot(filtered_df['row_agreement'], label=column, clip=(0, 1.0))

plt.title('Distribution of EEG_ID Agreement for Each Column as Consensus')
plt.xlabel('Row Agreement')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()


OK - this plot a little easier for me to interpert.

1.  Some eeg-id have very poor concensus at 20% agreement or less.   Might not want to use this data ?
2.  seize_vote seems the easiest to rate 
3.  gpd_vote would seem to be the hardest.  
4.  Again an appearnce that at least two different studies were aggregated to form our train data.


#  eeg Agreement

Lets repeat this data analysis by eeg_id.  There are a number of eeg_id that have multiple rows of data.

First lets see the distribution of evaluations per egg_id

In [None]:
# Assuming 'eeg_id' is a column in the 'df' DataFrame
# We will generate a histogram that shows the count of rows for each unique 'eeg_id'

# Counting the number of rows for each unique 'eeg_id'
# Adding the 'eeg_id_counts' to the DataFrame 'df'
# This will map each 'eeg_id' in 'df' to its count

# First, create a Series with 'eeg_id' as the index and the counts as values
eeg_id_counts = df['eeg_id'].value_counts()

# Mapping each 'eeg_id' in 'df' to its count
df['eeg_id_counts'] = df['eeg_id'].map(eeg_id_counts)




# Plotting the histogram
plt.figure(figsize=(12, 6))
plt.hist(eeg_id_counts, bins=100, color='orange', edgecolor='black')
plt.title('Histogram of Row Counts for Each Unique EEG ID')
plt.xlabel('Number of Rows per EEG ID')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()



Looks like a patient or two hooked up to an eeg on a single session for long time.    

The current popular shared notebooks are looking at only the first row per unique eeg_id.  That's how they go from 100K to 17K rows of data.

I think using only the first is leaving information on the table, but not sure I want to include 700 rows for a single eeg_id - hmmm   what to do?

In [None]:
df.head(25)  # Display the first few rows of the DataFrame to show the new column


In [None]:
row_agreement_agg = df.groupby('eeg_id')['consensus'].agg('sum')

# Mapping this aggregated value back to each row in 'df'
df['row_consensus_agg'] = df['eeg_id'].map(row_agreement_agg)


row_evaluators_agg = df.groupby('eeg_id')['total_evaluators'].agg('sum')

# Mapping this aggregated value back to each row in 'df'
df['row_evaluators_agg'] = df['eeg_id'].map(row_evaluators_agg)

df['eeg_agreement'] = df['row_consensus_agg']/df['row_evaluators_agg']

df.head(25)  # Display the first few rows of the DataFrame to show the new column



In [None]:
df.to_csv('eeg_agreement.csv', index = False)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# Plotting distribution curves for each column
for column in ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']:
    # Filter the DataFrame for rows where this column is the consensus
    filtered_df = df[df['consensus_column'] == column]

    # Plotting the distribution curve with clipping the x-axis range
    sns.kdeplot(filtered_df['eeg_agreement'], label=column, clip=(0, 1.0))

plt.title('Distribution of EEG_ID Agreement for Each Column as Consensus')
plt.xlabel('Row Agreement')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()



Row and eeg_id appear to have similiar plots.  

Expanding on this theme - lets look at patient_id

In [None]:
# Assuming 'eeg_id' is a column in the 'df' DataFrame
# We will generate a histogram that shows the count of rows for each unique 'eeg_id'

# Counting the number of rows for each unique 'eeg_id'
# Adding the 'eeg_id_counts' to the DataFrame 'df'
# This will map each 'eeg_id' in 'df' to its count

# First, create a Series with 'eeg_id' as the index and the counts as values
patient_id_counts = df['patient_id'].value_counts()

# Mapping each 'eeg_id' in 'df' to its count
df['patient_id_counts'] = df['patient_id'].map(patient_id_counts)




# Plotting the histogram
plt.figure(figsize=(12, 6))
plt.hist(patient_id_counts, bins=100, color='orange', edgecolor='black')
plt.title('Histogram of Row Counts for Each Unique Patient ID')
plt.xlabel('Number of Rows per patient ID')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [None]:


row_agreement_agg = df.groupby('patient_id')['consensus'].agg('sum')

# Mapping this aggregated value back to each row in 'df'
df['patient_consensus_agg'] = df['patient_id'].map(row_agreement_agg)


row_evaluators_agg = df.groupby('patient_id')['total_evaluators'].agg('sum')

# Mapping this aggregated value back to each row in 'df'
df['patient_evaluators_agg'] = df['patient_id'].map(row_evaluators_agg)

df['patient_agreement'] = df['patient_consensus_agg']/df['patient_evaluators_agg']

df.sample(25)

In [None]:
plt.figure(figsize=(12, 8))

# Plotting distribution curves for each column
for column in ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']:
    # Filter the DataFrame for rows where this column is the consensus
    filtered_df = df[df['consensus_column'] == column]

    # Plotting the distribution curve with clipping the x-axis range
    sns.kdeplot(filtered_df['patient_agreement'], label=column, clip=(0, 1.0))

plt.title('Distribution of Patient Agreement for Each Column as Consensus')
plt.xlabel('Row Agreement')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()


Couple of ways to look at this plot.

Seizure_vote - patients with this condition have consistent eeg's
OR
easy for evalators to spot and agree.

gpd_vote - patients with this condition don't show the issue over time - it comes and goes
OR
very hard for evalators to spot - the plot suggest we have no eeg for this type where all the evalators agreed 


One Conclusion - the shared notebooks that use only the 'first' are missing valuabe information.

As an initial use of this information I think I will use the patient agreement values as weights in my fork of Chris's catboost model.




In [None]:
df

In [None]:
# use this file if you wish to include any of these agreement values in your model.
df.to_csv('train_upgraded.csv', index = False)

In [None]:
sns.boxplot(x='expert_consensus', y='patient_agreement', data = df)
plt.show()

Another way to look at agreement.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(df['patient_agreement'], bins=100, color='green', edgecolor='black')
plt.title('Histogram of Patient Agreement')
plt.xlabel('Patient Agreement')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

The overview describes eeg's were the experts agree as "idealized".   If only 3 experts, but they all agree what level of confidence can we have that they are 'ideal'?

If only a single eeg for a patient, but agreement with a large number of experts, what confidence can we have that they are 'ideal'?

From the previous plot we can see that many of the perfect agreement eeg's are 'other' or 'seizure'.  


# Evaluator group size impact

A couple of the plots above have suggested that our data might be the result of two studies that were combined for this competition.

The number of evaluators used seems to seperate the two studies.

In [None]:
# Create 'large' DataFrame with rows where 'total_evaluators' is greater than 9
large = df[df['total_evaluators'] > 9]

# Create 'small' DataFrame with rows where 'total_evaluators' is less than 10.  actual values are 3 to 6 
small = df[df['total_evaluators'] < 10]


In [None]:
large["expert_consensus"].value_counts().plot(kind='bar');

In [None]:
small["expert_consensus"].value_counts().plot(kind='bar');

Did not want to see this result.

When the number of evaluators is 6 or less, Seizure is the major label, but in the grouping where greater than 9 evaluators were used, this label is the least seen.

Note that all rows are used, a couple of patients have very long eeg records with many evaluations that might change these plots dependent on use of 'first', or 'last', etc.

No way to know how many evaluators used for the test data.   

The small vs large group of evaluators presents a real problem - 

With a large group of evaluators the predominat category becomes 'other'.  For many years I tracked defect types for a number of different glass products at number of different producting locations.
When ever 'other' or 'misc' or 'unknown' was the leading type it always suggested a badly trained group of production inspectors.  

My guess - the large group of data was generated at a ACNS seminar or training session.   Of course, also possible that the large group was highly trained and the 5 available categories were too simplistic for experts.

A key question - was the distribution of eeg's similiar for both groups (so we are seeing measurement error) or was the distribution of eeg's vastly different and we are just seeing the results of completly different studies.

Not sure how to address this question.

The ACNS web site has a section were folks who want to be blessed can take a test on 25 eeg's.  Wondering if the large group evalators include results from some of that testing effort.  

# Idealized Only

In [None]:
# Creating a new DataFrame with the condition for high agreement only
# 
ideal_df = df[df['patient_agreement'] > 0.90]
ideal_df.to_csv('ideal_train.csv', index = False)
ideal_df

Run this file rather than train.csv.  My experience - its performs much worse.

My conclusion - we are not trying to create a model that conforms well to the ACNS guidelines (see link to the document in the overview), but rather we are attempting to model the performance of the evaluators.

In [None]:
plt.figure(figsize=(12, 8))

# Plotting distribution curves for each column
for column in ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']:
    # Filter the DataFrame for rows where this column is the consensus
    filtered_df = ideal_df[ideal_df['consensus_column'] == column]

    # Plotting the distribution curve with clipping the x-axis range
    sns.kdeplot(filtered_df['patient_agreement'], label=column, clip=(0, 1.0))

plt.title('Distribution of Patient Agreement for Idealized EEG')
plt.xlabel('Row Agreement')
plt.ylabel('Density')
plt.legend()
plt.grid(True)
plt.show()


# Generalized or Lateral

The eeg for 4 of the labels are evaluated based on the eeg's being Generalized or Lateral.
    Generalized - gpd and grda
    Lateral - lpd and irda
    
Let's create some category labels and investiate if the evalators are in agreement on the "Main Term" per the ACNS Guidelines

In [None]:
df['general_type'] = df['expert_consensus'].astype(str).str[0]
unique_values = df['general_type'].unique()
unique_values

In [None]:
sns.boxplot(x='general_type', y='patient_agreement', data = df)
plt.show()

As shown earlier in various formats, seizure has best agreement among evaluators.  Also seizure seems to have the clearest outliers.  Tempted to drop those - add it to my todo list and hope to find time.

Interested in only looking at the L and G rows of data.

In [None]:
check_general_lateral = df[df['general_type'].isin(['G', "L"])]
check_general_lateral

In [None]:
sns.boxplot(x='general_type', y='patient_agreement', data = check_general_lateral)
plt.show()

I keep getting surprised by some of these results.  I expected to see that lateral would have the best agreement.  This would be a basic look at Right vs Left eeg's and deciding if the the waveforms for the label were distributed on both sides or on a single side.

I was assuming I could make a feature that used math to determine if right or left had similiar distributions for the 10 seconds that was being evaluated to seperate the two generalized from the two lateral.  That would seem to be a waste of time.  In several discussion posts it has been suggested that evalators are not looking at only the 10 seconds, but are likely being lead to conclusions based on the full 50 secords or the full 10 minutes.



In [None]:
# Create 'large' DataFrame with rows where 'total_evaluators' is greater than 9
large = check_general_lateral[check_general_lateral['total_evaluators'] > 9]

# Create 'small' DataFrame with rows where 'total_evaluators' is less than 10.  actual values are 3 to 6 
small = check_general_lateral[check_general_lateral['total_evaluators'] < 10]

In [None]:
large

In [None]:
small

In [None]:
large["general_type"].value_counts().plot(kind='bar');

In [None]:
small["general_type"].value_counts().plot(kind='bar');

While the two apparent groups of evalators were shown earlier to have different distributions of labels, they don't seem to have any major difference in the seperation of Generlized from Lateral.