# Features+Head Starter for HMS Brain Comp
This is Features+Head Starter notebook for Kaggle's HMS brain comp. This model uses Kaggle's spectrograms and Chris's EEG spectrograms(modified version). This Features+Head Starter achieves CV 0.58 and LB 0.42!

Features+Head Starter uses Chris Deotte's Kaggle dataset [here][1]. This dataset is a single file which contains all of Kaggle's 11,138 spectrogram parquets. Reading this single file is much faster than reading 11k separate files. Don't forget to upvote his Kaggle [dataset][1] too! 

Also Uses Chris's EEG spectrograms [here][3] (modified version) 

### Train and Infer Tips

This notebook can be used both to train and submit (infer) to Kaggle LB. When training, you can set variable `submission = False` , you can also set `TEST_MODE = TRUE` to upload 500 samples queckly instead of the whole dataset for testing. 

For submission after training models, you should save them in the LOAD_MODELS_FROM dataset, then run this notebook with `submission = True`.

This notebook is made as generic as possible to expand and try different experiments.

What you could do:
- Resize images with `IMG_SIZE`
- Change EfficientNetB(0-7) with `LOAD_BACKBONE_FROM`
- Data augmentation by setting DataGenerator's parameter to `augment = True`

Many other experiments could be done by modifying code, such as:
- Input augmentation, the data generator outputs (400x600x3), but it is using a single channel randomly, the other two channels are empty, you could utilize those slot.
- Custom loss fucntions.
- Learning Rate scheduler.

This notebook is a direct descendent of Chris's notebook [here][2]

[1]: https://www.kaggle.com/datasets/cdeotte/brain-spectrograms
[2]: https://www.kaggle.com/code/cdeotte/efficientnetb2-starter-lb-0-57
[3]: https://www.kaggle.com/datasets/nartaa/eeg-spectrograms

In [None]:
import os
import tensorflow as tf
import tensorflow
import tensorflow.keras.backend as K
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import load_model

IMG_SIZE = (400,600)
VER = 14
LOAD_BACKBONE_FROM = '/kaggle/input/efficientnetb-tf-keras/EfficientNetB2.h5'
LOAD_MODELS_FROM = '/kaggle/input/features-head-starter-models/'
TEST_MODE = False
submission = True
np.random.seed(42)

# USE SINGLE GPU, MULTIPLE GPUS 
gpus = tf.config.list_physical_devices('GPU')
# WE USE MIXED PRECISION
tf.config.optimizer.set_experimental_options({"auto_mixed_precision": True})
if len(gpus)>1:
    strategy = tf.distribute.MirroredStrategy()
    print(f'Using {len(gpus)} GPUs')
else:
    strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    print(f'Using {len(gpus)} GPU')

# Load and create Non-Overlapping Eeg Id Train Data
The competition data description says that test data does not have multiple crops from the same `eeg_id`. Therefore we will train and validate using only 1 crop per `eeg_id`. There is a discussion about this [here][1].

EEGs with many NANs is removed. They can be kept if we comment out the code.

[1]: https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/467021

In [None]:
TARGETS = ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']
if not submission:
    train = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
    TARGETS = ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']
    META = ['spectrogram_id','spectrogram_label_offset_seconds','patient_id','expert_consensus']
    train = train.groupby('eeg_id')[META+TARGETS
                           ].agg({**{m:'first' for m in META},**{t:'sum' for t in TARGETS}}).reset_index()
    train[TARGETS] = train[TARGETS]/train[TARGETS].values.sum(axis=1,keepdims=True)
    train.columns = ['eeg_id','spec_id','offset','patient_id','target'] + TARGETS
    train.head(1)

    # REMOVE EEGs WITH MORE THAN 150 NANs
    eeg_nans = np.load('/kaggle/input/eeg-spectrograms/eeg_nans.npy', allow_pickle=True).item()
    df = pd.DataFrame.from_dict(eeg_nans, orient='index',columns=['nans']).reset_index(names='eeg_id')
    df  = df[df['nans'] > 150]
    df['eeg_id'] = df['eeg_id'].map(lambda x: x.split('.')[0]).astype('int')
    train = train[~train['eeg_id'].isin(df['eeg_id'])]
    train.head(1)

# Read Train Spectrograms and EEGs

We can read 1 file from Chris's [Kaggle dataset here][1] which contains all the 11k spectrograms in less than 1 minute! Don't forget to upvote this helpful [dataset][1]

[1]: https://www.kaggle.com/datasets/cdeotte/brain-spectrograms

In [None]:
%%time
if not submission:
    # FOR TESTING SET READ_FILES TO TRUE
    if TEST_MODE:
        train = train.sample(500,random_state=42).reset_index(drop=True)
        spectrograms = {}
        for i,e in enumerate(train.spec_id.values):
            if i%100==0: print(i,', ',end='')
            x = pd.read_parquet(f'/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/{e}.parquet')
            spectrograms[e] = x.values
        all_eegs = {}
        for i,e in enumerate(train.eeg_id.values):
            if i%100==0: print(i,', ',end='')
            x = np.load(f'/kaggle/input/eeg-spectrograms/EEG_Spectrograms/{e}.npy')
            all_eegs[e] = x
    else:
        spectrograms = np.load('/kaggle/input/brain-spectrograms/specs.npy',allow_pickle=True).item()
        all_eegs = np.load('/kaggle/input/eeg-spectrograms/eeg_specs.npy',allow_pickle=True).item()

# DATA GENERATOR
This data generator outputs 400x600x3, the spectrogram and eeg images are concatenated all togother in a single image, then copied to a single channel of the 3 channels. For using data augmention you can set `augment = True` when creating the train data generator. You can also resize by setting `img_size`

In [None]:
import albumentations as albu

class DataGenerator():
    'Generates data for Keras'
    def __init__(self, data, specs, eeg_specs, augment=False, mode='train',img_size=IMG_SIZE): 
        self.data = data
        self.augment = augment
        self.mode = mode
        self.img_size = img_size
        self.specs = specs
        self.eeg_specs = eeg_specs
        self.on_epoch_end()
        
    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index):
        X, y = self.data_generation(index)
        if self.img_size != (400,600): X = self.resize(X)
        if self.augment: X = self.augmentation(X)
        return X, y
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
                
    def on_epoch_end(self):
        if self.mode=='train': 
            self.data = self.data.sample(frac=1).reset_index(drop=True)
                        
    def data_generation(self, index):
        X = np.zeros((400,600,3),dtype='float32')
        y = np.zeros((6,),dtype='float32')
        
        row = self.data.iloc[index]
        if self.mode=='test': 
            offset = 0
        else:
            offset = int(row.offset/2)
            
        cnl = np.random.choice([0,1,2])
        spec = self.specs[row.spec_id]
        eeg = self.eeg_specs[row.eeg_id]
        for k in range(4):
            # EXTRACT 300 ROWS OF SPECTROGRAM
            img = spec[offset:offset+300,k*100:(k+1)*100].T
            
            # LOG TRANSFORM SPECTROGRAM
            img = np.clip(img,np.exp(-4),np.exp(8))
            img = np.log(img)
            
            # STANDARDIZE PER IMAGE
            img = np.nan_to_num(img, nan=0.0)                

            mn = img.flatten().min()
            mx = img.flatten().max()
            ep = 1e-5
            img = 255 * (img - mn) / (mx - mn + ep)
            X[k*100:(k+1)*100,:300,cnl] = img
            
            # EEG SPECTROGRAMS, ADD FROM 300 WIDTH AND MAKE A 3 CHANNEL IMAGE
            # STANDARDIZE PER IMAGE
            img = eeg[:,:,k]
            mn = img.flatten().min()
            mx = img.flatten().max()
            ep = 1e-5
            img = 255 * (img - mn) / (mx - mn + ep)
            X[k*100:(k+1)*100:,300:,cnl] = img


        if self.mode!='test':
            y[:] = row[TARGETS]
            
        return X,y
    
    def resize(self, img):
        composition = albu.Compose([
                albu.Resize(IMG_SIZE[0],IMG_SIZE[1])
            ])
        return composition(image=img)['image']
            
    def augmentation(self, img):
        composition = albu.Compose([
                albu.HorizontalFlip(p=0.4),
                albu.VerticalFlip(p=0.4),
                albu.RandomRotate90(p=0.4),
                albu.Rotate(p=0.5,limit=10),
            ])
        return composition(image=img)['image']

# DISPLAY DATA GENERATOR
Below we display example data generator spectrogram images.

In [None]:
if not submission:
    gen = DataGenerator(train, augment=False, specs=spectrograms, eeg_specs=all_eegs)
    for x,y in gen:
        break
    plt.imshow(x[:,:,2])
    plt.title(f'Target = {y.round(1)}',size=12)
    plt.yticks([])
    plt.ylabel('Frequencies (Hz)',size=12)
    plt.xlabel('Time (sec)',size=12)
    plt.show()

# TRAINING

## LEARNING RATE

In [None]:

if not submission:
    LR_START = 1e-4
    LR_MAX = 1e-3
    LR_RAMPUP_EPOCHS = 0
    LR_SUSTAIN_EPOCHS = 0
    LR_STEP_DECAY = 0.1
    EVERY = 2
    EPOCHS = 4

    def lrfn(epoch):
        if epoch < LR_RAMPUP_EPOCHS:
            lr = (LR_MAX - LR_START) / LR_RAMPUP_EPOCHS * epoch + LR_START
        elif epoch < LR_RAMPUP_EPOCHS + LR_SUSTAIN_EPOCHS:
            lr = LR_MAX
        else:
            lr = LR_MAX * LR_STEP_DECAY**((epoch - LR_RAMPUP_EPOCHS - LR_SUSTAIN_EPOCHS)//EVERY)
        return lr

    rng = [i for i in range(EPOCHS)]
    y = [lrfn(x) for x in rng]
    plt.figure(figsize=(6, 2))
    plt.plot(rng, y, 'o-'); 
    plt.xlabel('epoch',size=14); plt.ylabel('learning rate',size=14)
    plt.title('Step Training Schedule',size=16); plt.show()

    LR = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose = True)
    Constant_LR = tf.keras.callbacks.LearningRateScheduler(lambda x: 0.001, verbose = True)

## MODEL AND UTILITY FUNCTIONS

In [None]:
def build_model():  
    inp = tf.keras.layers.Input((IMG_SIZE[0],IMG_SIZE[1],3))
    base_model = load_model(f'{LOAD_BACKBONE_FROM}')    
    x = base_model(inp)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    output = tf.keras.layers.Dense(6,activation='softmax', dtype='float32')(x)
    model = tf.keras.Model(inputs=inp, outputs=output)
    opt = tf.keras.optimizers.Adam(learning_rate = 1e-3)
    kl = tf.keras.metrics.KLDivergence(name='kl')
    model.compile(loss=loss_fn, optimizer=opt, metrics=[kl])  
    return model

def loss_fn(y_true, y_pred):
    kl = tf.keras.losses.KLDivergence(reduction=tf.keras.losses.Reduction.NONE)
    cce = tf.keras.losses.CategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
    return tf.nn.compute_average_loss(kl(y_true, y_pred)) + tf.nn.compute_average_loss(cce(y_true, y_pred))

def score(y_true, y_pred):
    kl = tf.keras.metrics.KLDivergence()
    return kl(y_true, y_pred)

def plot_hist(hist):
    metrics = ['loss','kl']
    for i,metric in enumerate(metrics):
        plt.figure(figsize=(10,4))
        plt.subplot(1,2,i+1)
        plt.plot(hist[metric])
        plt.plot(hist[f'val_{metric}'])
        plt.title(f'{metric}',size=12)
        plt.ylabel(f'{metric}',size=12)
        plt.xlabel('epoch',size=12)
        plt.legend(["train", "validation"], loc="upper left")
        plt.show()

## TRANSFER LEARNING

In [None]:
from sklearn.model_selection import KFold, GroupKFold
import tensorflow.keras.backend as K, gc

if not submission:
    all_oof = []
    all_true = []
    losses = []
    val_losses = []
    kls = []
    val_kls = []
    total_hist = {}

    gkf = GroupKFold(n_splits=5)
    for i, (train_index, valid_index) in enumerate(gkf.split(train, train.target, train.patient_id)):   
        
        print('#'*25)
        print(f'### Fold {i+1}')
        
        train_gen = DataGenerator(train.iloc[train_index], augment=False, specs=spectrograms, eeg_specs=all_eegs)
        valid_gen = DataGenerator(train.iloc[valid_index], mode='valid', specs=spectrograms, eeg_specs=all_eegs)
        EPOCHS = 4
        BATCH_SIZE_PER_REPLICA = 32
        BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

        train_dataset = tf.data.Dataset.from_generator(generator=train_gen, 
                                                   output_signature=(tf.TensorSpec(shape=(IMG_SIZE[0],IMG_SIZE[1],3), dtype=tf.float32),
                                                                     tf.TensorSpec(shape=(6,), dtype=tf.float32))).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
        val_dataset = tf.data.Dataset.from_generator(generator=valid_gen, 
                                                   output_signature=(tf.TensorSpec(shape=(IMG_SIZE[0],IMG_SIZE[1],3), dtype=tf.float32),
                                                                     tf.TensorSpec(shape=(6,), dtype=tf.float32))).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
        save_best = tf.keras.callbacks.ModelCheckpoint(f'model_{VER}_{i}.weights.h5', 
                                                   monitor='val_loss', save_best_only=True, save_weights_only=True)
        
        print(f'### train size {len(train_index)}, valid size {len(valid_index)}')
        print('#'*25)
        
        K.clear_session()
        with strategy.scope():
            model = build_model()
        
        hist = model.fit(train_dataset, verbose=1, validation_data = val_dataset, 
                         epochs=EPOCHS, callbacks=[save_best,LR])
        losses.append(hist.history['loss'])
        val_losses.append(hist.history['val_loss'])
        kls.append(hist.history['kl'])
        val_kls.append(hist.history['val_kl'])
        oof = model.predict(val_dataset, verbose=1)
        all_oof.append(oof)
        all_true.append(train.iloc[valid_index][TARGETS].values)    
        del model, oof
        gc.collect()
        
    total_hist['loss'] = np.mean(losses,axis=0)
    total_hist['val_loss'] = np.mean(val_losses,axis=0)
    total_hist['kl'] = np.mean(kls,axis=0)
    total_hist['val_kl'] = np.mean(val_kls,axis=0)
    all_oof = np.concatenate(all_oof)
    all_true = np.concatenate(all_true)
    plot_hist(total_hist)
    print('#'*25)
    print(f'CV KL SCORE: {score(all_true,all_oof)}')

# Infer Test and Create Submission CSV
Infer the test data and create a `submission.csv` file.

In [None]:
import pywt, librosa

USE_WAVELET = None 

NAMES = ['LL','LP','RP','RR']

FEATS = [['Fp1','F7','T3','T5','O1'],
         ['Fp1','F3','C3','P3','O1'],
         ['Fp2','F8','T4','T6','O2'],
         ['Fp2','F4','C4','P4','O2']]

# DENOISE FUNCTION
def maddest(d, axis=None):
    return np.mean(np.absolute(d - np.mean(d, axis)), axis)

def denoise(x, wavelet='haar', level=1):    
    coeff = pywt.wavedec(x, wavelet, mode="per")
    sigma = (1/0.6745) * maddest(coeff[-level])

    uthresh = sigma * np.sqrt(2*np.log(len(x)))
    coeff[1:] = (pywt.threshold(i, value=uthresh, mode='hard') for i in coeff[1:])

    ret=pywt.waverec(coeff, wavelet, mode='per')
    
    return ret

import librosa

def spectrogram_from_eeg(parquet_path, display=False):
    
    # LOAD MIDDLE 50 SECONDS OF EEG SERIES
    eeg = pd.read_parquet(parquet_path)
    middle = (len(eeg)-10_000)//2
    eeg = eeg.iloc[middle:middle+10_000]
    
    # VARIABLE TO HOLD SPECTROGRAM
    img = np.zeros((100,300,4),dtype='float32')
    
    if display: plt.figure(figsize=(10,7))
    signals = []
    for k in range(4):
        COLS = FEATS[k]
        
        for kk in range(4):
            # FILL NANS
            x1 = eeg[COLS[kk]].values
            x2 = eeg[COLS[kk+1]].values
            m = np.nanmean(x1)
            if np.isnan(x1).mean()<1: x1 = np.nan_to_num(x1,nan=m)
            else: x1[:] = 0
            m = np.nanmean(x2)
            if np.isnan(x2).mean()<1: x2 = np.nan_to_num(x2,nan=m)
            else: x2[:] = 0
                
            # COMPUTE PAIR DIFFERENCES
            x = x1 - x2

            # DENOISE
            if USE_WAVELET:
                x = denoise(x, wavelet=USE_WAVELET)
            signals.append(x)

            # RAW SPECTROGRAM
            mel_spec = librosa.feature.melspectrogram(y=x, sr=200, hop_length=len(x)//300, 
                  n_fft=1024, n_mels=100, fmin=0, fmax=20, win_length=128)
            
            # LOG TRANSFORM
            width = (mel_spec.shape[1]//30)*30
            mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max).astype(np.float32)[:,:width]
            img[:,:,k] += mel_spec_db
                
        # AVERAGE THE 4 MONTAGE DIFFERENCES
        img[:,:,k] /= 4.0
        
        if display:
            plt.subplot(2,2,k+1)
            plt.imshow(img[:,:,k],aspect='auto',origin='lower')
            
    if display: 
        plt.show()
        plt.figure(figsize=(10,5))
        offset = 0
        for k in range(4):
            if k>0: offset -= signals[3-k].min()
            plt.plot(range(10_000),signals[k]+offset,label=NAMES[3-k])
            offset += signals[3-k].max()
        plt.legend()
        plt.show()
        
    return img

In [None]:
if submission:
    test = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/test.csv')
    print('Test shape',test.shape)
    test.head()

In [None]:
# READ ALL SPECTROGRAMS
if submission:
    PATH2 = '/kaggle/input/hms-harmful-brain-activity-classification/test_spectrograms/'
    files2 = os.listdir(PATH2)
    print(f'There are {len(files2)} test spectrogram parquets')
    
    spectrograms2 = {}
    for i,f in enumerate(files2):
        if i%100==0: print(i,', ',end='')
        tmp = pd.read_parquet(f'{PATH2}{f}')
        name = int(f.split('.')[0])
        spectrograms2[name] = tmp.iloc[:,1:].values
    
    # RENAME FOR DATA GENERATOR
    test = test.rename({'spectrogram_id':'spec_id'},axis=1)

In [None]:
# READ ALL EEG SPECTROGRAMS
if submission:
    PATH2 = '/kaggle/input/hms-harmful-brain-activity-classification/test_eegs/'
    DISPLAY = 0
    EEG_IDS2 = test.eeg_id.unique()
    all_eegs2 = {}

    print('Converting Test EEG to Spectrograms...'); print()
    for i,eeg_id in enumerate(EEG_IDS2):
        
        # CREATE SPECTROGRAM FROM EEG PARQUET
        img = spectrogram_from_eeg(f'{PATH2}{eeg_id}.parquet', i<DISPLAY)
        all_eegs2[eeg_id] = img

In [None]:
# INFER EFFICIENTNET ON TEST
if submission:
    preds = []
    test_gen = DataGenerator(test, mode='test',specs = spectrograms2, eeg_specs = all_eegs2)
    test_dataset = tf.data.Dataset.from_generator(generator=test_gen, 
                                               output_signature=(tf.TensorSpec(shape=(IMG_SIZE[0],IMG_SIZE[1],3), dtype=tf.float32),
                                                                 tf.TensorSpec(shape=(6,), dtype=tf.float32))).batch(64).prefetch(tf.data.AUTOTUNE)
    model = build_model()
    
    for i in range(5):
        print(f'Fold {i+1}')
        model.load_weights(f'{LOAD_MODELS_FROM}model_{VER}_{i}.weights.h5')
        pred = model.predict(test_dataset, verbose=1)
        preds.append(pred)
    pred = np.mean(preds,axis=0)
    print('Test preds shape',pred.shape)

In [None]:
if submission:
    sub = pd.DataFrame({'eeg_id':test.eeg_id.values})
    sub[TARGETS] = pred
    sub.to_csv('submission.csv',index=False)
    print('Submissionn shape',sub.shape)
    print()
    print(sub.head().to_string())

In [None]:
# SANITY CHECK TO CONFIRM PREDICTIONS SUM TO ONE
if submission:
    print(sub.iloc[:,-6:].sum(axis=1).to_string())