<img src="https://i.imgur.com/HqSaJ5J.jpg">

<center><h1> Harmful Brain Activity Classification </h1></center>
<center><h1>- data understanding -</h1></center>

> 📌 **Competition Scope**: Detect and classify seizures and other types of harmful brain activity in electroencephalography (EEG) data. Even experts find this to be a challenging task and *often disagree* about the correct labels.

### About the Problem

**There are 6 patterns to be identified**:
* seizure (SZ)
* generalized periodic discharges (GPD)
* lateralized periodic discharges (LPD)
* lateralized rhythmic delta activity (LRDA)
* generalized rhythmic delta activity (GRDA)
* other

The annotations were made by a group of experts, *however* the challenge is that not even the experts can fully agree on a case 100% of the time. Hence, the competition creates a second set of labels:
* where there are high levels of agreement => “idealized” patterns
* where ~1-2 experts give a label as “other” and ~1-2 give one of the remaining five labels => “proto” patterns
* where experts are approximately split between 2 of the 5 named patterns => “edge” cases

<img src="https://i.imgur.com/gTV9STa.png">

> 📌 **Note**: so there are patterns that look both like a Seizure or like an LPD or GPD. There are patterns that look like a LRDA and a GRDA. And so on.

### ○ Libraries

In [None]:
# general
import os
import gc
import wandb
import random
import math
from glob import glob
from tqdm import tqdm
from time import time
from pprint import pprint
import warnings
import pandas as pd
import numpy as np
from scipy.signal import spectrogram

# visuals
import seaborn as sns
import matplotlib as mpl
from matplotlib import cm
import matplotlib.patches as patches
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})

# env check
warnings.filterwarnings('ignore')
os.environ["WANDB_SILENT"] = "true"
COMP_ID = '2024_hms'
CONFIG = {'competition': COMP_ID, '_wandb_kernel': 'aot', "source_type": "artifact"}

# color
class clr:
    S = '\033[1m' + '\033[90m'
    E = '\033[0m'
    
my_colors = ["#FECF72", "#DB8C0F", "#E39A7F",
            "#D87AA0", "#91D5DF", "#7BAEC8",]

print(clr.S+"Notebook Color Schemes:"+clr.E)
sns.palplot(sns.color_palette(my_colors))
plt.show()

### 🐝 W&B Fork & Run

In order to run this notebook you will need to input your own **secret API key** within the `! wandb login $secret_value_0` line. 

🐝**How do you get your own API key?**

Super simple! Go to **https://wandb.ai/site** -> Login -> Click on your profile in the top right corner -> Settings -> Scroll down to API keys -> copy your very own key (for more info check [this amazing notebook for ML Experiment Tracking on Kaggle](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases)).

<center><img src="https://i.imgur.com/fFccmoS.png" width=500></center>

In [None]:
# 🐝 secrets
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
wandb_key = user_secrets.get_secret("wandb_key")

! wandb login $wandb_key

### ○ Helper Functions Below

In [None]:
# === data discover ===

def jitter(values,j):
    return values + np.random.normal(j,0.05,values.shape)


def find_rectangles(arr):
    '''
    return indices where the rectangle starts and ends
    '''
    rectangles = []
    start = None
    for i, val in enumerate(arr):
        if val == 3:
            if start is None:
                start = i
        elif start is not None:
            rectangles.append((start, i - 1))
            start = None
    if start is not None:
        rectangles.append((start, len(arr) - 1))
    return rectangles


def get_general_info(df, desc=None):
    
    # 🐝 new exp
    run = wandb.init(project=COMP_ID, name=f'{desc}_data_summary', config=CONFIG)

    print(clr.S+"--- General Info ---"+clr.E)
    print(clr.S+"Data Shape:"+clr.E, df.shape)
    print(clr.S+"Data Cols:"+clr.E, df.columns.tolist())
    print(clr.S+"Total No. of Cols:"+clr.E, len(df.columns.tolist()))
    print(clr.S+"No. Missing Values:"+clr.E, df.isna().sum().sum())
    print(clr.S+"Columns with missing data:"+clr.E, "\n",
          df.isna().sum()[df.isna().sum() != 0], "\n")

    for col in df.columns:
        if is_string_dtype(df[col]):
            print(clr.S+f"--- {col} --- is type string"+clr.E)
            print(clr.S+f"[nunique] {col}:"+clr.E, 
                  df[col].nunique())
        
        elif is_numeric_dtype(df[col]):
            print(clr.S+f"--- {col} --- is type numeric"+clr.E)
            print(clr.S+f"[describe] {col}:"+clr.E, "\n",
                  df[col].describe())
        
    # log data
    wandb.log
    (
        {"data_shape": len(df),
         "missing_values": df.isna().sum().sum()
        }
    )
    wandb.finish()
    print("🐝 Info saved to dashboard.")
            

def get_missing_values_plot(df):
    '''
    Plots missing values barchart for a given dataframe.
    '''
    
    # count missing values
    missing_counts = df.isnull().sum().reset_index()\
                            .sort_values(0, ascending=False)\
                            .reset_index(drop=True)
    missing_counts.columns = ["col_name", "missing_count"]

    # plot
    plt.figure(figsize=(24, 16))
    axs = sns.barplot(y=missing_counts.col_name, x=missing_counts.missing_count, 
                      color=my_colors[0])
    show_values_on_bars(axs, h_v="h", space=0.4)
    plt.xlabel('no. missing values', size=20, weight="bold")
    plt.ylabel('column name', size=20, weight="bold")
    plt.title('Missing Values', size=22, weight="bold")
    plt.show();
            
            
# === plots ===
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)
        
        
# === 🐝 w&b ===
def save_dataset_artifact(run_name, artifact_name, path, data_type="dataset"):
    '''Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    artifact_name: under what name should the dataset be stored
    path: path to the dataset'''
    
    run = wandb.init(project=COMP_ID, 
                     name=run_name, 
                     config=CONFIG)
    artifact = wandb.Artifact(name=artifact_name, 
                              type=data_type)
    artifact.add_file(path)

    wandb.log_artifact(artifact)
    wandb.finish()
    print(f"🐝Artifact {artifact_name} has been saved successfully.")
    
    
def create_wandb_plot(x_data=None, y_data=None, x_name=None, y_name=None, title=None, log=None, plot="line"):
    '''Create and save lineplot/barplot in W&B Environment.
    x_data & y_data: Pandas Series containing x & y data
    x_name & y_name: strings containing axis names
    title: title of the graph
    log: string containing name of log'''
    
    data = [[label, val] for (label, val) in zip(x_data, y_data)]
    table = wandb.Table(data=data, columns = [x_name, y_name])
    
    if plot == "line":
        wandb.log({log : wandb.plot.line(table, x_name, y_name, title=title)})
    elif plot == "bar":
        wandb.log({log : wandb.plot.bar(table, x_name, y_name, title=title)})
    elif plot == "scatter":
        wandb.log({log : wandb.plot.scatter(table, x_name, y_name, title=title)})
        
        
def create_wandb_hist(x_data=None, x_name=None, title=None, log=None):
    '''Create and save histogram in W&B Environment.
    x_data: Pandas Series containing x values
    x_name: strings containing axis name
    title: title of the graph
    log: string containing name of log'''
    
    data = [[x] for x in x_data]
    table = wandb.Table(data=data, columns=[x_name])
    wandb.log({log : wandb.plot.histogram(table, x_name, title=title)})

In [None]:
# 🐝 log cover
run = wandb.init(project=COMP_ID, name='cover', config=CONFIG)
cover = plt.imread("/kaggle/input/hmd-additional-data/leonardo_ai_cover.jpg")
wandb.log({"cover": wandb.Image(cover)})
wandb.finish()

# 1. Understanding the train columns

**train.csv**:
* all `_vote` cols are our target columns
* `eeg_id` marks one recording (17,089 in total)
* `spectrogram_id` represents the "training" data available to predict the classification - there are 11,138 spectrograms in total available in the training set
* `patient_id` is the ID of the patient who this data is about - 1950 in total
* `expert_consensus` contains the votes for each of these subsegments - most for seizures.

In [None]:
# 🐝
run = wandb.init(project=COMP_ID, name='understanding', config=CONFIG)

In [None]:
train = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/train.csv")
train.head()

In [None]:
print("Train shape:", train.shape, "\n")
print("Unique eeg_ids: ", train.eeg_id.nunique())
print(train.groupby("eeg_id")["eeg_sub_id"].count().describe(), "\n")
print("Unique spectrogram_ids: ", train.spectrogram_id.nunique())
print("Unique patient_ids: ", train.patient_id.nunique(), "\n")

In [None]:
# overall expert consensus
# data
dt = train.expert_consensus.value_counts().reset_index()
dt.columns = ["consensus", "frequency"]

# plot
plt.figure(figsize=(20, 10))

figure = sns.barplot(data=dt,
                     x="consensus", y="frequency", palette=my_colors[1:])
show_values_on_bars(figure, h_v="v", space=0.4)
plt.title('[train] Expert Consensus - Frequency', weight="bold", size=20)

plt.xlabel("Consensus", size = 18, weight="bold")
plt.ylabel("Count", size = 18, weight="bold")
    
sns.despine(right=True, top=True, left=True);

In [None]:
# 🐝 log plot
create_wandb_plot(x_data=dt["consensus"],
                  y_data=dt["frequency"],
                  x_name="Consensus", y_name="Count",
                  title="[train] Expert Consensus Frequency",
                  log="bar_consensus", plot="bar")

The `test.csv` dataset contains only the columns `eeg_id`, `spectrogram_id`, `patient_id`. This is because, in the end, this is the format we will be needing to train out `train.csv` data too.

We will also need to create another column that will contain the "patterns"
* idealized - high level of agreement
* proto - some say "other" and some agree on another activity
* edge - split ~equally between two activities

In [None]:
# Grouped train.csv
vote_cols = [col for col in train.columns if '_vote' in col]
print("vote cols:", vote_cols)

train_group = train.groupby(by=["eeg_id", "spectrogram_id", "patient_id"])\
                    [vote_cols].sum().reset_index()
train_group.head(7)

In [None]:
def categorize_votes(row):
    # compute max and sum
    col_names = ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']
    max_vote = row[col_names].max()
    total_votes = row[col_names].sum()

    # % votes for max fruit
    percentage = max_vote / total_votes * 100

    high_agreement_threshold = 70
    equal_splitting_threshold = 40

    if percentage >= high_agreement_threshold:
        return 'idealized'
    elif row['other_vote'] / total_votes >= 0.4 and percentage >= equal_splitting_threshold:
        return 'proto'
    elif row['other_vote'] == 0 and percentage >= equal_splitting_threshold:
        return 'edge'
    else:
        return 'undecided'

# create new set of "pattern" labels
train_group['pattern'] = train_group.apply(categorize_votes, axis=1)
train_group.head(7)

In [None]:
# data
dt = train_group.pattern.value_counts().reset_index()
dt.columns = ["pattern", "frequency"]

# plot
plt.figure(figsize=(20, 10))

figure = sns.barplot(data=dt,
                     x="pattern", y="frequency", palette=my_colors[1:])
show_values_on_bars(figure, h_v="v", space=0.4)
plt.title('[train] Categorized Pattern - Frequency', weight="bold", size=20)

plt.xlabel("Pattern", size = 18, weight="bold")
plt.ylabel("Count", size = 18, weight="bold")
    
sns.despine(right=True, top=True, left=True);

In [None]:
# 🐝 log plot
create_wandb_plot(x_data=dt["pattern"],
                  y_data=dt["frequency"],
                  x_name="Pattern", y_name="Count",
                  title="[train] Categorized pattern Frequency",
                  log="bar_pattern", plot="bar")

However, why the extra information in the `sub_ids`, `spectrogram_sub_id`, `spectrogram_label_offset` and `eeg_label_offset_seconds`?

Because the specialists can disagree with the overall assessment of an entire spectrogram, the 10mins/spectrogram data was split in multiple sub-segments and evaluated individually.

In [None]:
# an example of spectrogram id with one EEG recording
train[train.eeg_id==722738444]

However, not all cases look like the one above (as the `nunique` number of `eeg_id` differs than the `nunique` number for `spectrogram_id`).

One spectrogram (like the case below - 1219001) can be a part of multiple EEG recordings, all being from the same `patient_id`.

In [None]:
# an example of spectrogram id with multiple EEG records
# train[train.eeg_sub_id != train.spectrogram_sub_id]
train[train.spectrogram_id==1219001]

Hence, our "consolidated" `train.csv` dataset looks something like this (from a schematic point of view):

<img src="https://i.imgur.com/vB1WY96.jpg">

# 2. Understanding the Spectrograms

The `train_spectrograms` folder contains a .parquet file for each spectrogram.

The column names indicate the *frequency in hertz* (400 cols in total) and the recording regions of the EEG electrodes:
* LL = left lateral;
* RL = right lateral;
* LP = left parasagittal; 
* RP = right parasagittal.

In [None]:
spectrogram_id = 789577333

# read in the data
spec_base_path = "/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/"
spec_data = pd.read_parquet(spec_base_path + str(spectrogram_id) + ".parquet")

print(spec_data.shape)
spec_data.head()

To visualize a dataset this big, I will use the `specgram` module within `scipy` library. We call `data.T` to transpose the dataset and have the "time" variable on the x axis.

Let's look at some spectrograms from each category. I will *select only the spectrograms that have the highest votes of confidence*, meaning that there were close to no disagreements between the experts.

In [None]:
# number of spectrograms for each category
N = 5

spec_dict = {
    "seizure_vote": 0,
    "lpd_vote": 0,
    "gpd_vote": 0,
    "lrda_vote":0, 
    "grda_vote":0,
    "other_vote":0
}
idealized_df = train_group[train_group.pattern=="idealized"].reset_index(drop=True)

for key in spec_dict.keys():
    col_idx = idealized_df[key].sort_values(ascending=False).head(N).index
    spec_dict[key] = idealized_df.loc[col_idx, "spectrogram_id"].values
    
pprint(spec_dict)

### ⬇️ plot function below

In [None]:
def plot_spectrograms_by_category(spectrogram_ids, category):
    
    # read in the data
    spec_base_path = "/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/"
    spec_data = [pd.read_parquet(spec_base_path + str(id) + ".parquet") for id in spectrogram_ids]

    # create plots
    fs = 1000
    nfft = 1024

    fig, axes = plt.subplots(1, N, figsize=(20, 5), sharey=True)
    plt.suptitle(f"{category}", weight="bold")
    axes = axes.flatten()

    for i in range(N):
        axes[i].imshow(np.log(spec_data[i].T), cmap='magma', aspect='auto')
        axes[i].set_title(f'id {spectrogram_ids[i]}', size=15)
        axes[i].set_xlabel('Time', size=15)
        axes[i].set_ylabel('(Hz)', size=15)
        
#         axes[i].axis("off")
        axes[i].tick_params(axis='both', which='both', labelsize=10)

    plt.subplots_adjust(top=0.85)
    plt.show()

In [None]:
for key, values in spec_dict.items():
    plot_spectrograms_by_category(spectrogram_ids=values,
                                  category=key)

Now I would like to look at some **edge** cases.

The reason I want to do that is to see if there are any groups of activities that come up the most (e.g. is seizure usually the most similar with grda?).

In [None]:
# filter only edge cases
edge_df = train_group[train_group.pattern=="edge"].reset_index(drop=True)

# get the names of the first two columns with the largest values
def top_columns(row, n=2):
    l = row.nlargest(n).index.tolist()
    return str(l[0]) + ", " + str(l[1])

edge_df["edge_cases"] = edge_df.iloc[:, 3:-1].apply(top_columns, axis=1)
edge_df.head()

In [None]:
# data
dt = edge_df.edge_cases.value_counts().reset_index()
dt.columns = ["edge_cases", "frequency"]

# plot
plt.figure(figsize=(20, 15))
figure = sns.barplot(data=dt,
                     y="edge_cases", x="frequency", color=my_colors[4])
show_values_on_bars(figure, h_v="h", space=0.4)
plt.title('[train] Edge Cases - Frequency', weight="bold", size=20)

plt.xlabel("cases", size = 18, weight="bold")
plt.ylabel("Count", size = 18, weight="bold")
    
sns.despine(right=True, top=True, left=True);

In [None]:
# 🐝 log plot
create_wandb_plot(x_data=dt["edge_cases"],
                  y_data=dt["frequency"],
                  x_name="edge_cases", y_name="Count",
                  title="[train] Edge Cases Frequency",
                  log="bar_edge_cases", plot="bar")

Let us now see a few of these spectrograms that contain disagreement between experts.

In [None]:
spec_dict2 = {key:0 for 
              key in edge_df.edge_cases.value_counts()[:6].index}

# get top N ids
N = 5
for key in spec_dict2.keys():
    col_idx = edge_df[edge_df.edge_cases == key].head(N).index
    spec_dict2[key] = edge_df.loc[col_idx, "spectrogram_id"].values
pprint(spec_dict2)

In [None]:
for key, values in spec_dict2.items():
    plot_spectrograms_by_category(spectrogram_ids=values,
                                  category=key)

In [None]:
# 🐝
wandb.finish()

<img src="https://i.imgur.com/pTggvs1.jpg">
<center><h2>- XGBoost using RAPIDS -</h2></center>

I am using the [rapids library](https://rapids.ai/) to handle the data and for preprocessing (much faster) XGBoost for training.

As for the spectrograms dataset (as there are 11,138 `.parquet` files with 400 columns each), I will be using the dataset Chris has already put together (you can find it [here](https://www.kaggle.com/datasets/cdeotte/brain-spectrograms)).

### ○ ML Libraries

> 📌**Note**: CuML doesn't work with the newest pandas version - there are a few fixes available, but they are too of an overhead so as of now I'll just use `sklearn`.

In [None]:
# import cuml
import cupy
import cudf
import xgboost as xgb

# from cuml.model_selection import train_test_split
from sklearn.model_selection import train_test_split

# 3. Feature Engineering

The `spectrogram` files contain information about the hertz, on time, for multiple recording regions of the EEG electrodes. We can take these and create features out of them, which we can use afterwards to train the model.

In [None]:
# import spectrogram info
spect_data = np.load("/kaggle/input/brain-spectrograms/specs.npy", allow_pickle=True).item()

In [None]:
# example data

pprint(spect_data[319287046])
pprint(spect_data[319287046].shape)

In [None]:
# get all column names
sample_path = "/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/1000086677.parquet"
feature_col_names = cudf.read_parquet(sample_path).columns[1:]

print(feature_col_names)

In [None]:
# create features across all cols
# ~ 1min to run
fe_data = {}

for spect_id, data in tqdm(spect_data.items()):
    fe_data[spect_id] = {}
    
    for k, feature in enumerate(feature_col_names):
        fe_data[spect_id][f"{feature}_mean"] = data[:, k].mean()
        fe_data[spect_id][f"{feature}_min"] = data[:, k].min()
        fe_data[spect_id][f"{feature}_max"] = data[:, k].max()
        fe_data[spect_id][f"{feature}_std"] = data[:, k].std()

The end result is a dictionary of dicts of the format:

```
{
    spectrogram_id: 
    {
      'LL_0.59_mean': 51.703323,
      'LL_0.78_mean': 66.76726,
      'LL_0.98_mean': 78.36359,
      ...
    }
}
```

TODO: create vote target labels - convert from classification to regression task

In [None]:
# convert to df
fe_data_df = pd.DataFrame.from_dict(fe_data, orient='index').reset_index()

# append target labels
# target_df = train_group\
#             .groupby("spectrogram_id")[train_group.filter(regex='_vote$').columns]\
#             .sum().reset_index()
target_df = train\
            .groupby("spectrogram_id")["expert_consensus"]\
            .first().reset_index()
# encoding from string to numbers
target_df['expert_consensus'] = pd.factorize(target_df['expert_consensus'])[0]

final_df = pd.merge(left=fe_data_df, right=target_df, 
                    left_on="index", right_on="spectrogram_id")

final_df.head()

# 4. Model Training

TODO: code a better data validation strategy

In [None]:
# data validation
dtrain, dvalid = train_test_split(final_df, train_size=0.8, random_state=42)

FEATURE_COLS = final_df.columns[1:-2]
TARGET_COL = final_df.columns[-1]

### XGBoost

In [None]:
# xgboost train function
def train_xgboost(dtrain, dvalid, config):
    '''
    Train the XGBoost model.
    '''    
    params = {
        'objective': config.objective,
        'eval_metric': config.eval_metric,
        'num_class': config.num_class,
        'tree_method': config.tree_method,
        "random_state": config.random_state,
        "learning_rate": config.learning_rate,
        "max_depth": config.max_depth,
        "min_child_weight": config.min_child_weight,
    }
    

    # Matrix
    dtrain_matrix = xgb.DMatrix(dtrain[FEATURE_COLS], label=dtrain[TARGET_COL])
    dvalid_matrix = xgb.DMatrix(dvalid[FEATURE_COLS], label=dvalid[TARGET_COL])

    # Training ...
    model = xgb.train(params, dtrain_matrix, 
                      evals=[(dvalid_matrix, 'test')], 
                      num_boost_round=100,
                      verbose_eval=False)

    # Evaluate ...
    y_pred = model.predict(dvalid_matrix)
    y_pred = np.asarray(y_pred)

    # Compute accuracy
    y_true = np.asarray(dvalid[TARGET_COL])
    accuracy = np.sum(y_pred == y_true) / len(y_pred)
    wandb.log({"accuracy": np.float64(accuracy)})

    print(clr.S+f"Accuracy: {accuracy:.4f}"+clr.E)

### Train pipeline

In [None]:
def train_pipeline():
    
    # XGBoost hyperparameters
    config_defaults = {
        'objective': 'multi:softmax',
        'eval_metric': 'mlogloss',
        'num_class': 6,
        'tree_method': 'hist',
        'device': 'cuda',
        "random_state": 24,
        "learning_rate": 0.1,
        "max_depth": 1,
        "min_child_weight": 1,
    }
    
    # 🐝 W&B Experiment
    config_defaults.update(CONFIG)
    run = wandb.init(project=COMP_ID, config=config_defaults)
    config = wandb.config
    
    train_xgboost(dtrain, dvalid, config)
    
    # 🐝
    wandb.finish()

### First iteration:

In [None]:
train_pipeline()

# 5. Sweeps

> 📌**Note**: I am fine tuning the model using [Sweeps](https://docs.wandb.ai/guides/sweeps). The run from the image below can be found [here](https://wandb.ai/andrada/2024_hms/sweeps/txg90cxn?workspace=user-andrada). **Best run has an accuracy of 0.57.**

<img src="https://i.imgur.com/MOuYWKo.png">

In [None]:
# Sweep Config
sweep_config = {
    "method": "random",
    "metric": {
      "name": "accuracy",
      "goal": "maximize"   
    },
    "parameters": {
        "max_depth": {
            "values": [1, 4, 6, 10, 15, 20]
        },
        "min_child_weight": {
            "values": [1, 2, 3, 4, 5, 8, 10]
        },
        "learning_rate": {
            "values": [0.001, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7]
        },
        "random_state": {
            "values": [10, 24, 30, 45, 50, 75, 80, 100]
        }
    }
}

# Sweep ID
sweep_id = wandb.sweep(sweep_config, project=COMP_ID)

In [None]:
# 🐝 RUN SWEEPS
start = time()

# count = the number of trials/experiments to run
wandb.agent(sweep_id, train_pipeline, count=30)
print("Sweeping took:", round((time()-start)/60, 1), "mins")

# 6. Feature Importance after Sweeps

After we've run the feature importance, I want to create a final model and further analyse it to see which of the features that we have created yield the most entropy during training.

**I am using the metrics of one of the best sweeps so far**:
* lr: 0.1
* max_depth: 20
* min_child_weight: 10
* random_state: 100

In [None]:
# best model train
params = {
        'objective': 'multi:softmax',
        'eval_metric': 'mlogloss',
        'num_class': 6,
        'tree_method': 'hist',
        'device': 'cuda',
        "random_state": 100,
        "learning_rate": 0.1,
        "max_depth": 20,
        "min_child_weight": 10,
    }
    

# matrix
dtrain_matrix = xgb.DMatrix(dtrain[FEATURE_COLS], label=dtrain[TARGET_COL])
dvalid_matrix = xgb.DMatrix(dvalid[FEATURE_COLS], label=dvalid[TARGET_COL])

# training ...
best_model = xgb.train(params, dtrain_matrix, 
                       evals=[(dvalid_matrix, 'test')], 
                       num_boost_round=100,
                       verbose_eval=False)

# evaluate ...
y_pred = best_model.predict(dvalid_matrix)
y_pred = np.asarray(y_pred)

In [None]:
# Accuracy
y_true = np.asarray(dvalid[TARGET_COL])
print(np.sum(y_pred == y_true) / len(y_pred))

### Feature Importance

**What do the scores mean?**

They show the importance - the idea is as they get bigger, the more important the feature is. They are computed based on how many times the decission split was made based on that feature (hence, it means the feature itself gave a lot of information).

In [None]:
importance_dict = best_model.get_score(importance_type='weight')
importance_df = pd.DataFrame(list(importance_dict.items()), 
                             columns=['Feature', 'Importance'])\
                            .sort_values('Importance', ascending=False)\
                            .reset_index(drop=True)
importance_df.head()

In [None]:
importance_df.describe()

In [None]:
# Plot
plt.figure(figsize=(20, 15))

figure = sns.barplot(data=importance_df[importance_df.Importance >= 100],
                     x="Importance", y="Feature", color=my_colors[0])
show_values_on_bars(figure, h_v="h", space=0.4)
plt.title('Top Features in terms of Importance', weight="bold", size=20)

plt.xlabel("Importance", size = 18, weight="bold")
plt.ylabel("Feature Name", size = 18, weight="bold")
    
sns.despine(right=True, top=True, left=True);

In [None]:
importance_df["kpi"] = importance_df["Feature"].apply(lambda x: x.split("_")[-1])
dt = importance_df[importance_df.Importance>=30]["kpi"]\
            .value_counts().reset_index()

# Plot
plt.figure(figsize=(20, 15))

figure = sns.barplot(data=dt,
                     x="kpi", y="count", palette=my_colors)
show_values_on_bars(figure, h_v="v", space=0.4)
plt.title('KPI that yields the most information in top 25% features', weight="bold", size=20)

plt.xlabel("KPI", size = 18, weight="bold")
plt.ylabel("Importance", size = 18, weight="bold")
    
sns.despine(right=True, top=True, left=True);

### 🐝 [my W&B dash](https://wandb.ai/andrada/2024_hms?workspace=user-andrada)
    
<center><img src="https://i.imgur.com/TKsfAVQ.png"></center>

------

<center><img src="https://i.imgur.com/FDMMaAD.png"></center>

### My Specs

* 🖥 Z8 G4 Workstation
* 💾 2 CPUs & 96GB Memory
* 🎮 2x NVIDIA A6000
* 💻 Zbook Studio G9 on the go