# Group Importances
In this Python notebook we explore the computational efficiency of EBM group importances compared to more traditional methods such as grouped permutation feature importance (GPFI) (see https://link.springer.com/article/10.1007/s10618-022-00840-5). We compare their runtimes on a variety of OpenML datasets that are popular, open-source and easily accessible, and have different characteristics in terms of number of samples and features. This also allows us to produce results with clean code as little to no data preprocessing is required for these OpenML datasets.

# Installs and imports

In [None]:
!pip3 install interpret numpy pandas openml --quiet
!pip3 install --upgrade scikit-learn --quiet

!pip install git+https://github.com/lucasplagwitz/grouped_permutation_importance --quiet

In [None]:
# Standard
import pandas as pd
import numpy as np
import random
import openml
import time
from tqdm import tqdm
import matplotlib.pyplot as plt
from sklearn.utils.class_weight import compute_sample_weight

import warnings
warnings.filterwarnings('ignore')

# Imports for Explainable Boosting Machine
from interpret.glassbox import ExplainableBoostingClassifier

# Group importance methods
from interpret.glassbox._ebm._research import * # includes compute_group_importance
from grouped_permutation_importance import grouped_permutation_importance

# Datasets
We pick datasets that have a varying number of samples and features. This enables comparing group importance methods across an array of settings.

In [None]:
#                                               Samples x Features
dataset_ids = [3,     # kr-vs-kp:               3196 x 37
               31,    # Credit-g:               1000 x 21
               1216,  # Click_prediction_small: 1.5M x 10
               1489,  # Phoneme:                5404 x 6
               45085, # Breast:                 97 x 24482
               45570, # Higgs:                  11M x 29
               ]

In this notebook, we are concerned only with the computational efficiency of these methods to compute group importances, not necessarily their interpretations. To that end, we form 5 random groups of roughly equal size for all datasets. The 2 group importance methods take different input, so return both the lists of feature names as well as lists of their indices.

In [None]:
def partition(df, num_groups=5):
    "Parition df columns into num_groups groups of roughly equal size."
    columns = list(enumerate(df.columns))
    random.shuffle(columns)
    group_size = len(columns)// num_groups

    groups = [columns[i * group_size: (i + 1) * group_size] for i in range(num_groups - 1)]
    # Add remaining columns to the last group
    groups.append(columns[(num_groups - 1) * group_size:])

    # Get lists of indices and lists of names for same groupings
    index_groups = [[index for index, _ in group] for group in groups]
    name_groups = [[col_name for _, col_name in group] for group in groups]

    return index_groups, name_groups

# Computational Cost of Group Importances
For every dataset, get X and y and build the classifier, an EBM in this case. Then, acquire random groupings and compute group importances using (1) EBM's internal method for group importances, and (2) Grouped permutation feature importance (GPFI) as defined in https://link.springer.com/article/10.1007/s10618-022-00840-5.

In [None]:
# Keep track of runtimes
EBM_runtime = []
GPFI_runtime = []

for id in tqdm(dataset_ids):
    # Pull dataset
    print("Dataset ID:", id)
    openml_dataset = openml.datasets.get_dataset(id)
    X, y, _, _ = openml_dataset.get_data(target=openml_dataset.default_target_attribute)
    w = compute_sample_weight(class_weight="balanced", y=y) # Sample weights

    # Build classifier
    model = ExplainableBoostingClassifier(max_bins=128, max_rounds=5000, smoothing_rounds=500, outer_bags=8)
    model.fit(X, y, sample_weight=w)

    # Compute random groupings
    idxs, names = partition(X, num_groups=5)

    # Compute group importances using EBM's group importance method. Track runtime
    t1_EBM = time.perf_counter()
    for g in names: # EBMs compute group importance one by one, so loop over them
        compute_group_importance(g, model, X)
    t2_EBM = time.perf_counter()
    EBM_runtime.append(t2_EBM - t1_EBM) # Track EBM runtime
    print("EBM:", t2_EBM - t1_EBM)

    # Compute group importances using GPFI: GPFI implements it with all groups together in idxs, so only one call is needed.
    t1_GPFI = time.perf_counter()
    r_GPFI = grouped_permutation_importance(model, X.values, y.values, idxs=idxs, n_repeats=5, sample_weight=compute_sample_weight(class_weight="balanced", y=y))
    t2_GPFI = time.perf_counter()
    GPFI_runtime.append(t2_GPFI - t1_GPFI) # Track GPFI runtime
    print("GPFI:", t2_GPFI - t1_GPFI, '\n')

In [None]:
runtime_df = pd.DataFrame({'EBM Runtime': EBM_runtime, "GPFI Runtime": GPFI_runtime, "Dataset ID": dataset_ids})

In [None]:
runtime_df