# Ubiquant Feature Importance with LOFO

![](https://raw.githubusercontent.com/aerdem4/lofo-importance/master/docs/lofo_logo.png)

**LOFO** (Leave One Feature Out) Importance calculates the importances of a set of features based on **a metric of choice**, for **a model of choice**, by **iteratively removing each feature from the set**, and **evaluating the performance** of the model, with **a validation scheme of choice**, based on the chosen metric.

LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set. The mean and standard deviation (across the folds) of the importance of each feature is then reported.

While other feature importance methods usually calculate how much a feature is used by the model, LOFO estimates how much a feature can make a difference by itself given that we have the other features. Here are some advantages of LOFO:
* It generalises well to unseen test sets since it uses a validation scheme.
* It is model agnostic.
* It gives negative importance to features that hurt performance upon inclusion.
* It can group the features. Especially useful for high dimensional features like TFIDF or OHE features. It is also good practice to group very correlated features to avoid misleading results.
* It can automatically group highly correlated features to avoid underestimating their importance.

https://github.com/aerdem4/lofo-importance

In [None]:
!pip install lofo-importance

In [None]:
import cupy
import cudf
import cuml
from tqdm import tqdm
import os, sys
import torch

PATH = "../input/ubiquant-market-prediction"


df = cudf.read_csv(f"{PATH}/train.csv", nrows=200000)
print(df.shape)
df.head()

In [None]:
df["time_id"].max()

In [None]:
WINDOW = 5
START = 60
N_SPLITS = 5

cv = []

for i in range(N_SPLITS):
    train_ind = cupy.where(df["time_id"].values <= START + i*WINDOW)[0]
    val_ind = cupy.where((df["time_id"].values > START + i*WINDOW) & (df["time_id"].values <= START + (i+1)*WINDOW))[0]
    cv.append((cupy.asnumpy(train_ind), cupy.asnumpy(val_ind)))
    print(len(train_ind), len(val_ind))

In [None]:
features = [col for col in df.columns if col not in {"row_id", "target"}]
len(features)

In [None]:
from lofo import Dataset, LOFOImportance, plot_importance
from sklearn.model_selection import GroupKFold

ds = Dataset(df.to_pandas(), target="target", features=features,
    feature_groups=None,
    auto_group_threshold=0.6
)

In [None]:
import xgboost as xgb


param = {'objective': 'reg:squarederror',
         'learning_rate': 0.1,
         'max_depth': 5,
         "min_child_weight": 200,
         "tree_method": 'gpu_hist', "gpu_id": 0,
         'disable_default_eval_metric': 1,
         "n_estimators": 300
    }

model = xgb.XGBRegressor(**param)

In [None]:
lofo_imp = LOFOImportance(ds, cv=cv, scoring="neg_mean_squared_error", model=model)

importance_df = lofo_imp.get_importance()
importance_df

In [None]:
importance_df["feature_full_name"] = importance_df["feature"].values
importance_df["feature"] = importance_df["feature_full_name"].apply(lambda x: x[:100])

In [None]:
plot_importance(importance_df, figsize=(16, 12))

In [None]:
plot_importance(importance_df.head(16), figsize=(16, 12))

In [None]:
plot_importance(importance_df.tail(16), figsize=(16, 12))

In [None]:
importance_df.to_csv("feature_importance.csv", index=False)