<div style="color: #f8f8ff;
           display:fill;
           border-radius:10px;
           border-style: solid;
           border-color:#424949;
           text-align:center;
           background-color:#69541b ;
           font-size:20px;
           letter-spacing:0.5px;
           padding: 0.7em;
           text-align:left">  
<center>  Background  </center> 
<head> 

</head> 
 <hr>
 <ul>
     The content of this notebook mainly comes from <a href="https://www.kaggle.com/code/ogrellier/feature-selection-with-null-importances" color: red>here.</a> Thanks to @olivier for his contribution
 <hr>
 The notebook implements the following steps :

<li>Create the null importances distributions : these are created fitting the model over several runs on a shuffled version of the target. This shows how the model can make sense of a feature irrespective of the target.
<li>Fit the model on the original target and gather the feature importances. This gives us a benchmark whose significance can be tested against the Null Importances Distribution
<li>for each feature test the actual importance:
<li>Compute the probabability of the actual importance wrt the null distribution. I will use a very simple estimation using occurences while the article proposes to fit known distribution to the gathered data. In fact here I'll compute 1 - the proba so that things are in the right order.
<li>Simply compare the actual importance to the mean and max of the null importances. This will give sort of a feature importance that allows to see major features in the dataset. Indeed the previous method may give us lots of ones.
 </ul>
 <hr>
</div>

In [None]:
import numpy as np 
import pandas as pd 
import os
import eli5
import lightgbm as lgb
import time
import gc
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import warnings
warnings.simplefilter('ignore', UserWarning)
gc.enable()

from typing import Dict, Tuple, List, Union
from pandas import DataFrame, Series
from contextlib import contextmanager
from sklearn.model_selection import  train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score
from eli5.sklearn import PermutationImportance

<div style="color: #fff7f7;
           display:fill;
           border-radius:10px;
           border-style: solid;
           border-color:#424949;
           text-align:center;
           background-color:#69541b ;
           font-size:20px;
           letter-spacing:0.5px;
           padding: 0.7em;
           text-align:left">  
<center>  Seed and load data </center> 
</div>

In [None]:
SEED=2022
def seed_everything(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

@contextmanager
def timer(name: str):
    s = time.time()
    yield
    S_time = time.time() - s
    print(f'[{name}] {S_time: .2f}sec')

seed_everything(SEED)

In [None]:
train = pd.read_feather("../input/amexfeather/train_data.ftr")

<div style="color: #fff7f7;
           display:fill;
           border-radius:10px;
           border-style: solid;
           border-color:#424949;
           text-align:center;
           background-color:#69541b ;
           font-size:20px;
           letter-spacing:0.5px;
           padding: 0.7em;
           text-align:left">  
<center>  Num-feature and Cat-feature </center> 
</div>

In [None]:
cols = train.columns.to_list()
category_cols = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
numerical_cols = [col for col in cols if col not in category_cols + ['target',"S_2","customer_ID"]]
all_cols = category_cols + numerical_cols


for cat in category_cols:
    train[cat] = pd.factorize(train[cat])[0]
    train[cat] = train[cat].astype('category')

In [None]:
#Split data, also reduce data
train_x , valid_x, train_y, valid_y = train_test_split(train[all_cols], train[["target"]], test_size=0.35, stratify=train["target"])  
train = pd.concat([train_x, train_y], axis=1)
train = train.reset_index(drop=True)


del train_x, train_y, valid_x, valid_y
gc.collect()

<div style="color: #fff7f7;
           display:fill;
           border-radius:10px;
           border-style: solid;
           border-color:#424949;
           text-align:center;
           background-color:#69541b ;
           font-size:20px;
           letter-spacing:0.5px;
           padding: 0.7em;
           text-align:left">  
<center>  Train lgbm-model </center> 

In [None]:
def get_feature_importance_lgbm(data:DataFrame, shuffle:bool = False, seed:int = 2022) -> Tuple:
    
    
    """
    args:
        data: pd.DataFrame
        shuffle: default=False,Used to randomly shuffle the target value.
        seed:random seed for the lgbm-model
        
    returns:
        Tuple: 
          Tuple[0](model object):trained model
          Tuple[1](DataFrame):Feature importance after training
    
    """
    
    all_features = [f for f in data if f not in ['target', 'S_2', "customer_ID"]]
    
    #Take random scrambled label data and get the scrambled feature importances. 
    #if shuffle=True  These feature importances are wrong.
    y = data['target'].copy()
    if shuffle:
        y = data['target'].copy().sample(frac=1.0)
        
    lgb_params = {
        'boosting_type': 'rf',
        'subsample': 0.6,
        'colsample_bytree': 0.6,
        'num_leaves': 200,
        'max_depth': 10,
        'seed ': seed,
        'bagging_freq': 1,
        "n_jobs":4,
        "bagging_seed ":seed,
        "min_gain_to_split":0.10
    }
    
    clf = lgb.LGBMClassifier(**lgb_params)
    clf.fit(data[all_features], y, categorical_feature=category_cols)

    imp_df = pd.DataFrame()
    imp_df["feature"] = list(all_features)
    imp_df["importance_gain"] = clf.booster_.feature_importance(importance_type='gain')
    imp_df["importance_split"] = clf.booster_.feature_importance(importance_type='split')
    imp_df['train_accuracy'] = accuracy_score(y, clf.predict(data[all_features]))
    imp_df['train_recall'] = recall_score(y, clf.predict(data[all_features]))
    imp_df['train_precision'] = precision_score(y, clf.predict(data[all_features]))
    
    return clf, imp_df

In [None]:
with timer("train model:"):
    model, truly_imp_df = get_feature_importance_lgbm(data=train)
    truly_imp_df.to_csv("truly_feature_importance.csv")  #Save the data for easy comparison of filtered features in another notebook.

In [None]:
truly_imp_df.head(5)

<div style="color: #fff7f7;
           display:fill;
           border-radius:10px;
           border-style: solid;
           border-color:#424949;
           text-align:center;
           background-color:#69541b ;
           font-size:20px;
           letter-spacing:0.5px;
           padding: 0.7em;
           text-align:left">  
<center>  Explain feature importances - wight </center> 
</div>

In [None]:
eli5.show_weights(model, feature_names = all_cols, importance_type="split", top=20)

In [None]:
eli5.show_weights(model, feature_names = all_cols, importance_type="gain", top=20)

<div style="color: #fff7f7;
           display:fill;
           border-radius:10px;
           border-style: solid;
           border-color:#424949;
           text-align:center;
           background-color:#69541b ;
           font-size:20px;
           letter-spacing:0.5px;
           padding: 0.7em;
           text-align:left">  
<center>  Null feature importance </center>
</div>

In [None]:
with timer("build null feature importance"):
    
    null_imp = pd.DataFrame()
    runs = 5
    for i in range(runs):
        model, imp_df = get_feature_importance_lgbm(data=train, shuffle=True) # return (model, df) 
        imp_df["run_num"] = i+1
        null_imp = pd.concat([null_imp, imp_df], axis=0)
        
        del model
        gc.collect()
        print(f"======runing:{i+1}======")
        
    null_imp.to_csv("null_feature_importance_with_5.csv")

In [None]:
null_imp.head()

In [None]:
def display_distributions(actual_imp_df_:DataFrame, null_imp_df_:DataFrame, feature:str) -> None:
    
    """
    args:
      actual_imp_df_:Unshuffled data
      null_imp_df_：Data that has been shuffled many times. At present, it has gone through 5 shuffles
      feature:Feature columns in the data.
      
    """
    
    plt.figure(figsize=(13, 6))
    gs = gridspec.GridSpec(1, 2)
    
    # 1、Plot Split importances
    ax = plt.subplot(gs[0, 0])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature, 'importance_split'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature, 'importance_split'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Split Importance of %s' % feature.upper(), fontweight='bold')
    ax.spines[['top', 'right']].set_visible(False)
    ax.spines[['left','bottom']].set_linewidth(1.5)
    ax.grid(False)
    plt.xlabel('Null Importance (split) Distribution for %s ' % feature.upper())
    
    # 2、Plot Gain importances
    ax = plt.subplot(gs[0, 1])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature, 'importance_gain'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature, 'importance_gain'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Gain Importance of %s' % feature.upper(), fontweight='bold')
    ax.spines[['top', 'right']].set_visible(False)
    ax.spines[['left','bottom']].set_linewidth(1.5)
    ax.grid(False)
    plt.xlabel('Null Importance (gain) Distribution for %s ' % feature.upper())

- View the importance of the original features at the top, and the distribution graph after shuffle.

In [None]:
display_distributions(truly_imp_df, null_imp, feature="P_2")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="D_42")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="B_9")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="S_3")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="B_3")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="D_48")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="D_63")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="D_64")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="D_66")

In [None]:
display_distributions(truly_imp_df, null_imp, feature="D_68")

From the above plot I believe the power of the exposed feature selection method is demonstrated. In particular it is well known that :

Any feature sufficient variance can be used and made sense of by tree models. You can always find splits that help scoring better
Correlated features have decaying importances once one of them is used by the model. The chosen feature will have strong importance and its correlated suite will have decaying importances

The current method allows to :

Drop high variance features if they are not really related to the target
Remove the decaying factor on correlated features, showing their real importance (or unbiased importance)

#### Score features <br>
<h>  There are several ways to score features :

- Compute the number of samples in the actual importances that are away from the null importances recorded distribution.
- Compute ratios like Actual / Null Max, Actual / Null Mean, Actual Mean / Null Max <br>
In a first step I will use the log actual feature importance divided by the 75 percentile of null distribution.

In [None]:
def plot_feature_scores(true_df:DataFrame, null_df:DataFrame) -> None:
    
    def score_df():
        
        feature_scores = []
        for f in true_df['feature'].unique():
            f_null_imps_gain = null_df.loc[null_df['feature'] == f, 'importance_gain'].values
            f_act_imps_gain = true_df.loc[true_df['feature'] == f, 'importance_gain'].mean()
            gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75)))     # Avoid didvide by zero
            f_null_imps_split = null_df.loc[null_df['feature'] == f, 'importance_split'].values
            f_act_imps_split = true_df.loc[true_df['feature'] == f, 'importance_split'].mean()
            split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75)))  # Avoid didvide by zero
            feature_scores.append((f, split_score, gain_score))
            
        scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])
        
        scores_df.to_csv("score_feature_with_mean.csv")
        return scores_df
    
    
    plt.figure(figsize=(16, 16))
    gs = gridspec.GridSpec(1, 2)
    
    # 1、Plot Split importances
    ax = plt.subplot(gs[0, 0])
    sns.barplot(x='split_score', y='feature', data=score_df().sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
    ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
    ax.spines[['top', 'right']].set_visible(False)
    ax.spines[['left','bottom']].set_linewidth(1.5)
    ax.grid(False)
    
    # 2、Plot Gain importances
    ax = plt.subplot(gs[0, 1])
    sns.barplot(x='gain_score', y='feature', data=score_df().sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
    ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
    ax.spines[['top', 'right']].set_visible(False)
    ax.spines[['left','bottom']].set_linewidth(1.5)
    ax.grid(False)
    plt.tight_layout()
        

#=======================================
plot_feature_scores(truly_imp_df, null_imp)

#### Check the impact of removing uncorrelated features <br>
- use a different metric to asses correlation to the target

In [None]:
def plot_corr_scores(true_df:DataFrame, null_df:DataFrame) -> None:
    
    
    correlation_scores = []
    def corr_score_df():
        
        
        for f in true_df['feature'].unique():
            f_null_imps = null_df.loc[null_df['feature'] == f, 'importance_gain'].values
            f_act_imps = true_df.loc[true_df['feature'] == f, 'importance_gain'].values
            gain_score = 100 * (f_null_imps < np.percentile(f_act_imps, 25)).sum() / f_null_imps.size
            f_null_imps = null_df.loc[null_df['feature'] == f, 'importance_split'].values
            f_act_imps = true_df.loc[true_df['feature'] == f, 'importance_split'].values
            split_score = 100 * (f_null_imps < np.percentile(f_act_imps, 25)).sum() / f_null_imps.size
            correlation_scores.append((f, split_score, gain_score))

        corr_scores_df = pd.DataFrame(correlation_scores, columns=['feature', 'split_score', 'gain_score'])
        corr_scores_df.to_csv("corr_score_filter.csv")
        
        return correlation_scores, corr_scores_df
    
    
    correlation_scores, corr_scores_df = corr_score_df()
    
    fig = plt.figure(figsize=(16, 16))
    gs = gridspec.GridSpec(1, 2)

    # Plot Split importances
    ax = plt.subplot(gs[0, 0])
    sns.barplot(x='split_score', y='feature', data=corr_scores_df.sort_values('split_score', ascending=False).iloc[0:70], ax=ax)
    ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
    ax.spines[['top', 'right']].set_visible(False)
    ax.spines[['left','bottom']].set_linewidth(1.5)
    ax.grid(False)

    # Plot Gain importances
    ax = plt.subplot(gs[0, 1])
    sns.barplot(x='gain_score', y='feature', data=corr_scores_df.sort_values('gain_score', ascending=False).iloc[0:70], ax=ax)
    ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
    ax.spines[['top', 'right']].set_visible(False)
    ax.spines[['left','bottom']].set_linewidth(1.5)
    ax.grid(False)
    plt.tight_layout()
    plt.suptitle("Features' split and gain scores", fontweight='bold', fontsize=16)
    fig.subplots_adjust(top=0.93)
    
    gc.collect()
    
    return correlation_scores


#=====================
correlation_scores = plot_corr_scores(truly_imp_df, null_imp)

In [None]:
def com_feature(a:List, b:List) -> List:
    
    #get the same element in two lists
    
    return set(a) & set(b)

### Score feature removal for different thresholds :)

In [None]:
def score_feature_selection(df=None, train_features=None, cat_feats=None, target=None):

    dtrain = lgb.Dataset(df[train_features], target, free_raw_data=False, silent=True)
    lgb_params = {
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'learning_rate': .1,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'num_leaves': 31,
        'max_depth': -1,
        'seed': 13,
        'n_jobs': 4,
        'min_split_gain': .01,
        'reg_alpha': .0001,
        'reg_lambda': .0001,
        'metric': 'auc'
    }
    
    # Fit the model
    hist = lgb.cv(
        params=lgb_params, 
        train_set=dtrain, 
        num_boost_round=300,
        categorical_feature=cat_feats,
        nfold=5, 
        stratified=True,
        shuffle=True,
        early_stopping_rounds=20,
        verbose_eval=0,
        seed=17
    )
    
    # Return the last mean / std values 
    return hist['auc-mean'][-1], hist['auc-stdv'][-1]


for threshold in [10, 30, 50, 70, 90, 99]:
    split_feats = [f for f, score, _ in correlation_scores if score >= threshold]
    split_cat_feats = [f for f, score, _ in correlation_scores if (score >= threshold) & (f in category_cols)]
    gain_feats = [f for f, _, score in correlation_scores if score >= threshold]
    gain_cat_feats = [f for f, _, score in correlation_scores if (score >= threshold) & (f in category_cols)]
    
    
    print('Results for threshold %3d' % threshold)
    print("The selected features are now：")
    
    
    print("[the same element:]{}".format(com_feature(split_feats, gain_feats)))
    print("[the same cat_element]:{}.".format(com_feature(split_cat_feats, gain_cat_feats)))

    split_results = score_feature_selection(df=train, train_features=split_feats, cat_feats=split_cat_feats, target=train['target'])
    print('\t SPLIT : %.6f +/- %.6f' % (split_results[0], split_results[1]))
    
    gain_results = score_feature_selection(df=train, train_features=gain_feats, cat_feats=gain_cat_feats, target=train['target'])
    print('\t GAIN  : %.6f +/- %.6f' % (gain_results[0], gain_results[1]))
    
    print("==================")
    del split_feats, split_cat_feats, gain_feats, gain_cat_feats
    gc.collect()

Due to memory reasons, the next step will try to model and analyze the obtained feature importance data file in another notebook, and compare the effect of the filtered model.   ：)