<a id="section-one"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> - | Notebook resume</div>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">   
Hi Kagglers, throughout this notebook what I am going to do is use different feature selection techniques, I hope you like it, any suggestion is welcome. Greetings! </p>


# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> - | Table of contents</div>



* [1-Libraries](#section-one)
* [2-Data loading](#section-two)
* [3-Folds creation](#section-three)
* [4-Initial feature selection](#section-four)
* [5-Feature engineering](#section-five)
* [6-Final feature selection](#section-six)

<a id="section-one"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 1 | Libraries</div>

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
from sklearn import metrics, model_selection 
from sklearn.feature_selection import SelectFromModel
from sklearn.cluster import KMeans

In [None]:
import warnings
warnings.filterwarnings('ignore')

<a id="section-two"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 2 | Data loading</div>

In [None]:
train = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv', nrows=250000)

<a id="section-three"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 3 | Folds creation</div>

In [None]:
kf = model_selection.KFold(n_splits=5) 
train['kfold'] = -1
def kfold (df):
    for fold, (train_idx, test_idx) in enumerate(kf.split(X = df)):
        df.loc[test_idx, 'kfold'] = fold
        
    return df

In [None]:
train = kfold(train)

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Spliting the features

In [None]:
features = [feature for feature in train.columns if feature not in ('id', 'kfold','target')]
binary_features = [feature for feature in features if len(train[feature].unique()) == 2]
conts_features = [feature for feature in features if feature not in binary_features]

<a id="section-four"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 4 | Initial feature selection</div>

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
    Permutation feature importance </p>

In [None]:
fold = 0
X_train = train[train.kfold != fold].reset_index(drop=True)
X_valid = train[train.kfold == fold].reset_index(drop=True)

y_train = X_train['target']
y_valid = X_valid['target']

X_train = X_train[features]
X_valid = X_valid[features]

my_model = xgb.XGBClassifier(eval_metric='logloss',
                             tree_method='gpu_hist', 
                             gpu_id=0,
                             predictor="gpu_predictor",
                             random_state=0).fit(X_train, y_train)

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state=1).fit(X_valid, y_valid)
eli5.show_weights(perm, top=25, feature_names = X_valid.columns.tolist())

<p style="font-size:16px; font-family:verdana; line-height: 1.7em; margin-left:20px">   
<b> We can see how the most important features by far is f22 followed by f179, and then all the other ones.</b></p>


<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">   
Mutual information </p>

In [None]:
discrete_features_i = X_train.dtypes == int

In [None]:
from sklearn.feature_selection import mutual_info_classif

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_classif(X, y, discrete_features=discrete_features,n_neighbors=10)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

In [None]:
# Because this approach it's quite resource consuming I'm going to use only the top 50000 rows
mi_scores_i = make_mi_scores(X_train.head(50000), y_train.head(50000), discrete_features_i)

In [None]:
def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")


plt.figure(dpi=100, figsize=(10, 7))
plot_mi_scores(mi_scores_i.head(30))

<p style="font-size:16px; font-family:verdana; line-height: 1.7em; margin-left:20px">  
<b>Again we can see how the results of permutation feature importance are confirmed by this second selection technique.</b></p>

<a id="section-five"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 5 | Feature engineering</div>

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Counting the binary features  </p>

In [None]:
train["binary_count"] = train[binary_features].sum(axis=1)

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Stats </p>

In [None]:
train['mean'] = train[conts_features].mean(axis=1)
train['std'] = train[conts_features].std(axis=1)
train['median'] = train[conts_features].median(axis=1)
train['kurt'] = train[conts_features].kurtosis(axis=1)

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Clustering </p>

In [None]:
kmeans = KMeans(n_clusters=5)

train["cluster"] = kmeans.fit_predict(train)
train["cluster"] = train["cluster"]#.astype("category")

<a id="section-six"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 6 | Final feature selection</div>

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Permutation </p>

In [None]:
features = [feature for feature in train.columns if feature not in ('id', 'kfold','target')]
binary_features = [feature for feature in features if len(train[feature].unique()) == 2]
conts_features = [feature for feature in features if feature not in binary_features]

In [None]:
fold = 0
X_train = train[train.kfold != fold].reset_index(drop=True)
X_valid = train[train.kfold == fold].reset_index(drop=True)

y_train = X_train['target']
y_valid = X_valid['target']

X_train = X_train[features]
X_valid = X_valid[features]

In [None]:
my_model = xgb.XGBClassifier(eval_metric='logloss',
                             tree_method='gpu_hist', 
                             gpu_id=0,
                             predictor="gpu_predictor",
                             random_state=0).fit(X_train, y_train)

In [None]:
perm = PermutationImportance(my_model, random_state=1).fit(X_valid, y_valid)
eli5.show_weights(perm, top=25, feature_names = X_valid.columns.tolist())

<p style="font-size:20px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Mutual information </p>

In [None]:
discrete_features_f = X_train.dtypes == int

In [None]:
mi_scores_f = make_mi_scores(X_train.head(50000), y_train.head(50000), discrete_features_f)

In [None]:
plt.figure(dpi=100, figsize=(10, 7))
plot_mi_scores(mi_scores_f.head(30))

<p style="font-size:16px; font-family:verdana; line-height: 1.7em; margin-left:20px"> 
<b> We can see that of all the features we created only binary count appears among the most important ones, and only for mutual information, which in my opinion is not as reliable as permutation feature importance which uses a validation set.</b></p>