<a id="section-one"></a>
# <b> <span style='color:#808080'>- |</span> Explanation </b>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">   
This is an idea on how to deal with so many features of this dataset. First we tried some feature selection method, to reduce the number of features, and keep only the best ones. I've try mutual information and Shap, I think Shap it's better. Then to stablish a baseline score, we train a model with the features selected, this score it's gonna be low in comparassion, but we're using only 20 features. After that you train another model with the features selected by Shap and the non selected features one by one, to see if anyone can add some "info" to predict the target value. And finaly we train a model with the features selected by Shap, and the other ones selected by the second model. </p>


# <b> <span style='color:#808080'>- |</span> Table of Contents</b>

* [1-Libraries and data loading](#section-one)
* [2-Feature management](#section-two)
* [3-Baseline](#section-three)
* [4-Comparison](#section-four)
* [5-Final AUC with the selected features](#section-five)

<a id="section-one"></a>
# <b>1 <span style='color:#808080'>|</span> Libraries and Data loading</b>

In [None]:
import numpy as np 
import pandas as pd
import xgboost as xgb
from sklearn import metrics

In [None]:
train = pd.read_parquet('../input/playgroundkfold/train_kfold_play_oct_orig.parquet')

<a id="section-two"></a>
# <b>2 <span style='color:#808080'>|</span> Feature management</b>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
The process begins by separating the set of features to use in the analysis.
</p>

In [None]:
features = [feature for feature in train.columns if feature not in ('id', 'kfold','target')]

In [None]:
print(f'Starting feature len =  {len(features)}')

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
The features down below were selected with the Shap study.
</p>

In [None]:
shaped_features = ['f22','f179','f69','f58','f214','f78','f136','f156','f8','f3',
                   'f77','f92','f19','f200','f18','f247','f12','f211','f43','f201']

In [None]:
features_to_comp = [feature for feature in features if feature not in shaped_features]

In [None]:
print(f'Length of features to compare = {len(features_to_comp)}')

<a id="section-three"></a>
# <b>3 <span style='color:#808080'>|</span> Baseline</b>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
To stablish a baseline I've calculate the AUC for a given fold. </p>

In [None]:
fold = 0 
    
X_train = train[train.kfold != fold].reset_index(drop=True)
X_valid = train[train.kfold == fold].reset_index(drop=True)

y_train = X_train['target'].values
y_valid = X_valid['target'].values
    
X_train = X_train[shaped_features].values
X_valid = X_valid[shaped_features].values

# Model 
model = xgb.XGBClassifier(n_estimators = 20000, random_state=0, objective = 'binary:logistic',use_label_encoder=False,
                          tree_method='gpu_hist', gpu_id=0,predictor="gpu_predictor"
                             )
model.fit(X_train, y_train, early_stopping_rounds=20, eval_set=[(X_valid, y_valid)], eval_metric=['auc'],verbose=False) 

preds_valid = model.predict_proba(X_valid)[:,1]
auc_baseline = metrics.roc_auc_score (y_valid, preds_valid)
    
print(f'fold {fold} auc = {auc_baseline}')

<a id="section-four"></a>
# <b>4 <span style='color:#808080'>|</span> Comparison</b>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
Now in the comparison I'm going to do a loop over the non selected features to see at the end if anyone of them can add some value to improve the prediction. </p>

In [None]:
scores = dict()

# Loop over all the non selected features
for feature in features_to_comp:
    fold = 0 
    
    X_train = train[train.kfold != fold].reset_index(drop=True)
    X_valid = train[train.kfold == fold].reset_index(drop=True)

    y_train = X_train['target'].values
    y_valid = X_valid['target'].values
    
    # Adding the non selected feature to train and valid 
    X_train = X_train[shaped_features+[feature]].values
    X_valid = X_valid[shaped_features+[feature]].values

    # Model 
    model = xgb.XGBClassifier(n_estimators = 20000, random_state=0, objective = 'binary:logistic',use_label_encoder=False,
                             tree_method='gpu_hist', gpu_id=0,predictor="gpu_predictor"
                             )
    model.fit(X_train, y_train, early_stopping_rounds=20, eval_set=[(X_valid, y_valid)], eval_metric=['auc'],verbose=False) 

    preds_valid = model.predict_proba(X_valid)[:,1]
    auc = metrics.roc_auc_score (y_valid, preds_valid)
    
    scores[feature] = np.round(auc, 10)

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
Converting the dict with all the AUC features values to a dataframe
</p>

In [None]:
scores_features_df = pd.DataFrame.from_dict(scores, orient='index').reset_index().rename(columns = {'index':'Feature', 0:f'AUC'})

In [None]:
scores_features_df.head()

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
Adding the baseline score to perform the difference 
</p>

In [None]:
scores_features_df['Baseline AUC'] = auc_baseline

In [None]:
scores_features_df.head()

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
Finally we compute the diffence between the baseline score and the feature added score, also we can set a thershold value to keep those features.</p>

In [None]:
scores_features_df['Difference'] =  scores_features_df['AUC'] - scores_features_df['Baseline AUC']

In [None]:
pd.set_option('display.float_format', lambda x: '%.10f' % x) # Set standard notation instead scientific

In [None]:
scores_features_df.head()

In [None]:
scores_features_df.to_csv('features_selected.csv', index=False)

In [None]:
selec_features = list(scores_features_df[scores_features_df['Difference'] > 0.0001].Feature.values)

In [None]:
print(f'Length of the final set of features = {len(shaped_features+selec_features)}')

<a id="section-five"></a>
# <b>5 <span style='color:#808080'>|</span> Final AUC with the selected features</b>

In [None]:
fold = 0 
    
X_train = train[train.kfold != fold].reset_index(drop=True)
X_valid = train[train.kfold == fold].reset_index(drop=True)

y_train = X_train['target'].values
y_valid = X_valid['target'].values
    
X_train = X_train[shaped_features+selec_features].values
X_valid = X_valid[shaped_features+selec_features].values

# Model 
model = xgb.XGBClassifier(n_estimators = 20000, random_state=0, objective = 'binary:logistic',use_label_encoder=False,
                          tree_method='gpu_hist', gpu_id=0,predictor="gpu_predictor"
                             )
model.fit(X_train, y_train, early_stopping_rounds=20, eval_set=[(X_valid, y_valid)], eval_metric=['auc'],verbose=False) 

preds_valid = model.predict_proba(X_valid)[:,1]
auc_final = metrics.roc_auc_score (y_valid, preds_valid)
    
print(f'fold {fold}, final auc = {auc_final}')

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
<b> Insights:</b> By filtering all the features greater than 0.0001 (for example), we reduce the continuous feature size from 285 to 105, and the score increase from 0.8399 (with only the Shap features), to 0.8503. Also by selecting a reduce number of features, we shortening the training time, the overfitting, and we make a better model for unseen data.
</p>

<p style="font-size:20px; font-family:verdana; line-height: 1.7em">
<b>I hope you found this interesting, if you have a question, suggestion, please let me know in the comments. Greetings to all!
</b></p>