<div style="color:White; display:fill; border-radius:5px;background-color:#336b87;font-size:300%;font-family:sans-serif;letter-spacing:0.5px;text-align: center">
Feature selection & engineering
</div>

# Table of Contents

* [Introduction](#section-one)
* [Data loading](#section-two)
* [Initial feature selection(Baseline)](#section-three)
    - [Shap study](#subsection-three-one)
    - [Lime study](#subsection-three-two)
    - [Permutation feature importance](#subsection-three-three)
* [Feature engineering](#section-four)
    - [PCA decomposition](#subsection-four-one)
    - [SVD decomposition](#subsection-four-two)
    - [Polynomial features](#subsection-four-three)
    - [Tsfresh features](#subsection-four-four)
    - [Binning](#subsection-four-five)
    - [Stats features](#subsection-four-six)
    - [Log transformation](#subsection-four-seven)
* [Final feature selection](#section-five)
    - [Final shap study](#subsection-five-one)
* [Final thoughts](#section-six)

<a id="section-one"></a>

# Introduction

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   Throughout this notebook I will first make an initial features selection with Shap, Lime and permutation feature importance, to see if there is any difference regards to feature selection, and then try to make a feature engineering process and see what results sheds. The Shap  technique helped me in the previous competition. I hope you like it.
    <br><br>

In [None]:
import numpy as np 
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing, decomposition
from tsfresh.feature_extraction import feature_calculators as fc

In [None]:
import shap, lime
from lime import lime_tabular

<a id="section-two"></a>
# Data loading

In [None]:
train = pd.read_parquet('../input/playgroundkfold/train_kfold_play_nov_orig.parquet')
test = pd.read_parquet('../input/playgroundkfold/test_play_nov.parquet')

In [None]:
features = [feature for feature in train.columns if feature not in ('id','kfold', 'target')]

<a id="section-three"></a>
# Initial feature selection with Shap, Lime & Permutation (Baseline)

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   Splitting the data for the shap study and train a xgb model
    <br><br>

In [None]:
X_train = train.query('kfold != 0')
X_valid = train.query('kfold == 0')

y_train = X_train['target'].copy()
y_valid = X_valid['target'].copy()
X_train = X_train[features].values.copy()
X_valid = X_valid[features].values.copy()

In [None]:
model1 = xgb.XGBClassifier(n_estimators=1000, use_label_encoder=False, eval_metric = 'auc',
                          tree_method='gpu_hist', gpu_id=0,predictor="gpu_predictor").fit(X_train, y_train)

<a id="subsection-three-one"></a>
<h2> SHAP study
    </h2> 

In [None]:
booster_xgb1 = model1.get_booster()
shap_values_xgb1 = booster_xgb1.predict(xgb.DMatrix(X_train, y_train), pred_contribs=True)

In [None]:
shap_values_xgb1 = shap_values_xgb1[:, :-1]

<h3> 
SHAP Summary Plot
</h3>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   With this plot we can visualize the overall impact of the features across multiple instances. For that reason the result of the shapley study it's much more reliable to establish what features are the most relevant in comparassion with for example the feature importance of a tree base model.    
    <br><br>

In [None]:
shap.summary_plot(shap_values_xgb1, X_train, feature_names=train[features].columns);

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
This graph tells us which are the most important characteristics and their range of effects on the data set. The features are sorted by rank based on their impact on the target value. This technique helped me a lot in the previous competition, on the one hand in the training times of the models, and I also think that to avoid some overfitting by reducing the number of features and making a more generalizable model.
    <br><br>

In [None]:
shaped_features = ['f34','f55','f8','f43','f91','f71','f80','f27','f50','f97',
                   'f41','f66','f57','f25','f22','f96','f82','f81','f26','f40']

<a id="subsection-three-two"></a>
<h2> 
Lime
</h2>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
Take this approach as a grain of salt, because it's an "unirow" method, that means only can explain one row at the time, very different than Shap and Permutation. I just wanna show you because the result are quite nice, and also the fact it's a unirow approach maybe can helps to understood more the problem, by the oviuos reduction of the complexity. And you can use the results to do some experimentations and see what happens. I've think you can do this study randomizing the sample rows to check some variation on the output. Maybe I can do this in the future.
    <br><br>

In [None]:
lime_explainer = lime_tabular.LimeTabularExplainer(training_data=X_train,
                                                   feature_names=train[features].columns,
                                                   mode='regression',
                                                   verbose=False,
                                                   random_state=42)

In [None]:
test_sample = X_valid[5,:]

In [None]:
lime_exp = lime_explainer.explain_instance(data_row=test_sample,
                                           num_features=20,
                                           predict_fn=model1.predict_proba)

<h3> 
Lime results
</h3>

In [None]:
plot = lime_exp.as_pyplot_figure()
plot.tight_layout()

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
This graph shows the top 20 most important features in the row 6 (index 5), remember that. The red values correspond to negatives correlations, and grees positives. You can check this for more features just only need to change the num_features param in the lime_expleainer.
    <br><br>

In [None]:
lime_exp.show_in_notebook(show_table=True, show_all=False)

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
Another way to see the result it's this one, I really like the format, and for that reason I show it to you.
    <br><br>

<a id="subsection-three-three"></a>
<h2> 
Permutation feature importance
</h2>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
This is another technique wich consist in randomly shuffle a single column of the validation data, leaving the target and all other columns in place, and how this shuffle affect the accuracy of predictions in that now-shuffled data. There is a mini Kaggle course where you can learn more about it. I share the link for that down below.
https://www.kaggle.com/learn/machine-learning-explainability    <br><br>

In [None]:
!pip install eli5 --upgrade --quiet

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(model1,scoring = 'roc_auc', random_state=0).fit(X_valid, y_valid)
eli5.show_weights(perm, feature_names = train[features].columns.tolist())

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
We can see how the permutation results are very similar to Shap, that I think it's because the process tho do the predictions of the features it's similar, but it's nice to se some confirmations
    <br><br>

<a id="section-four"></a>
# Feature engineering

<h3>
Standardizing the features
</h3>

In [None]:
scl = preprocessing.StandardScaler()
train_scl = scl.fit_transform(train[shaped_features])
test_scl = scl.transform(test[shaped_features])

<a id="subsection-four-one"></a>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   <b> 1 - PCA decomposition:</b>
    PCA or Principal Components Analysis gives us our ideal set of features. It creates a set of principal components that are rank ordered by variance (the first component has higher variance than the second, and so on), uncorrelated, and low in number (we can throw away the lower ranked components as they contain little signal). In this case I just apply the PCA decomposition to the features selected by Shap, not all the set. To more info about the PCA, you can check the Kaggle course, https://www.kaggle.com/ryanholbrook/principal-component-analysis.
    <br><br>

In [None]:
# Create principal components
pca = decomposition.PCA()
train_pca = pca.fit_transform(train_scl)
test_pca = pca.transform(test_scl)

# Convert to dataframe
component_names_1 = [f"PC{i+1}" for i in range(train_pca.shape[1])]
train_pca = pd.DataFrame(train_pca, columns=component_names_1)

component_names_2 = [f"PC{i+1}" for i in range(test_pca.shape[1])]
test_pca = pd.DataFrame(test_pca, columns=component_names_2)

<a id="subsection-four-two"></a>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   <b> 2 - SVD decomposition:</b>
    Im going to use the SVD or Singular Value Decomposition for dimensionality reduction and also see if helps to denoise the data. In this case Im going to apply the SVD to the whole dataset. You can check the documentation here, https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html, or explanations on TDS here, https://towardsdatascience.com/search?q=svd.
    <br><br>

<h3>
Standardizing the features
</h3>

In [None]:
scl = preprocessing.StandardScaler()
train_scl = scl.fit_transform(train[features])
test_scl = scl.transform(test[features])

In [None]:
# Create svd components
svd = decomposition.TruncatedSVD()
train_svd = svd.fit_transform(train_scl)
test_svd = svd.transform(test_scl)

# Convert to dataframe
component_names_1 = [f"SVD{i+1}" for i in range(train_svd.shape[1])]
train_svd = pd.DataFrame(train_svd, columns=component_names_1)

component_names_2 = [f"SVD{i+1}" for i in range(test_svd.shape[1])]
test_svd = pd.DataFrame(test_svd, columns=component_names_2)

<a id="subsection-four-three"></a>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   <b> 3 - Polynomial features:</b>
    This process generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]. You can learn more about it here, https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html.
    <br><br>

In [None]:
poly = preprocessing.PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
train_poly = poly.fit_transform(train[shaped_features])
test_poly = poly.fit_transform(test[shaped_features])

train_poly_df = pd.DataFrame(train_poly, columns=[f"POLY_{i}" for i in range(train_poly.shape[1])])
test_poly_df = pd.DataFrame(test_poly, columns=[f"POLY_{i}" for i in range(test_poly.shape[1])])


<a id="subsection-four-four"></a>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
   <b>4 - Tsfresh Features:</b>
     tsfresh offers hundreds of features and tens of variations of different features that you can use for time series based features. You can learn more about it here, https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html.
    <br><br>

In [None]:
for feature in shaped_features:
    train[f'AbsEnergy_{feature}'] = fc.abs_energy(train[feature])
    train[f'AbsSumChanges_{feature}'] = fc.absolute_sum_of_changes(train[feature])
    train[f'MeanAbsChange_{feature}'] = fc.mean_abs_change(train[feature])
    train[f'MeanChange_{feature}'] = fc.mean_change(train[feature])

<a id="subsection-four-five"></a>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
<b>5 - Binning: </b> 
    <br><br>

In [None]:
for col in features:
    train[col+'_bin'] = pd.cut(train[col], bins=5, labels=False)

<a id="subsection-four-six"></a>
<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
<b>6 - Stats features: </b> 
    <br><br>

In [None]:
train['mean'] = train[features].mean(axis=1)
train['median'] = train[features].median(axis=1)
train['std'] = train[features].std(axis=1)
train['var'] = train[features].var(axis=1)
train['kurt'] = train[features].kurtosis(axis=1)

<a id="section-five"></a>
# Final feature selection

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
  The first thing we do is concatenate the train and test datasets with their respective PCA and SVD datasets.
    <br><br>

In [None]:
train = pd.concat ([train,train_pca], axis=1)
test = pd.concat ([test,test_pca], axis=1)

In [None]:
train = pd.concat ([train,train_svd], axis=1)
test = pd.concat ([test,test_svd], axis=1)

In [None]:
train = pd.concat ([train,train_poly_df], axis=1)
test = pd.concat ([test,test_poly_df], axis=1)

<a id="subsection-five-one"></a>
<h2>
Final feature selection
</h2>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
For the final selection of features I'm going to use only Shap, because the Permutation approch it's quite resource demanding. 
<br><br>

In [None]:
X_train = train.query('kfold != 0')
X_valid = train.query('kfold == 0')

X_train = X_train[features].copy()
X_valid = X_valid[features].copy()

In [None]:
model2 = xgb.XGBClassifier(n_estimators=1000, use_label_encoder=False, eval_metric = 'auc',
                          tree_method='gpu_hist', gpu_id=0,predictor="gpu_predictor").fit(X_train, y_train)

In [None]:
booster_xgb2 = model2.get_booster()
shap_values_xgb2 = booster_xgb2.predict(xgb.DMatrix(X_train, y_train), pred_contribs=True)

In [None]:
shap_values_xgb2 = shap_values_xgb2[:, :-1]

<h3>
SHAP Summary Plot
</h3>

In [None]:
shap.summary_plot(shap_values_xgb2, X_train, feature_names=X_train.columns);

<a id="section-six"></a>
# Final thoughts

<p style="font-size:15px; font-family:verdana; line-height: 1.7em">
    We can se how the features we created are useful, because provide some information to predict the target. I will be adding more features as they occur to me. If you have any questions, suggestions, or if I make some mistake, please let me know. Good luck to all!
    <br><br>