## Feature selection
Although this approach to select features is really handy without much knowledge of the underlying data we should keep in mind that business knowledge should be incorporated in this phase as well. Furthermore this step takes a long time for small to medium sized datasets -> ergo: this is not scalabe. Possible solutions:
- increase step_size of the RFECV functions
- decrease the number of estimators in the RFECV functions
- use a subset of the original dataset
- ..

The goal behind this approach is to keep the best features, since we are combining the different lists and possibly adding features besides these approaches we might be better off keeping a high over indicator of the most relevant features.

To formally check the difference we make use of the timeit module: https://www.pythoncentral.io/time-a-python-function/

In [1]:
import os
from collections import Counter

import numpy as np
import pandas as pd

import src.features.feature_selection as feat_select

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
train_path_cleaned = os.path.join("..", "data", "train_set_cleaned")

In [4]:
df = pd.read_csv(
    train_path_cleaned, sep=";", decimal=".", low_memory=False, compression="zip"
)

In [5]:
df.dtypes.value_counts()

int64      308
float64      3
object       1
dtype: int64

#### Fix objects

In [6]:
df.select_dtypes(include="object").head()

Unnamed: 0,klant_min_begindatum_dat
0,2013-10-31 00:00:00
1,2001-10-31 00:00:00
2,2004-04-01 00:00:00
3,1992-12-31 00:00:00
4,2016-06-14 00:00:00


In [7]:
df.drop("klant_min_begindatum_dat", axis=1, inplace=True)

### Feature selection

n_features_rf, xgb, and logreg all use Recursive Feature Elimination and cross-validation:

"Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached."
- use the different feat select methods
- combine the output
- add business sense

Notes:
- The initial accuracy is quite high (88%), are there still near perfect predictors in the set?

Notes on improved efficiency:
- n_features_rf with cpus=4, max_depth=10, n_estimators=100,  and step_size=5: 17.44 min, feat=65, max_acc=88%
- n_features_rf with cpus=4, max_depth=10, n_estimators=100, and step_size=10: 6.9 min., feat=70, max_acc=87%
- n_features_rf with cpus=4, max_depth=10, n_estimators=50,  and step_size=10: 3.5 min., feat=70, max_acc=87.9%

Code to measure improved efficiency:
```python
import timeit

# Create a wrapper(decorator) to include arguments 
def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

wrapped = wrapper(feat_select.n_features_rf, x_train=X, y_train=y, cpus=4)

timeit.timeit(wrapped, number=1)
```

In [8]:
y = df["toon_churn"]
X = df.drop(["toon_churn"], axis=1)

### Changed step_size to 10 for all RFECV feature selection functions

In [10]:
rf_list = feat_select.n_features_rf(X, y, cpus=4)

RandomForest: The optimal number of features is 70 with a maximum accuracy
of 87.9%. Please keep in mind that the stepsize is 10.


In [11]:
xgb_list = feat_select.n_features_xgb(X, y, cpus=4)

XGboost: The optimal number of features is 30 with a maximum accuracy
of 85.8%. Please keep in mind that the stepsize is 10.


In [12]:
logreg_list = feat_select.n_features_logreg(X, y, cpus=4)

Log. regression: The optimal number of features is 280 with a maximum accuracy
of 87.5%. Please keep in mind that the stepsize is 5.


In [13]:
boruta_list = feat_select.n_features_boruta(X, y)

Boruta: The optimal number of features is 160


#### Combine lists and continue
- Abstract the code, notify that files are being written to local folders, catch the error via Exception
- Or add your destination folder given your current folder
- add os.path.join, os.scandir, os.mkdir
- Assume you are working with the modeling template + jupyter notebook

In [15]:
extensive_list = feat_select.create_feature_lists(
    "toon_churn", rf_list, xgb_list, logreg_list, boruta_list
)

The list with 4 vote(s) contains 27 variables and can be found in ..\feature_lists\combined_list_4_votes
The list with 3 vote(s) contains 71 variables and can be found in ..\feature_lists\combined_list_3_votes
The list with 2 vote(s) contains 146 variables and can be found in ..\feature_lists\combined_list_2_votes
The list with 1 vote(s) contains 296 variables and can be found in ..\feature_lists\combined_list_1_votes


### Export dataset for modeling

In [16]:
train_path_processed = os.path.join("..", "data", "train_set_processed")

In [17]:
df[extensive_list].to_csv(
    train_path_processed, sep=";", encoding="utf-8", index=False, compression="zip"
)