... continued from [this kernel]().

**Outline**

- Feature Engineering (see the [previous kernel](http://www.kaggle.com/brendanhasz/elo-feature-engineering-and-feature-selection#feature-engineering))
- Feature Aggregations (see the [previous kernel](http://www.kaggle.com/brendanhasz/elo-feature-engineering-and-feature-selection#feature-aggregations))
- [Feature Selection](#feature-selection)
  - [Mutual Information](#mutual-information)
  - [Permutation-based Feature Importance](#permutation-based-feature-importance)
- [Conclusion](#conclusion)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc

from scipy.stats import spearmanr
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.impute import SimpleImputer

from catboost import CatBoostRegressor

# Plot settings
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
sns.set()

!pip install git+http://github.com/brendanhasz/dsutils.git
from dsutils.encoding import one_hot_encode
from dsutils.encoding import TargetEncoderCV
from dsutils.printing import print_table
from dsutils.evaluation import permutation_importance_cv
from dsutils.evaluation import plot_permutation_importance

In [None]:
# Load data containing all the features
fname = '../input/elo-feature-engineering-and-feature-selection/card_features_all.feather'
cards = pd.read_feather(fname)
cards.set_index('card_id', inplace=True)

# Test data
test = cards['target'].isnull()
X_test = cards[test].copy()
del X_test['target']

# Training data
y_train = cards.loc[~test, 'target'].copy()
X_train = cards[~test].copy()
del X_train['target']

# Clean up
del cards
gc.collect()

## Feature Selection

Machine learning models don't generally perform well when they're given a huge number of features, many of which are not very informative.  The more superfluous features we give our model to train on, the more likely it is to overfit!  To prune out features which could confuse our predictive model, we'll perform some feature selection. 

Ideally, we'd fit our model a bunch of different times, using every possible different combination of features, and use the set of features which gives the best cross-validated results. But, there are a few problems with that approach.  First, that would lead to overfitting to the training data.  Second, and perhaps even more importantly, it would take forever.

There are a bunch of different ways to perform feature selection in a less exhaustive, but more expedient manner.  Forward selection, backward selection, selecting features based on their correlation with the target variable, and "embedded" methods such as Lasso regressions (where the model itself performs feature selection during training) are all options.  However, here we'll use two different methods: the mutual information between each feature and the target variable, and the permutation-based feature importance of each feature.


### Mutual Information

To get some idea of how well each feature corresponds to the target (the loyalty score), we can compute the [mutual information](https://en.wikipedia.org/wiki/Mutual_information) between each feature and the target. Let's make a function to compute the mutual information between two vectors.

In [None]:
def mutual_information(xi, yi, res=20):
    """Compute the mutual information between two vectors"""
    ix = ~(np.isnan(xi) | np.isinf(xi) | np.isnan(yi) | np.isinf(yi))
    x = xi[ix]
    y = yi[ix]
    N, xe, ye = np.histogram2d(x, y, res)
    Nx, _ = np.histogram(x, xe)
    Ny, _ = np.histogram(y, ye)
    N = N / len(x) #normalize
    Nx = Nx / len(x)
    Ny = Ny / len(y)
    Ni = np.outer(Nx, Ny)
    Ni[Ni == 0] = np.nan
    N[N == 0] = np.nan
    return np.nansum(N * np.log(N / Ni))

The mutual information represents the amount of information that can be gained about one variable by knowing the value of some other vairable.  Obviously this is very relevant to the task of feature selection: we want to choose features which knowing the value of will give us as much information as possible about the target variable.

Practically speaking, the nice thing about using mutual information instead of, say, the correlation coefficient, is that it is sensitive to nonlinear relationships.  We'll be using nonlinear predictive models (like gradient boosted decision trees), and so we don't want to limit the features we select to be only ones which have a linear relationship to the target variable.  Notice how the sin-like relationship in the middle plot below has a high mutual information, but not a great correlation coefficient.

In [None]:
# Show mutual information vs correlation
x = 5*np.random.randn(1000)
y = [x + np.random.randn(1000),
     2*np.sin(x) + np.random.randn(1000),
     x + 10*np.random.randn(1000)]
plt.figure(figsize=(10, 4))
for i in range(3):    
    plt.subplot(1, 3, i+1)
    plt.plot(x, y[i], '.')
    rho, _ = spearmanr(x, y[i])
    plt.title('Mutual info: %0.3f\nCorr coeff: %0.3f'
              % (mutual_information(x, y[i]), rho))
    plt.gca().tick_params(labelbottom=False, labelleft=False)

We'll use the mutual information of the quantile-transformed aggregation scores (just so outliers don't mess up the mutual information calculation).  So, we'll need a function to perform the [quantile transform](https://en.wikipedia.org/wiki/Quantile_normalization), and one to compute the mutual information after applying the quantile transform:

In [None]:
def quantile_transform(v, res=101):
    """Quantile-transform a vector to lie between 0 and 1"""
    x = np.linspace(0, 100, res)
    prcs = np.nanpercentile(v, x)
    return np.interp(v, prcs, x/100.0)
    
    
def q_mut_info(x, y):
    """Mutual information between quantile-transformed vectors"""
    return mutual_information(quantile_transform(x),
                              quantile_transform(y))

Now we can compute the mutual information between each feature and the loyalty score.

In [None]:
%%time

# Compute the mutual information
cols = []
mis = []
for col in X_train:
    mi = q_mut_info(X_train[col], y_train)
    cols.append(col)
    mis.append(mi)
    
# Print mut info of each feature
print_table(['Column', 'Mutual_Information'],
            [cols, mis])

Let's only bother keeping the features with the top 200 mutual information scores.

In [None]:
# Create DataFrame with scores
mi_df = pd.DataFrame()
mi_df['Column'] = cols
mi_df['mut_info'] = mis

# Sort by mutual information
mi_df = mi_df.sort_values('mut_info', ascending=False)
top200 = mi_df.iloc[:200,:]
top200 = top200['Column'].tolist()

# Keep only top 200 columns
X_train = X_train[top200]
X_test = X_test[top200]

<a id="permutation-based-feature-importance"></a>
### Permutation-based Feature Importance

A different way to select features is to try and train a model using *all* the features, and then determine how heavily the model's performance depends on the features.  But, we'll need to use a model which can handle a lot of features without overfitting too badly (i.e., an unregularized linear regression wouldn't be a good idea here).  So, we'll use a gradient boosted decision tree, specifically [CatBoost](http://catboost.ai/).  

Let's create a data processing and prediction pipeline.  First, we'll target encode the categorical columns (basically just set each category to the mean target value for samples having that category - see my [post on target encoding](https://brendanhasz.github.io/2019/03/04/target-encoding.html)).  Then, we can normalize the data and impute missing data (we'll just fill in missing data with the median of the column).  Finally, we can use CatBoost to predict the loyalty scores from the features we've engineered.

In [None]:
# Regression pipeline
cat_cols = [c for c in X_train if 'mode' in c] 
reg_pipeline = Pipeline([
    ('target_encoder', TargetEncoderCV(cols=cat_cols)),
    ('scaler', RobustScaler()),
    ('imputer', SimpleImputer(strategy='median')),
    ('regressor', CatBoostRegressor(loss_function='RMSE', 
                                    verbose=False))
])

We can measure how heavily the model depends on various features by using permutation-based feature importance.  Basically, we train the model on all the data, and then measure its error after shuffling each row.  When the model's error increases a lot after shuffling a row, that means that the feature which was shuffled was important for the model's predictions.

The advantage of permutation-based feature importance is that it gives a super-clear view and a single score as to how important each feature is.  The downside is that this score is intrinsically linked to the model.  Whereas computing the mutual information between the features and the target only depends on the data, permutation-based feature importance scores depend on the data, the model being used, and the interaction between the two.  If your model can't fit the data very well, your permutation scores will be garbage!

Luckily CatBoost nearly always does a pretty good job of prediction, even in the face of lots of features!  So, let's compute the permutation-based feature importance for each feature (the complete code is [on my GitHub](https://github.com/brendanhasz/dsutils/blob/master/src/dsutils/evaluation.py#L126)).

In [None]:
%%time 

# Compute the cross-validated feature importance
imp_df = permutation_importance_cv(
    X_train, y_train, reg_pipeline, 'rmse', n_splits=2)

Then, we can plot the importance scores for each feature.  These scores are just the difference between the model's error with no shuffled features and the error with the feature of interest shuffled.  So, larger scores correspond to features which the model needs to have a low error.

In [None]:
# Plot the feature importances
plt.figure(figsize=(8, 100))
plot_permutation_importance(imp_df)
plt.show()

Finally, we'll want to save the features so that we can use them to train a model to predict the loyalty scores.  Let's save the top 100 most important features to a [feather](https://github.com/wesm/feather) file, so that we can quickly load them back in when we do the modeling.  First though, we need to figure out which features *are* the ones with the best importance scores.

In [None]:
# Get top 100 most important features
df = pd.melt(imp_df, var_name='Feature', value_name='Importance')
dfg = (df.groupby(['Feature'])['Importance']
       .aggregate(np.mean)
       .reset_index()
       .sort_values('Importance', ascending=False))
top100 = dfg['Feature'][:100].tolist()

Then, we can save those features (and the corresponding target variable!) to a feather file.

In [None]:
# Save file w/ 100 most important features
cards = pd.concat([X_train[top100], X_test[top100]])
cards['target'] = y_train
cards.reset_index(inplace=True)
cards.to_feather('card_features_top100.feather')

<a id="conclusion"></a>
## Conclusion

Now that we've engineered features for each card account, the next thing to do is create models to predict the target value from those features.  In [the next kernel](https://www.kaggle.com/brendanhasz/elo-modeling), we'll try different modeling techniques to see which gives the best predictions.