In [1]:
# This file contains a few basic functions of feature engineering, specifically Feature Selection to 
# refer back to when writing new code

import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

import os

clicks = pd.read_parquet('data/baseline_data.pqt')
data_files = ['count_encodings.pqt',
              'catboost_encodings.pqt',
              'interactions.pqt',
              'past_6hr_events.pqt',
              'downloads.pqt',
              'time_deltas.pqt',
              'svd_encodings.pqt']
data_root = 'data'
for file in data_files:
    features = pd.read_parquet(os.path.join(data_root, file))
    clicks = clicks.join(features)

In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score
    
train, valid, test = get_data_splits(clicks)
_, baseline_score = train_model(train, valid)

Training model!
Validation AUC score: 0.9658334271834417


**Which data to use for feature selection?**

Since many feature selection methods require calculating statistics from the dataset, should you use all the data for feature selection?

Including validation and test data within the feature selection is a source of leakage. You'll want to perform feature selection on the train set only, then use the results there to remove features from the validation and test sets.

**Univariate Feature Selection**

Below, use SelectKBest with the f_classif scoring function to choose 40 features from the 91 features in the data.

In [3]:
from sklearn.feature_selection import SelectKBest, f_classif
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Create the selector, keeping 40 features
selector = SelectKBest(f_classif, k = 40)

# Use the selector to retrieve the best features
X_new = selector.fit_transform(train[feature_cols], train["is_attributed"])

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(selector.inverse_transform(X_new),
                                index = train.index,
                                columns = feature_cols)

# Find the columns that were dropped
dropped_columns = selected_features.columns[selected_features.var() == 0]

In [4]:
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1))

Training model!
Validation AUC score: 0.9625481759576047


**The best value of K**

With this method we can choose the best K features, but we still have to choose K ourselves. How would you find the "best" value of K? That is, you want it to be small so you're keeping the best features, but not so small that it's degrading the model's performance.

To find the best value of K, you can fit multiple models with increasing values of K, then choose the smallest K with validation score above some threshold or some other criteria. A good way to do this is loop over values of K and record the validation scores for each iteration.

**Use L1 regularization for feature selection**

Now try a more powerful approach using L1 regularization. Implement a function select_features_l1 that returns a list of features to keep.

Use a LogisticRegression classifier model with an L1 penalty to select the features. For the model, set:

the random state to 7,
the regularization parameter to 0.1,
and the solver to 'liblinear'.
Fit the model then use SelectFromModel to return a model with the selected features.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def select_features_l1(X, y):
    logistic = LogisticRegression(C = 0.1, penalty = "l1", random_state = 7).fit(X, y)
    model = SelectFromModel(logistic, prefit = True)
    
    X_new = model.transform(X)
    
    # Get back the kept features as a DataFrame with dropped columns as all 0's
    selected_features = pd.DataFrame(model.inverse_transform(X_new),
                                    index = X.index,
                                    columns = X.columns)
    
    # Dropped columns have values of all 0's, keep other columns
    cols_to_keep = selected_features.columns[selected_features.var() != 0]
    return cols_to_keep

In [6]:
n_samples = 10000
X, y = train[feature_cols][:n_samples], train['is_attributed'][:n_samples]
selected = select_features_l1(X, y)

dropped_columns = feature_cols.drop(selected)
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1))

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

**Feature Selection with Trees**

Since we're using a tree-based model, using another tree-based model for feature selection might produce better results. What would you do different to select the features using a trees classifier?

You could use something like RandomForestClassifier or ExtraTreesClassifier to find feature importances. SelectFromModel can use the feature importances to find the best features.

**Top K features with L1 regularization**

Here you've set the regularization parameter C=0.1 which led to some number of features being dropped. However, by setting C you aren't able to choose a certain number of features to keep. What would you do to keep the top K important features using L1 regularization?

To select a certain number of features with L1 regularization, you need to find the regularization parameter that leaves the desired number of features. To do this you can iterate over models with different regularization parameters from low to high and choose the one that leaves K features. Note that for the scikit-learn models C is the inverse of the regularization strength.