# Exercise - Feature Importance and Feature Selection


In this exercise, you will train a base model and investigate which features the performance of the model seems to be driven by. Then you will apply feature selection techniques to reduce the feature set and investigate the effect this has on the model's performance.


In [1]:
# DO NOT MODIFY - imports
import pandas as pd
import numpy as np

## 1. Setup, Baseline Model and Baseline Performance Score


Execute the cells below to create a synthetic dataset for binary classification with 50 features and 10,000 examples. Imagine the target `y` is the direction of price movements which we would like to predict using the 50 features at our disposal.

In [2]:
# DO NOT MODIFY - create dataset and display basic statistics
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000, n_classes=2, n_features=50, n_informative=10, n_redundant=10, class_sep=0.4, n_clusters_per_class=3, random_state=52)

X = pd.DataFrame(X)
y = pd.Series(y)

# DO NOT MODIFY - Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

Before we continue, we would like to establish a baseline score. We will choose accuracy as the relevant performance metric.  
Write code to calculate and display the accuracy score on the _test set_ of a naive baseline model that always predicts the majority class (based on the majority class in the _training set_).

> **HINT:**
> First, you have to find the majority class (either `0` or `1`) in the target variable on the _training set_. You can use `df.value_counts()` or you can look at the `mode()` of the target, since it only has two classes.  
> Next, create an array with the same length as `y_test` with all elements equal to the majority class you just found.  
> Finally, use this as the vector of predictions to evaluate this naive baseline model on the _test set_.


In [3]:
# DO NOT MODIFY - imports
from sklearn.metrics import accuracy_score

# FILL IN - Find the majority class in the training set

# FILL IN - Calculate the precision of the majority class classifier on the test set
baseline_test_acc = ...
baseline_test_acc

## 2. Basic Feature Selection with Permutation Importance

Run the code cell below to train a `LogisticRegression` model with its default hyperparameter values, using all 50 features.

In [4]:
# DO NOT MODIFY - import and train a LogisticRegression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=52)
clf.fit(X_train, y_train)

Run the code cells below to get the cross-validated accuracy score and the actual accuracy score on the test set.

In [5]:
# DO NOT MODIFY
from sklearn.model_selection import cross_val_score
cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy").mean()

In [6]:
# DO NOT MODIFY -
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

Take a look at the permutation importance scores of the features using the test set. Use `n_repeats=10` and `random_state=52`. Store the average permutation importance scores in `mean_perm_imps`.

In [7]:
# DO NOT MODIFY - import
from sklearn.inspection import permutation_importance

# FILL IN - Calculate permutation importance scores for the features in the test set
# Use n_repeats=10 and random_state=52
perm_imps = ...
mean_perm_imps = ...

Run the cell below to print out the features listed in decreasing order of absolute value of mean permutation importance.

In [8]:
# DO NOT MODIFY - Sort and print the features by decreasing absolute permutation importance
sorted_idx = np.argsort(np.abs(mean_perm_imps))[::-1]
for i in sorted_idx:
    print(f"{i}: {mean_perm_imps[i]}")

Reduce the feature set by dropping features that have a permutation importance score less than `0.003`. Store the resulting reduced feature sets in `X_train_reduced` and `X_test_reduced`.

In [9]:
# FILL IN - Filter out features with permutation importance less than 0.003

X_train_reduced = ...
X_test_reduced = ...

Run the cell below to see how many features remain. (There should be 9.)

In [10]:
# DO NOT MODIFY - There should be 9 features remaining
X_train_reduced.shape[1]

Re-train the classifier from earlier and check its average cross-validated accuracy and test accuracy scores.

In [11]:
# FILL IN - Train a new LogisticRegression model on the reduced feature set


In [12]:
# FILL IN - Calculate the mean cross-validated accuracy of the new model


In [13]:
# FILL IN - Calculate the precision of the new model on the test set


**NOTE:** Reducing the feature set may or may not improve performance. After all, even some less "important" features still provide some information and eliminating them might result in a hit to performance scores. But a reduction in performance may still be worthwhile if it means faster model training and a more interpretable model.