## Select with Target Mean as Performance Proxy

This transformer contains the methods of feature selection described in the notebook **06.2-Method-used-in-a-KDD-competition**

The functionality has now been included in Feature-engine.

Feature-engine automatically detects categorical and numerical variables. 

- Categories in categorical variables will be replaced by the mean value of the target.

- Numerical variables will be first discretised and then, each bin replaced by the target mean value.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from feature_engine.selection import SelectByTargetMeanPerformance

In [2]:
# load the titanic dataset

data = pd.read_csv('../titanic.csv')
data.shape

(1306, 9)

In [3]:
data.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,female,29.0,0,0,211.3375,B5,S
1,1,1,male,0.9167,1,2,151.55,C22,S
2,1,0,female,2.0,1,2,151.55,C22,S
3,1,0,male,30.0,1,2,151.55,C22,S
4,1,0,female,25.0,1,2,151.55,C22,S


In [4]:
# Variable preprocessing:

# then I will narrow down the different cabins by selecting only the
# first letter, which represents the deck in which the cabin was located

# captures first letter of string (the letter of the cabin)
data['cabin'] = data['cabin'].str[0]

# now we will rename those cabin letters that appear only 1 or 2 in the
# dataset by N

# replace rare cabins by N
data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])

data['cabin'].unique()

array(['B', 'C', 'E', 'D', 'A', 'N', 'F'], dtype=object)

In [5]:
# number of passenges per cabin

data['cabin'].value_counts()

N    1019
C      94
B      63
D      46
E      41
A      22
F      21
Name: cabin, dtype: int64

In [6]:
# number of passengers per value
data['parch'].value_counts()

0    999
1    170
2    113
3      8
4      6
5      6
6      2
9      2
Name: parch, dtype: int64

In [7]:
# cap variable at 3, the rest of the values are
# shown by too few observations

data['parch'] = np.where(data['parch']>3,3,data['parch'])

In [8]:
data['sibsp'].value_counts()

0    888
1    319
2     42
4     22
3     20
8      9
5      6
Name: sibsp, dtype: int64

In [9]:
# cap variable at 3, the rest of the values are
# shown by too few observations

data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])

In [10]:
# cast discrete variables as categorical

# feature-engine considers categorical variables all those of type
# object. So in order to work with numerical variables as if they
# were categorical, we  need to cast them as object

data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')

In [11]:
# check absence of missing data

data.isnull().sum()

pclass      0
survived    0
sex         0
age         0
sibsp       0
parch       0
fare        0
cabin       0
embarked    0
dtype: int64

**Important**

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [12]:
# separate train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['survived'], axis=1),
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((914, 8), (392, 8))

In [13]:
# feautre engine automates the selection for both
# categorical and numerical variables

sel = SelectByTargetMeanPerformance(
    variables=None, # automatically finds categorical and numerical variables
    scoring="roc_auc", # the metric to evaluate performance
    threshold=0.6, # the threshold for feature selection, 
    bins=3, # the number of intervals to discretise the numerical variables
    strategy="equal_frequency", # whether the intervals should be of equal size or equal number of observations
    cv=2,# cross validation
    regression=False, # whether this is regression or classification
)

sel.fit(X_train, y_train)

In [14]:
# here the selector stores the roc-auc per feature

sel.feature_performance_

{'pclass': 0.6798614277309268,
 'sex': 0.7491001943282519,
 'age': 0.5581472195335474,
 'sibsp': 0.5563082996205047,
 'parch': 0.5696414230138407,
 'fare': 0.6603974806466256,
 'cabin': 0.63880017706053,
 'embarked': 0.5630695122556864}

In [15]:
# and these are the features that will be dropped

sel.features_to_drop_

['age', 'sibsp', 'parch', 'embarked']

In [16]:
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape

((914, 4), (392, 4))

That is all for this lecture, I hope you enjoyed it and see you in the next one!