## Method used in a KDD 2009 competition

Here, there is the feature selection approach undertaken by data scientists at the University of Melbourne in the [KDD 2009](http://www.kdd.org/kdd-cup/view/kdd-cup-2009) data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.

The authors describe this procedure as an aggressive non-parametric feature selection procedure that is based in contemplating the relationship between the feature and the target. Therefore, this method should be classified as a filter method.

**The procedure consists in the following steps**:

For each categorical variable:

    1) Separate into train and test

    2) Determine the mean value of the target within each label of the categorical variable using the train set

    3) Use that mean target value per label as the prediction (using the test set) and calculate the roc-auc.

For each numerical variable:

    1) Separate into train and test
    
    2) Divide the variable into 100 quantiles

    3) Calculate the mean target within each quantile using the training set 

    4) Use that mean target value / bin as the prediction (using the test set) and calculate the roc-auc


The authors quote the following advantages of the method:

- Speed: computing mean and quantiles is direct and efficient
- Stability respect to scale: extreme values for continuous variables do not skew the predictions
- Comparable between categorical and numerical variables
- Accommodation of non-linearities

See my notes at the end of the notebook for a discussion on the method.

**Important**
The authors here use the roc-auc, but in principle, it could be possible to use any metric, including those valid for regression.

**Reference**:
[Predicting customer behaviour: The University of Melbourne's KDD Cup Report. Miller et al. JMLR Workshop and Conference Proceedings 7:45-55](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

In [2]:
# load the titanic dataset
data = pd.read_csv('../titanic.csv')
data.shape

(1309, 14)

In [3]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


In [4]:
# Variable preprocessing:

# note: the different cabins to be narrow down by selecting only the
# first letter, which represents the deck in which the cabin was located

data['cabin'] = data['cabin'].str[0]
data['cabin'].unique()

array(['B', 'C', 'E', 'D', 'A', nan, 'T', 'F', 'G'], dtype=object)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [6]:
data.cabin = data.cabin.fillna(0)

## 1. Feature selection on categorical variables

First, I will demonstrate the feature selection procedure over categorical variables. The Titanic dataset contains 4 categorical variables, which are Sex, Pclass, Cabin and Embarked.

In [7]:
# separate train and test sets

X_train, X_test, y_train, y_test = train_test_split(
    data[['pclass', 'sex', 'embarked', 'cabin', 'survived']],
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 5), (393, 5))

In [8]:
X_train.groupby("cabin")['survived'].mean().to_dict()

{0: 0.30484330484330485,
 'A': 0.5294117647058824,
 'B': 0.7619047619047619,
 'C': 0.5633802816901409,
 'D': 0.71875,
 'E': 0.71875,
 'F': 0.6666666666666666,
 'G': 0.5,
 'T': 0.0}

### Replace categories by target mean

In [9]:
# function that determines the target mean per category

def mean_encoding(df_train, df_test, categorical_vars):
    
    # temporary copy of the original dataframes
    df_train_temp = df_train.copy()
    df_test_temp = df_test.copy()
    
    for col in categorical_vars:
        
        # make a dictionary of categories, target-mean pairs
        target_mean_dict = df_train.groupby([col])['survived'].mean().to_dict()
        
        # replace the categories by the mean of the target
        df_train_temp[col] = df_train[col].map(target_mean_dict)
        df_test_temp[col] = df_test[col].map(target_mean_dict)
    
    # drop the target from the dataset
    df_train_temp.drop(['survived'], axis=1, inplace=True)
    df_test_temp.drop(['survived'], axis=1, inplace=True)
    
    # return  remapped datasets
    return df_train_temp, df_test_temp

In [10]:
categorical_vars = ['pclass', 'sex', 'embarked', 'cabin']

X_train_enc, X_test_enc = mean_encoding(X_train, X_test, categorical_vars)

X_train_enc.head()

Unnamed: 0,pclass,sex,embarked,cabin
501,0.43617,0.728358,0.338957,0.304843
588,0.43617,0.728358,0.338957,0.304843
402,0.43617,0.728358,0.553073,0.304843
1193,0.259036,0.187608,0.373494,0.304843
686,0.259036,0.728358,0.373494,0.304843


The strings were replaced by the target mean.

### Determine the roc-auc using the variable values as input

In [11]:
# calculate a roc-auc value, using the encoded variables as predictions

roc_values = []

for feature in categorical_vars:
    
    roc_values.append(roc_auc_score(y_test, X_test_enc[feature])) 

In [12]:
# into series

m1 = pd.Series(roc_values)
m1.index = categorical_vars
m1.sort_values(ascending=False)

sex         0.749959
pclass      0.670096
cabin       0.628809
embarked    0.593358
dtype: float64

All the features stand there all right, because the roc_auc for all of them is higher than 0.5.

Sex seems to be the most important feature to predict survival, as its roc_auc is the highest.


## 2. Feature Selection on numerical variables

The procedure is exactly the same, but it requires one additional first step which is to divide the continuous variable into bins. 

The authors of the method divide the variable in 100 quantiles, that is 100 bins. Here, there is no point in such devision, I will divide the variable in 10 bins only.

Working with the numerical variables Age and Fare:

In [13]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data[['age', 'fare', 'survived']],
    data['survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 3), (393, 3))

In [14]:
# fill missing values

X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

#### Bin variable Age

In [15]:
# binning the *train* set // getting the intervals

X_train['age_binned'], intervals = pd.qcut(
    X_train['age'],
    q = 10,
    labels=False,
    retbins=True,
    precision=3,
    duplicates='drop',
)

X_train[['age_binned', 'age']].head(10)

Unnamed: 0,age_binned,age
501,1,13.0
588,1,4.0
402,5,30.0
1193,0,0.0
686,2,22.0
971,0,0.0
117,5,30.0
540,1,2.0
294,8,49.0
261,6,35.0


In [16]:
# count the number of distinct bins

X_train['age_binned'].nunique()

9

In [17]:
# there are 9 bins only but that is all right

# display the bins
X_train['age_binned'].unique()

array([1, 5, 0, 2, 8, 6, 7, 4, 3], dtype=int64)

In [18]:
# use the interval limits calculated in the previous cell to
# bin the *test* set

X_test['age_binned'] = pd.cut(x = X_test['age'], bins=intervals, labels=False)

X_test[['age_binned', 'age']].head(10)

Unnamed: 0,age_binned,age
1139,6.0,38.0
533,2.0,21.0
459,7.0,42.0
1150,,0.0
393,3.0,25.0
1189,1.0,4.0
5,7.0,48.0
231,8.0,52.0
330,8.0,57.0
887,,0.0


In [19]:
# filling in zeros intead of NaNs
X_test.age_binned = X_test.age_binned.fillna(0) 

#### Bin Variable Fare

In [20]:
# train
X_train['fare_binned'], intervals = pd.qcut(
    X_train['fare'],
    q=10,
    labels=False,
    retbins=True,
    precision=3,
    duplicates='drop',
)

# test
X_test['fare_binned'] = pd.cut(x = X_test['fare'], bins=intervals, labels=False)

In [21]:
X_test['fare_binned'].nunique()

10

In [22]:
X_test.isnull().sum()

age            0
fare           0
survived       0
age_binned     0
fare_binned    4
dtype: int64

In [23]:
# filling in zeros intead of NaNs
X_test = X_test.fillna(0)

### Replace bins with target mean

In [24]:
# encode the variables with the target mean

binned_vars = ['age_binned', 'fare_binned']

X_train_enc, X_test_enc = mean_encoding(
    X_train[binned_vars+['survived']], X_test[binned_vars+['survived']], binned_vars)

X_train_enc.head()

Unnamed: 0,age_binned,fare_binned
501,0.432432,0.390805
588,0.432432,0.445652
402,0.444444,0.354167
1193,0.290323,0.382022
686,0.350515,0.382022


### Determine roc-auc using encodings

In [25]:
# calculate a roc-auc value, using the encoded variables as predictions

roc_values = []

for feature in binned_vars:
    
    roc_values.append(roc_auc_score(y_test, X_test_enc[feature])) 

In [26]:
# I make a series for easy visualisation

m1 = pd.Series(roc_values)
m1.index = binned_vars
m1.sort_values(ascending=False)

fare_binned    0.711313
age_binned     0.507923
dtype: float64

Fare, is a much better predictor of Survival. Age produces a random output, the roc-auc is 0.5.


**Some thoughts**

The authors mention that by using this method, you are able to compare directly numerical with categorical variables. In a sense this is true, however we need to keep in mind, that categorical variables may or may not (and typically they will not) show the same percentage of observations per label. However, when we divide a numerical variable into quantile bins, we guarantee that each bin shows the same percentage of observations.

Alternatively, instead of binning into quantiles, we can bin into equal-distance bins.