## Bonus: Method used in KDD 2009 competition

I will describe the feature selection approach undertaken by data scientists at the University of Melbourne for the [KDD 2009](http://www.kdd.org/kdd-cup/view/kdd-cup-2009) data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.

The authors describe this procedure as an aggressive non-parametric feature selection procedure, that is based in contemplating the relationship between the feature and the target. Therefore, this method should be classified as a filter method.

**The procedure consists in the following steps**:

For each categorical variable:

    1) Separate into train and test

    2) Determine the mean value of the target within each label of the categorical variable using the train set

    3) Use that mean target value per label as the prediction in the test set and calculate the roc-auc.

For each numerical variable:

    1) Separate into train and test
    
    2) Divide the variable into 100 quantiles

    3) Calculate the mean target within each quantile using the training set 

    4) Use that mean target value / bin as the prediction on the test set and calculate the roc-auc


The authors quote the following advantages of the method:

- Speed: computing mean and quantiles is direct and efficient
- Stability respect to scale: extreme values for continuous variables do not skew the predictions
- Comparable between categorical and numerical variables
- Accommodation of non-linearities

See my notes at the end of the notebook for a discussion on the method and the authors assumptions.

You will understand better the procedure as I proceed with the demonstration. I will use the titanic dataset from Kaggle.

**Reference**:
[Predicting customer behaviour: The University of Melbourne's KDD Cup Report. Miller et al. JMLR Workshop and Conference Proceedings 7:45-55](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf)

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

In [None]:
# load the titanic dataset
data = pd.read_csv('titanic_train.csv')
data.shape

(891, 12)

In [None]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# Variable preprocessing:
# Cabin contains missing data
# I will replace missing data by adding a category "Missing"
# then I will narrow down the different cabins by selecting only the
# first letter, which represents the deck in which the cabin was located

data['Cabin'].fillna('Missing', inplace=True)

# captures first letter of string (the letter of the cabin)
data['Cabin'] = data['Cabin'].str[0]
data['Cabin'].unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

### Important

In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

In [None]:
# separate train and test sets
# I will only use the categorical variables and the target

X_train, X_test, y_train, y_test = train_test_split(
    data[['Pclass', 'Sex', 'Embarked', 'Cabin', 'Survived']],
    data['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((623, 5), (268, 5))

### Feature selection on categorical variables

First, I will demonstrate the feature selection procedure over categorical variables. The Titanic dataset contains 4 categorical variables, which are Sex, Pclass, Cabin and Embarked.

In the next cell I create a function that calculates the mean of Survival (and this is equivalent to the probability of survival) of the passenger, within each label of a categorical variable. It creates a dictionary, using the training set only, that maps each label of the training set variable, to a probability of survival.

Then, the function replaces the label both in train and test set, by the probability of survival. It is like making a prediction on the outcome, by using only the label of the variable.

In this way, the function replaces the original strings, by probabilities. 

The bottom line of this method is that we **use just the label of the variable to estimate the probability of survival of the passenger**. 
A bit like "Tell me which one was your Cabin, and I will tell you your probability of Survival".

If the labels of a categorical variable and therefore the categorical variable are good predictors, then, we should obtain a roc-auc above 0.5 for that variable, when we evaluate those probabilities with the real outcome, which is whether the passenger survived or not.

In [None]:
def mean_encoding(df_train, df_test):
    # temporary copy of the original dataframes
    df_train_temp = df_train.copy()
    df_test_temp = df_test.copy()
    
    for col in ['Sex', 'Cabin', 'Embarked', 'Cabin']:
        # make a dictionary mapping labels / categories to the mean target for that label
        risk_dict = df_train.groupby([col])['Survived'].mean().to_dict()
        
        # re-map the labels
        df_train_temp[col] = df_train[col].map(risk_dict)
        df_test_temp[col] = df_test[col].map(risk_dict)
    
    # drop the target
    df_train_temp.drop(['Survived'], axis=1, inplace=True)
    df_test_temp.drop(['Survived'], axis=1, inplace=True)        
    return df_train_temp, df_test_temp
        
X_train_enc, X_test_enc = mean_encoding(X_train, X_test)
X_train_enc.head()

Unnamed: 0,Pclass,Sex,Embarked,Cabin
857,1,0.196078,0.341357,0.740741
52,1,0.753488,0.564815,0.692308
386,3,0.196078,0.341357,0.303609
124,1,0.196078,0.341357,0.692308
578,3,0.753488,0.564815,0.303609


The strings were replaced by probabilities.

In [None]:
# now, I calculate a roc-auc value, using the probabilities that we used to
# replace the labels, and comparing it with the true target:

roc_values = []
for feature in ['Sex', 'Cabin', 'Embarked', 'Cabin']:
    roc_values.append(roc_auc_score(y_test, X_test_enc[feature])) 

In [None]:
# I make a series for easy visualisation
m1 = pd.Series(roc_values)
m1.index = ['Sex', 'Cabin', 'Embarked', 'Cabin']
m1.sort_values(ascending=False)

Sex         0.771667
Cabin       0.641637
Cabin       0.641637
Embarked    0.577500
dtype: float64

We can see, that all the features are important, because the roc_auc for all of them is higher than 0.5. In addition, Sex seems to be the most important feature to predict survival, as its roc_auc is the highest.

As you see, this is a very powerful, yet straightforward approach to feature selection.

### Feature Selection on numerical variables

The procedure is exactly the same, but it requires one additional first step which is to divide the continuous variable into bins. The authors of the method divide the variable in 100 quantiles, that is 100 bins. In principle, you could divide the variable in less bins. Here I will divide the variable in 10 bins only.

I will work with the numerical variables Age and Fare.

In [None]:
# load dataset
data = pd.read_csv('titanic_train.csv')
data.shape

(891, 12)

In [None]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data[['Age', 'Fare', 'Survived']],
    data['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((623, 3), (268, 3))

In [None]:
# I will divide Age in 10 bins. I use the qcut (quantile cut)
# function from pandas and I indicate that I want 9 cutting points,
# thus 10 bins.
# retbins= True indicates that I want to capture the limits of
# each interval (so I can then use them to cut the test set)

# create 10 labels, one for each quantile
# instead of having the quantile limits, the new variable
# will have labels in its bins

labels = ['Q' + str(i + 1) for i in range(0, 10)]

X_train['Age_binned'], intervals = pd.qcut(
    X_train.Age,
    10,
    labels=labels,
    retbins=True,
    precision=3,
    duplicates='drop',
)

X_train[['Age_binned', 'Age']].head(10)

Unnamed: 0,Age_binned,Age
857,Q10,51.0
52,Q9,49.0
386,Q1,1.0
124,Q10,54.0
578,,
549,Q1,8.0
118,Q4,24.0
12,Q3,20.0
157,Q6,30.0
127,Q4,24.0


In [None]:
# we can count the number of bins. It has 11 because Age contains missing data. 
# Those are kept in a separate bin (NaN)
len(X_train.Age_binned.unique())

11

In [None]:
# here we see the NaN values
X_train.Age_binned.unique()

[Q10, Q9, Q1, NaN, Q4, ..., Q6, Q2, Q7, Q5, Q8]
Length: 11
Categories (10, object): [Q1 < Q2 < Q3 < Q4 ... Q7 < Q8 < Q9 < Q10]

In [None]:
# and these are the cutting points of the intervals
intervals, len(intervals)

(array([ 0.67, 13.1 , 19.  , 22.  , 25.4 , 29.  , 32.  , 36.  , 41.  ,
        49.  , 80.  ]), 11)

In [None]:
# now I use the boundaries calculated in the previous cell to
# bin the testing set

X_test['Age_binned'] = pd.cut(x = X_test.Age, bins=intervals, labels=labels)
X_test[['Age_binned', 'Age']].head(10)

Unnamed: 0,Age_binned,Age
495,,
648,,
278,Q1,7.0
31,,
255,Q5,29.0
298,,
609,Q8,40.0
318,Q6,31.0
484,Q4,25.0
367,,


In [None]:
# same as before, it shows 10 bins and the NaN separately.
len(X_test.Age_binned.unique())

11

In [None]:
# here we see the NaN values
X_test.Age_binned.unique()

[NaN, Q1, Q5, Q8, Q6, ..., Q2, Q7, Q3, Q10, Q9]
Length: 11
Categories (10, object): [Q1 < Q2 < Q3 < Q4 ... Q7 < Q8 < Q9 < Q10]

In [None]:
# and here we count the NaN values
X_train[['Age_binned']].isnull().sum(), X_test[['Age_binned']].isnull().sum()

(Age_binned    121
 dtype: int64, Age_binned    57
 dtype: int64)

In [None]:
# in order to replace the NaN values by a new category
# called "Missing", first I need to recast the variables as
# objects

X_train['Age_binned'] = X_train['Age_binned'].astype('O')
X_test['Age_binned'] = X_test['Age_binned'].astype('O')

In [None]:
# and now I replace the missing values with a new category
X_train['Age_binned'].fillna('Missing', inplace=True)
X_test['Age_binned'].fillna('Missing', inplace=True)

In [None]:
# I create a dictionary that maps the bins to the mean of survival
risk_dict = X_train.groupby(['Age_binned'])['Survived'].mean().to_dict()

# re-map the labels, I replace the bins by the probability of survival
X_train['Age_binned'] = X_train['Age_binned'].map(risk_dict)
X_test['Age_binned'] = X_test['Age_binned'].map(risk_dict)

X_train['Age_binned'].head()

857    0.360000
52     0.360000
386    0.568627
124    0.360000
578    0.305785
Name: Age_binned, dtype: float64

In [None]:
# now, I calculate a roc-auc value, using the probabilities that we used to
# replace the labels, and comparing it with the true target:

roc_auc_score(y_test, X_test['Age_binned'])

0.5723809523809524

It is higher than 0.5, so in principle Age does have some predictive power, although it seems worse than any of the categorical variables we evaluated before.

Let's do the same quickly for Fare:

#### Fare

In [None]:
# separate the Fare values into 10 bins

labels = ['Q' + str(i + 1) for i in range(0, 10)]

# train
X_train['Fare_binned'], intervals = pd.qcut(
    X_train.Fare,
    10,
    labels=labels,
    retbins=True,
    precision=3,
    duplicates='drop',
)

# test
X_test['Fare_binned'] = pd.cut(x = X_test.Fare, bins=intervals, labels=labels)

In [None]:
# test shows some missing data. These appear when the Fare values can't
# be allocated to any of the calculated bins

X_test['Fare_binned'].isnull().sum(), X_train['Fare_binned'].isnull().sum()

(8, 0)

In [None]:
# parse as categorical variables

X_train['Fare_binned'] = X_train['Fare_binned'].astype('O')
X_test['Faree_binned'] = X_test['Fare_binned'].astype('category')

In [None]:
# I create a dictionary that maps the bins to the mean of survival
risk_dict = X_train.groupby(['Fare_binned'])['Survived'].mean().to_dict()

# re-map the labels, I replace the bins by the probability of survival
X_train['Fare_binned'] = X_train['Fare_binned'].map(risk_dict)
X_test['Fare_binned'] = X_test['Fare_binned'].map(risk_dict)

X_train['Fare_binned'].head()


857    0.492063
52     0.533333
386    0.354839
124    0.730159
578    0.396825
Name: Fare_binned, dtype: float64

In [None]:
# now, I calculate a roc-auc value, using the probabilities that we used to
# replace the labels, and comparing it with the true target:

# first I estimate a survival probability of zero for the missing data
X_test['Fare_binned']=X_test['Fare_binned'].cat.add_categories(0)
X_test['Fare_binned']=X_test['Fare_binned'].fillna(0, inplace=True)

# then I calcualte the roc_auc
roc_auc_score(y_test, X_test['Fare_binned'])
y_test,X_test['Fare_binned']

Fare, is a much better predictor of Survival.

The authors mention that by using this method, you are able to compare directly numerical with categorical variables. In a sense this is true, however we need to keep in mind, that categorical variables may or may not (and typically they will not) show the same percentage of observations per label. However, when we divide a numerical variable into quantile bins, we guarantee that each bin shows the same percentage of observations.

Alternatively, instead of binning into quantiles, we can bin into equal-distance bins.The way to do this, is to calculate the max value - min value range and divide that distance into the amount of bins we want to construct. That would determine the cut-points for the bins.

To learn more on discretisation, visit my course "Feature Engineering for Machine Learning" also in Udemy. Get more details in the Final section of this course.

That is all for this lecture, I hope you enjoyed it and see you in the next one!