### The procedure consists of following steps
For Categorical variables :-
1. Seperate into train and test.
2. Determine the mean values of the target within each label of the categorical
    variables using train set.
3. Use that mean target value per label as the prediction in the test set and calculate the roc-auc.

For each numerical Variables :-
1. Seperate into trian and test.
2. Divide the variables into 100 quartiles.
3. Calculate the mean target within each quartiles using the training set.
4. Use that mean target value/bin as the prediction on the test set and calculate the roc-auc.

### Advantages
1. Speed : Computing mean and quartiles are direct and efficient.
2. Stability with respect to scale: extreme values for continous variables do not skew the predictions
3. Compare between categorical and numerical variables.
4. Accomodation of non-linearities.

[Note : Works only for Binary Classification problems.]

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load the titanic dataset
data=pd.read_csv('datasets/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Variables preprocessing:
# Cabin contains missing data
# We'll replace missing data by adding category "Missing"
# then we'll narrow down the different cabins by selecting only the
# first letter, which represents the deck in which the cabin was present

data['Cabin'].fillna('Missing',inplace=True)
print(data['Cabin'].unique())
data['Cabin']=data['Cabin'].str[0]
data['Cabin'].unique()       # Since there is no 'M' deck it is safe to use missing deck data as 'M'

['Missing' 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78'
 'D33' 'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26'
 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7'
 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87'
 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26'
 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124'
 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44'
 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20'
 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101'
 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48'
 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24'
 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3'
 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24'
 'C50' 'B42' 'C148']


array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [5]:
# Seperate the train and test sets
# We'll only use the categorical variables and the target
print(data.info())      # categorical variables dtypes as 'O'

x_train,x_test,y_train,y_test=train_test_split(data[['Pclass','Sex','Embarked','Cabin','Survived']],
                                              data['Survived'],test_size=0.3,random_state=0)
x_train.shape,x_test.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          891 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


((623, 5), (268, 5))

## Feature Selection on Categorical Variables
First, we will work with Categorical variables.

We'll create a function that calculates the mean of Survival(Probability of survival) of the passenger,
within each label of a categorical variable. It creates dictionary, using the training set variables, to a
probability of survival.

Then the function replaces the label in both training and the testing set with the probabilities. This makes the
question a bit like "Tell me which one was your cabin, and I'll tell you your probability of survival."

For a good predictors the roc-auc values must be greater than 0.5

In [6]:
# CHECKING MANUALLY FOR PCLASS
data.groupby('Pclass')['Survived'].mean() # checks how many 1s and 0s and depending on that calculates Survival Prob

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [7]:
# Understanding the data and what does mean() do.
data.groupby('Pclass')['Survived'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,216.0,0.62963,0.484026,0.0,0.0,1.0,1.0,1.0
2,184.0,0.472826,0.500623,0.0,0.0,0.0,1.0,1.0
3,491.0,0.242363,0.428949,0.0,0.0,0.0,0.0,1.0


In [8]:
print(data.groupby('Sex')['Survived'].mean()) # Looks like very few males survived the crash :P.
data.groupby('Sex')['Survived'].describe()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,314.0,0.742038,0.438211,0.0,0.0,1.0,1.0,1.0
male,577.0,0.188908,0.391775,0.0,0.0,0.0,0.0,1.0


In [9]:
def mean_encoding(df_train,df_test):
    # Temporary copy of the original dataset since we are dropping 'Survived' col at the end.
    df_train_temp=df_train.copy()
    df_test_temp=df_test.copy()
    
    for col in ['Pclass','Sex','Cabin','Embarked']:
        risk_dict=df_train.groupby([col])['Survived'].mean().to_dict()
        
        # Remap the labels
        df_train_temp[col]=df_train[col].map(risk_dict)
        df_test_temp[col]=df_test[col].map(risk_dict)
        
    # Dropping the target variables
    df_train_temp.drop(labels=['Survived'],inplace=True,axis=1)
    df_test_temp.drop(labels=['Survived'],inplace=True,axis=1)
        
    return df_train_temp,df_test_temp

In [10]:
# Calling the function
x_train_enc,x_test_enc=mean_encoding(x_train,x_test)
x_train_enc.shape,x_test_enc.shape

((623, 4), (268, 4))

In [11]:
x_train_enc.head()

Unnamed: 0,Pclass,Sex,Embarked,Cabin
857,0.621795,0.196078,0.341357,0.740741
52,0.621795,0.753488,0.564815,0.692308
386,0.241791,0.196078,0.341357,0.303609
124,0.621795,0.196078,0.341357,0.692308
578,0.241791,0.753488,0.564815,0.303609


In [12]:
# Now we'll use this probability values
# against y values to calculate roc_auc values
roc_values=[]
for col in x_train_enc.columns:
    roc_values.append(roc_auc_score(y_test,x_test_enc[col].values))

In [13]:
# Converting into series
roc_values=pd.Series(roc_values)
roc_values.index=x_test_enc.columns
roc_values

Pclass      0.680476
Sex         0.771667
Embarked    0.577500
Cabin       0.641637
dtype: float64

Hence, we can conclude that all the features are important with sex being the most important feature for prediction.

## Feature Selection on numerical variables
The procedure is exactly the same, but it requires one additional first step which is to divide the continuos
variable into bins. The author of the method divide the variables into 100 quantiles, that is 100 bins. In principle, we could divide the variables in less bins. Here we'll divide the variables in 10 bins.

In [14]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,M,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,M,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,M,S


In [15]:
# we'll work with 'Age' and 'Fare' as numerical variables
# seperate the dataset into train and test
x_train,x_test,y_train,y_test=train_test_split(data[['Age','Fare','Survived']],
                                              data['Survived'],test_size=0.3,random_state=0)
x_train.shape,x_test.shape

((623, 3), (268, 3))

In [16]:
# We'll divide Age into 10 bins. We'll use the qcut(quantile cut)
# function from pandas and we'll indicate that we want 9 cutting points.
# retbins=True : indicates that we want to capture the limits of each intervals.
# create 10 labels, one for each quantile
# instead of having the quantile limits, the new variable
# will have labels in its bins

labels=['Q'+str(i+1) for i in range(0,10)]

x_train['Age_binned'],intervals=pd.qcut(x_train['Age'],q=10,labels=labels,retbins=True,precision=3,duplicates='drop')
x_train[['Age_binned','Age']].head()

Unnamed: 0,Age_binned,Age
857,Q10,51.0
52,Q9,49.0
386,Q1,1.0
124,Q10,54.0
578,,


In [19]:
# it has 11 categories as the data contains missing data.
x_train['Age_binned'].unique(),intervals

([Q10, Q9, Q1, NaN, Q4, ..., Q6, Q2, Q7, Q5, Q8]
 Length: 11
 Categories (10, object): [Q1 < Q2 < Q3 < Q4 ... Q7 < Q8 < Q9 < Q10],
 array([ 0.67, 13.1 , 19.  , 22.  , 25.4 , 29.  , 32.  , 36.  , 41.  ,
        49.  , 80.  ]))

In [22]:
# now we'll use the boundries calculated in the previous cell to
# bin the testing set

x_test['Age_binned']=pd.cut(x_test['Age'],bins=intervals,labels=labels)
x_test[['Age_binned','Age']].head(10)

Unnamed: 0,Age_binned,Age
495,,
648,,
278,Q1,7.0
31,,
255,Q5,29.0
298,,
609,Q8,40.0
318,Q6,31.0
484,Q4,25.0
367,,


In [33]:
# Let's deal with NaN values, replace NaN values by a new Category
# called 'Missing' as we did with categorical variables.

# First count the null values
print('TRAIN SET: {}'.format(x_train[x_train['Age_binned'].isna()].shape[0]))
print('TEST SET : {}'.format(x_test.Age_binned.isna().sum()))

# Checking the dtype
print('dtype for age_binned : {}'.format(x_train['Age_binned'].dtypes))

# To replace NaN values change the dtype to 'O'
x_train['Age_binned']=x_train['Age_binned'].astype('O')
x_test['Age_binned']=x_test['Age_binned'].astype('O')

# Now replacing the missing data
x_train['Age_binned'].fillna('Missing',inplace=True)
x_test['Age_binned'].fillna('Missing',inplace=True)

x_train.head()

TRAIN SET: 121
TEST SET : 57
dtype for age_binned : category


Unnamed: 0,Age,Fare,Survived,Age_binned
857,51.0,26.55,1,Q10
52,49.0,76.7292,1,Q9
386,1.0,46.9,0,Q1
124,54.0,77.2875,0,Q10
578,,14.4583,0,Missing


In [34]:
# Let's create dictionary with the corresponding bin and survived mean
risk_dict=x_train.groupby('Age_binned')['Survived'].mean().to_dict()

print("Dictionary contains : {}".format(risk_dict))

x_train['Age_binned']=x_train['Age_binned'].map(risk_dict)
x_test['Age_binned']=x_test['Age_binned'].map(risk_dict)

x_train['Age_binned'].head()

Dictionary contains : {'Missing': 0.30578512396694213, 'Q1': 0.5686274509803921, 'Q10': 0.36, 'Q2': 0.43103448275862066, 'Q3': 0.2826086956521739, 'Q4': 0.32608695652173914, 'Q5': 0.4166666666666667, 'Q6': 0.4666666666666667, 'Q7': 0.48214285714285715, 'Q8': 0.35, 'Q9': 0.36}


857    0.360000
52     0.360000
386    0.568627
124    0.360000
578    0.305785
Name: Age_binned, dtype: float64

In [35]:
# Now, let's calculate the roc-auc values, using the probabilities that we used to
# replace the labels, and compare to the target.
roc_auc_score(y_test,x_test['Age_binned'])

0.5723809523809524

### Conclusion :-
Since the value is greater than 0.5, so in principle Age does have some predictive power, although it seems worst than any of the categorical variables we evaluated before.

In [39]:
# FOLLOWING THE SAME STEPS FOR 'FARE' 
x_train['fare_binned'],intervals=pd.qcut(x_train['Fare'],q=10,labels=labels,retbins=True,precision=3,duplicates='drop')

x_test['fare_binned']=pd.cut(x_test['Fare'],bins=intervals,labels=labels)

x_train[['fare_binned','Fare']].head()

Unnamed: 0,fare_binned,Fare
857,Q7,26.55
52,Q9,76.7292
386,Q8,46.9
124,Q10,77.2875
578,Q5,14.4583


In [41]:
# Counting null values
print('Train set: {}\nTest set: {}'.format(x_train.fare_binned.isnull().sum(),
                                          x_test.fare_binned.isnull().sum()))

# Parse as 'O' dtype
x_train['fare_binned']=x_train['fare_binned'].astype('O')
x_test['fare_binned']=x_test['fare_binned'].astype('O')

x_test['fare_binned'].fillna('Missing',inplace=True)

x_test.head()

Train set: 0
Test set: 8


Unnamed: 0,Age,Fare,Survived,Age_binned,fare_binned
495,,14.4583,0,0.305785,Q5
648,,7.55,0,0.305785,Q1
278,7.0,29.125,0,0.568627,Q8
31,,146.5208,1,0.305785,Q10
255,29.0,15.2458,1,0.416667,Q6


In [42]:
# Creating dict with the mean values
risk_dict=x_train.groupby('fare_binned')['Survived'].mean().to_dict()

print('Dict Value :{}'.format(risk_dict))

x_train['fare_binned']=x_train.fare_binned.map(risk_dict)
x_test['fare_binned']=x_test.fare_binned.map(risk_dict)

x_train.head()

Dict Value :{'Q1': 0.12698412698412698, 'Q10': 0.7301587301587301, 'Q2': 0.3709677419354839, 'Q3': 0.14705882352941177, 'Q4': 0.26785714285714285, 'Q5': 0.3968253968253968, 'Q6': 0.47619047619047616, 'Q7': 0.49206349206349204, 'Q8': 0.3548387096774194, 'Q9': 0.5333333333333333}


Unnamed: 0,Age,Fare,Survived,Age_binned,fare_binned
857,51.0,26.55,1,0.36,0.492063
52,49.0,76.7292,1,0.36,0.533333
386,1.0,46.9,0,0.568627,0.354839
124,54.0,77.2875,0,0.36,0.730159
578,,14.4583,0,0.305785,0.396825


In [45]:
# remove missing data value
x_test['fare_binned'].fillna(0,inplace=True)

# Calculating the roc_values
roc_auc_score(y_test,x_test['fare_binned'])

0.7253869047619047

Thus, we can see that 'Fare' is a far much better predictor as compared to 'Age'.