### The procedure consists of following steps
For Categorical variables :-
1. Seperate into train and test.
2. Determine the mean values of the target within each label of the categorical
    variables using train set.
3. Use that mean target value per label as the prediction in the test set and calculate the roc-auc.

For each numerical Variables :-
1. Seperate into trian and test.
2. Divide the variables into 100 quartiles.
3. Calculate the mean target within each quartiles using the training set.
4. Use that mean target value/bin as the prediction on the test set and calculate the roc-auc.

### Advantages
1. Speed : Computing mean and quartiles are direct and efficient.
2. Stability with respect to scale: extreme values for continous variables do not skew the predictions
3. Compare between categorical and numerical variables.
4. Accomodation of non-linearities.

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score

import warnings
warnings.filterwarnings('ignore')

In [15]:
# Load the titanic dataset
data=pd.read_csv('datasets/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
# Variables preprocessing:
# Cabin contains missing data
# We'll replace missing data by adding category "Missing"
# then we'll narrow down the different cabins by selecting only the
# first letter, which represents the deck in which the cabin was present

data['Cabin'].fillna('Missing',inplace=True)
print(data['Cabin'].unique())
data['Cabin']=data['Cabin'].str[0]
data['Cabin'].unique()       # Since there is no 'M' deck it is safe to use missing deck data as 'M'

['Missing' 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78'
 'D33' 'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26'
 'C110' 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7'
 'C49' 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87'
 'B77' 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26'
 'C106' 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124'
 'C91' 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44'
 'A34' 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20'
 'B79' 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101'
 'C68' 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48'
 'E58' 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24'
 'C90' 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3'
 'D6' 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24'
 'C50' 'B42' 'C148']


array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [18]:
# Seperate the train and test sets
# We'll only use the categorical variables and the target
print(data.info())      # categorical variables dtypes as 'O'

x_train,x_test,y_train,y_test=train_test_split(data[['Pclass','Sex','Embarked','Cabin','Survived']],
                                              data['Survived'],test_size=0.3,random_state=0)
x_train.shape,x_test.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          891 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


((623, 5), (268, 5))

## Feature Selection on Categorical Variables
First, we will work with Categorical variables.

We'll create a function that calculates the mean of Survival(Probability of survival) of the passenger,
within each label of a categorical variable. It creates dictionary, using the training set variables, to a
probability of survival.

Then the function replaces the label in both training and the testing set with the probabilities. This makes the
question a bit like "Tell me which one was your cabin, and I'll tell you your probability of survival."

For a good predictors the roc-auc values must be greater than 0.5

In [24]:
# CHECKING MANUALLY FOR PCLASS
data.groupby('Pclass')['Survived'].mean() # checks how many 1s and 0s and depending on that calculates Survival Prob

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [40]:
# Understanding the data and what does mean() do.
data.groupby('Pclass')['Survived'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,216.0,0.62963,0.484026,0.0,0.0,1.0,1.0,1.0
2,184.0,0.472826,0.500623,0.0,0.0,0.0,1.0,1.0
3,491.0,0.242363,0.428949,0.0,0.0,0.0,0.0,1.0


In [45]:
print(data.groupby('Sex')['Survived'].mean()) # Looks like very few males survived the crash :P.
data.groupby('Sex')['Survived'].describe()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,314.0,0.742038,0.438211,0.0,0.0,1.0,1.0,1.0
male,577.0,0.188908,0.391775,0.0,0.0,0.0,0.0,1.0


In [58]:
def mean_encoding(df_train,df_test):
    # Temporary copy of the original dataset since we are dropping 'Survived' col at the end.
    df_train_temp=df_train.copy()
    df_test_temp=df_test.copy()
    
    for col in ['Pclass','Sex','Cabin','Embarked']:
        risk_dict=df_train.groupby([col])['Survived'].mean().to_dict()
        
        # Remap the labels
        df_train_temp[col]=df_train[col].map(risk_dict)
        df_test_temp[col]=df_test[col].map(risk_dict)
        
    # Dropping the target variables
    df_train_temp.drop(labels=['Survived'],inplace=True,axis=1)
    df_test_temp.drop(labels=['Survived'],inplace=True,axis=1)
        
    return df_train_temp,df_test_temp

In [59]:
# Calling the function
x_train_enc,x_test_enc=mean_encoding(x_train,x_test)
x_train_enc.shape,x_test_enc.shape

((623, 4), (268, 4))

In [61]:
x_train_enc.head()

Unnamed: 0,Pclass,Sex,Embarked,Cabin
857,0.621795,0.196078,0.341357,0.740741
52,0.621795,0.753488,0.564815,0.692308
386,0.241791,0.196078,0.341357,0.303609
124,0.621795,0.196078,0.341357,0.692308
578,0.241791,0.753488,0.564815,0.303609


In [71]:
# Now we'll use this probability values
# against y values to calculate roc_auc values
roc_values=[]
for col in x_train_enc.columns:
    roc_values.append(roc_auc_score(y_test,x_test_enc[col].values))

In [72]:
# Converting into series
roc_values=pd.Series(roc_values)
roc_values.index=x_test_enc.columns
roc_values

Pclass      0.680476
Sex         0.771667
Embarked    0.577500
Cabin       0.641637
dtype: float64

Hence, we can conclude that all the features are important with sex being the most important feature for prediction.