# ONE HOT ENCODING

One hot encoding consists of encoding each categorical variable with different boolean variables -- dummies, which take values 0 or 1, indicating if a category is present in an observation.

### Advantages
- Makes no assumption about the distribution or categories of the categorical variable
- Keeps all the information of the categorical variable
- Suitable for linear models

### Limitations
- Expands the feature space
- Does not add extra information while encoding
- Many dummy variables may be identical, introducing redundant information


### Dataset:
- Titanic dataset


### Content:

1. Loading the data and train/test splitting
2. Exploring cardinality
3. One hot encoding with pandas
    - into k dummy variables
    - into k-1 dummy variables
    - get_dummies() with missing values
4. One hot encoding with Scikit-learn

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder

### 1. First Steps

In [2]:
# load titanic dataset

data = pd.read_csv('../titanic.csv',
                   usecols=['sex', 'embarked', 'cabin', 'survived'])
data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B5,S
1,1,male,C22,S
2,0,female,C22,S
3,0,male,C22,S
4,0,female,C22,S


To avoid leaking information and overfitting, just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

In [3]:
# separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['sex', 'embarked', 'cabin']],
    data['survived'],  
    test_size=0.3, 
    random_state=0)  

X_train.shape, X_test.shape

((916, 3), (393, 3))

### 2. Exploring the cardinality

In [4]:
X_train['sex'].unique()

array(['female', 'male'], dtype=object)

In [5]:
# "sex" has 2 labels

In [6]:
X_train['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [7]:
# "embarked" has 3 labels and missing data

In [8]:
X_train['cabin'].unique()

array([nan, 'E36', 'C68', 'E24', 'C22', 'D38', 'B50', 'A24', 'C111', 'F',
       'C6', 'C87', 'E8', 'B45', 'C93', 'D28', 'D36', 'C125', 'B35', 'T',
       'B73', 'B57', 'A26', 'A18', 'B96', 'G6', 'C78', 'C101', 'D9',
       'D33', 'C128', 'E50', 'B26', 'B69', 'E121', 'C123', 'B94', 'A34',
       'D', 'C39', 'D43', 'E31', 'B5', 'D17', 'F33', 'E44', 'D7', 'A21',
       'D34', 'A29', 'D35', 'A11', 'B51', 'D46', 'E60', 'C30', 'D26',
       'E68', 'A9', 'B71', 'D37', 'F2', 'C55', 'C89', 'C124', 'C23',
       'C126', 'E49', 'E46', 'D19', 'B58', 'C82', 'B52', 'C92', 'E45',
       'C65', 'E25', 'B3', 'D40', 'C91', 'B102', 'B61', 'A20', 'B36',
       'C7', 'B77', 'D20', 'C148', 'C105', 'E38', 'B86', 'C132', 'C86',
       'A14', 'C54', 'A5', 'B49', 'B28', 'B24', 'C2', 'F4', 'A6', 'C83',
       'B42', 'A36', 'C52', 'D56', 'C116', 'B19', 'E77', 'E101', 'B18',
       'C95', 'D15', 'E33', 'B30', 'D21', 'E10', 'C130', 'D6', 'C51',
       'D30', 'E67', 'C110', 'C103', 'C90', 'C118', 'C97', 'D47', 'E34

In [9]:
X_train['cabin'].nunique()

146

In [10]:
# capture only the first letter of the 
# cabin for to evoid high cardinality

X_train['cabin'] = X_train['cabin'].str[0]

In [11]:
X_train['cabin'].unique()

array([nan, 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object)

In [12]:
# "cabin" has 9 labels and missing data

In [13]:
X_test['cabin'] = X_test['cabin'].str[0]
X_test['cabin'].unique()

array([nan, 'G', 'E', 'C', 'B', 'A', 'F', 'D'], dtype=object)

### 3.  One hot encoding with pandas

(no use with Pipelines as it does not preserve information from train data to propagate to test data)

#### - into k dummy variables

In [14]:
# reate dummy variables with get_dummies

# for "sex"

tmp = pd.get_dummies(X_train['sex'])

tmp.head()

Unnamed: 0,female,male
501,1,0
588,1,0
402,1,0
1193,0,1
686,1,0


In [15]:
# put the dummies next to the original variable (for better visualisation)

pd.concat([X_train['sex'],
           pd.get_dummies(X_train['sex'])], axis=1).head()

Unnamed: 0,sex,female,male
501,female,1,0
588,female,1,0
402,female,1,0
1193,male,0,1
686,female,1,0


In [16]:
# same for for "embarked"

tmp = pd.get_dummies(X_train['embarked'])

# for better visualisation
pd.concat([X_train['embarked'],
           pd.get_dummies(X_train['embarked'])], axis=1).head()

Unnamed: 0,embarked,C,Q,S
501,S,0,0,1
588,S,0,0,1
402,C,1,0,0
1193,Q,0,1,0
686,Q,0,1,0


In [17]:
# for "cabin"

tmp = pd.get_dummies(X_train['cabin'])

# for better visualisation
pd.concat([X_train['cabin'],
           pd.get_dummies(X_train['cabin'])], axis=1).head()

Unnamed: 0,cabin,A,B,C,D,E,F,G,T
501,,0,0,0,0,0,0,0,0
588,,0,0,0,0,0,0,0,0
402,,0,0,0,0,0,0,0,0
1193,,0,0,0,0,0,0,0,0
686,,0,0,0,0,0,0,0,0


In [18]:
# put together (train set)

tmp = pd.get_dummies(X_train)

print(tmp.shape)

tmp.head()

(916, 13)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,1,0,0,0,1,0,0,0,0,0,0,0,0
588,1,0,0,0,1,0,0,0,0,0,0,0,0
402,1,0,1,0,0,0,0,0,0,0,0,0,0
1193,0,1,0,1,0,0,0,0,0,0,0,0,0
686,1,0,0,1,0,0,0,0,0,0,0,0,0


In [19]:
# put together (test set)

tmp = pd.get_dummies(X_test)

print(tmp.shape)

tmp.head()

(393, 12)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,0,1,0,0,1,0,0,0,0,0,0,0
533,1,0,0,0,1,0,0,0,0,0,0,0
459,0,1,0,0,1,0,0,0,0,0,0,0
1150,0,1,0,0,1,0,0,0,0,0,0,0
393,0,1,0,0,1,0,0,0,0,0,0,0


The train set contains 13 dummy features, whereas the test set contains 12 features. This occurred because there was no category T in cabin in the test set.

This will cause problems if training and scoring models with scikit-learn, because predictors require train and test sets to be of the same shape.

### into k -1 

In [20]:
# reate dummy variables with get_dummies(...., drop_first=True)

tmp = pd.get_dummies(X_train, drop_first=True)
print(tmp.shape)
tmp.head()

(916, 10)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,0,0,1,0,0,0,0,0,0,0
588,0,0,1,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0,0,0
1193,1,1,0,0,0,0,0,0,0,0
686,0,1,0,0,0,0,0,0,0,0


In [21]:
# altogether: test set

tmp = pd.get_dummies(X_test, drop_first=True)
print(tmp.shape)
tmp.head()

(393, 9)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,1,0,1,0,0,0,0,0,0
533,0,0,1,0,0,0,0,0,0
459,1,0,1,0,0,0,0,0,0
1150,1,0,1,0,0,0,0,0,0
393,1,0,1,0,0,0,0,0,0


In [22]:
# and again, compare the difference in column number: 10 vs 9

#### - get_dummies() with missing values

In [23]:
# add an additional dummy variable to indicate missing data

tmp = pd.get_dummies(X_test, drop_first=True)

pd.get_dummies(X_train, drop_first=True, dummy_na=True).head()

Unnamed: 0,sex_male,sex_nan,embarked_Q,embarked_S,embarked_nan,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T,cabin_nan
501,0,0,0,1,0,0,0,0,0,0,0,0,1
588,0,0,0,1,0,0,0,0,0,0,0,0,1
402,0,0,0,0,0,0,0,0,0,0,0,0,1
1193,1,0,1,0,0,0,0,0,0,0,0,0,1
686,0,0,1,0,0,0,0,0,0,0,0,0,1


In [24]:
pd.get_dummies(X_test, drop_first=True, dummy_na=True).head()

Unnamed: 0,sex_male,sex_nan,embarked_Q,embarked_S,embarked_nan,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_nan
1139,1,0,0,1,0,0,0,0,0,0,0,1
533,0,0,0,1,0,0,0,0,0,0,0,1
459,1,0,0,1,0,0,0,0,0,0,0,1
1150,1,0,0,1,0,0,0,0,0,0,0,1
393,1,0,0,1,0,0,0,0,0,0,0,1


### 4. One hot encoding with Scikit-learn

- it creates the same number of features in train and test set<br>
but
- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names in a table, therefore inconvenient for variable exploration

In [25]:
# we create and train the encoder

encoder = OneHotEncoder(categories='auto',
                       drop='first',
                       sparse=False,
                       handle_unknown='error')

encoder.fit(X_train.fillna('Missing'))

OneHotEncoder(drop='first', sparse=False)

In [26]:
# observe the learned categories

encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Missing', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing', 'T'], dtype=object)]

In [27]:
# transform the train set

tmp = encoder.transform(X_train.fillna('Missing'))

pd.DataFrame(tmp).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [28]:
# we can now retrieve the feature names (latest release of Scikit-learn)

encoder.get_feature_names()

array(['x0_male', 'x1_Missing', 'x1_Q', 'x1_S', 'x2_B', 'x2_C', 'x2_D',
       'x2_E', 'x2_F', 'x2_G', 'x2_Missing', 'x2_T'], dtype=object)

In [29]:
# ransfom the test set
tmp = encoder.transform(X_test.fillna('Missing'))

# reconstitute it back to a pandas dataframe
tmp = pd.DataFrame(tmp)

# add the feature names
tmp.columns = encoder.get_feature_names()

tmp.head()

Unnamed: 0,x0_male,x1_Missing,x1_Q,x1_S,x2_B,x2_C,x2_D,x2_E,x2_F,x2_G,x2_Missing,x2_T
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
