## One Hot Encoding - sklearn

- OneHotEncoder
- OneHotEncoder + ColumnTransformer
- OneHotEncoder + SimpleImputer in a pipeline

We will see how to perform one hot encoding with Scikit-learn using the Titanic dataset.

For guidelines to obtain the data, check **section 2** of the course.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# sklearn
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [2]:
# load titanic dataset

usecols = ["pclass", "sibsp", "parch", "sex", "embarked", "cabin", "survived"]

data = pd.read_csv("../../titanic.csv", usecols=usecols)
data["cabin"] = data["cabin"].str[0]

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B,S
1,1,1,male,1,2,C,S
2,1,0,female,1,2,C,S
3,1,0,male,1,2,C,S
4,1,0,female,1,2,C,S


### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 6), (393, 6))

## One hot encoding with Scikit-learn

### Advantages

- quick
- Creates the same number of features in train and test set
- works within a pipeline

### Limitations

- need to set specific output to return pandas
- need additional transformer to encode variable subset
- changes variable names after transformation

In [4]:
# To start, we fillna manually. Later on
# we add this step to a pipeline

X_train.fillna("Missing", inplace=True)
X_test.fillna("Missing", inplace=True)

In [5]:
# set up encoder

encoder = OneHotEncoder(
    categories="auto",
    drop="first",  # to return k-1, use drop=false to return k dummies
    sparse_output=False,
    handle_unknown="error",  # helps deal with rare labels
)

encoder.set_output(transform="pandas")

In [6]:
# fit the encoder (finds categories)

encoder.fit(X_train)

In [7]:
# we observe the learned categories

encoder.categories_

[array([1, 2, 3], dtype=int64),
 array(['female', 'male'], dtype=object),
 array([0, 1, 2, 3, 4, 5, 8], dtype=int64),
 array([0, 1, 2, 3, 4, 5, 6, 9], dtype=int64),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing', 'T'], dtype=object),
 array(['C', 'Missing', 'Q', 'S'], dtype=object)]

In [8]:
# transform the data sets

X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

X_train_enc.head()

Unnamed: 0,pclass_2,pclass_3,sex_male,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_1,...,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_Missing,cabin_T,embarked_Missing,embarked_Q,embarked_S
501,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
588,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
402,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1193,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
686,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [9]:
# we can retrieve the feature names as follows:

encoder.get_feature_names_out()

array(['pclass_2', 'pclass_3', 'sex_male', 'sibsp_1', 'sibsp_2',
       'sibsp_3', 'sibsp_4', 'sibsp_5', 'sibsp_8', 'parch_1', 'parch_2',
       'parch_3', 'parch_4', 'parch_5', 'parch_6', 'parch_9', 'cabin_B',
       'cabin_C', 'cabin_D', 'cabin_E', 'cabin_F', 'cabin_G',
       'cabin_Missing', 'cabin_T', 'embarked_Missing', 'embarked_Q',
       'embarked_S'], dtype=object)

## Encoding variable subset

In [10]:
# et up encoder

encoder = OneHotEncoder(
    categories="auto",
    drop="first",  # to return k-1, use drop=false to return k dummies
    sparse_output=False,
    handle_unknown="error",  # helps deal with rare labels
)

In [11]:
# select the variables to encode

ct = ColumnTransformer(
    [("encoder", encoder, ["sex", "embarked", "cabin"])], remainder="passthrough"
)

ct.set_output(transform="pandas")

In [12]:
# train encoder

ct.fit(X_train)

In [13]:
# encode

X_train_enc = ct.transform(X_train)
X_test_enc = ct.transform(X_test)

X_train_enc.head()

Unnamed: 0,encoder__sex_male,encoder__embarked_Missing,encoder__embarked_Q,encoder__embarked_S,encoder__cabin_B,encoder__cabin_C,encoder__cabin_D,encoder__cabin_E,encoder__cabin_F,encoder__cabin_G,encoder__cabin_Missing,encoder__cabin_T,remainder__pclass,remainder__sibsp,remainder__parch
501,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2,0,1
588,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2,1,1
402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2,1,0
1193,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3,0,0
686,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3,0,0


## imputation and encoding

In [14]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,  # seed to ensure reproducibility
)

X_train.shape, X_test.shape

((916, 6), (393, 6))

In [15]:
# check for missing data

X_train.isnull().sum()

pclass        0
sex           0
sibsp         0
parch         0
cabin       702
embarked      2
dtype: int64

In [16]:
# set up encoder
encoder = OneHotEncoder(
    categories="auto",
    drop="first",  # to return k-1, use drop=false to return k dummies
    sparse_output=False,
    handle_unknown="error",  # helps deal with rare labels
)

In [17]:
# set up encoder and imputation in pipeline
# we only want to impute categorical variables

pipe = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        (("ohe", encoder)),
    ]
)

In [18]:
# select the variables to transform (impute + encode)

ct = ColumnTransformer(
    [("encoder", pipe, ["sex", "embarked", "cabin"])], remainder="passthrough"
)

ct.set_output(transform="pandas")

In [19]:
# fit pipeline

ct.fit(X_train)

In [20]:
# transform data

X_train_enc = ct.transform(X_train)
X_test_enc = ct.transform(X_test)

X_train_enc.head()

Unnamed: 0,encoder__sex_male,encoder__embarked_Q,encoder__embarked_S,encoder__embarked_missing,encoder__cabin_B,encoder__cabin_C,encoder__cabin_D,encoder__cabin_E,encoder__cabin_F,encoder__cabin_G,encoder__cabin_T,encoder__cabin_missing,remainder__pclass,remainder__sibsp,remainder__parch
501,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,0,1
588,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,1,1
402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2,1,0
1193,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3,0,0
686,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3,0,0


In [21]:
# the size of the resulting dataframes is identical

X_train_enc.shape, X_test_enc.shape

((916, 15), (393, 15))