## One Hot Encoding - Feature-engine

We will see how to perform one hot encoding with Feature-engine using the Titanic dataset.

For guidelines to obtain the data, check **section 2** of the course.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Feature-engine
from feature_engine.encoding import OneHotEncoder
from feature_engine.imputation import CategoricalImputer

In [2]:
# load titanic dataset

usecols = ["pclass", "sibsp", "parch", "sex", "embarked", "cabin", "survived"]

data = pd.read_csv("../../titanic.csv", usecols=usecols)
data["cabin"] = data["cabin"].str[0]

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B,S
1,1,1,male,1,2,C,S
2,1,0,female,1,2,C,S
3,1,0,male,1,2,C,S
4,1,0,female,1,2,C,S


### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 6), (393, 6))

## One hot encoding with Feature-Engine

### Advantages

- quick
- returns dataframe
- returns feature names
- allows to select features to encode
- appends dummies to original dataset

Limitations

- Not sure yet.

In [4]:
# To start, we fillna manually. Later on
# we add this step to a pipeline

X_train.fillna("Missing", inplace=True)
X_test.fillna("Missing", inplace=True)

In [5]:
# set up encoder

encoder = OneHotEncoder(
    variables=None,  # alternatively pass a list of variables
    drop_last=True,  # to return k-1, use drop=false to return k dummies
)

In [6]:
# fit the encoder (finds categories)

encoder.fit(X_train)

In [7]:
# automatically found numerical variables

encoder.variables_

['sex', 'cabin', 'embarked']

In [8]:
# we observe the learned categories

encoder.encoder_dict_

{'sex': ['female'],
 'cabin': ['Missing', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C', 'Q']}

In [9]:
# transform the data sets

X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

X_train_enc.head()

Unnamed: 0,pclass,sibsp,parch,sex_female,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T,embarked_S,embarked_C,embarked_Q
501,2,0,1,1,1,0,0,0,0,0,0,0,1,0,0
588,2,1,1,1,1,0,0,0,0,0,0,0,1,0,0
402,2,1,0,1,1,0,0,0,0,0,0,0,0,1,0
1193,3,0,0,0,1,0,0,0,0,0,0,0,0,0,1
686,3,0,0,1,1,0,0,0,0,0,0,0,0,0,1


In [10]:
# we can retrieve the feature names as follows:

encoder.get_feature_names_out()

['pclass',
 'sibsp',
 'parch',
 'sex_female',
 'cabin_Missing',
 'cabin_E',
 'cabin_C',
 'cabin_D',
 'cabin_B',
 'cabin_A',
 'cabin_F',
 'cabin_T',
 'embarked_S',
 'embarked_C',
 'embarked_Q']

## imputation and encoding

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,  # seed to ensure reproducibility
)

X_train.shape, X_test.shape

((916, 6), (393, 6))

In [12]:
# check for missing data

X_train.isnull().sum()

pclass        0
sex           0
sibsp         0
parch         0
cabin       702
embarked      2
dtype: int64

In [13]:
# set up encoder and imputation in pipeline
# we only want to impute categorical variables

pipe = Pipeline(
    [
        ("imputer", CategoricalImputer()),
        ("ohe", OneHotEncoder(drop_last=True)),
    ]
)

In [14]:
# fit pipeline

pipe.fit(X_train)

In [15]:
# transform data

X_train_enc = pipe.transform(X_train)
X_test_enc = pipe.transform(X_test)

X_train_enc.head()

Unnamed: 0,pclass,sibsp,parch,sex_female,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T,embarked_S,embarked_C,embarked_Q
501,2,0,1,1,1,0,0,0,0,0,0,0,1,0,0
588,2,1,1,1,1,0,0,0,0,0,0,0,1,0,0
402,2,1,0,1,1,0,0,0,0,0,0,0,0,1,0
1193,3,0,0,0,1,0,0,0,0,0,0,0,0,0,1
686,3,0,0,1,1,0,0,0,0,0,0,0,0,0,1
