# Ordinal encoding

In this notebook, we will encode static features with ordinal encoding, using comparatively Scikit-learn, Feature-engine, and Category Encoders.

We will use the online retail dataset, which we prepared in the notebook `02-create-online-retail-II-datasets.ipynb` located in the `01-Create-Datasets` folder.

In [1]:
import numpy as np
import pandas as pd

## Load data

In [2]:
df = pd.read_csv("../Datasets/online_retail_dataset_countries.csv",
                parse_dates=["week"],
                index_col="week",
                )

df.head()

Unnamed: 0_level_0,country,quantity,revenue
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,Belgium,143,439.1
2009-12-13,Belgium,10,8.5
2009-12-20,Belgium,0,0.0
2009-12-27,Belgium,0,0.0
2010-01-03,Belgium,0,0.0


In [3]:
# Number of countries in the dataset

df["country"].nunique()

6

## Scikit-learn

In [4]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer

In [5]:
# Set up the ordinal encoder

o_enc = OrdinalEncoder()

In [6]:
# We set the encoder inside the ColumnTransformer
# to encode only the variable "country".

ct = ColumnTransformer(
    [("o_enc", o_enc, ["country"])],  # to encode only the variable country
    remainder="passthrough",  # to return all the columns in the resulting array
)

In [7]:
# We should split the data into train and 
# test before fitting. 

# We avoid this step to speed up the demo

ct.fit(df)

In [8]:
# Encode country

tmp = ct.transform(df)

# The result is a numpy array,
# where the original variable was replaced by
# integers.

tmp

array([[0.0000000e+00, 1.4300000e+02, 4.3910000e+02],
       [0.0000000e+00, 1.0000000e+01, 8.5000000e+00],
       [0.0000000e+00, 0.0000000e+00, 0.0000000e+00],
       ...,
       [5.0000000e+00, 1.3399800e+05, 2.1074176e+05],
       [5.0000000e+00, 1.2304100e+05, 2.2021399e+05],
       [5.0000000e+00, 2.0428100e+05, 3.7294626e+05]])

In [9]:
# Recreate the dataframe

df_t = pd.DataFrame(tmp, columns=ct.get_feature_names_out())

df_t.head()

Unnamed: 0,o_enc__country,remainder__quantity,remainder__revenue
0,0.0,143.0,439.1
1,0.0,10.0,8.5
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0


Note that the variables that were not encoded are added at the right of the dataframe with the predix "remainder".

## Feature-engine

In [10]:
from feature_engine.encoding import OrdinalEncoder

In [11]:
# Set up the ordinal encoder

o_enc = OrdinalEncoder(
    encoding_method="arbitrary",  # assigns integers arbitrarily
)

In [12]:
# We should split the data into train and 
# test before fitting. 

# We avoid this step to speed up the demo

o_enc.fit(df)

In [13]:
# Feature-engine's encoder finds categorical variables
# by default

o_enc.variables_

['country']

In [14]:
# we can also see the integers assigned to each country

o_enc.encoder_dict_

{'country': {'Belgium': 0,
  'EIRE': 1,
  'France': 2,
  'Germany': 3,
  'Spain': 4,
  'United Kingdom': 5}}

In [15]:
# Feature-engine's encoder replaces the categories
# with integers, in place, returning a dataframe with the original
# and encoded variables, by default.

df_t = o_enc.transform(df)

df_t.head()

Unnamed: 0_level_0,country,quantity,revenue
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,0,143,439.1
2009-12-13,0,10,8.5
2009-12-20,0,0,0.0
2009-12-27,0,0,0.0
2010-01-03,0,0,0.0


## Category Encoders

In [16]:
from category_encoders.ordinal import OrdinalEncoder

In [17]:
# Set up the ordinal encoder

o_enc = OrdinalEncoder()

In [18]:
# We should split the data into train and 
# test before fitting. 

# We avoid this step to speed up the demo

o_enc.fit(df)

In the former output, we can see the names assigned to each one of the labels.

In [19]:
# Category encoders's finds categorical variables
# by default

o_enc.cols

['country']

In [20]:
# we can retrieve the integer mappings like this

o_enc.mapping

[{'col': 'country',
  'mapping': Belgium           1
  EIRE              2
  France            3
  Germany           4
  Spain             5
  United Kingdom    6
  NaN              -2
  dtype: int64,
  'data_type': dtype('O')}]

In [21]:
# Category encoders's encoder replaces the categories
# with integers, in place, returning a dataframe with the original 
# and encoded variables, by default.

df_t = o_enc.transform(df)

df_t.head()

Unnamed: 0_level_0,country,quantity,revenue
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-06,1,143,439.1
2009-12-13,1,10,8.5
2009-12-20,1,0,0.0
2009-12-27,1,0,0.0
2010-01-03,1,0,0.0
