# Ordinal Encoding in Scikit-Learn

## Defining some toy data

- We start by defining some toy data here:

In [None]:
import pandas as pd

In [None]:
feature_1 = [
    1.1, 2.1, 3.1, 4.2,
    5.1, 6.1, 7.1, 8.1,
    1.2, 2.1, 3.1, 4.1
]

feature_2 = [
    'b', 'b', 'b', 'b',
    'a', 'a', 'a', 'a',
    'c', 'c', 'c', 'c'
]

df = pd.DataFrame([feature_1, feature_2]).T
df.columns = ['numerical', 'categorical']
df['numerical'] = df['numerical'].astype(float)

df

## Ordinal Encoding

- Usually, we use onehot encoding if we have categorical data without ordering information, so-called nominal data.
- An example of such data is blood type (A, B, AB, or O)

- Ordinal encoding is typically used if we have categorical data with ordering information.
- One example of such data is T-shirt sizes (XS, S, M, L, or XL)
- Now, assume that the "categorical" column above has ordered features; we can use the `OrdinalEncoder` to encode that:

In [None]:
data = df['categorical'].values.reshape(-1, 1)
data

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ode = OrdinalEncoder(
    categories= [['a', 'b', 'c']]
)

ode.fit_transform(data)

- Notice that based on the alphabetical ordering the ordinal encoder assumes that `'a: 0 < b: 1 < c: 2'`.

- If we want to change that and have an ordering assumption like `'b: 0 < a: 1 < c: 2'`, we can override the feature ordering via the `categories` attribute as follows:

In [None]:
ode = OrdinalEncoder(
    categories= [['b', 'a', 'c']]
)

ode.fit_transform(data)

## Using the `OrdinalEncoder` when other columns are present

- Below is an example using a `ColumnTransformer` to transform only specific columns via the `OrdinalEncoder` when multiple columns are present.
- For instance, considering the toy dataset at the top, assume we only want to transform the "categorical" column but not the "numerical" column:

In [None]:
import sklearn
from sklearn.compose import ColumnTransformer


ohe = OrdinalEncoder()

X = df.values
categorical_features = [1]

col_transformer = ColumnTransformer(
    transformers=[
        ('cat', ohe, categorical_features)],
         remainder='passthrough'
)

col_transformer.fit(df)
X_t = col_transformer.transform(df)

- Note that there are a few extra workaround like the `FloatTransformer()`, which are explained [here](sklearn-onehot-encoding-mixedtype-df.ipynb).

In [None]:
X_t

In [None]:
%load_ext watermark
%watermark --iversions