## Ordinal Encoding - Category encoders

Ordinal encoding consist in replacing the categories by integers from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable.

The numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models. It is also suitable for tree based machine learning algorithms.


### Advantages

- Straightforward to implement
- Does not expand the feature space


### Limitations

- Does not capture any information about the categories labels
- Not suitable for linear models.

Ordinal encoding is better suited for non-linear methods which are able to navigate through the arbitrarily assigned digits to try and find patters that relate them to the target.


## In this demo:

We will see how to perform one hot encoding with Category encoders using the House Prices dataset.

For guidelines to obtain the dataset, please visit **section 2** of the course.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

from category_encoders.ordinal import OrdinalEncoder

In [2]:
# load dataset

data = pd.read_csv(
    "../../houseprice.csv",
    usecols=["Neighborhood", "Exterior1st", "Exterior2nd", "SalePrice"],
)

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


### Encoding important

We select which digit to assign to each category using the train set, and then use those mappings in the test set.

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[["Neighborhood", "Exterior1st", "Exterior2nd"]],  # predictors
    data["SalePrice"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((1022, 3), (438, 3))

## Ordinal Encoding with Category encoders

In [4]:
ordinal_enc = OrdinalEncoder(
    cols=["Neighborhood", "Exterior1st", "Exterior2nd"],
)

ordinal_enc.fit(X_train)

In [5]:
# in the mapping we can observe the numbers
# assigned to each category for all the indicated variables

ordinal_enc.mapping

[{'col': 'Neighborhood',
  'mapping': CollgCr     1
  ClearCr     2
  BrkSide     3
  Edwards     4
  SWISU       5
  Sawyer      6
  Crawfor     7
  NAmes       8
  Mitchel     9
  Timber     10
  Gilbert    11
  Somerst    12
  MeadowV    13
  OldTown    14
  BrDale     15
  NWAmes     16
  NridgHt    17
  SawyerW    18
  NoRidge    19
  IDOTRR     20
  NPkVill    21
  StoneBr    22
  Blmngtn    23
  Veenker    24
  Blueste    25
  NaN        -2
  dtype: int64,
  'data_type': dtype('O')},
 {'col': 'Exterior1st',
  'mapping': VinylSd     1
  Wd Sdng     2
  WdShing     3
  HdBoard     4
  MetalSd     5
  AsphShn     6
  BrkFace     7
  Plywood     8
  CemntBd     9
  Stucco     10
  BrkComm    11
  AsbShng    12
  ImStucc    13
  CBlock     14
  Stone      15
  NaN        -2
  dtype: int64,
  'data_type': dtype('O')},
 {'col': 'Exterior2nd',
  'mapping': VinylSd     1
  Wd Sdng     2
  Plywood     3
  Wd Shng     4
  HdBoard     5
  MetalSd     6
  AsphShn     7
  CmentBd     8
  BrkF

In [6]:
# this is the list of variables that the encoder will transform

ordinal_enc.cols

['Neighborhood', 'Exterior1st', 'Exterior2nd']

In [7]:
X_train = ordinal_enc.transform(X_train)
X_test = ordinal_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd
64,1,1,1
682,2,2,2
960,3,2,3
1384,4,3,4
1100,5,2,2


**Note**

If the argument `cols` is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?