# Feature Engineering - Cardinality in Machine Learning

Cardinality refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. The binary features, of course, could only assume one of two values (0 or 1).

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable gender the categories or labels are male and female, whereas in the variable city the labels can be London, Manchester, Brighton and so on.

Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.

The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality.

#### Are multiple labels in a categorical variable a problem?
High cardinality may pose the following problems:

Variables with too many labels tend to dominate over those with only a few labels, particularly in Tree based algorithms.
A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.
Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.
Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.
In particular, tree methods can be biased towards variables with lots of labels (variables with high cardinality). Thus, their performance may be affected by high cardinality.

Below we will see the effect of high cardinality of variables on the performance of different machine learning algorithms and how a quick fix to reduce the number of labels, without any sort of data insight, helps to boost the performance.

#### In this Blog:
We will:

Learn how to quantify cardinality
See examples of high and low cardinality variables
Understand the effect of cardinality while preparing train and test sets
See the effect of cardinality on Machine Learning Model performance
We will use the Titanic dataset.

In [None]:
# to read the dataset into a dataframe and perform operations on it
# to perform basic array operations
import numpy as np
import pandas as pd
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
)

# to build machine learning models
from sklearn.linear_model import LogisticRegression

# to evaluate the models
from sklearn.metrics import roc_auc_score

# to separate data into train and test
from sklearn.model_selection import train_test_split

Now we will read the titanic dataset using read_csv(). head() shows the first 5 rows of the dataframe. The categorical variables in this dataset are Name, Sex, Ticket, Cabin and Embarked.

Note: Ticket and Cabin contain both letters and numbers, so they could be treated as Mixed Variables. For this demonstration, we will treat them as categorical variables.

In [None]:
data = pd.read_csv("data/titanic.csv")
data.head()

Let’s inspect the cardinality of each categorical variable in the dataset.

In [None]:
print("Number of categories in the variable Name: {}".format(len(data.Name.unique())))

print("Number of categories in the variable Gender: {}".format(len(data.Sex.unique())))

print(
    "Number of categories in the variable Ticket: {}".format(len(data.Ticket.unique()))
)

print("Number of categories in the variable Cabin: {}".format(len(data.Cabin.unique())))

print(
    "Number of categories in the variable Embarked: {}".format(
        len(data.Embarked.unique())
    )
)

print("Total number of passengers in the Titanic: {}".format(len(data)))

While the variable Sex contains only 2 categories and Embarked contains 4 (low cardinality), the variables Ticket, Name and Cabin, as expected, contain a huge number of different labels (high cardinality).

To demonstrate the effect of high cardinality in train and test sets and machine learning performance, we will work with the variable Cabin. We will create a new variable with reduced cardinality.

We will begin by exploring the values in the variable Cabin. As we saw in the previous cell there are 148 unique values. We will display these values using unique()

In [None]:
data.Cabin.unique()

Now we will reduce the cardinality of the variable. To do so, instead of using the entire cabin value, we will retain only the first letter in Cabin_reduced.

Rationale: The first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.

In [None]:
# let's capture the first letter of Cabin
data["Cabin_reduced"] = data["Cabin"].astype(str).str[0]

data[["Cabin", "Cabin_reduced"]].head()

Now let’s check the cardinality of Cabin_reduced. We reduced the number of different labels from 148 to 9.

In [None]:
print("Number of categories in the variable Cabin: {}".format(len(data.Cabin.unique())))

print(
    "Number of categories in the variable Cabin_reduced: {}".format(
        len(data.Cabin_reduced.unique())
    )
)

Now we will split the data into training and testing set with the help of train_test_split(). use_col contains the variables of the feature space i.e. the variables which provide information necessary for prediction. Survived contains the values which have to be predicted. The test_size = 0.3 will keep 30% data for testing and 70% data will be used for training the model. random_state controls the shuffling applied to the data before applying the split.

In [None]:
use_cols = ["Cabin", "Cabin_reduced", "Sex"]

X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols], data["Survived"], test_size=0.3, random_state=0
)

X_train.shape, X_test.shape

As you can see from the previous cell the training set contains 623 rows and the test dataset contains 268 rows.

#### High cardinality leads to uneven distribution of categories in train and test sets
When a variable has high cardinality, often some categories land only in the training set, or only in the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.

We will find the number of labels in Cabin which are present only in the training set and are not present in the test dataset.

In [None]:
unique_to_train_set = [
    x for x in X_train.Cabin.unique() if x not in X_test.Cabin.unique()
]

len(unique_to_train_set)

There are 100 Cabins that are only present in the training set, and not in the testing set. Simillarly, we will compute the number of labels present only in the test set and not in the training set.

In [None]:
unique_to_test_set = [
    x for x in X_test.Cabin.unique() if x not in X_train.Cabin.unique()
]

len(unique_to_test_set)

This problem can be overcomed by reducing the cardinality of the variable. Let’s find out the number of labels present only in the training set for Cabin with reduced cardinality i.e. Cabin_reduced.

In [None]:
unique_to_train_set = [
    x
    for x in X_train["Cabin_reduced"].unique()
    if x not in X_test["Cabin_reduced"].unique()
]

len(unique_to_train_set)

Now we will find the number of labels present only in the test set for Cabin with reduced cardinality i.e. Cabin_reduced.

In [None]:
unique_to_test_set = [
    x
    for x in X_test["Cabin_reduced"].unique()
    if x not in X_train["Cabin_reduced"].unique()
]

len(unique_to_test_set)

Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no labels in the test set which are not present in the training set.

#### Effect of cardinality on Machine Learning Model Performance
In order to evaluate the effect of categorical variables in machine learning models, we will quickly replace the categories by numbers. We will re-map Cabin to numbers so we can use it to train ML models.

Note: This is neither the only nor the best way to encode categorical variables into numbers

Here itertools is just used to display the first 100 elements in the newly created dictionary.

In [None]:
import itertools

cabin_dict = {k: i for i, k in enumerate(X_train["Cabin"].unique(), 0)}
print(dict(itertools.islice(cabin_dict.items(), 100)))

Now we will replace the labels in Cabin using the dictionary cabin_dict created above. The numerical values will be stored in Cabin_mapped.

In [None]:
X_train.loc[:, "Cabin_mapped"] = X_train.loc[:, "Cabin"].map(cabin_dict)
X_test.loc[:, "Cabin_mapped"] = X_test.loc[:, "Cabin"].map(cabin_dict)

X_train[["Cabin_mapped", "Cabin"]].head(10)

We can see that NaN takes the value 2 in the new variable, E17 takes the value 0, D33 takes the value 1, and so on. Now we will replace the letters in the Cabin_reduced variable with numbers following the same procedure as above.

In [None]:
# create replace dictionary
cabin_dict = {k: i for i, k in enumerate(X_train["Cabin_reduced"].unique(), 0)}

# replace labels by numbers with dictionary
X_train.loc[:, "Cabin_reduced"] = X_train.loc[:, "Cabin_reduced"].map(cabin_dict)
X_test.loc[:, "Cabin_reduced"] = X_test.loc[:, "Cabin_reduced"].map(cabin_dict)

X_train[["Cabin_reduced", "Cabin"]].head(20)

We see now that D33 and D26 correspond to the same number, 1, because we are capturing only the first letter. They both start with D.

Now we wil map the categorical variable Sex to numbers.

In [None]:
X_train.loc[:, "Sex"] = X_train.loc[:, "Sex"].map({"male": 0, "female": 1})
X_test.loc[:, "Sex"] = X_test.loc[:, "Sex"].map({"male": 0, "female": 1})

X_train.Sex.head()

Next we will check if there are any missing values in these variables in the training as well as testing dataset.

In [None]:
X_train[["Cabin_mapped", "Cabin_reduced", "Sex"]].isnull().sum()

In [None]:
X_test[["Cabin_mapped", "Cabin_reduced", "Sex"]].isnull().sum()

n the test set, there are now 30 missing values for the highly cardinal variable Cabin_mapped. These were introduced while encoding the categories into numbers.

#### Why?

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, we will fill those missing values with 0.

Let’s check the number of different categories in the encoded variables

In [None]:
len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())

From here we can conclude that from the original 148 cabins in the dataset, only 121 are present in the training set. We also see how we reduced the number of different categories to just 9 in our previous step.

Let’s go ahead and evaluate the effect of labels on machine learning algorithms.

### Random Forests
We will build the model on data with high cardinality for cabin and then predict using that model.

In [None]:
# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[["Cabin_mapped", "Sex"]], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[["Cabin_mapped", "Sex"]])
pred_test = rf.predict_proba(X_test[["Cabin_mapped", "Sex"]].fillna(0))

print("Train set")
print("Random Forests roc-auc: {}".format(roc_auc_score(y_train, pred_train[:, 1])))
print("Test set")
print("Random Forests roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1])))

We observe that the performance of the Random Forests on the training set is quite superior to its performance in the test set. This indicates that the model is over-fitting, which means that it does a great job at predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction for unseen data.

Now we will build the model on data with low cardinality for cabin and then predict using that model.

In [None]:
# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[["Cabin_reduced", "Sex"]], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[["Cabin_reduced", "Sex"]])
pred_test = rf.predict_proba(X_test[["Cabin_reduced", "Sex"]])

print("Train set")
print("Random Forests roc-auc: {}".format(roc_auc_score(y_train, pred_train[:, 1])))
print("Test set")
print("Random Forests roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1])))

We can see now that the Random Forests no longer over-fitS to the training set. In addition, the model is much better at generalising the predictions.

Note:- We can overcome the effect of high cardinality by adjusting the hyper-parameters of the random forests. That goes beyond the scope of this blog. Here, I want to show you that given a same model, with identical hyper-parameters, high cardinality may cause the model to over-fit

### AdaBoost
We will build the model on data with high cardinality for cabin and then predict using that model.

In [None]:
# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[["Cabin_mapped", "Sex"]], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[["Cabin_mapped", "Sex"]])
pred_test = ada.predict_proba(X_test[["Cabin_mapped", "Sex"]].fillna(0))

print("Train set")
print("Adaboost roc-auc: {}".format(roc_auc_score(y_train, pred_train[:, 1])))
print("Test set")
print("Adaboost roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1])))

Now we will build the model on data with low cardinality for cabin and then predict using that model.

In [None]:
# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[["Cabin_reduced", "Sex"]], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[["Cabin_reduced", "Sex"]])
pred_test = ada.predict_proba(X_test[["Cabin_reduced", "Sex"]].fillna(0))

print("Train set")
print("Adaboost roc-auc: {}".format(roc_auc_score(y_train, pred_train[:, 1])))
print("Test set")
print("Adaboost roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1])))

The Adaboost model trained on the variable with high cardinality is also overfitting to the training set. Whereas the Adaboost trained on the low cardinal variable is not overfitting and therefore does a better job in generalising the predictions.

In addition, building an AdaBoost on a model with less categories in Cabin, is a) simpler and b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.

### Logistic Regression
We will build the model on data with high cardinality for cabin and then predict using that model.

In [None]:
# call the model
logit = LogisticRegression(random_state=44, solver="lbfgs")

# train the model
logit.fit(X_train[["Cabin_mapped", "Sex"]], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[["Cabin_mapped", "Sex"]])
pred_test = logit.predict_proba(X_test[["Cabin_mapped", "Sex"]].fillna(0))

print("Train set")
print(
    "Logistic regression roc-auc: {}".format(roc_auc_score(y_train, pred_train[:, 1]))
)
print("Test set")
print("Logistic regression roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1])))

Now we will build the model on data with low cardinality for cabin and then predict using that model.

In [None]:
# call the model
logit = LogisticRegression(random_state=44, solver="lbfgs")

# train the model
logit.fit(X_train[["Cabin_reduced", "Sex"]], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[["Cabin_reduced", "Sex"]])
pred_test = logit.predict_proba(X_test[["Cabin_reduced", "Sex"]].fillna(0))

print("Train set")
print(
    "Logistic regression roc-auc: {}".format(roc_auc_score(y_train, pred_train[:, 1]))
)
print("Test set")
print("Logistic regression roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1])))

We can draw the same conclusion for Logistic Regression: reducing the cardinality improves the performance and generalisation of the algorithm.

### Gradient Boosted Classifier
We will build the model on data with high cardinality for cabin and then predict using that model.

In [None]:
# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[["Cabin_mapped", "Sex"]], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[["Cabin_mapped", "Sex"]])
pred_test = gbc.predict_proba(X_test[["Cabin_mapped", "Sex"]].fillna(0))

print("Train set")
print(
    "Gradient Boosted Trees roc-auc: {}".format(
        roc_auc_score(y_train, pred_train[:, 1])
    )
)
print("Test set")
print(
    "Gradient Boosted Trees roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1]))
)

Now we will build the model on data with low cardinality for cabin and then predict using that model.

In [None]:
# model build on data with plenty of categories in Cabin variable

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[["Cabin_reduced", "Sex"]], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[["Cabin_reduced", "Sex"]])
pred_test = gbc.predict_proba(X_test[["Cabin_reduced", "Sex"]].fillna(0))

print("Train set")
print(
    "Gradient Boosted Trees roc-auc: {}".format(
        roc_auc_score(y_train, pred_train[:, 1])
    )
)
print("Test set")
print(
    "Gradient Boosted Trees roc-auc: {}".format(roc_auc_score(y_test, pred_test[:, 1]))
)

We can see that all the algorithms give better performance when the cardinality of the variables is low.