# Cardinality

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable _gender_ the categories are male and female, whereas in the variable _city_ the labels could be London, Manchester, Brighton, and so on.

Categorical variables can contain different numbers of categories. The variable "gender" contains only 2 labels, but a variable like "city" or "postcode" can contain a huge number of labels.

The number of different labels is known as cardinality. A high number of labels within a variable is known as __high cardinality__.


## Is high cardinality a problem?

High cardinality poses the following challenges: 

- Variables with too many labels tend to dominate those with only a few labels, particularly in **decision tree-based** algorithms.

- High cardinality may introduce noise.

- Some of the labels may only be present in the training data set and not in the test set, so machine learning algorithms may over-fit to the training set.

- Some labels may appear only in the test set, leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

**Algorithms based on decision trees can be biased towards variables with high cardinality**.

Below is a demo about the effect of high cardinality on the performance of various machine learning algorithms.

## In this Demo:

- Learn how to quantify cardinality.
- See examples of high and low cardinality variables.
- Understand the effect of cardinality in train and test sets.
- Evaluate the effect of cardinality on machine learning model performance.

We will use the Titanic dataset.

- To download the dataset, please refer to the **Datasets** lecture in **Section 2** of the course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# The machine learning models.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# To evaluate the models.
from sklearn.metrics import roc_auc_score

# To separate data into train and test.
from sklearn.model_selection import train_test_split

In [2]:
# let's load the titanic dataset.

data = pd.read_csv('../titanic.csv')

data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


The categorical variables are Name, Sex, Ticket, Cabin and Embarked.

**Note** that Ticket and Cabin contain both letters and numbers, so they could be treated as Mixed Variables. In this demo, I will treat them as categorical.

In [3]:
# Let's inspect the cardinality: the number
# of different labels.

print('Number of categories in the variable Name: {}'.format(
    len(data.name.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.sex.unique())))

print('Number of categories in the variable Ticket: {}'.format(
    len(data.ticket.unique())))

print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))

print('Number of categories in the variable Embarked: {}'.format(
    len(data.embarked.unique())))

print('Total number of passengers in the Titanic: {}'.format(len(data)))

Number of categories in the variable Name: 1307
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 929
Number of categories in the variable Cabin: 182
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 1309


While the variable Sex contains only 2 categories and the variable Embarked 4 (low cardinality), the variables Ticket, Name, and Cabin, as expected, contain a huge number of different labels (high cardinality).

To demonstrate the effect of high cardinality on train and test sets and on machine learning performance, I will work with the variable cabin. I will create a new variable with reduced cardinality.

In [4]:
# let's explore the values of Cabin.

# We know from the previous cell that there are 148
# different cabins, therefore the variable
# is highly cardinal.

data.cabin.unique()

array(['B5', 'C22', 'E12', 'D7', 'A36', 'C101', nan, 'C62', 'B35', 'A23',
       'B58', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99', 'C52', 'T',
       'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4', 'E40',
       'B38', 'E24', 'B51', 'B96', 'C46', 'E31', 'E8', 'B61', 'B77', 'A9',
       'C89', 'A14', 'E58', 'E49', 'E52', 'E45', 'B22', 'B26', 'C85',
       'E17', 'B71', 'B20', 'A34', 'C86', 'A16', 'A20', 'A18', 'C54',
       'C45', 'D20', 'A29', 'C95', 'E25', 'C111', 'C23', 'E36', 'D34',
       'D40', 'B39', 'B41', 'B102', 'C123', 'E63', 'C130', 'B86', 'C92',
       'A5', 'C51', 'B42', 'C91', 'C125', 'D10', 'B82', 'E50', 'D33',
       'C83', 'B94', 'D49', 'D45', 'B69', 'B11', 'E46', 'C39', 'B18',
       'D11', 'C93', 'B28', 'C49', 'B52', 'E60', 'C132', 'B37', 'D21',
       'D19', 'C124', 'D17', 'B101', 'D28', 'D6', 'D9', 'B80', 'C106',
       'B79', 'C47', 'D30', 'C90', 'E38', 'C78', 'C30', 'C118', 'D36',
       'D48', 'D47', 'C105', 'B36', 'B30', 'D43', 'B24', 'C2', 'C65',


Let's reduce the cardinality of the variable. How? Instead of using the entire value (letter + number), I will only use the first letter.

***Rationale***: the first letter indicates the deck on which the cabin was located, indicating both social class status and proximity to the Titanic's surface. Both are known to improve the probability of survival.

In [5]:
# Let's capture the first letter of cabin.

data['Cabin_reduced'] = data['cabin'].astype(str).str[0]

data[['cabin', 'Cabin_reduced']].head()

Unnamed: 0,cabin,Cabin_reduced
0,B5,B
1,C22,C
2,C22,C
3,C22,C
4,C22,C


In [6]:
print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))

print('Number of categories in the variable Cabin reduced: {}'.format(
    len(data.Cabin_reduced.unique())))

Number of categories in the variable Cabin: 182
Number of categories in the variable Cabin reduced: 9


We reduced the number of different labels from 182 to 9.

In [7]:
# Let's separate the data into training and testing sets.

use_cols = ['cabin', 'Cabin_reduced', 'sex']

# This functions is from Scikit-learn
X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols], 
    data['survived'],  
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 3), (393, 3))

## Uneven distribution of categories

When a variable is highly cardinal, some categories appear only on the training set, and others only on the testing set. If present only in the training set, they may cause over-fitting. If present only on the testing set, the machine learning model will not know how to handle them, as they were not seen during training.

In [8]:
# Labels present only in the training set:

unique_to_train_set = [
    x for x in X_train.cabin.unique() if x not in X_test.cabin.unique()
]

len(unique_to_train_set)

113

There are 113 Cabins only present in the training set.

In [9]:
# Labels present only in the test set.

unique_to_test_set = [
    x for x in X_test.cabin.unique() if x not in X_train.cabin.unique()
]

len(unique_to_test_set)

36

Variables with high cardinality have categories present either only in the training set, or only in the testing set. This will cause problems at the time of training (over-fitting) and scoring of new data (how will the model deal with unseen categories?).

This problem can be mitigated by reducing the cardinality of the variable. Let's do that.

In [10]:
# Labels present only in the training set
# for Cabin with reduced cardinality.

unique_to_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()
]

len(unique_to_train_set)

1

In [11]:
# Labels present only in the test set
# for Cabin with reduced cardinality.

unique_to_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()
]

len(unique_to_test_set)

0

By reducing the cardinality, there is now only 1 label in the training set that is not present in the test set. There is no label in the test set that is not in the training set either.

## The impact of cardinality on the performance of machine learning models

In order to evaluate the effect of categorical variables in machine learning models, I will quickly replace the categories with numbers.

In [12]:
# Let's re-map Cabin into numbers so we can use it to train ML models

# I will replace each cabin by a number
# to quickly demonstrate the effect of
# labels on machine learning algorithms.

##############
# Note: this is neither the only nor the best
# way to encode categorical variables into numbers.
# There is more on encoding techniques in the section
# "Encoding categorical variales".
##############

cabin_dict = {k: i for i, k in enumerate(X_train.cabin.unique(), 0)}
cabin_dict

{nan: 0,
 'E36': 1,
 'C68': 2,
 'E24': 3,
 'C22': 4,
 'D38': 5,
 'B50': 6,
 'A24': 7,
 'C111': 8,
 'F': 9,
 'C6': 10,
 'C87': 11,
 'E8': 12,
 'B45': 13,
 'C93': 14,
 'D28': 15,
 'D36': 16,
 'C125': 17,
 'B35': 18,
 'T': 19,
 'B73': 20,
 'B57': 21,
 'A26': 22,
 'A18': 23,
 'B96': 24,
 'G6': 25,
 'C78': 26,
 'C101': 27,
 'D9': 28,
 'D33': 29,
 'C128': 30,
 'E50': 31,
 'B26': 32,
 'B69': 33,
 'E121': 34,
 'C123': 35,
 'B94': 36,
 'A34': 37,
 'D': 38,
 'C39': 39,
 'D43': 40,
 'E31': 41,
 'B5': 42,
 'D17': 43,
 'F33': 44,
 'E44': 45,
 'D7': 46,
 'A21': 47,
 'D34': 48,
 'A29': 49,
 'D35': 50,
 'A11': 51,
 'B51': 52,
 'D46': 53,
 'E60': 54,
 'C30': 55,
 'D26': 56,
 'E68': 57,
 'A9': 58,
 'B71': 59,
 'D37': 60,
 'F2': 61,
 'C55': 62,
 'C89': 63,
 'C124': 64,
 'C23': 65,
 'C126': 66,
 'E49': 67,
 'E46': 68,
 'D19': 69,
 'B58': 70,
 'C82': 71,
 'B52': 72,
 'C92': 73,
 'E45': 74,
 'C65': 75,
 'E25': 76,
 'B3': 77,
 'D40': 78,
 'C91': 79,
 'B102': 80,
 'B61': 81,
 'A20': 82,
 'B36': 83,
 'C7': 84,

In [13]:
# Replace the labels in Cabin with the dictionary
# we just created.

X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'cabin'].map(cabin_dict)

X_train[['Cabin_mapped', 'cabin']].head(10)

Unnamed: 0,Cabin_mapped,cabin
501,0,
588,0,
402,0,
1193,0,
686,0,
971,0,
117,1,E36
540,0,
294,2,C68
261,3,E24


Note how NaN takes the value 0, E36 takes the value 1, C68 takes the value 2, and so on.

In [14]:
# Now I will replace the letters in the reduced cabin variable
# using the same procedure.

# Create replacement dictionary.
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}

# Replace labels by numbers using dictionary.
X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(
    cabin_dict)
X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)

X_train[['Cabin_reduced', 'cabin']].head(20)

  X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(
  X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)


Unnamed: 0,Cabin_reduced,cabin
501,0,
588,0,
402,0,
1193,0,
686,0,
971,0,
117,1,E36
540,0,
294,2,C68
261,1,E24


We see now that E36 and E24 take the same number, 1, because we are capturing only the letter. They both start with E.

In [15]:
# Re-map the categorical variable Sex into numbers.

X_train.loc[:, 'sex'] = X_train.loc[:, 'sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'sex'] = X_test.loc[:, 'sex'].map({'male': 0, 'female': 1})

X_train.sex.head()

  X_train.loc[:, 'sex'] = X_train.loc[:, 'sex'].map({'male': 0, 'female': 1})
  X_test.loc[:, 'sex'] = X_test.loc[:, 'sex'].map({'male': 0, 'female': 1})


501     1
588     1
402     1
1193    0
686     1
Name: sex, dtype: int64

In [16]:
# Check if there are missing values in these variables.

X_train[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     0
Cabin_reduced    0
sex              0
dtype: int64

In [17]:
X_test[['Cabin_mapped', 'Cabin_reduced', 'sex']].isnull().sum()

Cabin_mapped     41
Cabin_reduced     0
sex               0
dtype: int64

In the test set, there are now 41 missing values for the highly cardinal variable. These were introduced when encoding the categories into numbers. 

How? 

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, I will fill in those missing values with 0.

In [18]:
# Let's check the number of different 
# categories in the encoded variables.

len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())

(147, 9)

From the above we note immediately that from the original 182 cabins in the dataset, only 147 are present in the training set. We also see how we reduced the number of different categories to just 9 in our previous step.

Let's go ahead and evaluate the effect of cardinality in machine learning algorithms.

## Random Forests

In [19]:
# Model trained with data with high cardinality.

# The model.
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# Train the model.
rf.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.853790650048556
Test set
Random Forests roc-auc: 0.7691361097284443


The performance of the Random Forests on the training set is quite superior to its performance on the test set. This indicates that the model is over-fitting, which means that it does a great job of predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction to unseen data.

In [20]:
# Model trained with data with low cardinality.

# The model.
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# Train the model.
rf.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'sex']])

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Random Forests roc-auc: 0.8163420365403872
Test set
Random Forests roc-auc: 0.8017670482827277


Note that the Random Forests no longer over-fit to the training set. The model is much better at generalising the predictions (compare the ROC-AUC of this model vs the ROC-AUC of the previous model: : 0.81 vs 0.80).

**I would like to point out, that likely we can overcome the effect of high cardinality by adjusting the hyper-parameters of the random forests. That goes beyond the scope of this course. Here, I want to show you that given a model with identical hyper-parameters, high cardinality may cause the model to over-fit**.

## AdaBoost

In [21]:
# Model trained with data with high cardinality.

# The model.
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# Train the model.
ada.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# Make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Adaboost roc-auc: 0.8296861713101102
Test set
Adaboost roc-auc: 0.7604391350035948


In [22]:
# Model trained with data with fewer categories.

# The model.
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# Train the model.
ada.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))

print('Train set')
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('AdaBoost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Adaboost roc-auc: 0.8161256723642566
Test set
Adaboost roc-auc: 0.8001078480172557


Similarly, Adaboost trained with the variable with high cardinality overfits to the train set. Adaboost trained with the low cardinal variable does not overfit.

In addition, training AdaBoost with data with less categories in Cabin, returns a) a simpler model and, b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because, most likely, the value was seen during training.

## Logistic Regression

In [23]:
# Model trained with data with high cardinality.

# The model.
logit = LogisticRegression(random_state=44, solver='lbfgs')

# Train the model.
logit.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Logistic regression roc-auc: 0.8133909298124677
Test set
Logistic regression roc-auc: 0.7750815773463858


In [24]:
# Model trained with data with fewer categories.

# The model.
logit = LogisticRegression(random_state=44, solver='lbfgs')

# Train the model.
logit.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# Make predictions on train and test set.
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Logistic regression roc-auc: 0.8123468468695123
Test set
Logistic regression roc-auc: 0.8008268347989602


We can draw the same conclusions for Logistic Regression: reducing the cardinality improves the performance of the algorithm.

## Gradient Boosted Classifier

In [25]:
# Model trained with data with high cardinality.

# The model.
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# Train the model.
gbc.fit(X_train[['Cabin_mapped', 'sex']], y_train)

# Make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Gradient Boosted Trees roc-auc: 0.862631390919749
Test set
Gradient Boosted Trees roc-auc: 0.7733117637298823


In [26]:
# Model trained with data with fewer categories.

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_reduced', 'sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

Train set
Gradient Boosted Trees roc-auc: 0.816719415917359
Test set
Gradient Boosted Trees roc-auc: 0.8015181682429069


Gradient Boosted trees are overfit to the training set when using a variable with high cardinality. This was expected as tree-based methods tend to be biased to variables with plenty of categories.

**That is all for this demonstration. I hope you enjoyed the notebook, and I'll see you in the next one.**