## Cardinality

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable _gender_ the categories or labels are male and female, whereas in the variable _city_ the labels can be London, Manchester, Brighton and so on.

Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.

The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as __high cardinality__.


### Are multiple labels in a categorical variable a problem?

High cardinality may pose the following problems: 

- Variables with too many labels tend to dominate over those with only a few labels, particularly in **Tree based** algorithms.

- A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.

- Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.

- Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.


In particular, **tree methods can be biased towards variables with lots of labels** (variables with high cardinality). Thus, their performance may be affected by high cardinality.

Below, I will show the effect of high cardinality of variables on the performance of different machine learning algorithms, and how a quick fix to reduce the number of labels, without any sort of data insight, already helps to boost performance.

## In this Demo:

We will:

- Learn how to quantify cardinality
- See examples of high and low cardinality variables
- Understand the effect of cardinality when preparing train and test sets
- Visualise the effect of cardinality on Machine Learning Model performance

We will use the Titanic dataset.

- To download the dataset, please refer to the **Datasets** lecture in **Section 1** of the course.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# to separate data in train and test splits
from sklearn.model_selection import train_test_split

In [2]:
# We will again load the titanic dataset
data = pd.read_csv("D:/ML-Datasets/Kaggle-Datasets/titanic.csv")
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


The categorical variables in this dataset are Name, Sex, Ticket, Cabin and Embarked.

---------------
**Note** that Ticket and Cabin contain both letters and numbers, so they could be treated as Mixed Variables. For this demonstration, I will treat them as categorical.

In [3]:
# let's inspect the cardinality, this is the number
# of different labels, for the different categorical variables

print('Number of categories in the variable Name: {}'.format(
    len(data.name.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.sex.unique())))

print('Number of categories in the variable Ticket: {}'.format(
    len(data.ticket.unique())))

print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))

print('Number of categories in the variable Embarked: {}'.format(
    len(data.embarked.unique())))

print('Total number of passengers in the Titanic: {}'.format(len(data)))

Number of categories in the variable Name: 1307
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 929
Number of categories in the variable Cabin: 182
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 1309


While the variable Sex contains only 2 categories and Embarked 4 (low cardinality), the variables Ticket, Name and Cabin, as expected, contain a huge number of different labels (high cardinality).

To demonstrate the effect of high cardinality in train and test sets and machine learning performance, I will work with the variable Cabin. I will create a new variable with reduced cardinality.

In [4]:
# let's explore the values / categories of Cabin

# we know from the previous cell that there are 182
# different cabins, therefore the variable
# is highly cardinal

data.cabin.unique()

array(['B5', 'C22', 'E12', 'D7', 'A36', 'C101', nan, 'C62', 'B35', 'A23',
       'B58', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C99', 'C52', 'T',
       'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B4', 'E40',
       'B38', 'E24', 'B51', 'B96', 'C46', 'E31', 'E8', 'B61', 'B77', 'A9',
       'C89', 'A14', 'E58', 'E49', 'E52', 'E45', 'B22', 'B26', 'C85',
       'E17', 'B71', 'B20', 'A34', 'C86', 'A16', 'A20', 'A18', 'C54',
       'C45', 'D20', 'A29', 'C95', 'E25', 'C111', 'C23', 'E36', 'D34',
       'D40', 'B39', 'B41', 'B102', 'C123', 'E63', 'C130', 'B86', 'C92',
       'A5', 'C51', 'B42', 'C91', 'C125', 'D10', 'B82', 'E50', 'D33',
       'C83', 'B94', 'D49', 'D45', 'B69', 'B11', 'E46', 'C39', 'B18',
       'D11', 'C93', 'B28', 'C49', 'B52', 'E60', 'C132', 'B37', 'D21',
       'D19', 'C124', 'D17', 'B101', 'D28', 'D6', 'D9', 'B80', 'C106',
       'B79', 'C47', 'D30', 'C90', 'E38', 'C78', 'C30', 'C118', 'D36',
       'D48', 'D47', 'C105', 'B36', 'B30', 'D43', 'B24', 'C2', 'C65',


Let's now reduce the cardinality of the variable. How? instead of using the entire **cabin** value, I will capture only the 
first letter.

***Rationale***: the first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.

In [5]:
# lets capture the first letter of the cabin
data['Cabin_reduced'] = data['cabin'].astype(str).str[0]

data[['cabin', 'Cabin_reduced']].head()

Unnamed: 0,cabin,Cabin_reduced
0,B5,B
1,C22,C
2,C22,C
3,C22,C
4,C22,C


In [6]:
print('Number of categories in the variable Cabin: {}'.format(
    len(data.cabin.unique())))

print('Number of categories in the variable Cabin reduced: {}'.format(
    len(data.Cabin_reduced.unique())))

Number of categories in the variable Cabin: 182
Number of categories in the variable Cabin reduced: 9


We reduced the number of different labels from 182 to 9.

In [7]:
# let's separate into training and testing set
# in order to build machine learning models

use_cols = ['cabin', 'Cabin_reduced', 'sex']

# this functions comes from scikit-learn
X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols], 
    data['survived'],  
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((916, 3), (393, 3))

### High cardinality leads to uneven distribution of categories in train and test sets

When a variable is highly cardinal, often some categories land only on the training set, or only on the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.

In [8]:
# Let's find out labels present only in the training set

unique_to_train_set = [
    x for x in X_train.cabin.unique() if x not in X_test.cabin.unique()
]

len(unique_to_train_set)

113

There are 113 Cabins only present in the training set, and not in the testing set.

In [9]:
# Lets find out the labels present only in the test set

unique_to_test_set = [
    x for x in X_test.cabin.unique() if x not in X_train.cabin.unique()
]

len(unique_to_test_set)

36

Variables with high cardinality tend to have values (i.e., categories) present in the training set, that are not present in the test set, and vice versa. This will bring problems at the time of training (due to over-fitting) and scoring of new data (how should the model deal with unseen categories?).

This problem is almost overcome by reducing the cardinality of the variable. See below.

In [10]:
# Let's find out labels present only in the training set
# for Cabin with reduced cardinality

unique_to_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()
]

len(unique_to_train_set)

1

In [11]:
# Let's find out labels present only in the test set
# for Cabin with reduced cardinality

unique_to_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()
]

len(unique_to_test_set)

0