# Categoricals and Enums

In [1]:
import polars as pl

## Categorical Data
- Some string columns consist of a limited number of unique values.
- These columns store categorical data. Categorical means "consisting of categories".
- Some examples of categorical data: gender, blood type, country, subscription plan.
- By default, Polars stores each string separately in memory, even if the values are identical.

### The Categorical Data Type
- The `Categorical` data type uses a string cache behind the scenes.
- The cache is a dictionary that maps String values to a complementary `UInt32` integer.
- The integer is the "physical" representation, while the String is the "lexical" representation of the value.
- The cache design stores each string value in memory once. The integer maps back to the string.
- Categorical data is ideal when there is a small number of unique values in a column.

In [2]:
gym = pl.read_csv("gym_memberships.csv")
gym.head(5)

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,str,str,str,i64,bool
1,"""Silver""","""Pilates""","""Los Angeles""",20,False
2,"""Bronze""","""Spin""","""Miami""",7,True
3,"""Bronze""","""Pilates""","""Los Angeles""",9,False
4,"""Silver""","""Pilates""","""Dallas""",1,True
5,"""Gold""","""Pilates""","""Miami""",14,False


- The `estimated_size` method returns the number of bytes the `DataFrame` occupies in memory.

In [3]:
gym.estimated_size()

33678

- The `DataFrame` has 1000 rows and 6 columns.
- The `membership_tier`, `favorite_class`, and `city` columns store a small number of unique values.
- The `n_unique` method returns the number of unique values in a column.

In [4]:
gym.select(pl.all().n_unique())

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
u32,u32,u32,u32,u32,u32
1000,3,4,5,21,2


- Let's cast the columns with a small number of repeating values as `Categorical` types.

In [5]:
gym = pl.read_csv(
    "gym_memberships.csv",
    schema_overrides={
        "membership_tier": pl.Categorical,
        "favorite_class": pl.Categorical,
        "city": pl.Categorical,
    },
)
gym.head(5)

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,cat,cat,cat,i64,bool
1,"""Silver""","""Pilates""","""Los Angeles""",20,False
2,"""Bronze""","""Spin""","""Miami""",7,True
3,"""Bronze""","""Pilates""","""Los Angeles""",9,False
4,"""Silver""","""Pilates""","""Dallas""",1,True
5,"""Gold""","""Pilates""","""Miami""",14,False


- The `DataFrame` occupies 17% less memory!

In [6]:
gym.estimated_size()

28128

In [7]:
28128 / 33678

0.8352039907357919

### Further Reading
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#creating-a-categorical-series
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.estimated_size.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.n_unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html
- https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Categorical.html

## The cat Namespace
- Polars nests categorical methods under the `cat` attribute/namespace.
- The `cat.get_categories` method returns a column with the distinct category values.

In [8]:
gym = pl.read_csv(
    "gym_memberships.csv",
    schema_overrides={
        "membership_tier": pl.Categorical,
        "favorite_class": pl.Categorical,
        "city": pl.Categorical,
    },
)

gym.head(3)

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,cat,cat,cat,i64,bool
1,"""Silver""","""Pilates""","""Los Angeles""",20,False
2,"""Bronze""","""Spin""","""Miami""",7,True
3,"""Bronze""","""Pilates""","""Los Angeles""",9,False


- Polars uses a _single_ global string cache to store all categorical values.
- To prove this fact, we can target a sample categorical column and invoke the `cat.get_categories` method.
- The column stores the unique values across _all_ 3 categorical columns (`membership_tier`, `favorite_class`, and `city`).
- It's a confusing detail that the Polars team promises to address in future updates. Enums solve this problem.

In [9]:
gym.select(pl.col("city").cat.get_categories().alias("values"))

values
str
"""Silver"""
"""Gold"""
"""Yoga"""
"""New York"""
"""Bronze"""
…
"""Miami"""
"""Chicago"""
"""HIIT"""
"""Spin"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cat.get_categories.html

## Enums
- Enums are a similar data type to categoricals. They are ideal when the unique values are known in advance.
- The Polars docs recommend using enums when possible. The `Categorical` type incurs a performance cost in comparison.
- Instantiate an `Enum` with the `pl.Enum` constructor. Pass a list of the distinct values.
- When converting a column into an enum column, Polars will throw an error if a value isn't found within the enum.
- The `Categorical` type will not raise this error because the values are not known in advance.

In [10]:
pl.read_csv("gym_memberships.csv").estimated_size()

33678

In [11]:
pl.read_csv(
    "gym_memberships.csv",
    schema_overrides={
        "membership_tier": pl.Categorical,
        "favorite_class": pl.Categorical,
        "city": pl.Categorical,
    },
).estimated_size()

28128

In [12]:
tiers_enum = pl.Enum(["Bronze", "Silver", "Gold"])
cities_enum = pl.Enum(["Miami", "Chicago", "Los Angeles", "New York", "Dallas"])
classes_enum = pl.Enum(["HIIT", "Yoga", "Spin", "Pilates"])

In [13]:
gym = pl.read_csv(
    "gym_memberships.csv",
    schema_overrides={
        "membership_tier": tiers_enum,
        "favorite_class": classes_enum,
        "city": cities_enum,
    },
)

gym.head(1)

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,enum,enum,enum,i64,bool
1,"""Silver""","""Pilates""","""Los Angeles""",20,False


In [14]:
gym.estimated_size()

19128

- The enum type enables a 44% reduction in memory!

In [15]:
19128 / 33678

0.567967218955995

- Enums enable each category to exist independently of the others (separate cache).
- Under the hood, Polars models an enum as an optimized categorical.
- We still use the `cat` namespace to access the categorical/enum methods.

In [16]:
gym.select(pl.col("membership_tier").cat.get_categories())

membership_tier
str
"""Bronze"""
"""Silver"""
"""Gold"""


In [17]:
gym.select(pl.col("favorite_class").cat.get_categories())

favorite_class
str
"""HIIT"""
"""Yoga"""
"""Spin"""
"""Pilates"""


In [18]:
gym.select(pl.col("city").cat.get_categories())

city
str
"""Miami"""
"""Chicago"""
"""Los Angeles"""
"""New York"""
"""Dallas"""


### Further Reading
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#creating-an-enum
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cat.get_categories.html

## Enums, Categorical, and Sorting
- There are some interesting nuances when we opt into storing categorical/enum values.

In [19]:
tiers_enum = pl.Enum(["Bronze", "Silver", "Gold"])
cities_enum = pl.Enum(["Miami", "Chicago", "Los Angeles", "New York", "Dallas"])
classes_enum = pl.Enum(["HIIT", "Yoga", "Spin", "Pilates"])

gym = pl.read_csv(
    "gym_memberships.csv",
    schema_overrides={
        "membership_tier": tiers_enum,
        "favorite_class": classes_enum,
        "city": pl.Categorical,
    },
)
gym.head(1)

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,enum,enum,cat,i64,bool
1,"""Silver""","""Pilates""","""Los Angeles""",20,False


- The order of variants in the enum declaration determines the sort order of the categorical column.
- For example, we declared our tier order to be `"Bronze"`, then `"Silver"`, then `"Gold"`.
- Polars will sort a column by the variant declaration order, not alphabetical (lexical) order.
- We can use this sort logic to our advantage in domains where the sort order is not alphabetical.
- For example, we may _want_ to rank `"Bronze"` earlier than `"Silver"` in sort order.

In [20]:
gym.sort("membership_tier")

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,enum,enum,cat,i64,bool
2,"""Bronze""","""Spin""","""Miami""",7,true
3,"""Bronze""","""Pilates""","""Los Angeles""",9,false
6,"""Bronze""","""HIIT""","""New York""",8,false
7,"""Bronze""","""Pilates""","""New York""",11,false
8,"""Bronze""","""HIIT""","""Los Angeles""",2,false
…,…,…,…,…,…
983,"""Gold""","""HIIT""","""Los Angeles""",14,false
984,"""Gold""","""Spin""","""Miami""",8,false
987,"""Gold""","""Pilates""","""Miami""",9,false
988,"""Gold""","""Pilates""","""New York""",16,true



- The variant sort order also influences sort-adjacent operations like `>` and `<`.
- The next example targets rows with a membership tier greater than or equal to `"Silver"`.
- The rows include membership tier values of `"Silver"` and `"Gold"`.

In [21]:
gym.filter(pl.col("membership_tier") >= "Silver")

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,enum,enum,cat,i64,bool
1,"""Silver""","""Pilates""","""Los Angeles""",20,false
4,"""Silver""","""Pilates""","""Dallas""",1,true
5,"""Gold""","""Pilates""","""Miami""",14,false
9,"""Silver""","""Yoga""","""Chicago""",14,false
11,"""Silver""","""Spin""","""Dallas""",7,true
…,…,…,…,…,…
990,"""Silver""","""Pilates""","""New York""",12,true
991,"""Silver""","""Yoga""","""Dallas""",18,false
992,"""Silver""","""Yoga""","""Miami""",3,true
996,"""Silver""","""Spin""","""Los Angeles""",17,false


- With categoricals, Polars uses lexical (alphabetical) sorting.
- The reason why is because Polars doesn't know the definitive ordering of the values.
- The categorical's values are built up as Polars encounters them.

In [22]:
gym.sort("city")

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,enum,enum,cat,i64,bool
9,"""Silver""","""Yoga""","""Chicago""",14,false
19,"""Bronze""","""Spin""","""Chicago""",19,false
32,"""Bronze""","""HIIT""","""Chicago""",9,true
39,"""Gold""","""Spin""","""Chicago""",13,true
40,"""Gold""","""Yoga""","""Chicago""",1,false
…,…,…,…,…,…
977,"""Gold""","""HIIT""","""New York""",9,true
988,"""Gold""","""Pilates""","""New York""",16,true
990,"""Silver""","""Pilates""","""New York""",12,true
995,"""Bronze""","""Spin""","""New York""",11,false


- Filtering will be alphabetical as well.

In [23]:
gym.filter(pl.col("city") > "Los Angeles")

member_id,membership_tier,favorite_class,city,visits_last_month,is_active
i64,enum,enum,cat,i64,bool
2,"""Bronze""","""Spin""","""Miami""",7,true
5,"""Gold""","""Pilates""","""Miami""",14,false
6,"""Bronze""","""HIIT""","""New York""",8,false
7,"""Bronze""","""Pilates""","""New York""",11,false
14,"""Bronze""","""Yoga""","""New York""",11,true
…,…,…,…,…,…
992,"""Silver""","""Yoga""","""Miami""",3,true
994,"""Bronze""","""Pilates""","""Miami""",16,false
995,"""Bronze""","""Spin""","""New York""",11,false
997,"""Gold""","""HIIT""","""Miami""",20,false


### Further Reading
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#category-ordering-and-comparison
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#lexical-comparison-with-strings
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sort.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html