# Categoricals and Enums

## Categorical Data
- Some string columns consist of a limited number of unique values.
- These columns store categorical data. Categorical means "consisting of categories".
- Some examples of categorical data: gender, blood type, country, subscription plan.
- By default, Polars stores each string separately in memory, even if the values are identical.

### The Categorical Data Type
- The `Categorical` data type uses a string cache behind the scenes.
- The cache is a dictionary that maps String values to a complementary `UInt32` integer.
- The integer is the "physical" representation, while the String is the "lexical" representation of the value.
- The cache design stores each string value in memory once. The integer maps back to the string.
- Categorical data is ideal when there is a small number of unique values in a column.

- The `estimated_size` method returns the number of bytes the `DataFrame` occupies in memory.

- The `DataFrame` has 1000 rows and 6 columns.
- The `membership_tier`, `favorite_class`, and `city` columns store a small number of unique values.
- The `n_unique` method returns the number of unique values in a column.

- Let's cast the columns with a small number of repeating values as `Categorical` types.

- The `DataFrame` occupies 17% less memory!

### Further Reading
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#creating-a-categorical-series
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.estimated_size.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.n_unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html
- https://docs.pola.rs/api/python/stable/reference/api/polars.datatypes.Categorical.html

## The cat Namespace
- Polars nests categorical methods under the `cat` attribute/namespace.
- The `cat.get_categories` method returns a column with the distinct category values.

- Polars uses a _single_ global string cache to store all categorical values.
- To prove this fact, we can target a sample categorical column and invoke the `cat.get_categories` method.
- The column stores the unique values across _all_ 3 categorical columns (`membership_tier`, `favorite_class`, and `city`).
- It's a confusing detail that the Polars team promises to address in future updates. Enums solve this problem.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cat.get_categories.html

## Enums
- Enums are a similar data type to categoricals. They are ideal when the unique values are known in advance.
- The Polars docs recommend using enums when possible. The `Categorical` type incurs a performance cost in comparison.
- Instantiate an `Enum` with the `pl.Enum` constructor. Pass a list of the distinct values.
- When converting a column into an enum column, Polars will throw an error if a value isn't found within the enum.
- The `Categorical` type will not raise this error because the values are not known in advance.

- The enum type enables a 44% reduction in memory!

- Enums enable each category to exist independently of the others (separate cache).
- Under the hood, Polars models an enum as an optimized categorical.
- We still use the `cat` namespace to access the categorical/enum methods.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#creating-an-enum
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cat.get_categories.html

## Enums, Categorical, and Sorting
- There are some interesting nuances when we opt into storing categorical/enum values.

- The order of variants in the enum declaration determines the sort order of the categorical column.
- For example, we declared our tier order to be `"Bronze"`, then `"Silver"`, then `"Gold"`.
- Polars will sort a column by the variant declaration order, not alphabetical (lexical) order.
- We can use this sort logic to our advantage in domains where the sort order is not alphabetical.
- For example, we may _want_ to rank `"Bronze"` earlier than `"Silver"` in sort order.


- The variant sort order also influences sort-adjacent operations like `>` and `<`.
- The next example targets rows with a membership tier greater than or equal to `"Silver"`.
- The rows include membership tier values of `"Silver"` and `"Gold"`.

- With categoricals, Polars uses lexical (alphabetical) sorting.
- The reason why is because Polars doesn't know the definitive ordering of the values.
- The categorical's values are built up as Polars encounters them.

- Filtering will be alphabetical as well.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#category-ordering-and-comparison
- https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/#lexical-comparison-with-strings
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sort.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html