## String and categorical dtypes
By the end of this lecture you will be able to:
- convert from string to categorical dtype
- get the underlying integer values
- sort categorical data

When we have a string column with repeated values it is often faster and less memory intensive to cast the strings to the `pl.Categorical` dtype. The categorical dtype works in some surprising ways, however. In this lecture we go through the fundamentals of how Polars works with the categorical dtype. 

In [None]:
import polars as pl

## Categorical dtype
The `pl.Categorical` dtype is useful when you have a string column with many repeated values.

The `pl.Categorical` dtype replaces the strings with a unique mapping from each string to an integer.

We first create a simple `DataFrame` with a string column

In [None]:
df = (
    pl.DataFrame(
        {
            "text":["cat","dog","rabbit","cat"]
        }
    )
)

We convert from string to categorical with `cast`

In [None]:
(
    df
    .with_columns(
        pl.col("text").cast(pl.Categorical).alias("text_cat")
    )
)

There is no difference in the printed appearance of values in a `pl.Categorical` column and the original string column.

### Physical representation of categoricals

In Polars the integer part of the categorical mapping is referred to as the **"physical"** representation.

We can see the underlying integer values with the `to_physical` expression

In [None]:
(
    df
    .with_columns(
        pl.col("text").cast(pl.Categorical).alias("text_cat")
    )
    .with_columns(
        pl.col("text_cat").to_physical().alias("cat_physical")
    )
)

The integer representation is set by the order of occurence in the column.

The dtype for the categorical encoding is `pl.UInt32` - unsigned 32-bit integers.

Polars can accommodate over 4 billion unique string mappings with `pl.UInt32` integers.

## Sorting categoricals

As categoricals have both a `lexical` (string) representation and an integer representation there are two ways to sort a categorical column.

To illustrate this we create a `DataFrame` with:
- some string values in the first column
- their postion in the `values` column to keep track of where they started
- a categorical column and
- a physical column

In [None]:
df_physical = (
    pl.DataFrame(
            {"strings": ["c","b","a","c"], "values": [0, 1, 2, 3]}
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical).alias("cats")
    )
    .with_columns(
        pl.col("cats").to_physical().alias("physical")
    )
)
df_physical

If we sort this `DataFrame` on the `cats` column we see that the `"c"` values come first rather than `"a"`! 

**In Polars the default is for sorting categoricals by the `physical` representation and not the string representation**

In [None]:
df_physical.sort("cats")

We can change the ordering convention to sort by the string lexical representation. We do this by passing the `ordering` argument to `pl.Categorical`. If we already have a categorical column with the default physical ordering we can cast the column to a lexical ordering

In [None]:
df_lexical = (
    df_physical
    .with_columns(
        pl.col("cats").cast(pl.Categorical(ordering="lexical")),
    )
)

In [None]:
df_lexical.sort("cats")

We could also set the lexical ordering from the outset when we first create the categorical column

In [None]:
(
    pl.DataFrame(
            {"strings": ["c","b","a","c"], "values": [0, 1, 2, 3]}
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical("lexical")).alias("cats")
    )
    .sort("cats")
)

## Operations on categoricals
Arithmetic operations on categorical columns lead to an exception - even when they work on string columns. 

You can see this behaviour by uncommenting the following cell

In [None]:
# (
#     df_lexical
#     .select(
#         pl.all().max()
#     )
# )

### Integers as categoricals?
We might have an integer column that we consider to be a categorical column. However, only a string column can be converted to `pl.Categorical` in Polars.

If we want to cast an integer column to categorical we first cast it to string dtype.


### Saving categoricals

If we save a `DataFrame` with a categorical column to:
- a Parquet file then the categorical dtype is preserved when we read it back into a `DataFrame`
- a CSV file then the categorical column is cast to string

## Exercises

In the exercises you will develop your understanding of:
- casting a string column to categorical
- accessing the physical values
- sorting by a categorical column in alphabetical order

### Exercise 1
We have the following `DataFrame` of animals and their sizes

In [None]:
df_animal_sizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
)


Cast the `size` column to categorical and call it `size_cats`

In [None]:
df_animal_sizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    <blank>
)

Add a column with the physical values of the categoricals

Sort the `DataFrame` by `size_cats` in alphabetical order

### Exercise 2
Create a `DataFrame` with the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(50)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Get the estimated size of the `spotify_df` in megabytes

In [None]:
(
    spotify_df
    <blank>
)

Create a new Spotify `DataFrame` where we:
- cast any suitable columns to categorical
- cast any numerical columns to the smallest possible precision

See the following cell if you want a hint for a calculation to determine suitable columns to cast to categorical

In [None]:
# Hint 
# We can count the number of unique entries in a column with .unique().count()
(
    spotify_df
    .select(
        pl.col("title").unique().count()
    )
)

In [None]:
(
    spotify_df
    .select(
        <blank>
    )
)

Create the new `DataFrame` with a smaller size in memory

In [None]:
new_spotify_df = (
    spotify_df
    <blank>
)
new_spotify_df.head(3)

Get the estimated size of `new_spotify_df` in megabytes

In [None]:
(
    new_spotify_df
    <blank>
)

Find all rows where the artist is Taylor Swift

In [None]:
(
    new_spotify_df
    <blank>
    .head(3)
)

In the solutions we finish off with performance comparison of a group by operation on a string versus categorical column

## Solutions

### Solution to Exercise 1 

Cast the `size` column to categorical and call it `size_cats`

In [None]:
df_animal_sizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    .with_columns(
        pl.col("size").cast(pl.Categorical).alias("size_cats")
    )
)
df_animal_sizes

Add a column with the physical values of the categoricals

In [None]:
df_animal_sizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    .with_columns(
        pl.col("size").cast(pl.Categorical).alias("size_cats")
    )
    .with_columns(
        pl.col("size_cats").to_physical().alias("physical"),
    )
    .sort("size_cats")
)
df_animal_sizes

Sort the `DataFrame` by `size_cats` in alphabetical order

In [None]:
df_animal_sizes = (
    pl.DataFrame(
        {
            "animals":["dog","cat","mouse","giraffe"],
            "size": ["medium","medium","small","big"]
        }
    )
    .with_columns(
        pl.col("size").cast(pl.Categorical("lexical")).alias("size_cats")
    )
    .with_columns(
        pl.col("size_cats").to_physical().alias("physical"),
    )
    .sort("size_cats")
)
df_animal_sizes

### Solution to Exercise 2
Create a `DataFrame` with the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(50)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Get the estimated size of the `spotify_df` in megabytes

In [None]:
(
    spotify_df
    .estimated_size(unit="mb")
)

Create a new Spotify `DataFrame` where we:
- cast any suitable columns to categorical
- cast any numerical columns to the smallest appropriate precision

See the following cell if you want a hint for a calculation to determine suitable columns

In [None]:
# Hint 
# We can count the number of unique entries in a column with .unique().count()
(
    spotify_df
    .select(
        pl.col("title").unique().count()
    )
)

The count of unique values in the string columns

In [None]:
(
    spotify_df
    .select(
        pl.col(pl.Utf8).unique().count()
    )
)

Suitable columns have a string dtype and have a small number of unique values. All string columns have a relatively small number of unique values

In [None]:
new_spotify_df = (
    spotify_df
    .with_columns(
        pl.col(pl.Utf8).cast(pl.Categorical),
        pl.col(pl.NUMERIC_DTYPES).shrink_dtype()
    )
)
new_spotify_df.head(3)

Get the estimated size of `new_spotify_df` in megabytes

In [None]:
(
    new_spotify_df
    .estimated_size(unit="mb")
)

Find all rows where the artist is Taylor Swift

In [None]:
(
    new_spotify_df
    .filter(pl.col("artist") == "Taylor Swift")
    .head(3)
)

Here we make a performance comparison of a group by operation on a string versus categorical column

In [None]:
%%timeit -n1 -r3
(
    spotify_df
    .group_by("title")
    .agg(
        pl.col("streams").sum()
    )
)

In [None]:
%%timeit -n1 -r3
(
    new_spotify_df
    .group_by("title")
    .agg(
        pl.col("streams").sum()
    )
)