## Categoricals and the string cache
By the end of this lecture you will be able to:
- coordinate categorical mappings across `DataFrames` with the string cache
- filter a categorical column

We introduce the string cache here. In Section 6 on Joins and Concats we will see that the string cache is useful when combining `DataFrames` with categorical columns

In [None]:
import polars as pl

We create a `DataFrame` and add a categorical column called `cats`

In [None]:
df = (
    pl.DataFrame(
        {
            "strings": ["c","b","a","c"], 
            "values": [1, 2, 3, 4]
        }
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical).alias("cats")
    )
)
df

## Filtering a categorical column
We filter a categorical column for equality in the same way as for a string column

In [None]:
(
    df
    .filter(
        cats = "b"
    )
)

We can also filter a categorical column with `is_in` (note - in earlier versions this gave an `Exception`)

In [None]:
(
    df
    .filter(
        pl.col("cats").is_in(["b"])
    )
)

## Categoricals from different `DataFrames`
When we combine `DataFrames` that have categoricals Polars needs to ensure that the same mapping is used from strings to integers in both `DataFrames`.

To illustrate this we create a new `DataFrame` called `df_right` that has a different mapping of strings to integers from `df` above

In [None]:
df_right = (
    pl.DataFrame(
        {
            "strings": ["a","b"], 
            "values": [10, 20]
        }
    )
    .with_columns(
        pl.col("strings").cast(pl.Categorical).alias("cats")
    )
)
df_right

If we join `df` and `df_right` on the categorical column then the operation works but Polars also raises a warning

In [None]:
(
    df
    .join(
        df_right,
        on = "cats",
        how="left"
    )
)

### Why do we get a warning?

We get a warning because when we do this operation Polars: 
1. checks if `df` and `df_right` have compatible mappings from strings to integers
2. if they do not Polars re-encodes the `df_right` mapping from strings to integers

If you had a large `DataFrame` with many mappings this operation could be expensive and this is why we get the warning.

## Combining categoricals with the `StringCache`
We can instead use a `StringCache` to ensure that different `DataFrames` have the same categorical mapping.

The `StringCache` object:
- stores the categorical mapping
- ensures that all `DataFrames` use the same mapping. 

We can use the `StringCache`:
- inside a context manager or
- by enabling it globally.

We see both below.

### Using the `StringCache` inside a context-manager

A context-manager is a way to ensure certain actions happen in Python.

Everything inside the code block beginning with `with` is in the same context.

In this case
```python
with pl.StringCache():
```
ensures that everything that happens in the following code block uses the same categorical mappings. In this example it ensures that the list in `is_in` is cast to the same categorical mappings as the `strings` column of the `DataFrame`

In [None]:
with pl.StringCache():
    # Create the left dataframe
    df = (
        pl.DataFrame(
                {"strings": ["c","b","a","c"], "values": [1, 2, 3, 4]}
        )
        .with_columns(
            pl.col("strings").cast(pl.Categorical).alias("cats")
        )
    )
    # Create the right dataframe
    df_right = (
        pl.DataFrame(
            {
                "strings": ["a","b"], 
                "values": [10, 20]
            }
        )
        .with_columns(
            pl.col("strings").cast(pl.Categorical).alias("cats")
        )
    )
    # Join the dataframes
    df_joined = (
        df
        .join(
            df_right,
            on = "cats",
            how="left"
        )
)
df_joined

In this case we do not get the warning.

At the end of the `with` block the `StringCache` is deleted but both `DataFrames` still have the mapping internally.

### Enabling the `StringCache`
We can also enable the `StringCache` to be on through a session - be aware that this can have affects beyond this script/notebook. In fact I've commented it out here because when I run my test suite with `pytest` this command changes the outputs in other notebooks!

In [None]:
# pl.enable_string_cache()

When we use `pl.enable_string_cache()` Polars enables a `StringCache` that is used by all categorical columns until:
- the end of the session or
- you call `pl.disable_string_cache()`

You can see whether a string cache is enabled with 

In [None]:
pl.using_string_cache()

### Context-manager or enable the string cache?
Enabling the string cache is easier than using pl.StringCache in a context-manager.

However, I recommend using the context-manager approach as:
- it makes the use of the string cache explicit in the code
- it avoids errors that can arise from setting global values

### Use cases for `pl.StringCache`

We need the string cache whenever different objects with a categorical dtype are involved. For example when:
- joining `DataFrames` with categorical dtypes
- concatenating `DataFrames` with categorical dtypes
- creating a `DataFrame` with categorical dtype from multiple files

We will see examples of these in later Sections of the course.

## Exercises
In the exercises you will develop your understanding of:
- filtering a categorical column
- using the string cache

### Exercise 1
Create a `DataFrame` from the Titanic dataset and cast the `Pclass` column to categorical.

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Continue by casting the `Embarked` column to categorical (change `with_column` to `with_columns`). 

Filter the `Pclass` column for third class passengers

Add a filter on the `Embarked` column for passengers who embarked in either Southampton (`S`) or Queenstown (`Q`)

### Exercise 2
We want to filter the Spotify `DataFrame` to find all tracks by either Taylor Swift or Ed Sheeran.

First we create the path to the CSV

In [None]:
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"

Enable the string cache

- Create the `DataFrame`
- Cast the `artist` column to categorical
- Filter for the tracks by the artists mentioned above

In [None]:
(
    pl.read_csv(spotify_csv,try_parse_dates=True)
    <blank>
)

Then disable the string cache

## Solutions

### Solution to Exercise 1

Cast the `Pclass` column to categorical

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical)
    )
    .head()
)


Cast the `Embarked` column to categorical

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
        pl.col("Embarked").cast(pl.Categorical)
    )
    .head(3)
)


Filter the `Pclass` column for third class passengers

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
        pl.col("Embarked").cast(pl.Categorical)
    )
    .filter(pl.col("Pclass")=="3")
    .head(3)
)


In addition, filter the `Embarked` column for passengers who embarked in Southampton (`S`) or Queenstown (`Q`)

In [None]:
df = (
    pl.read_csv(csv_file)
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8).cast(pl.Categorical),
        pl.col("Embarked").cast(pl.Categorical)
    )
    .filter(pl.col("Pclass")=="3")
    .filter(pl.col("Embarked").is_in(["S","Q"]))   
)
df.head(3)

### Solution to Exercise 2
We want to filter the Spotify `DataFrame` to find all tracks by either Taylor Swift or Ed Sheeran.

Enable the string cache (**not** using `with pl.StringCache`)

In [None]:
pl.enable_string_cache()

In [None]:
pl.Config.set_fmt_str_lengths(50)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"

- Create the `DataFrame`
- Cast the `artist` column to categorical
- Filter for the tracks by the artists mentioned above

In [None]:
(
    pl.read_csv(spotify_csv,try_parse_dates=True)
    .with_columns(
        pl.col("artist").cast(pl.Categorical)
    )
    .filter(
        pl.col("artist").is_in(["Taylor Swift","Ed Sheeran"])
    )
    .head()
)

Then disable the string cache

In [None]:
pl.disable_string_cache()