## Join on string and categorical columns
By the end of this lecture you will be able to:
- join on string columns
- join on categorical columns
- do fast-track joins on string columns using categoricals

I recommended that you do the lectures on "String and categorical dtypes" and "Categoricals and the string cache" in Section 3 before doing this lecture.

In [None]:
import polars as pl
import numpy as np
np.random.seed(0)

## Joins on string dtype

We first create a short array with some integers

In [None]:
integer_array = np.array([3,3,1,2])
integer_array

For the left `DataFrame` we convert each of the integers to an `id` string that starts with `"id"`. We keep the integers in the `values` column

In [None]:
df_left = (
    pl.DataFrame(
        {
            "id":[f"id{i}" for i in integer_array],
            "values":integer_array
        }
    )
)
df_left

We then create the right `DataFrame` that has metadata about each `id`

In [None]:
df_right = (
    pl.DataFrame(
        {
            "id":[f"id{i}" for i in range(1,4)],
            "metadata":[i for i in range(1,4)]
        }
    )
)
df_right

When the `id` column is a string dtype we can join these `DataFrames` in the standard way 

In [None]:
(
    df_left.join(df_right,on="id")
)

Polars cannot use the fast-track algorithm for joining string columns as the algorithm works on integers.

To use the fast-track algorithm the string column must be cast to categorical dtype

## Joins on categorical dtype
We cast the `id` column to categorical dtype

In [None]:
df_left = (
    pl.DataFrame(
        {
            "id":[f"id{i}" for i in integer_array],
            "values":integer_array
        }
    )
    .with_columns(
        pl.col("id").cast(pl.Categorical)
    )
)
df_left

And we cast the `id` column to categorical for the right `DataFrame`

In [None]:
df_right = (
    pl.DataFrame(
        {
            "id":[f"id{i}" for i in range(1,4)],
            "metadata":[i for i in range(1,4)]
        }
    )
    .with_columns(
        pl.col("id").cast(pl.Categorical)
    )
)
df_right

If we try to join them on the categorical column we get a warning

In [None]:
(
    df_left.join(df_right,on="id")
)

We get an warning because we didn't cast to categorical for both `DataFrames` inside a `StringCache`. 

As we are not inside a `StringCache` Polars can't be sure if the left and right `DataFrames` use the same mapping from strings to integers and so does a re-mapping. This re-mapping can be expensive for large `DataFrames`

We try casting to categorical again inside a `StringCache`

In [None]:
with pl.StringCache():
    df_left = (
        pl.DataFrame(
            {
                "id":[f"id{i}" for i in integer_array],
                "values":integer_array
            }
        )
        .with_columns(
            pl.col("id").cast(pl.Categorical)
        )
    )
    
    df_right = (
        pl.DataFrame(
            {
                "id":[f"id{i}" for i in range(1,4)],
                "metadata":[i for i in range(1,4)]
            }
        )
        .with_columns(
            pl.col("id").cast(pl.Categorical)
        )
    )

We can now join the `DataFrames` in the standard way

In [None]:
(
    df_left.join(df_right,on="id")
)

We can also do the `join` or any other operations inside the `StringCache` block. 

## Fast-track joins
We can do fast-track joins on **sorted** categorical columns as these are integer columns underneath-the-hood.

**Key point**: the categorical join columns must be sorted based on their `physical` integer representation and not their `lexical` alphanumeric representation.

To illustrate this we create `df_left` and `df_right` each with a `physical` integer column

In [None]:
with pl.StringCache():
    df_left = (
        pl.DataFrame(
            {
                "id":[f"id{i}" for i in integer_array],
                "values":integer_array
            }
        )
        .with_columns(
            pl.col("id").cast(pl.Categorical)
        )
        .with_columns(
            pl.col("id").to_physical().alias("physical_left")
        )
    )
    df_right = (
        pl.DataFrame(
            {
                "id":[f"id{i}" for i in range(1,4)],
                "metadata":[i for i in range(1,4)]
            }
        )
        .with_columns(
            pl.col("id").cast(pl.Categorical)
        )
        .with_columns(
            pl.col("id").to_physical().alias("physical_right")
        )
    )

We inspect the new left and right `DataFrames` with the `physical` column

In [None]:
df_left

In [None]:
df_right

From the `physical` columns we can see that:
- `df_left` *looks* unsorted (from the alphabetic values in `id`) but is actually sorted (from the integer values in `physical_left`) while 
- `df_right` *looks* sorted but is actually unsorted!

If we inspect the `flags` we see that Polars doesn't think either is sorted

In [None]:
print(df_left["id"].flags)
print(df_right["id"].flags)

We can use `set_sorted` to tell Polars that `df_left` is sorted.

We need to sort `df_right` by `id`. Recall that by default when we sort a categorical column we sort by the `physical` integer representation.

We create new `DataFrames` here to avoid confusion if cells in this notebook are executed out-of-order

In [None]:
df_left_sorted = (
    df_left
    .with_columns(
        pl.col("id").set_sorted()
    )
)
df_right_sorted = (
    df_right
    .sort("id")
)
df_right_sorted

We can now join the sorted `DataFrames` and Polars will use the fast-track algorithm

In [None]:
(
    df_left_sorted.join(df_right_sorted,on="id")
)

## Are fast-track joins worthwhile?
A fast-track join may or may not speed up your overall query - you have to check the performance for your data. Factors that affect the performance include:
- size of the `DataFrames` and
- cardinality of the join column as fast-track is more worthwhile with high cardinality - see exercise 2

## Getting fast-track joins on categoricals right

To get fast-track joins right ensure that the categorical column in both the left and right `DataFrames` are sorted.


In the example above the left `DataFrame` was sorted but the right `DataFrame` was not. This is not true in general.

We can check if the join column is sorted by calling `is_sorted` on the `id` column as a `Series`

In [None]:
df_left["id"].is_sorted()

We check that the `id` column for the right `DataFrame` is not sorted

In [None]:
df_right["id"].is_sorted()

So in this case we can call `set_sorted` on the `id` column of the left `DataFrame` but we would have to sort the `id` column of the right `DataFrame` to ensure both `DataFrames` are sorted.

## Exercises
In the exercises you will develop your understanding of:
- joining on categorical columns
- joining on string columns
- doing fast-track joins on categorical columns

## Exercise 1
The CITES and ISO CSV files are here 

In [None]:
cites_csv_file = "../data/cites_extract.csv"
iso_csv_file = "../data/countries_extract.csv"

We want to join the ISO data for importers and exporters.

- create `DataFrames` from the CITES trade data and ISO country data in the following CSVs
- cast the join columns to categorical

Join the ISO data for both importers and exporters

Q: Could Polars do a fast-track join on `Importer` in `dfCites` if `set_sorted` was used?

Hint: Add a `physical` column to `dfCites`

Do a fast-track join with ISO data on the `Importer` and `Exporter` columns (combine and modify your code from the first and second parts of this exercise)

## Exercise 2

We compare the performance of sorted and unsorted joins on strings and categoricals. 

We create the left `DataFrame` with length `N` and random `id` strings in this function

In [None]:
N = 100_000
# cardinality is number of unique values
cardinality = N // 2
def createLeftDataFrame(N:int,cardinality:int):
    """
    Create the left dataframe with columns:
    id - random strings of the form idX where X is between 0 and 0
    values - the integer X value
    physical - the physical integers underlying the categorical id column
    """
    # create the random integer array
    integer_array = np.random.randint(0,cardinality,N)
    return (
    pl.DataFrame(
        {
            "id":[f"id{i}" for i in integer_array],
            "values":integer_array
        }
    )
    .with_columns(
        pl.col("id").cast(pl.Categorical)
    )
    .with_columns(
        pl.col("id").to_physical().alias("physical")
    )
)
df_left = createLeftDataFrame(N = N,cardinality=cardinality)
df_left.head()

We create the right `DataFrame` with metadata about each `id` in this function

In [None]:
def createRightDataFrame(N:int,cardinality:int):
    """
    Create the right dataframe with columns:
    id - the string ids covering the same range as the left dataframe
    meta - a metadata column that has the integer number from the id
    physical - the physical integers underlying the categorical id column
    """
    return (
    pl.DataFrame(
        {
            "id":[f"id{i}" for i in range(cardinality)],
            "meta":[i for i in range(cardinality)]
        }
    )
    .with_columns(
        pl.col("id").cast(pl.Categorical)
    )
    .with_columns(
        pl.col("id").to_physical().alias("physical")
    )

)
df_right = createRightDataFrame(N = N,cardinality=cardinality)
df_right.head(3)

Create `df_left` and `df_right` inside a `StringCache`

In [None]:
N = 10_000_000
cardinality = 10
<blank>

Time how long it takes to join on unsorted categorical columns

In [None]:
%%timeit -n1 -r1 
(
    <blank>
)

Sort the categorical columns in new `DataFrames`

In [None]:
df_left_sorted = <blank>
df_right_sorted = <blank>

Time how long it takes to join on sorted categorical columns (see the discussion on the results in the solutions)

Cast the categorical columns to strings in new `DataFrames`

In [None]:
df_left_string = (df_left.<blank>)
df_right_string = (df_right.<blank>)

Time how long it takes to join on string columns

Do these comparisons again with higher cardinality e.g. `cardinality > 1000` and lower cardinality e.g. `cardinality = 10`

## Solutions

### Solution to Exercise 1

We want to join the ISO data for importers and exporters.

In a single query:
- create `DataFrames` from the CITES trade data and ISO country data in the following CSVs
- cast the relevant columns to categorical

In [None]:
cites_csv_file = "../data/cites_extract.csv"
iso_csv_file = "../data/countries_extract.csv"

In [None]:
with pl.StringCache():
    df_CITES = (
        pl.read_csv(cites_csv_file)
        .with_columns(
            pl.col("Importer").cast(pl.Categorical),
            pl.col("Exporter").cast(pl.Categorical),
        )
    )
    df_ISO = (
        pl.read_csv(iso_csv_file)
        .with_columns(
                pl.col("alpha-2").cast(pl.Categorical)
        )
    )

Join the ISO data for importers and exporters

In [None]:
with pl.StringCache():
    df_CITES = (
        pl.read_csv(cites_csv_file)
        .with_columns(
            pl.col("Importer").cast(pl.Categorical),
            pl.col("Exporter").cast(pl.Categorical),
        )
    )
    df_ISO = (
        pl.read_csv(iso_csv_file)
        .with_columns(
                pl.col("alpha-2").cast(pl.Categorical)
        )
    )
    
(
    df_CITES
        .join(df_ISO,left_on="Importer",right_on="alpha-2", how="left")
        .rename({"name":"name_importer","region":"region_importer"})
        .join(df_ISO,left_on="Exporter",right_on="alpha-2", how="left")
        .rename({"name":"name_exporter","region":"region_exporter"})
)

Q: Could Polars do a fast-track join with `dfCites` on `Importer` if `set_sorted` was used?

Note: this now raises an exception, I'm looking into it

In [None]:
# df_CITES["Importer"].is_sorted()

No, the column is not sorted

Do a fast-track join on the `Importer` and `Exporter` columns (copy your code from above)

See:
- the sorting on `df_ISO`
- the two sort calls on `dfCites` in the join query

In [None]:
with pl.StringCache():
    df_CITES = (
        pl.read_csv(cites_csv_file)
        .with_columns(
            pl.col("Importer").cast(pl.Categorical),
            pl.col("Exporter").cast(pl.Categorical),
        )
    )
    df_ISO = (
        pl.read_csv(iso_csv_file)
        .with_columns(
                pl.col("alpha-2").cast(pl.Categorical)
        )
        ### Sorting on df_ISO!
        .sort("alpha-2")
    )
(
    df_CITES
        .sort("Importer")
        .join(df_ISO,left_on="Importer",right_on="alpha-2", how="left")
        .rename({"name":"name_importer","region":"region_importer"})
        .sort("Exporter")
        .join(df_ISO,left_on="Exporter",right_on="alpha-2", how="left")
        .rename({"name":"name_exporter","region":"region_exporter"})
)

### Solution to Exercise 2

Create `df_left` and `df_right` inside a `StringCache`

In [None]:
N = 10_000_000
cardinality = 10000
with pl.StringCache():
    df_left = createLeftDataFrame(N = N,cardinality=cardinality)
    df_right = createRightDataFrame(N = N,cardinality=cardinality)

Time how long it takes to join on unsorted categorical columns

In [None]:
df_left.head()

In [None]:
%%timeit -n1 -r3
(
    df_left.join(df_right,on="id")
)

Sort the categorical columns

In [None]:
df_left_sorted = df_left.sort("id")
df_right_sorted = df_right.sort("id")

Time how long it takes to join on sorted categorical columns

In [None]:
%%timeit -n1 -r3
(
    df_left_sorted.join(df_right_sorted,on="id")
)

I get a **small** speed-up with the sorted data when cardinality is high. However, the relative performance varies from comparison-to-comparison and with different versions of Polars. Check this for your own data to see if you get a useful speed-up. 

Cast the categorical columns to strings and compare how long it takes to join on string columns 

In [None]:
df_left_string = df_left.with_columns(pl.col("id").cast(pl.Utf8))
df_right_string = df_right.with_columns(pl.col("id").cast(pl.Utf8))

In [None]:
%%timeit -n1 -r1
(
    df_left_string.join(df_right_string,on="id")
)

So the string join is indeed much slower.

The `id` column has high cardinality because we set the range of `id` values to be `N // 2` in `createLeftDataFrame` and `createRightDataFrame`. 

Do these comparisons again with higher and lower cardinality.