# Concatenation
- A join merges two `DataFrames` based on matching key values.
- In comparison, concatenation refers to stacking/merging `DataFrames` together.
- We _glue_ two `DataFrames` together in a specified direction (vertical, horizontal, diagonal, aligned).

In [1]:
import polars as pl

## Vertical Concatenation
- Vertical concatenation combines two `DataFrames` vertically.
- Polars adds one `DataFrame` to the end of another, resulting in  a taller `DataFrame`.

In [2]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019


In [3]:
new_paintings = pl.read_csv("museum/new_paintings.csv")
new_paintings

art_id,title,artist,year
i64,str,str,i64
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


- The `pl.concat` function concatenates two `DataFrames` together.
- The `how` parameter sets the orientation/direction of the merge.
- The default argument for `how` is `vertical`.
- `old_paintings` and `new_paintings` have the same columns, simplifying the concatenation.

In [4]:
pl.concat([old_paintings, new_paintings], how="vertical")

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#concatenating-dataframes
- https://docs.pola.rs/user-guide/transformations/concatenation/#vertical-concatenation-getting-longer
- https://docs.pola.rs/api/python/stable/reference/api/polars.concat.html

## Horizontal Concatenation
- Horizontal concatenation merges two `DataFrames` in a horizontal direction (the `DataFrame` becomes wider). 
- Polars concatenates the second `DataFrame` to the right side of the first `DataFrame`.

In [5]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
new_paintings = pl.read_csv("museum/new_paintings.csv")
paintings = pl.concat([old_paintings, new_paintings], how="vertical")
paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


In [6]:
artwork_metadata = pl.read_csv("museum/artwork_metadata.csv")
artwork_metadata

location,on_display
str,bool
"""Room A""",True
"""Room B""",True
"""Room C""",False
"""Room C""",True
"""Sculpture Hall""",True
"""Sculpture Hall""",False


- Horizontal concatenation glues the second `DataFrame`'s columns to the _right_ of the first `DataFrame`'s columns.

In [7]:
pl.concat([paintings, artwork_metadata.head(4)], how="horizontal")

art_id,title,artist,year,location,on_display
i64,str,str,i64,str,bool
101,"""Sunset Over Water""","""Clara Vu""",2020,"""Room A""",True
102,"""Abstract Thoughts""","""M. Lenz""",2019,"""Room B""",True
103,"""Golden Horizon""","""R. Miro""",2026,"""Room C""",False
104,"""Color Theory""","""N. Okada""",2026,"""Room C""",True


- If the right `DataFrame` has more rows, the column values for the left `DataFrame` will be `null`.

In [8]:
pl.concat([paintings, artwork_metadata], how="horizontal")

art_id,title,artist,year,location,on_display
i64,str,str,i64,str,bool
101.0,"""Sunset Over Water""","""Clara Vu""",2020.0,"""Room A""",True
102.0,"""Abstract Thoughts""","""M. Lenz""",2019.0,"""Room B""",True
103.0,"""Golden Horizon""","""R. Miro""",2026.0,"""Room C""",False
104.0,"""Color Theory""","""N. Okada""",2026.0,"""Room C""",True
,,,,"""Sculpture Hall""",True
,,,,"""Sculpture Hall""",False


- If the left `DataFrame` has more rows, the column values for the right `DataFrame` will be `null`.

In [9]:
pl.concat([paintings, artwork_metadata.head(2)], how="horizontal")

art_id,title,artist,year,location,on_display
i64,str,str,i64,str,bool
101,"""Sunset Over Water""","""Clara Vu""",2020,"""Room A""",True
102,"""Abstract Thoughts""","""M. Lenz""",2019,"""Room B""",True
103,"""Golden Horizon""","""R. Miro""",2026,,
104,"""Color Theory""","""N. Okada""",2026,,


### Further Reading
- https://docs.pola.rs/user-guide/transformations/concatenation/#horizontal-concatenation-getting-wider
- https://docs.pola.rs/api/python/stable/reference/api/polars.concat.html

## Diagonal Concatenation
- Diagonal concatenation merges `DataFrames` with different shapes (different columns and/or rows).
- Diagonal concatenation leads to a `DataFrame` that may be taller and/or wider.

In [10]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
new_paintings = pl.read_csv("museum/new_paintings.csv")
paintings = pl.concat([old_paintings, new_paintings], how="vertical")
paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


- Say we have a `future_paintings` dataset with an extra `category` column.
- We want to concatenate `future_paintings` to the end of the other datasets.

In [11]:
future_paintings = pl.read_csv("museum/future_paintings.csv")
future_paintings

art_id,title,artist,year,category
i64,str,str,i64,str
105,"""Reflections in Glass""","""Y. Kim""",2028,"""experimental"""
106,"""Beneath the Echo""","""T. Alvarez""",2028,"""surrealism"""


- A plain `vertical` strategy will not work because of the mismatch in columns.
- `paintings` has 4 columns while `future_paintings` has 5 columns.

In [12]:
# pl.concat([paintings, future_paintings], how="vertical")

- A `diagonal` strategy merges the `DataFrames` in a both a row and column-manner (vertical and horizontal).
- A `diagonal` strategy does _not_ join. It simply adds to both the bottom and the right.
- Polars will concatenate rows for shared columns (`art_id`, `title`, `artist`, and `year`) beneath one another.
- Polars will concatneate distinct columns (`category`) to the right of the other columns.
- Rows from `paintings` that do not have a `category` will have a `null` (missing) value.

In [13]:
pl.concat([paintings, future_paintings], how="diagonal")

art_id,title,artist,year,category
i64,str,str,i64,str
101,"""Sunset Over Water""","""Clara Vu""",2020,
102,"""Abstract Thoughts""","""M. Lenz""",2019,
103,"""Golden Horizon""","""R. Miro""",2026,
104,"""Color Theory""","""N. Okada""",2026,
105,"""Reflections in Glass""","""Y. Kim""",2028,"""experimental"""
106,"""Beneath the Echo""","""T. Alvarez""",2028,"""surrealism"""


### Further Reading
- https://docs.pola.rs/user-guide/transformations/concatenation/#diagonal-concatenation-getting-longer-wider-and-nullier
- https://docs.pola.rs/api/python/stable/reference/api/polars.concat.html

## Align Concatenation
- Align concatenation joins rows together based on shared column values.
- When there are no matches, Polars fill the missing cells with `null`.

In [14]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019


In [15]:
expanded_artwork_metadata = pl.read_csv("museum/expanded_artwork_metadata.csv")
expanded_artwork_metadata

art_id,location,on_display
i64,str,bool
101,"""Room A""",True
102,"""Room B""",True
103,"""Room C""",False
104,"""Room C""",True


- A diagonal concatenation joins the rows from `expanded_artwork_metadata` to both the bottom and right of `old_paintings`.
- Rows from `old_paintings` have `null` for the `location` and `display`. columns
- Rows from `expanded_artwork_metadata` have `null` for `title`, `artist`, and `year` columns.
- Shared `art_id` rows (101, 102) are not merged together.

In [16]:
pl.concat([old_paintings, expanded_artwork_metadata], how="diagonal")

art_id,title,artist,year,location,on_display
i64,str,str,i64,str,bool
101,"""Sunset Over Water""","""Clara Vu""",2020.0,,
102,"""Abstract Thoughts""","""M. Lenz""",2019.0,,
101,,,,"""Room A""",True
102,,,,"""Room B""",True
103,,,,"""Room C""",False
104,,,,"""Room C""",True


- We want to align the rows that match by `art_id`, merging their information (`title` `artist`, `year`, `location`, `on_display`).
- For the remaining rows, we want the regular diagonal behavior (add the rows to the bottom and to the right of `old_paintings`).

In [17]:
pl.concat([old_paintings, expanded_artwork_metadata], how="align")

art_id,title,artist,year,location,on_display
i64,str,str,i64,str,bool
101,"""Sunset Over Water""","""Clara Vu""",2020.0,"""Room A""",True
102,"""Abstract Thoughts""","""M. Lenz""",2019.0,"""Room B""",True
103,,,,"""Room C""",False
104,,,,"""Room C""",True


- If there are mismatches between shared column values, Polars will concatenate rather than join the rows.

In [18]:
pl.concat(
    [
        old_paintings,
        expanded_artwork_metadata.with_columns(pl.lit("Great Painting").alias("title")),
    ],
    how="align",
)

art_id,title,artist,year,location,on_display
i64,str,str,i64,str,bool
101,"""Great Painting""",,,"""Room A""",True
101,"""Sunset Over Water""","""Clara Vu""",2020.0,,
102,"""Abstract Thoughts""","""M. Lenz""",2019.0,,
102,"""Great Painting""",,,"""Room B""",True
103,"""Great Painting""",,,"""Room C""",False
104,"""Great Painting""",,,"""Room C""",True


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/api/polars.concat.html

## Relaxed Concatenations
- Relaxed concatenation is a less strict form of concatenation that coerces columns to their supertypes.
- A supertype is a data type that can represent the values across both columns.

In [19]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019


- Let's say our `new_paintings` collection arrives with an `art_id` column of floats and a `year` column of strings.

In [20]:
new_paintings = pl.read_csv(
    "museum/new_paintings.csv",
    schema_overrides={"art_id": pl.Float32, "year": pl.String},
)
new_paintings

art_id,title,artist,year
f32,str,str,str
103.0,"""Golden Horizon""","""R. Miro""","""2026"""
104.0,"""Color Theory""","""N. Okada""","""2026"""


- Vertical concatenation with `pl.concat` fails due to a mismatch of column types.
- Polars observes that `art_id` in `new_paintings` is of type `Float64` while `art_id` in `old_paintings` is `Int64`.

In [21]:
# pl.concat([old_paintings, new_paintings], how="vertical")

- The `_relaxed` category of concatenation options coerces columns with mismatching data types into supertypes.
- For example, Polars will coerce `i64` and `f32` into `f64`, the type that supports both values..
- Similarly, Polars will coerce `i64` and `str` into `str`. A string can represent a number, a number cannot represent a string.

In [22]:
pl.concat([old_paintings, new_paintings], how="vertical_relaxed")

art_id,title,artist,year
f64,str,str,str
101.0,"""Sunset Over Water""","""Clara Vu""","""2020"""
102.0,"""Abstract Thoughts""","""M. Lenz""","""2019"""
103.0,"""Golden Horizon""","""R. Miro""","""2026"""
104.0,"""Color Theory""","""N. Okada""","""2026"""


- There is no `horizontal_relaxed` method. Horizontal concatenation adds new columns to the right of the first `DataFrame`, so there can be no mismatches between column types.
- There is a complementary `diagonal_relaxed` method. Polars will coerce columns with mismatched types into a super type.
- In this following example, `i64` (`old_paintings.art_id`), `f32` (`new_paintings.art_id`), and `f64` (`future_paintings.art_id`) are coerced into a `f64`.

In [23]:
future_paintings = pl.read_csv(
    "museum/future_paintings.csv", schema_overrides={"art_id": pl.Float64}
)
future_paintings

art_id,title,artist,year,category
f64,str,str,i64,str
105.0,"""Reflections in Glass""","""Y. Kim""",2028,"""experimental"""
106.0,"""Beneath the Echo""","""T. Alvarez""",2028,"""surrealism"""


In [24]:
pl.concat([old_paintings, new_paintings, future_paintings], how="diagonal_relaxed")

art_id,title,artist,year,category
f64,str,str,str,str
101.0,"""Sunset Over Water""","""Clara Vu""","""2020""",
102.0,"""Abstract Thoughts""","""M. Lenz""","""2019""",
103.0,"""Golden Horizon""","""R. Miro""","""2026""",
104.0,"""Color Theory""","""N. Okada""","""2026""",
105.0,"""Reflections in Glass""","""Y. Kim""","""2028""","""experimental"""
106.0,"""Beneath the Echo""","""T. Alvarez""","""2028""","""surrealism"""


### Further Reading
- https://docs.pola.rs/api/python/stable/reference/api/polars.concat.html

## Rechunking
- Polars _may_ store `DataFrame` columns in different locations in memory. We call these locations "chunks".
- Think of a chunk like a "warehouse" in the real world. It's a location in memory.
- Storing items closer together in memory makes it faster for the CPU to process all the values.
- As an analogy, it's faster for a worker to locate items if they're all stored in the same warehouse.
- **Rechunking** is the process of merging multiple chunks together so data is stored contiguously in memory.
- Rechunking requires an upfront cost (Polars must copy data) but improves performance in future operations.

### Rechunking and Concatenation
- When we concatenate two `DataFrames` with `pl.concat`, Polars does not rechunk the new data by default.
- For example, `df1.my_column` will be in one location in memory while `df2.my_column` will be in another location.
- Skipping rechunking makes concatenation faster...
- ...but column-based operations on the new `DataFrame` may be slower.
- To speed up column operations in the new `DataFrame`, we have to opt in to pay the cost of rechunking.

In [25]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019


In [26]:
new_paintings = pl.read_csv("museum/new_paintings.csv")
new_paintings

art_id,title,artist,year
i64,str,str,i64
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


- The `n_chunks` method returns the number of memory chunks that Polars is using to store the data.
- Pass `"all"` to see the number of chunks for all columns.
- The output may be different depending on Polars versions, operating system, hardware, etc.
- `[2, 2, 2, 2]` means each of the 4 columns is split into 2 chunks.
- In this case, Polars is using a total of 8 memory chunks for the `DataFrame`.

In [27]:
old_paintings.n_chunks("all")

[2, 2, 2, 2]

In [28]:
new_paintings.n_chunks("all")

[1, 1, 1, 1]

- By default, Polars will not rechunk data when concatenating.
- Pass the `rechunk` parameter a value of `True` to force the rechunking.
- Polars will move the data for a single column to be contiguous in memory.

In [29]:
pl.concat([old_paintings, new_paintings], how="vertical").n_chunks("all")
pl.concat([old_paintings, new_paintings], how="vertical", rechunk=False).n_chunks("all")

[3, 3, 3, 3]

In [30]:
pl.concat([old_paintings, new_paintings], how="vertical", rechunk=True).n_chunks("all")

[1, 1, 1, 1]

- If the concatenation is already complete, the `DataFrame` suppports a `rechunk` method.

In [31]:
all_paintings = pl.concat([old_paintings, new_paintings])

In [32]:
all_paintings = all_paintings.rechunk()

In [33]:
all_paintings.n_chunks("all")

[1, 1, 1, 1]

### Further Reading
- https://docs.pola.rs/user-guide/transformations/concatenation/#rechunking
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.n_chunks.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rechunk.html

## The vstack Method
- The `concat` function accepts a list of all `DataFrames` to concatenate.
- We may not have the complete list of `DataFrames` available all at once.
- For example, we may iterate over a directory of files in a `for` loop and open CSV files one by one.
-  The `vstack` and `extend` methods are two alternate options for concatenation.
- `vstack` and `extend` are both methods on existing `DataFrame`. They specifically operate on 2 `DataFrames`.

In [34]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
new_paintings = pl.read_csv("museum/new_paintings.csv")
pl.concat([old_paintings, new_paintings])
old_paintings.vstack(new_paintings)

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


In [35]:
df = pl.DataFrame(
    schema={
        "art_id": pl.Int64,
        "title": pl.String,
        "artist": pl.String,
        "year": pl.Int64,
    }
)
df

art_id,title,artist,year
i64,str,str,i64


- Rechunking is beneficial for faster queries but is a costly operation. Polars has to copy data around memory to a smaller number of chunks.
- When performing multiple concatenations, we thus do not want to rechunk on _every concatenation_.
- Rather, we want to concatenate all the data (even if it contains many chunks), then rechunk _once_ before we start our query operations.
- The `vstack` method does not rechunk by default. It just adds the chunks of the `DataFrames` together. Think of it like storing items across two warehouses.
- The `vstack` method returns a new `DataFrame`. Neither `DataFrame` is mutated.

In [36]:
for i in range(100):
    df = df.vstack(pl.read_csv("museum/old_paintings.csv"))

- The advantage of `vstack` for concatenation is that Polars does _not_ rechunk.
- The concatenation process is much faster _but_ subsequent queries will be slower.
- On my system, Polars stores each column's contents in 200 chunks of memory.
- We concatenated 100 `DataFrames`, each occupying 2 chunks of memory.

In [37]:
df.n_chunks("all")

[200, 200, 200, 200]

- Let's try filtering this 200-row `DataFrame`.
- The `%%timeit` magic runs a cell multiple times to calculate the average runtime.
- The `μs` unit is a microsecond (one millionth of a second).

In [38]:
%%timeit

df.filter(pl.col("art_id") == 102)

550 μs ± 28.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


- But that's OK. We sped up the concatentation by not rechunking on each loop.
- Now that we have the concatenated `DataFrame`, we can perform a single `rechunk`.
- Once the `DataFrame` is rechunked, the filter operation is 8x more efficient.
- These are small `DataFrames` but consider the impact on much larger data sets.

In [39]:
rechunked = df.rechunk()

In [40]:
%%timeit

rechunked.filter(pl.col("art_id") == 102)

68.6 μs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [41]:
70 / 596

0.1174496644295302

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.vstack.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rechunk.html

## The extend Method
- The `extend` method concatenates one `DataFrame` to the end of another. It copies the data from the second `DataFrame`.
- The `extend` method is a rare example of a Polars method that mutates the existing `DataFrame` rather than returning a new one.
- The `extend` method returns the `DataFrame` for convenience, but Polars still mutates the `DataFrame` that the method is invoked upon.
- The `extend` method may rechunk when it is invoked. That's why it's not ideal in a loop.
- The `extend` method is ideal when you want to concatenate _once_ and are not concerned about rechunking.

In [42]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
new_paintings = pl.read_csv("museum/new_paintings.csv")
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019


In [43]:
old_paintings.extend(new_paintings)

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


In [44]:
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019
103,"""Golden Horizon""","""R. Miro""",2026
104,"""Color Theory""","""N. Okada""",2026


In [45]:
df = pl.DataFrame(
    schema={
        "art_id": pl.Int64,
        "title": pl.String,
        "artist": pl.String,
        "year": pl.Int64,
    }
)
df

art_id,title,artist,year
i64,str,str,i64


- With the `extend` method, Polars _may_ rechunk on any method call.
- The movement of data around memory slows down the speed of the `extend` operation.
- This concatenation may thus be slower than `vstack`.

In [46]:
for i in range(100):
    df.extend(pl.read_csv("museum/old_paintings.csv"))

- And it may not even be all chunked at the end!

In [47]:
df.n_chunks("all")

[1, 200, 200, 1]

In [48]:
%%timeit

df.filter(pl.col("title") == "Abstract Thoughts")

194 μs ± 5.84 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [49]:
rechunked = df.rechunk()

In [50]:
%%timeit

rechunked.filter(pl.col("title") == "Abstract Thoughts")

69.5 μs ± 1.97 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Polars Docs:
- Prefer `vstack` over `extend` when you want to append many times before doing a query. For instance, when you read in multiple files and want to store them in a single `DataFrame`. In the latter case, finish the sequence of `vstack` operations with a `rechunk`.
- Prefer `extend` over `vstack` when you want to do a query after a single append. For instance, during online operations where you add `n` rows and rerun a query.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.extend.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.vstack.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rechunk.html

## The hstack Method
- The `hstack` method is the equivalent to `vstack` but in the horizontal direction.
- The `hstack` method concatenates the second `DataFrame` to the right of the first `DataFrame`.
- The `hstack` method requires the same height for both `DataFrames`.

In [51]:
old_paintings = pl.read_csv("museum/old_paintings.csv")
old_paintings

art_id,title,artist,year
i64,str,str,i64
101,"""Sunset Over Water""","""Clara Vu""",2020
102,"""Abstract Thoughts""","""M. Lenz""",2019


In [52]:
artwork_metadata = pl.read_csv("museum/artwork_metadata.csv").head(2)
artwork_metadata

location,on_display
str,bool
"""Room A""",True
"""Room B""",True


In [53]:
old_paintings.hstack(artwork_metadata).n_chunks("all")

[2, 2, 2, 2, 1, 1]

In [54]:
old_paintings.hstack(artwork_metadata).rechunk().n_chunks("all")

[1, 1, 1, 1, 1, 1]

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.hstack.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rechunk.html