## Concatenation
By the end of this lecture you will be able to:
- vertically concatenate `DataFrames`
- handle inconsistent dtypes in a vertical concat
- horizontally concatenate `DataFrames`
- diagonally concatenate `DataFrames`


In [None]:
import polars as pl

We create a first `DataFrame` with fake trade records from 2000

In [None]:
df_2000 = pl.DataFrame(
    [
        {"year":2000,"exporter":"India","importer":"USA","quantity":0},
        {"year":2000,"exporter":"India","importer":"USA","quantity":1},
    ]
)
df_2000

We now create a second `DataFrame` with trade records from 2001

In [None]:
df_2001 = pl.DataFrame(
    [
        {"year":2001,"exporter":"India","importer":"USA","quantity":2},
        {"year":2001,"exporter":"India","importer":"USA","quantity":3},
    ]
)
df_2001

## Combining `DataFrames` vertically
If we have data in two different `DataFrames` that we combine as a new `DataFrame` we can manage the data in memory in three different ways:
- keeping the data in the original two locations in memory and referencing the new `DataFrame` to these original locations
- copying the data to a single location in memory and and referencing the new `DataFrame` to this single location
- appending the data from the second `DataFrame` to the location of the first `DataFrame`

We cover three methods for vertically combining `DataFrames`: `df.vstack`, `df.extend` and `pl.concat`. The output of each method is the same from a user perspective but differs in terms of where the data sits in memory underneath the hood

### `vstack`

We combine the 2000 and 2001 `DataFrames` into a single `DataFrame` with the `vstack` method

In [None]:
(
    df_2000
    .vstack(
        df_2001
    )
)

A `vstack`:
- keeps the data from both `DataFrames` in their original locations in memory and points the new `DataFrame` to those locations

### Rechunk
A `vstack` is computationally very cheap (as no data is copied). However, subsequent operations (e.g. `group_by`) are slower than if the data has been *rechunked* (i.e. copied from the original two chunks to a new single location in memory.


We can manually cause two `DataFrames` linked by `vstack` to be copied to a single location in memory with `rechunk`

In [None]:
(
    df_2000
    .vstack(
        df_2001
    )
    .rechunk()
)

We see below that the `pl.concat` function is a way of applying `vstack` and `rechunk` to a list of `DataFrames`.

### Extend
We can append one `DataFrame` to another with `extend`

In [None]:
(
    df_2000
    .extend(
        df_2001
    )
)

An `extend`:
- copies the data from second `DataFrame` (`df_2001`) and appends it to the data of the first `DataFrame` (`df_2000`)
- modifies the first `DataFrame` (`df_2000`) *in-place*

We can see that `df_2000` has been modified in-place as it now has both years of data

In [None]:
df_2000

Before continuing we re-assign `df_2000` back to its original value to reduce confusion if cells are executed out-of-order!

In [None]:
df_2000 = pl.DataFrame(
    [
        {"year":2000,"exporter":"India","importer":"USA","quantity":0},
        {"year":2000,"exporter":"India","importer":"USA","quantity":1},
    ]
)
df_2000

### Use case of `vstack`, `rechunk` and `extend`
- If you are combining `DataFrames` to do more transformations/groupbys/joins etc it is normally best to use `vstack` and `rechunk` so that all the data is together in memory. In practice it is simpler to use `pl.concat` to do this as we see below
- If you want to combine two `DataFrames` but do not want to do more operations on them (e.g. you just want to check their length of perhaps write to a file) you should use `vstack`
- If you want to add a small `DataFrame` to a large `DataFrame` use `extend` as it only copies the data in the small `DataFrame`

### Vertically concatenating `DataFrames`

Above we saw how to vertically combine two `DataFrames`. More generally, we can combine a `list` of `DataFrames` with `pl.concat`. For clarity, we set the `how="vertical"` argument explicitly this time although it is the default argument

In [None]:
(
    pl.concat(
        [df_2000,df_2001],
        how="vertical"
    )
)

When we do `pl.concat` Polars:
- does a series of `vstacks` to combine the list of `DataFrames`
- then does a `rechunk` to gather all the data together in memory

We can stop Polars from doing the `rechunk` by passing the `rechunk=False` argument

In [None]:
df_vertical = (
    pl.concat(
        [
            df_2000,
            df_2001
        ],
        rechunk=False
    )
)
df_vertical

### Handling different dtypes in vertical concatenation

Polars expects the column names and dtypes to match when doing vertical concatenation.

To illustrate some approaches for handling differences in types we create an alternative `df_2001` where the `quantity` column is 64-bit float instead of 64-bit integer

In [None]:
df_2001_float= (
    df_2001
    .with_columns(
        pl.col("quantity").cast(pl.Float64)
    )
)
df_2001_float

When the dtypes do not match we may have to manage this by doing an explicit `cast` of the column types.

In this example we cast the `quantity` column back to `pl.Int64`

In [None]:
(
    pl.concat(
        [
            df_2000,
            df_2001_float.with_columns(
                pl.col("quantity").cast(pl.Int64)
            )
        ],
    )
)

However, Polars also has a way of managing certain differences by casting to a "supertype". For example, the supertype of `pl.Float64` and `pl.Int64` is `pl.Float64`.

We can do a vertical concatenation using supertypes where necessary by specifying the `how` method as `vertical_relaxed` instead of `vertical`

In [None]:
(
    pl.concat(
        [
            df_2000,
            df_2001_float
        ],
        how="vertical_relaxed"
    )
)


## Horizontal concatenation
We can horizontally concatenate `DataFrames` that have:
- the same number of rows and
- different column names

For horizontal concatenation we create another `DataFrame` that has more details about each of the trades in 2000

In [None]:
df_2000_details = pl.DataFrame(
    [
        {"item":"Clothes","value":10},
        {"item":"Machinery","value":100},
    ]
 )
df_2000_details

### `hstack`

We can combine two `DataFrames` horizontally with `hstack`

In [None]:
(
    df_2000
    .hstack(
        df_2000_details
    )
)

This operation is *not* in-place unless we pass `in-place=True`.

We can also pass a `list` of `Series` inside `hstack`

In [None]:
(
    df_2000
    .hstack(
        [
            df_2000_details["item"],
            df_2000_details["value"]
        ]
    )
)


### Horizontal concatenation
We can also use `pl.concat` for horizontal concatenation

In [None]:
(
    pl.concat(
        [
            df_2000,
            df_2000_details
        ]
        ,
        how="horizontal"
    )
)

## Diagonal concatenation

We are now looking at new trade records for 2000 and 2001 between China and the USA.

In 2000 the schema of the trade records is the same as we saw above with: 
- `year`
- `exporter` and 
- `importer`

However, in 2001 the schema also includes:
- `item` and 
- `value`

In [None]:
df_2000 = pl.DataFrame(
    [
        {"year":2000,"exporter":"China","importer":"USA","quantity":0},
        {"year":2000,"exporter":"China","importer":"USA","quantity":1},
    ]
)
df_2001 = pl.DataFrame(
    [
        {"year":2001,"exporter":"China","importer":"USA","quantity":2,"item":"Clothes","value":10},
        {"year":2001,"exporter":"China","importer":"USA","quantity":3,"item":"Machinery","value":100},
    ]
)

We want to combine these records into a single `DataFrame`. As the column names are not the same we cannot do a vertical concatenation.

Instead we can do a diagonal concatenation

In [None]:
(
    pl.concat(
        [
            df_2000,
            df_2001
        ],
        how="diagonal"
    )
)

This diagonal concatenation is a vertical concatenation for the column names that match but with `null` values where a column is not present in one of the `DataFrames`.

We can also do `diagonal_relaxed` where we use supertypes for columns where necessary and possible

Diagonal concatenation can be a quick way to work with multiple CSVs or other files where:
- the columns are not the same in all files
- the order of the columns is not the same in all files


## Exercises

### Exercise 1


You are given the following data from the sales of a bike shop. 

In [None]:
sales_2000 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales_2000

In [None]:
sales_2001 = [
    {"make":"Giant","model":"Roam","quantity":100.0},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales_2001

Combine the 2000 and 2001 data into a single `DataFrame`

Now add a third year of data to the `DataFrame`

In [None]:
sales_2002 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]
sales_2002

### Exercise 2
In the lecture on quantiles in the Statistics section we learned how to calculate quantiles.

In this exercise we will combine multiple quantiles into a single `DataFrame`.

As a reminder, this is how we calculate a single quantile on the floating point columns

In [None]:
csv_file = "../data/titanic.csv"
df = pl.read_csv(csv_file)
q = 0.25
(
    df
    .select(
        pl.col(pl.Float64).quantile(q)
        )
)

We want to produce a `DataFrame` that has:
- the 0.25,0.5 and 0.75 percentiles of the floating point columns on separate rows
- a column called `percentiles` to show the percentile for each row 

Create this `DataFrame` using vertical concatenation.

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [None]:
csv_file = "../data/titanic.csv"
df = pl.read_csv(csv_file)
quantiles = [0.25,0.5,0.75]
dfList = []
<blank>

Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

Concatenate the outputs

## Solutions


### Solution to Exercise 1

In [None]:
sales_2000 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales_2001 = [
    {"make":"Giant","model":"Roam","quantity":100.0},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]

sales_2000_df = pl.DataFrame(sales_2000)
sales_2001_df = pl.DataFrame(sales_2001)

Combine the full set of data into a single `DataFrame`

In [None]:
(
    pl.concat(
        [
            sales_2000_df,
            sales_2001_df
        ],
        how="vertical_relaxed"
    )
)           

In [None]:
sales_2002 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]
sales_2002_df = pl.DataFrame(sales_2002)
sales_2002_df

In [None]:
(
    pl.concat(
        [
            sales_2000_df,
            sales_2001_df,
            sales_2002_df
        ],
        how="diagonal_relaxed"
    )
)           

### Solution to Exercise 2

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [None]:
csv_file = "../data/titanic.csv"
df = pl.read_csv(csv_file)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
)

Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

In [None]:
csv_file = "../data/titanic.csv"
df = pl.read_csv(csv_file)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
        .with_columns(
            pl.lit(q).alias("percentiles")
        )
)

Concatenate the outputs

In [None]:
csv_file = "../data/titanic.csv"
df = pl.read_csv(csv_file)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
        .with_columns(
            pl.lit(q).alias("percentiles")
        )
)
pl.concat(dfList)