## Concatenation
By the end of this lecture you will be able to:
- vertically concatenate a list of `DataFrames`
- horizontally concatenate a list of `DataFrames`
- diagonally concatenate a list of `DataFrames`


In [1]:
import polars as pl

We create a first `DataFrame` with fake trade records from 2020

In [2]:
df2020 = pl.DataFrame(
    [
        {"year":2020,"exporter":"India","importer":"USA","quantity":0},
        {"year":2020,"exporter":"India","importer":"USA","quantity":1},
    ]
)
df2020

year,exporter,importer,quantity
i64,str,str,i64
2020,"""India""","""USA""",0
2020,"""India""","""USA""",1


We now create a second fake `DataFrame`with trade records from 2021

In [3]:
df2021 = pl.DataFrame(
    [
        {"year":2021,"exporter":"India","importer":"USA","quantity":2},
        {"year":2021,"exporter":"India","importer":"USA","quantity":3},
    ]
)
df2021

year,exporter,importer,quantity
i64,str,str,i64
2021,"""India""","""USA""",2
2021,"""India""","""USA""",3


## Vertical concatenation

We combine the 2020 and 2021 `DataFrames` into a single `DataFrame` with `pl.concat`

In [4]:
dfVertical = (
    pl.concat(
        [df2020,df2021]
    )
)
dfVertical

year,exporter,importer,quantity
i64,str,str,i64
2020,"""India""","""USA""",0
2020,"""India""","""USA""",1
2021,"""India""","""USA""",2
2021,"""India""","""USA""",3


Vertical concatenation fails when:
- the dataframes do not have the same column names.

## Horizontal concatenation
We create another `DataFrame` that has more details about each of the trades in 2020

In [5]:
df2020Details = pl.DataFrame(
    [
        {"item":"Clothes","value":10},
        {"item":"Machinery","value":100},
    ]
 )
df2020Details

item,value
str,i64
"""Clothes""",10
"""Machinery""",100


We combine these details with the original records using a horizontal concatenation.

In [6]:
dfHorizontal = pl.concat(
    [
        df2020,df2020Details
    ]
    ,how="horizontal"
)
dfHorizontal

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2020,"""India""","""USA""",0,"""Clothes""",10
2020,"""India""","""USA""",1,"""Machinery""",100


Horizontal concatenation fails when:
- the dataframes have overlapping column names or 
- a different number of rows

## Diagonal concatenation

We are now looking at new fake trade records for 2020 and 2021 between China and the USA.

In 2020 the schema of the trade records is the same as we saw above with: 
- `year`
- `exporter` and 
- `importer`

In 2021 the schema changed and also includes:
- `item` and 
- `value`

In [7]:
df2020 = pl.DataFrame(
    [
        {"year":2020,"exporter":"China","importer":"USA","quantity":0},
        {"year":2020,"exporter":"China","importer":"USA","quantity":1},
    ]
)
df2021 = pl.DataFrame(
    [
        {"year":2021,"exporter":"China","importer":"USA","quantity":2,"item":"Clothes","value":10},
        {"year":2021,"exporter":"China","importer":"USA","quantity":3,"item":"Machinery","value":100},
    ]
)


We want to combine these records into a single `DataFrame`. As the column names are not the same we cannot do a vertical concatenation.

Instead we do a diagonal concatenation.

In [8]:
dfTrades2020 = pl.DataFrame(
    [
        {"year":2020,"exporter":"China","importer":"USA"},
        {"year":2020,"exporter":"China","importer":"USA"},
    ]
)
# Old schema includes value
dfTrades2021 = pl.DataFrame(
    [
        {"year":2021,"exporter":"China","importer":"USA","value":10},
        {"year":2021,"exporter":"China","importer":"USA","value":100},
        {"year":2021,"exporter":"China","importer":"USA","value":1000},

    ]
)
pl.concat([dfTrades2020,dfTrades2021],how="diagonal")

year,exporter,importer,value
i64,str,str,i64
2020,"""China""","""USA""",
2020,"""China""","""USA""",
2021,"""China""","""USA""",10.0
2021,"""China""","""USA""",100.0
2021,"""China""","""USA""",1000.0


This diagonal concatenation is a vertical concatenation for the column names that match with `null` values where the column names do not.

## Exercises

## Exercise 1: Horizontal concatenation
We split the Titanic dataset into `dfLeft` and `dfRight`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
dfLeft = df.select(df.columns[:6])
dfRight = df.select(df.columns[5:])

Do a horizontal concatenation of `dfLeft` and `dfRight`

In [None]:
df = pl.concat(<blank>)

## Exercise 2

You are given the following data from the sales of a bike shop. 

In [None]:
sales2020 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales2021 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]

Combine the full set of data into a single `DataFrame`

In [None]:
<blank>

Combine the overlapping columns into a single `DataFrame`

## Exercise 3
In the lecture on quantiles in the Statistics section we learned how to calculate quantiles.

In this exercise we will combine multiple quantiles into a single `DataFrame`.

As a reminder, this is how we calculate a single quantile on the floating point columns

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
q = 0.25
(
    df
    .select(
            pl.col(pl.Float64).quantile(q)
        )
)

We want to produce a `DataFrame` that has:
- the 0.25,0.5 and 0.75 percentiles of the floating point columns on separate rows
- a column called `percentiles` to show the percentile for each row 

Create this `DataFrame` using vertical concatenation.

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
<blank>

Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

Concatenate the outputs

## Solutions

## Solution to Exercise 1

Do a horizontal concatenation of `dfLeft` and `dfRight`

In [None]:
df = pl.concat([dfLeft,dfRight.drop("Age")],how="horizontal")

## Solution to Exercise 2

In [None]:
sales2020 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales2021 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]
dfSales2020 = pl.DataFrame(sales2020)
dfSales2021 = pl.DataFrame(sales2021)

Combine the full set of data into a single `DataFrame`

In [None]:
pl.concat([dfSales2020,dfSales2021],how="diagonal")

Combine the data with overlapping columns into a single `DataFrame`

In [None]:
pl.concat(
    [dfSales2020,dfSales2021.select(["make","model","quantity"])
    ])

## Solution to Exercise 3

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
)


Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
        .with_column(
            pl.lit(q).alias("percentiles")
        )
)


Concatenate the outputs

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
        .with_column(
            pl.lit(q).alias("percentiles")
        )
)
pl.concat(dfList)
