# Concatenating Tables with Set-Like Operations

One of the two way of combining two tables is to stack one table on top of the other.  When stacking two tables on top of one another, we need to decide

1. If we combine columns based on position or name (and if combining by name, what do we do with mismatches?)
2. How to decide which rows to keep.  In this case, we will take some guidance from SQL clauses.

## Three Types of Operations

* **Union:** Keeps rows from either table.
* **Intersection:** Only keeps common columns
* **Set Difference/Except:** Keep rows from the left table *except* those in the right table.

## Set Operations in Action 

<img src="./img/table_verbs_set.gif" width=800>

## All Operations Match by Position

All operations

* Match columns by position
* Require same number/type of columns

## Distinct Versus All

* **UNION/INTERSECT/SET DIFFERENE** are **DISTINCT**
    * Only keeps distinct rows, removing duplicates.
* **UNION ALL/INTERSECT ALL/SET DIFFERENCE ALL**
    * Keeps duplicate rows

In [5]:
import polars as pl

In [3]:
sales_may = pl.read_csv('./data/auto_sales_may.csv')
sales_may

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15,12
1,"""Bob""",20,14,6,24
2,"""Yolanda""",19,10,28,17
3,"""Xerxes""",11,27,17,9


In [5]:
sales_apr = pl.read_csv('./data/auto_sales_apr.csv')
sales_apr

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15,12
1,"""Bob""",19,12,17,20
2,"""Yolanda""",19,8,32,15
3,"""Xerxes""",12,23,18,9


## Unions with `polars`

* Use `vstack` to perform a union on 2 tables.
* Use `pl.concat` to perform a union of 3+ tables.

In [6]:
sales_may.vstack(sales_apr)

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15,12
1,"""Bob""",20,14,6,24
2,"""Yolanda""",19,10,28,17
3,"""Xerxes""",11,27,17,9
0,"""Ann""",22,18,15,12
1,"""Bob""",19,12,17,20
2,"""Yolanda""",19,8,32,15
3,"""Xerxes""",12,23,18,9


## `df.vstack` is NOT distinct

You need to perform a `unique` after the union to get distinct rows.

In [7]:
sales_may.vstack(sales_apr).unique()

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15,12
1,"""Bob""",20,14,6,24
2,"""Yolanda""",19,10,28,17
3,"""Xerxes""",11,27,17,9
1,"""Bob""",19,12,17,20
2,"""Yolanda""",19,8,32,15
3,"""Xerxes""",12,23,18,9


In [10]:
sales_may.vstack(sales_apr).unique(keep='last')

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
1,"""Bob""",20,14,6,24
2,"""Yolanda""",19,10,28,17
3,"""Xerxes""",11,27,17,9
0,"""Ann""",22,18,15,12
1,"""Bob""",19,12,17,20
2,"""Yolanda""",19,8,32,15
3,"""Xerxes""",12,23,18,9


## Columns are stacked by column location/order!

In [13]:
df1 = pl.DataFrame({l:[l + str(i) for i in range(3)] for l in ['a', 'b']})
df1

a,b
str,str
"""a0""","""b0"""
"""a1""","""b1"""
"""a2""","""b2"""


In [14]:
df2 = pl.DataFrame({l:[l + str(i) for i in range(3)] for l in ['b', 'a']})
df2

b,a
str,str
"""b0""","""a0"""
"""b1""","""a1"""
"""b2""","""a2"""


In [15]:
df1.vstack(df2)

SchemaError: cannot vstack: because column names in the two DataFrames do not match for left.name='a' != right.name='b'

In [16]:
df1.vstack(df2.select(['a', 'b']))

a,b
str,str
"""a0""","""b0"""
"""a1""","""b1"""
"""a2""","""b2"""
"""a0""","""b0"""
"""a1""","""b1"""
"""a2""","""b2"""


## Adding a month column

Another way to keep both of Ann's sales rows is adding a month column (which we should probably do anyway).

In [19]:
pl.Config.with_columns_kwargs = True

(sales_may
 .with_columns(month = 'May')
 .vstack(sales_apr
         .with_columns(month = 'April')
        )
)

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck,month
i64,str,i64,i64,i64,i64,str
0,"""Ann""",22,18,15,12,"""May"""
1,"""Bob""",20,14,6,24,"""May"""
2,"""Yolanda""",19,10,28,17,"""May"""
3,"""Xerxes""",11,27,17,9,"""May"""
0,"""Ann""",22,18,15,12,"""April"""
1,"""Bob""",19,12,17,20,"""April"""
2,"""Yolanda""",19,8,32,15,"""April"""
3,"""Xerxes""",12,23,18,9,"""April"""


## No `INTERSECT` or `DIFFERENCE` in `polars`

As of Fall 2022, `polars` lacks the either of these set operations.

## Combining multiple files using concatenate

The function `pl.concat` allows stacking any number of files with the same columns.

In [23]:
df1 = pl.DataFrame({"a": [1], "b": [3]})
df2 = pl.DataFrame({"a": [2], "b": [4]})
pl.concat([df1, df2])

a,b
i64,i64
1,3
2,4


In [29]:
dfs = [pl.DataFrame({"a": [i + j], "b": [i + j +1]}) for i in range(2) for j in range(i, i + 2)]
dfs

[shape: (1, 2)
 ┌─────┬─────┐
 │ a   ┆ b   │
 │ --- ┆ --- │
 │ i64 ┆ i64 │
 ╞═════╪═════╡
 │ 0   ┆ 1   │
 └─────┴─────┘,
 shape: (1, 2)
 ┌─────┬─────┐
 │ a   ┆ b   │
 │ --- ┆ --- │
 │ i64 ┆ i64 │
 ╞═════╪═════╡
 │ 1   ┆ 2   │
 └─────┴─────┘,
 shape: (1, 2)
 ┌─────┬─────┐
 │ a   ┆ b   │
 │ --- ┆ --- │
 │ i64 ┆ i64 │
 ╞═════╪═════╡
 │ 2   ┆ 3   │
 └─────┴─────┘,
 shape: (1, 2)
 ┌─────┬─────┐
 │ a   ┆ b   │
 │ --- ┆ --- │
 │ i64 ┆ i64 │
 ╞═════╪═════╡
 │ 3   ┆ 4   │
 └─────┴─────┘]

In [30]:
pl.concat(dfs)

a,b
i64,i64
0,1
1,2
2,3
3,4


## <font color="red"> Exercise 2.8.1</font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Read the April and May data frames.
2. Add the month column each data frame
3. Use `df.vstack` to combine these data frames into one combined `df`
4. Use `pd.concat` to combine these data frames into one combined `df`

In [46]:
# Your code here