# Concatenating Tables with Set-Like Operations

One of the two way of combining two tables is to stack one table on top of the other.  When stacking two tables on top of one another, we need to decide

1. If we combine columns based on position or name (and if combining by name, what do we do with mismatches?)
2. How to decide which rows to keep.  In this case, we will take some guidance from SQL clauses.

In [1]:
import pandas as pd
from dfply import *

## Three Types of Operations

* **Union:** Keeps rows from either table.
* **Intersection:** Only keeps common columns
* **Set Difference/Except:** Keep rows from the left table *except* those in the right table.

## Set Operations in Action 

<img src="./img/table_verbs_set.gif" width=800>

## All Operations Match by Position

All operations

* Match columns by position
* Require same number/type of columns

## Distinct Versus All

**UNION/INTERSECT/SET DIFFERENE** are **DISTINCT**
    * Only keeps distinct rows, removing duplicates.
**UNION ALL/INTERSECT ALL/SET DIFFERENCE ALL**
    * Keeps duplicate rows

**Note:** `pyspark` also includes `unionFromName`, which will match columns by name and doesn't require them to be in the same order.

## Example - Auto Sales in Spark

In [2]:
sales_may = pd.read_csv('./data/auto_sales_may.csv')
sales_may

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9


In [3]:
sales_apr = pd.read_csv('./data/auto_sales_apr.csv')
sales_apr

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9


# Concatenating Tables with Set-Like Operations in `pyspark`

Now let's look at combining tables with `union`, `intersect`, and `except` in `pyspark`.

## Unions with `dfply`

Use `left_table >> union(right_table)`

In [4]:
sales_may >> union(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9


## `dfply.union` is distinct

Since Ann have the same sales each month, her row only included one row.  Note that we can use `keep='last'` to `keep='first'` to determine which row is kept.

In [5]:
sales_may >> union(sales_apr, keep='last')

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9
0,Ann,22,18,15,12
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9


## Making `union_all`

We can use `pd.concat` to perform a `UNION ALL`

In [6]:
pd.concat([sales_apr, sales_may], ignore_index=True)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,20,14,6,24
2,Yolanda,19,10,28,17
3,Xerxes,11,27,17,9
4,Ann,22,18,15,12
5,Bob,19,12,17,20
6,Yolanda,19,8,32,15
7,Xerxes,12,23,18,9


## Making a `dfply.union_all`

In [7]:
@dfpipe
def union_all(left_df, right_df, ignore_index=True):
    return pd.concat([left_df, right_df], ignore_index=ignore_index)

In [8]:
sales_may >> union_all(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9
4,Ann,22,18,15,12
5,Bob,20,14,6,24
6,Yolanda,19,10,28,17
7,Xerxes,11,27,17,9


## Adding a month column

Another way to keep both of Ann's sales rows is adding a month column (which we should probably do anyway).

In [9]:
sales_may >> mutate(month = 'May') >> union(sales_apr >> mutate(month = 'April'))

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck,month
0,Ann,22,18,15,12,May
1,Bob,19,12,17,20,May
2,Yolanda,19,8,32,15,May
3,Xerxes,12,23,18,9,May
0,Ann,22,18,15,12,April
1,Bob,20,14,6,24,April
2,Yolanda,19,10,28,17,April
3,Xerxes,11,27,17,9,April


## Finding common rows with `dfply.intersect`

In [10]:
sales_may >> intersect(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,Ann,22,18,15,12


## Finding rows unique to the left table.

Use `left_table >> dfply.set_diff(right_table)`

In [11]:
sales_may >> set_diff(sales_apr)

Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,Bob,19,12,17,20
2,Yolanda,19,8,32,15
3,Xerxes,12,23,18,9


## <font color="red"> Exercise 1 </font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Use `glob` to get all 6 file paths.
2. Use a regular expression to create a `lambda` function that pulls the month from the files.
3. Read the 6 data frames into a `dict` with keys equal to the month name and values containing the corresponding data frame.
4. Write a helper function that adds a month column each dictionary.  Use a dictionary comprehension to apply this helper to each `df`.
5. Use the accumulator pattern and `dfply.union` to combine these 6 data frames into one combined `df`
6. Inspect the head and shape of the resulting `df`

In [12]:
from glob import glob
files = glob('./data/uber-raw-data-*14-sample.csv')
files

['./data/uber-raw-data-apr14-sample.csv',
 './data/uber-raw-data-aug14-sample.csv',
 './data/uber-raw-data-jul14-sample.csv',
 './data/uber-raw-data-jun14-sample.csv',
 './data/uber-raw-data-may14-sample.csv',
 './data/uber-raw-data-sep14-sample.csv']

In [13]:
import re
FILE_NAME_RE = re.compile(r'^\./data/uber-raw-data-([a-z]*)14-sample\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(files)

['apr', 'aug', 'jul', 'jun', 'may', 'sep']

In [14]:
dfs = {name:pd.read_csv(path) for name, path in zip(file_names(files), files)}
dfs['apr'].head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/18/2014 21:38:00,40.7359,-73.9852,B02682
1,4/23/2014 15:19:00,40.7642,-73.9543,B02598
2,4/10/2014 7:15:00,40.7138,-74.0103,B02598
3,4/11/2014 15:23:00,40.7847,-73.9698,B02682
4,4/7/2014 17:26:00,40.646,-73.7767,B02598


In [15]:
def add_month(mo):
    return dfs[mo] >> mutate(month = mo)

In [16]:
add_month('may').head()

Unnamed: 0,Date/Time,Lat,Lon,Base,month
0,5/31/2014 18:57:00,40.766,-73.9714,B02682,may
1,5/13/2014 21:19:00,40.7598,-73.9782,B02598,may
2,5/21/2014 18:19:00,40.7254,-73.9979,B02598,may
3,5/20/2014 16:40:00,40.6246,-73.9676,B02682,may
4,5/22/2014 7:31:00,40.7374,-73.9965,B02598,may


In [17]:
addingmonth = {f:add_month(f) for f in file_names(files)}
addingmonth['may'].head()

Unnamed: 0,Date/Time,Lat,Lon,Base,month
0,5/31/2014 18:57:00,40.766,-73.9714,B02682,may
1,5/13/2014 21:19:00,40.7598,-73.9782,B02598,may
2,5/21/2014 18:19:00,40.7254,-73.9979,B02598,may
3,5/20/2014 16:40:00,40.6246,-73.9676,B02682,may
4,5/22/2014 7:31:00,40.7374,-73.9965,B02598,may


In [18]:
addingmonth['apr'] >> union(addingmonth['may']) >> head

Unnamed: 0,Date/Time,Lat,Lon,Base,month
0,4/18/2014 21:38:00,40.7359,-73.9852,B02682,apr
1,4/23/2014 15:19:00,40.7642,-73.9543,B02598,apr
2,4/10/2014 7:15:00,40.7138,-74.0103,B02598,apr
3,4/11/2014 15:23:00,40.7847,-73.9698,B02682,apr
4,4/7/2014 17:26:00,40.646,-73.7767,B02598,apr


In [24]:
col_names = ['Date/Time', 'Lat', 'Lon', 'Base', 'month']
df = pd.DataFrame(columns=col_names)
for d in addingmonth.values():
    df = df >> union_all(d)

In [25]:
df.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,month
0,4/18/2014 21:38:00,40.7359,-73.9852,B02682,apr
1,4/23/2014 15:19:00,40.7642,-73.9543,B02598,apr
2,4/10/2014 7:15:00,40.7138,-74.0103,B02598,apr
3,4/11/2014 15:23:00,40.7847,-73.9698,B02682,apr
4,4/7/2014 17:26:00,40.646,-73.7767,B02598,apr


In [26]:
df.shape

(600000, 5)

## Up Next

Stuff