### Library Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

### Template

In [2]:
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.11  - Unionizing Multiple Dataframes")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

In [3]:
pets = spark.read.csv(path, header=True)
pets.toPandas()

Unnamed: 0,id,breed_id,nickname,birthday,age,color
0,1,1,King,2014-11-22 12:30:31,5,brown
1,2,3,Argus,2016-11-22 10:05:10,10,
2,3,1,Chewie,2016-11-22 10:05:10,15,
3,3,2,Maple,2018-11-22 10:05:10,17,white


### Unionizing Multiple Dataframes

There are a couple of situations where you would want to perform an union transformation.

**Case 1: Collecting Data from Various Sources**

When you're collecting data from multiple sources, some point in your spark application you will need to reconcile all the different sources into the same format and work with a single source of truth. This will require you to `union` the different datasets together.

**Case 2: Perfoming Different Transformations on your Dataset**

Sometimes you would like to perform seperate transformations on different parts of your data based on your task. This would involve breaking up your dataset into different parts and working on them individually. Then at some point you might want to stitch they back together, this would again be a `union` operation.

### Case 1 - `union()` (the Wrong Way)

In [4]:
pets_2 = pets.select(
    'breed_id',
    'id',
    'age',
    'color',
    'birthday',
    'nickname'
)

(
    pets
    .union(pets_2)
    .where(F.col('id').isin(1,2))
    .toPandas()
)

Unnamed: 0,id,breed_id,nickname,birthday,age,color
0,1,1,King,2014-11-22 12:30:31,5,brown
1,2,3,Argus,2016-11-22 10:05:10,10,
2,1,1,5,brown,2014-11-22 12:30:31,King
3,1,3,15,,2016-11-22 10:05:10,Chewie
4,2,3,17,white,2018-11-22 10:05:10,Maple


### Case 1 - Another Wrong Way

In [5]:
pets_3 = pets.select(
    '*',
    '*'
)

pets_3.show()

(
    pets
    .union(pets_3)
    .where(F.col('id').isin(1,2))
    .toPandas()
)

+---+--------+--------+-------------------+---+-----+---+--------+--------+-------------------+---+-----+
| id|breed_id|nickname|           birthday|age|color| id|breed_id|nickname|           birthday|age|color|
+---+--------+--------+-------------------+---+-----+---+--------+--------+-------------------+---+-----+
|  1|       1|    King|2014-11-22 12:30:31|  5|brown|  1|       1|    King|2014-11-22 12:30:31|  5|brown|
|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|  2|       3|   Argus|2016-11-22 10:05:10| 10| null|
|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null|  3|       1|  Chewie|2016-11-22 10:05:10| 15| null|
|  3|       2|   Maple|2018-11-22 10:05:10| 17|white|  3|       2|   Maple|2018-11-22 10:05:10| 17|white|
+---+--------+--------+-------------------+---+-----+---+--------+--------+-------------------+---+-----+



AnalysisException: u"Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 12 columns;;\n'Union\n:- Relation[id#10,breed_id#11,nickname#12,birthday#13,age#14,color#15] csv\n+- Project [id#10, breed_id#11, nickname#12, birthday#13, age#14, color#15, id#10, breed_id#11, nickname#12, birthday#13, age#14, color#15]\n   +- Relation[id#10,breed_id#11,nickname#12,birthday#13,age#14,color#15] csv\n"

**What Happened?**

This actually worked out quite nicely, I forgot this was the case actually. **Spark will only allow you to union `df` that have the exact number of columns and where the column datatypes are exactly the same.**

**Case 1**

Because we infered the schema and datatypes from the csv file it was able to union the 2 dataframes, but the results doesn't make sense at all; The columns don't match up.

**Case 2**

We created a new dataframe with twice the numnber of columns and tried to union it with the original `df`, spark threw an error as it doesn't know what to do when the number of columns don't match up.

### Case 2 - `union()` (the Right Way)

In [6]:
(
    pets
    .union(pets_2.select(pets.columns))
    .union(pets_3.select(pets.columns))
    .toPandas()
)

Unnamed: 0,id,breed_id,nickname,birthday,age,color
0,1,1,King,2014-11-22 12:30:31,5,brown
1,2,3,Argus,2016-11-22 10:05:10,10,
2,3,1,Chewie,2016-11-22 10:05:10,15,
3,3,2,Maple,2018-11-22 10:05:10,17,white
4,1,1,King,2014-11-22 12:30:31,5,brown
5,2,3,Argus,2016-11-22 10:05:10,10,
6,3,1,Chewie,2016-11-22 10:05:10,15,
7,3,2,Maple,2018-11-22 10:05:10,17,white
8,1,1,King,2014-11-22 12:30:31,5,brown
9,2,3,Argus,2016-11-22 10:05:10,10,


**What Happened?**

The columns match perfectly! How? **For each of the new `df` that you would like to union with the original `df` you will `select` the column from the original `df` during the union.** This will:
1. Guarantees the ordering of the columns, as a `select` will select the columns in order of which they are listed in.
2. Guarantees that only the columns of the original `df` is selected, from the previous sections, we know that `select` will only the specified columns.

### Summary

* Always always be careful when you are `union`ing `df` together.
* When you `union` `df`s together you should ensure:
    1. The number of columns are the same.
    2. The columns are the exact same.
    3. The columns are in the same order.