## Union() vs unionAll() in PySpark


In PySpark, the `union()` and `unionAll()` transformations are used to merge two or more DataFrames with the same schema or structure.


### Older Versions of PySpark (before 2.0)


**unionAll:** 
- This was the method used to combine two DataFrames without removing duplicates. It kept all rows, even if they were the same in both DataFrames.

**union:**
- In older versions, union would remove duplicate rows after combining the DataFrames. It performed deduplication of the resulting DataFrame.

### Newer Versions of PySpark (from 2.0 onwards)

**union:**
- In modern versions of PySpark (2.0+), union now behaves like unionAll. It keeps duplicates and does not remove duplicate rows. The union function now combines DataFrames without deduplication, just like unionAll did in the older versions.

**unionAll:** 
- unionAll has been deprecated in PySpark 2.0 and later. It was kept for backwards compatibility, but the union function now does the same thing.

**Example Using union (New Version)**

In newer versions (PySpark 2.0 and above), union keeps duplicates, so it behaves the same as unionAll from older versions.

In [0]:
# sample dataframe
data1 = [("Rohish", 26), ("Smit", 25), ("Rajesh", 27)]
data2 = [("Rohish", 26), ("Melody", 25)]

schema=["Name", "age"]

df1 = spark.createDataFrame(data1, schema)
df2 = spark.createDataFrame(data2, schema)

df1.show()
df2.show()

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
+------+---+

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|Melody| 25|
+------+---+



In [0]:
# unionAll and union both gives same results
df1.unionAll(df2).show()

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
|Rohish| 26|
|Melody| 25|
+------+---+



In [0]:
# unionAll and union both gives same results
df1.union(df2).show()

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
|Rohish| 26|
|Melody| 25|
+------+---+



### union and unionAll in Spark SQL vs PySpark DataFrames

**PySpark DataFrame:**
- If you are performing `union` and `unionAll` on dataframe there is no difference both are same

**Spark SQL**:
- `union` only returns a unique record, while `union all` returns all the records (including duplicates).

In [0]:
# creating temp view on df
df1.createOrReplaceTempView("view_1")
df2.createOrReplaceTempView("view_2")

In [0]:
spark.sql("""select * from view_1""").show()
spark.sql("""select * from view_2""").show()

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
+------+---+

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|Melody| 25|
+------+---+



In [0]:
# union: combine the data from table1 and table2 (removes duplicates)
spark.sql("""
          select * from view_1
          union
          select * from view_1
          """).show()

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
+------+---+



In [0]:
# union all: combine the data from table1 and table2 (keeps duplicates)
spark.sql("""
          select * from view_1
          union all
          select * from view_1
          """).show()

+------+---+
|  Name|age|
+------+---+
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
|Rohish| 26|
|  Smit| 25|
|Rajesh| 27|
+------+---+



## unionByName() in PySpark

In PySpark, unionByName() is a function that allows you to combine two DataFrames based on column names, rather than the order of columns. 

This function is useful when you want to union DataFrames that may have columns in different orders or even different columns.

#### Key Points about unionByName:

**Column Name-Based:**
- It combines DataFrames by matching columns with the same names. 
- If the columns do not match in order, unionByName() will still work as long as the column names are the same.

**Handling Missing Columns:**
- If one DataFrame has columns that the other DataFrame does not have, unionByName() will fill the missing columns with null values in the resulting DataFrame.

**vallowMissingColumns:**
- By default, PySpark will throw an error if the two DataFrames do not have the same set of columns. 
- However, you can use the allowMissingColumns parameter to specify whether missing columns should be filled with null instead of raising an error.

**Example 1:** Basic Usage of unionByName()

In [0]:
# Sample DataFrames with different column orders
data1 = [("Rohish", 27), ("Melody", 25)]
data2 = [(35, "Chetan")]

columns1 = ["Name", "Age"]
columns2 = ["Age", "Name"]

df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

In [0]:
# Perform unionByName, which combines by column names
df1.unionByName(df2).show()

+------+---+
|  Name|Age|
+------+---+
|Rohish| 27|
|Melody| 25|
|Chetan| 35|
+------+---+



Here, even though the columns in `df1` and `df2` are in different orders, `unionByName` successfully combines them by matching the column names (`Name` and `Age`).

**Example 2:** Using allowMissingColumns=True

In [0]:
# Sample DataFrames with different columns
data1 = [("Rohish", 27), ("Melody", 25)]
data2 = [("Chetan", 35, "M"), ("Rajesh", 27, "M")]

columns1 = ["Name", "Age"]
columns2 = ["Name", "Age", "Gender"]

df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

In [0]:
# Perform unionByName with allowMissingColumns=True
df1.unionByName(df2, allowMissingColumns=True).show()

+------+---+------+
|  Name|Age|Gender|
+------+---+------+
|Rohish| 27|  null|
|Melody| 25|  null|
|Chetan| 35|     M|
|Rajesh| 27|     M|
+------+---+------+



In this example, the `df1` DataFrame does not have the Gender column. Since `allowMissingColumns=True`, the missing column is added with `null` values for those rows where the column is missing.

**Example 3:** Using allowMissingColumns=False (default)

`If allowMissingColumns=False` (default), PySpark will throw an error if the DataFrames have different sets of columns.

In [0]:
# Perform unionByName without allowing missing columns (default behavior)
df1.unionByName(df2, allowMissingColumns=False)

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-1950943183691311>:2[0m
[1;32m      1[0m [38;5;66;03m# Perform unionByName without allowing missing columns (default behavior)[39;00m
[0;32m----> 2[0m [43mdf1[49m[38;5;241;43m.[39;49m[43munionByName[49m[43m([49m[43mdf2[49m[43m,[49m[43m [49m[43mallowMissingColumns[49m[38;5;241;43m=[39;49m[38;5;28;43;01mFalse[39;49;00m[43m)[49m

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     46[0m start [38;5;241m=[39m time[38;5;241m.[39mperf_counter()
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [38;5;241m=[39m [43mfunc[49m[43m([49m[38;5;241;43m*[39;49m[43margs[49m[43m,[49m[43m [49m[38;5;241;43m*[39;49m[38;5;241;43m*[39;49m

### When to use What??

**Use union:**
- When the DataFrames have the same schema (same column names and order).
- You want to combine rows from both DataFrames, and the columns are already aligned correctly.
- Warning: If the columns are in a different order, you may get errors.

**Use unionByName:**
- When the column names are the same but the order differs between the DataFrames.
- When the DataFrames have different columns (but you want to combine them, filling missing columns with null).
- When you're working with DataFrames that are similar but may have missing or additional columns.