# PySpark Union vs Union All

## Key Differences

- `union()`: Removes duplicate rows when combining DataFrames.

- `unionAll()`: Retains all duplicate rows when combining DataFrames (deprecated, use `union()`).

In [10]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('UnionExample').getOrCreate()

StatementMeta(, 605b3e0e-08a2-4587-9dfc-750528d92dd8, 12, Finished, Available, Finished)

In [11]:
from pyspark.sql import Row

data1 = [("Alice", 1), ("Bob", 2)]
data2 = [("Bob", 2), ("Cathy", 3), ("David", 4)]
columns = ["Name", "Id"]


df1 = spark.createDataFrame(data1,columns)
df2 = spark.createDataFrame(data2,columns)

df1.show()
df2.show()

StatementMeta(, 605b3e0e-08a2-4587-9dfc-750528d92dd8, 13, Finished, Available, Finished)

+-----+---+
| Name| Id|
+-----+---+
|Alice|  1|
|  Bob|  2|
+-----+---+

+-----+---+
| Name| Id|
+-----+---+
|  Bob|  2|
|Cathy|  3|
|David|  4|
+-----+---+



In [12]:
# Union
df_union = df1.union(df2)
df_union.show()

StatementMeta(, 605b3e0e-08a2-4587-9dfc-750528d92dd8, 14, Finished, Available, Finished)

+-----+---+
| Name| Id|
+-----+---+
|Alice|  1|
|  Bob|  2|
|  Bob|  2|
|Cathy|  3|
|David|  4|
+-----+---+



###### The union() function combines DataFrames but does NOT remove duplicates explicitly. If you want unique values, use 'distinct()' on the result.


In [13]:
df_union_distinct = df_union.distinct()
df_union_distinct.show()

StatementMeta(, 605b3e0e-08a2-4587-9dfc-750528d92dd8, 15, Finished, Available, Finished)

+-----+---+
| Name| Id|
+-----+---+
|Alice|  1|
|  Bob|  2|
|Cathy|  3|
|David|  4|
+-----+---+



## Conclusion 
- Use `union()` to merge DataFrames and remove duplicates with `.distinct()` if needed.
- `unionAll()` is deprecated, and `union()` is the recommended approach.