# Union and unionAll in PySpark

* **Purpose:** Both union and unionAll are used to combine two DataFrames into a single DataFrame.

* **DataFrame Compatibility:** The two DataFrames must have the same schema (i.e., the same column names and data types) to perform the union operation.

### union()

* **Functionality:** Combines two DataFrames and retains all rows, duplicate rows from the result.
* **Behavior:** The union() method doesnot retains unique rows across both DataFrames, resulting in a DataFrame with duplicates.

### unionAll()

* **Functionality:** Combines two DataFrames and retains all rows, including duplicates.
* **Behavior:** The unionAll() method performs the union operation but does not eliminate duplicate rows, similar to Unionall

```

# Using union to retain all rows including duplicates
unioned_df = df1.union(df2)
# Using unionAll to retain all rows including duplicates
unionAll_df = df1.unionAll(df2)

```



In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *  # Import the function
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import regexp_replace, col
from google.colab import drive


In [10]:
data1 = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
data2 = [("David", 40), ("Eve", 45), ("Alice", 25)]
columns = ["name", "age"]

df1 = spark.createDataFrame(data1, columns)
df2 = spark.createDataFrame(data2, columns)
df1.show()
df2.show()

#using union to retain all rows including duplicate
unioned_df =df1.union(df2)
print("unioned_df (No duplicates removed):")
display(unioned_df.count())
unioned_df.show()


#Using unionall to retain all rows
unionAll_df = df1.unionAll(df2)
print("unionAll_df (duplicates retained):")
display(unionAll_df.count())
unionAll_df.show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+

+-----+---+
| name|age|
+-----+---+
|David| 40|
|  Eve| 45|
|Alice| 25|
+-----+---+

unioned_df (No duplicates removed):


6

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  David| 40|
|    Eve| 45|
|  Alice| 25|
+-------+---+

unionAll_df (duplicates retained):


6

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
|  David| 40|
|    Eve| 45|
|  Alice| 25|
+-------+---+



###  Remove duplicate rows and create a new DataFrame

In [12]:
unioned_df = unioned_df.dropDuplicates()
print("unioned_df : (duplicates removed):")
display(unioned_df.count())
unioned_df.show()

unionAll_df = unionAll_df.distinct()
print("unionAll_df : (duplicates removed):")
display(unionAll_df.count())
unionAll_df.show()


unioned_df : (duplicates removed):


5

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|Charlie| 35|
|    Bob| 30|
|  David| 40|
|    Eve| 45|
+-------+---+

unionAll_df : (duplicates removed):


5

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|Charlie| 35|
|    Bob| 30|
|  David| 40|
|    Eve| 45|
+-------+---+

