In [0]:
# Sample data for df1
data1 = [
    (1, "Apple"),
    (2, "Banana"),
    (3, "Orange"),
    (3, "Orange")  # duplicate row
]

# Sample data for df2
data2 = [
    (3, "Orange"),
    (4, "Grapes")
]

# Define schema columns
columns = ["id", "fruit"]

# Create DataFrames
df1 = spark.createDataFrame(data1, columns)
df2 = spark.createDataFrame(data2, columns)

print("DF1:")
df1.display()

print("DF2:")
df2.display()


In [0]:
# Using exceptAll to find differences
result = df1.exceptAll(df2)

print("Rows in DF1 but not in DF2 (including duplicates):")
result.display()

### Rules & Requirements
Same schema → The two DataFrames must have the same column names and data types, in the same order.

Case sensitivity → By default, it is case-sensitive ("Apple" ≠ "apple").

Duplicate handling →

If a row occurs multiple times in DF1 and fewer times in DF2, exceptAll() will keep the remaining duplicates.

Example: If DF1 has (3, Orange) twice and DF2 has it once → result will keep it once.

Order is not guaranteed — output order may vary.

Immutable — It does not change the original DataFrames.

### When to Use exceptAll()
When you need set difference but want to preserve duplicate rows.

For comparing transactional data where the number of occurrences matters.

To find extra records in one DataFrame compared to another.

### How is it Different from except()?

| Feature                  | `except()` | `exceptAll()` |
| ------------------------ | ---------- | ------------- |
| Removes duplicates       | ✅ Yes      | ❌ No          |
| Preserves duplicate rows | ❌ No       | ✅ Yes         |
| Same schema requirement  | ✅ Yes      | ✅ Yes         |
| SQL equivalent           | `EXCEPT`   | `EXCEPT ALL`  |


### How It Works Internally
Row comparison is done by value, column by column.

For each row in DF1:

If it exists in DF2, it removes only one occurrence.

If it does not exist, it stays in the result.

Keeps remaining duplicate occurrences.

### Real-World Use Cases
Data reconciliation — find extra rows in one dataset vs another while keeping track of how many extra copies exist.

Log analysis — detect extra events in one log vs a baseline.

Inventory checks — find overstocked items that don’t match an expected stock list.

ETL validation — compare source and target data where duplicates are important.