In PySpark, the subtract() function is used to compare two DataFrames (or RDDs) and return the rows that are present in one DataFrame but not in the other.
It's commonly used when you want to filter out matching rows between two DataFrames.

I'll explain with examples step by step.

### 1. Syntax
df1.subtract(df2)


Returns rows in df1 but not in df2.

Both DataFrames must have the same schema (same column names and data types).

### 2. Sample Data

Let's create two sample DataFrames:

In [1]:
# Sample DataFrames
data1 = [
    (1, "A", 100),
    (2, "B", 200),
    (3, "C", 300),
    (4, "D", 400)
]

data2 = [
    (3, "C", 300),
    (4, "D", 400),
    (5, "E", 500)
]

columns = ["id", "name", "salary"]

df1 = spark.createDataFrame(data1, columns)
df2 = spark.createDataFrame(data2, columns)

print("DataFrame 1")
display(df1)
print("DataFrame 2")
display(df2)

StatementMeta(, e9c57332-e3a6-4b01-8012-b0c7c86447fd, 3, Finished, Available, Finished)

DataFrame 1


SynapseWidget(Synapse.DataFrame, 99412277-737d-415d-9175-9711f7e69262)

DataFrame 2


SynapseWidget(Synapse.DataFrame, 5b490abc-c624-4665-a522-527e3352be11)

### 3. Subtract Example – Get Rows in df1 but NOT in df2

In [2]:
df1_diff = df1.subtract(df2)

print("Rows present in df1 but NOT in df2")
display(df1_diff)

StatementMeta(, e9c57332-e3a6-4b01-8012-b0c7c86447fd, 4, Finished, Available, Finished)

Rows present in df1 but NOT in df2


SynapseWidget(Synapse.DataFrame, 2c961aef-5735-4e59-9c69-0629d333a582)

### 4. Subtract Example – Get Rows in df2 but NOT in df1

In [3]:
df2_diff = df2.subtract(df1)

print("Rows present in df2 but NOT in df1")
display(df2_diff)

StatementMeta(, e9c57332-e3a6-4b01-8012-b0c7c86447fd, 5, Finished, Available, Finished)

Rows present in df2 but NOT in df1


SynapseWidget(Synapse.DataFrame, 0e0a9b1a-f184-49cc-96b5-58ffe78210ae)

### 5. Two-Way Comparison (Symmetric Difference)

If you want rows that are different in both DataFrames:

In [4]:
df_diff = df1.subtract(df2).union(df2.subtract(df1))

print("Rows that are different between df1 and df2")
display(df_diff)

StatementMeta(, e9c57332-e3a6-4b01-8012-b0c7c86447fd, 6, Finished, Available, Finished)

Rows that are different between df1 and df2


SynapseWidget(Synapse.DataFrame, ca890d2c-a79e-4c48-af9c-5c7b2f7e6f63)

### 6. Filtering One DataFrame Using subtract()

Let's say you want all rows from df1 that are not in df2:

In [5]:
filtered_df = df1.subtract(df2)
display(filtered_df)


StatementMeta(, e9c57332-e3a6-4b01-8012-b0c7c86447fd, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 6d612c49-412b-4cca-8fab-cb49a669f623)

### 7. Important Notes

Schema must match → otherwise, subtract() will throw an error.

Works row-wise → it checks all columns, not just one.

For column-specific comparisons, use join() instead.

### 8. Alternative: Using Left Anti Join (Recommended for Large Data)

subtract() collects and compares rows, which can be expensive for large datasets.
A better alternative is left anti join:

This produces the same result as df1.subtract(df2) but is more efficient.

In [7]:
display(df1.join(df2, on=["id", "name", "salary"], how="left_anti"))

StatementMeta(, e9c57332-e3a6-4b01-8012-b0c7c86447fd, 9, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 2951ee1a-dfbe-473c-8b1c-467c1d417812)

### Summary Table

| Use Case                       | Method                                       |
| ------------------------------ | -------------------------------------------- |
| Rows in `df1` but not in `df2` | `df1.subtract(df2)`                          |
| Rows in `df2` but not in `df1` | `df2.subtract(df1)`                          |
| Rows different in both         | `df1.subtract(df2).union(df2.subtract(df1))` |
| Large data efficiency          | Use `left_anti` join                         |
