<a href="https://colab.research.google.com/github/ramayer/google-colab-examples/blob/main/Spark_delta_io_Unexpected_behavior_in_MERGE_INTO_WHEN_MATCHED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark's Delta Table's MERGE INTO statement appears to unnecessarily overwrite parquet files.

* Delta.io's "MERGE INTO" statement can sometimes re-write hundreds of files even though it successfully notices that only a single row changed.
* Expected behaviour - a MERGE INTO statement that only touches one row should modify at most 3 files in a delta table (the parquet file that contains that row is added; the previous parquet file that contained the row is removed; and the transaction log is updated) 
* Observed behaviour - In the example below, even though only a single row is updated, hundreds of files are rewritten.


## Install & initialize Spark and its dependencies

In [1]:
!apt-get -qq install -y openjdk-8-jdk-headless > /tmp/apt-get.out
!(wget -q --show-progress -nc https://mirrors.ocf.berkeley.edu/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz)
!tar xf spark-3.1.2-bin-hadoop3.2.tgz
try:
  import pyspark, findspark, delta
except:
  %pip install -q --upgrade pyspark findspark delta


[K     |████████████████████████████████| 281.3 MB 36 kB/s 
[K     |████████████████████████████████| 198 kB 53.3 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Building wheel for delta (setup.py) ... [?25l[?25hdone


In [3]:
import findspark
import pyspark
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

MAX_MEMORY="8g"
findspark.init()
from pyspark.sql import SparkSession
spark = (pyspark.sql.SparkSession.builder.appName("MyApp") 
    .config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0") 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
    .config("spark.executor.memory", MAX_MEMORY) 
    .config("spark.driver.memory", MAX_MEMORY) 
    .enableHiveSupport() 
    .getOrCreate()        
    )

spark

## Create Test Delta Tables

* one for the merge source, and 
* one for the merge destination

In [4]:
src_df = spark.range(1000   ).selectExpr("id","'data for ' || id as data")
dst_df = spark.range(100_000).selectExpr("id","'data for ' || id as data")

src_df.write.format('delta').mode('overwrite').saveAsTable("tmp_merge_src")
dst_df.write.format('delta').mode('overwrite').saveAsTable("tmp_merge_dst")


In [5]:
print("Merge Source")
src_df.show(3)
print("Merge Destination")
dst_df.show(3)

Merge Source
+---+----------+
| id|      data|
+---+----------+
|  0|data for 0|
|  1|data for 1|
|  2|data for 2|
+---+----------+
only showing top 3 rows

Merge Destination
+---+----------+
| id|      data|
+---+----------+
|  0|data for 0|
|  1|data for 1|
|  2|data for 2|
+---+----------+
only showing top 3 rows



In [6]:
# debug function to show table history
import json
import pyspark.sql.functions as psf
def show_history():
    hist = spark.sql("describe history tmp_merge_dst").selectExpr(
       "timestamp","operation","operationmetrics", "operationparameters", 
    ).sort(psf.col('timestamp').desc())
    print(json.dumps(hist.take(1),default=str,indent=1))

## This MERGE INTO statement will only update a single row, but unnecessariy overwrite over 100 parquet files.

* As the history shows below, it correctly noticed that only a single row was modified.   However it rewrites hundreds of files.
* It should have noticed that most of those files were unaffected because of the "AND s.data <> d.data" clause.


In [7]:
spark.sql("""
  UPDATE tmp_merge_src SET data='only this row should be updated, but it seems many are rewritten' where id = 1
""")

spark.sql("""
MERGE INTO tmp_merge_dst as d
     USING tmp_merge_src as s
     ON (s.id = d.id)
     WHEN MATCHED AND s.data <> d.data THEN UPDATE SET *
     WHEN NOT MATCHED THEN INSERT *
""")
show_history()

[
 [
  "2021-11-19 05:35:49",
  "MERGE",
  {
   "numOutputRows": "50000",
   "numTargetRowsInserted": "0",
   "numTargetRowsUpdated": "1",
   "numTargetFilesAdded": "200",
   "numTargetFilesRemoved": "1",
   "numTargetRowsDeleted": "0",
   "scanTimeMs": "4682",
   "numSourceRows": "1000",
   "executionTimeMs": "16412",
   "numTargetRowsCopied": "49999",
   "rewriteTimeMs": "11724"
  },
  {
   "matchedPredicates": "[{\"predicate\":\"(NOT (s.`data` = d.`data`))\",\"actionType\":\"update\"}]",
   "predicate": "(s.`id` = d.`id`)",
   "notMatchedPredicates": "[{\"actionType\":\"insert\"}]"
  }
 ]
]


## This workaround produces the expected results.

* By filtering in the USING clause instead of the WHEN MATCHED clause, it correctly only modifies 1 file (and the transaction log)
* Ideally the filter in the WHEN MATCHED clause would do the same thing.


In [None]:
spark.sql("""
  UPDATE tmp_merge_src SET data='only this row should be updated' where id = 1
""")

spark.sql("""
MERGE INTO tmp_merge_dst as d
     USING (select * from tmp_merge_src as n where not exists (select 1 from tmp_merge_dst as o where n.id=o.id and n.data = o.data)) as s
     ON (s.id = d.id)
     WHEN MATCHED AND s.data <> d.data THEN UPDATE SET *
     WHEN NOT MATCHED THEN INSERT *
""")
show_history()

[
 [
  "2021-11-18 03:07:56.899000",
  "MERGE",
  {
   "numOutputRows": "39",
   "numTargetRowsInserted": "0",
   "numTargetRowsUpdated": "1",
   "numTargetFilesAdded": "2",
   "numTargetFilesRemoved": "1",
   "numTargetRowsDeleted": "0",
   "scanTimeMs": "874",
   "numSourceRows": "1",
   "executionTimeMs": "1749",
   "numTargetRowsCopied": "38",
   "rewriteTimeMs": "873"
  },
  {
   "matchedPredicates": "[{\"predicate\":\"(NOT (s.`data` = d.`data`))\",\"actionType\":\"update\"}]",
   "predicate": "(s.`id` = d.`id`)",
   "notMatchedPredicates": "[{\"actionType\":\"insert\"}]"
  }
 ]
]
