### Import Libraries

In [1]:
import pyspark
from delta import *
from pyspark.sql.functions import initcap

### Create Spark Session with Delta

In [2]:
#  Create a spark session with Delta
builder = pyspark.sql.SparkSession.builder.appName("DeltaTutorial") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

# Create spark context
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/Library/Python/3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/sahilnagpal/.ivy2/cache
The jars for the packages stored in: /Users/sahilnagpal/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-624ab538-c9f0-4c80-838f-91accc027dd4;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.2.0 in central
	found io.delta#delta-storage;3.2.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 240ms :: artifacts dl 9ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.2.0 from central in [default]
	io.delta#delta-storage;3.2.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     | 

In [3]:
spark.sql(f"CREATE DATABASE IF NOT EXISTS demo_db")

DataFrame[]

### Reading the data

In [6]:
df = spark.read.parquet("/Users/sahilnagpal/Desktop/wordsToSpeak/delta_lake/dataset/invoices_201_99457.parquet")
df.show(5,truncate=False)

+-----------+----------+------+---+--------+--------+------+--------------+------------+-----------------+-------------+
|customer_id|invoice_no|gender|age|category|quantity|price |payment_method|invoice_date|shopping_mall    |_rescued_data|
+-----------+----------+------+---+--------+--------+------+--------------+------------+-----------------+-------------+
|201        |I885979   |Female|26 |Clothing|3       |900.24|Debit Card    |2021-07-04  |Metrocity        |NULL         |
|202        |I810217   |Female|51 |Clothing|3       |900.24|Cash          |2022-01-14  |Metrocity        |NULL         |
|203        |I499170   |Female|38 |Toys    |1       |35.84 |Credit Card   |2022-02-20  |Kanyon           |NULL         |
|204        |I792963   |Female|59 |Clothing|5       |1500.4|Debit Card    |2022-06-18  |Emaar Square Mall|NULL         |
|205        |I311151   |Female|39 |Souvenir|3       |35.19 |Credit Card   |2022-04-27  |Mall of Istanbul |NULL         |
+-----------+----------+------+-

In [7]:
df.repartition(5).write.mode("overwrite").partitionBy("category").saveAsTable("demo_db.optimize_ex1")

                                                                                

In [9]:
spark.conf.get("spark.sql.warehouse.dir")

'file:/Users/sahilnagpal/Desktop/wordsToSpeak/delta_lake/coding/spark-warehouse'

### Create Delta Table

In [23]:
df.repartition(5).write.format("delta").mode("overwrite").partitionBy("category").saveAsTable("demo_db.optimize_ex2")

                                                                                

### Before Optimize

In [21]:
%%time
df_ex1 = spark.read.table("demo_db.optimize_ex1")
df_out = df_ex1.where(df_ex1.category == "Clothing").collect()

                                                                                

CPU times: user 240 ms, sys: 36.6 ms, total: 276 ms
Wall time: 2.3 s


### Optimize the Table

In [24]:
from delta.tables import DeltaTable
table = DeltaTable.forName(spark, "demo_db.optimize_ex1")
table.optimize().executeCompaction()

                                                                                

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,clusteringStats:struct<inputZCubeFiles:struct<numFiles:bigint,size:bigint>,inputOtherFiles:struct<numFiles:bigint,size:bigint>,inputNumZCubes:bigint,mergedFiles:struct<numFiles:bigint,size:bigint>,numOutputZCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterPar

In [27]:
table.vacuum(0)

                                                                                

Deleted 40 files and directories in a total of 9 directories.


DataFrame[]

In [15]:
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")

DataFrame[key: string, value: string]

### After Optimize

In [30]:
%%time
df_ex1 = spark.read.table("demo_db.optimize_ex1")
df_out = df_ex1.where(df_ex1.category == "Clothing").collect()

CPU times: user 297 ms, sys: 18.2 ms, total: 315 ms
Wall time: 735 ms


In [22]:
spark.sql("drop table demo_db.optimize_ex2")

DataFrame[]

### Small File Problem in Delta Lake (Spark)

#### What is the Small File Problem?
The small file problem occurs when a Delta table contains a large number of small Parquet files instead of fewer large ones.
This leads to **performance degradation** in reading, writing, and optimizing data due to excessive metadata handling and I/O operations.

---

#### Root Causes

#### 1. **Frequent Micro-Batch Writes**
- **Description:** Streaming jobs or micro-batch pipelines write small amounts of data per batch.
- **Impact:** Generates many small files, one for each micro-batch.

---

#### 2. **High Partition Count**
- **Description:** Over-partitioning by multiple columns or by high-cardinality columns (e.g., `user_id`).
- **Impact:** Creates many small partitions and small files inside each partition folder.

---

#### 3. **Low `spark.sql.files.maxRecordsPerFile` or `spark.sql.files.maxPartitionBytes` Settings**
- **Description:** These settings control file sizes during writes.
- **Impact:** If set too low, Spark produces smaller files.

---

#### 4. **Append-Only Writes without Compaction**
- **Description:** Continuous `append` operations without periodic `OPTIMIZE` or `MERGE`.
- **Impact:** Accumulates small files over time, especially for frequently updated tables.

---

#### 5. **Frequent Upserts (MERGE INTO)**
- **Description:** Delta `MERGE` operations may rewrite multiple small files for updated partitions.
- **Impact:** Can result in fragmented files if not followed by compaction.

---

#### 6. **Concurrent Writes**
- **Description:** Multiple writers (jobs) writing to the same table simultaneously.
- **Impact:** Each writer may create separate small files for the same partition.

---

#### 7. **Checkpointing in Streaming Pipelines**
- **Description:** Streaming queries with checkpointing write output frequently to Delta tables.
- **Impact:** Increases small files due to continuous micro-batch commits.

---

### Why is it a Problem?
- Slower query performance due to excessive file scanning.
- Increased metadata size in the Delta log.
- Longer `OPTIMIZE` and `VACUUM` operations.
- Higher storage costs due to inefficiency.

---

### Solutions
- Use `OPTIMIZE` to compact small files into larger ones.
- Adjust partitioning strategy to reduce over-partitioning.
- Tune `spark.sql.files.maxRecordsPerFile` and related configs.
- Batch data before writing (reduce write frequency).
- Schedule periodic compaction for streaming tables.

