<a href="https://colab.research.google.com/github/sandeepgundeboina/LearningSpark/blob/main/SparkDeltaMergeSchema.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install delta-spark==2.0.0

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("SparkDeltaOptimize") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.0.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

Collecting delta-spark==2.0.0
  Downloading delta_spark-2.0.0-py3-none-any.whl.metadata (1.9 kB)
Collecting pyspark<3.3.0,>=3.2.0 (from delta-spark==2.0.0)
  Downloading pyspark-3.2.4.tar.gz (281.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5 (from pyspark<3.3.0,>=3.2.0->delta-spark==2.0.0)
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading delta_spark-2.0.0-py3-none-any.whl (20 kB)
Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.4-py2.py3-none-any.whl size=282040920 sha256=ff4f21312f8279234a3719354a399aac0e

In [None]:
!rm -rf /content/sample_data/merge_schema

In [None]:
from delta.tables import *

DeltaTable.create(spark)\
    .tableName('employee')\
    .addColumn('empid',"INT")\
    .addColumn('empname','STRING')\
    .addColumn('gender','STRING')\
    .addColumn('dept','STRING')\
    .addColumn('salary','INT')\
    .property('description','Table for mergeSchema Demo')\
    .location('/content/sample_data/merge_schema').execute()


<delta.tables.DeltaTable at 0x7ebf9f4d7910>

In [None]:
spark.sql('select * from employee').show()

+-----+-------+------+----+------+
|empid|empname|gender|dept|salary|
+-----+-------+------+----+------+
+-----+-------+------+----+------+



In [None]:
spark.sql('insert into employee values(1,"Sagar","M","Billing",30000);')

DataFrame[]

In [None]:
spark.sql('select * from employee').show()

+-----+-------+------+-------+------+
|empid|empname|gender|   dept|salary|
+-----+-------+------+-------+------+
|    1|  Sagar|     M|Billing| 30000|
+-----+-------+------+-------+------+



In [None]:
import pandas as pd

# Provided data
employee_data = [{'empid': 2, 'empname': 'Revanth', 'gender': 'M', 'dept': 'HR', 'salary': 25000, 'additional_col': 'test'}]

# Create a pandas DataFrame
df_pandas = pd.DataFrame(employee_data)

# Display the DataFrame
print("Pandas DataFrame created:")
display(df_pandas)

# Save the DataFrame to a CSV file
df_pandas.to_csv("employee_data.csv", index=False)

print("\nDataFrame saved to employee_data.csv")

Pandas DataFrame created:


Unnamed: 0,empid,empname,gender,dept,salary,additional_col
0,2,Revanth,M,HR,25000,test



DataFrame saved to employee_data.csv


In [None]:
# Read the CSV file into a Spark DataFrame
spark_df = spark.read.csv("employee_data.csv", header=True, inferSchema=True)

# Write the Spark DataFrame to a Delta Lake table
delta_table_path = "/content/sample_data/merge_schema"
spark_df.write.format("delta").mode("append").save(delta_table_path)

AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: 196ed77e-85eb-4f0d-8a12-63f89cc0177e).
To enable schema migration using DataFrameWriter or DataStreamWriter, please set:
'.option("mergeSchema", "true")'.
For other operations, set the session configuration
spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation
specific to the operation for details.

Table schema:
root
-- empid: integer (nullable = true)
-- empname: string (nullable = true)
-- gender: string (nullable = true)
-- dept: string (nullable = true)
-- salary: integer (nullable = true)


Data schema:
root
-- empid: integer (nullable = true)
-- empname: string (nullable = true)
-- gender: string (nullable = true)
-- dept: string (nullable = true)
-- salary: integer (nullable = true)
-- additional_col: string (nullable = true)

         

As can be seen in the error message, when try to pass the csv file to load into already created table, it is showing error of schema mismatch, to overcome this issue, there is a option of "mergeSchema" while writing data to table, which will help in updating the schema of table and load the data, without any issue to pipeline, and for all the previous data it would populate for the extra column that is added to table.

In [None]:
spark_df.write.format("delta").option("mergeSchema","true").mode("append").save(delta_table_path)

In [None]:
spark.sql('select * from employee').show()

+-----+-------+------+-------+------+--------------+
|empid|empname|gender|   dept|salary|additional_col|
+-----+-------+------+-------+------+--------------+
|    2|Revanth|     M|     HR| 25000|          test|
|    1|  Sagar|     M|Billing| 30000|          null|
+-----+-------+------+-------+------+--------------+



In [None]:
spark.table('employee').show()

+-----+-------+------+-------+------+--------------+
|empid|empname|gender|   dept|salary|additional_col|
+-----+-------+------+-------+------+--------------+
|    2|Revanth|     M|     HR| 25000|          test|
|    1|  Sagar|     M|Billing| 30000|          null|
+-----+-------+------+-------+------+--------------+



**END OF CODE**