# Soumil Nitin Shah 
Bachelor in Electronic Engineering |
Masters in Electrical Engineering | 
Master in Computer Engineering |

* Website : http://soumilshah.com/
* Github: https://github.com/soumilshah1995
* Linkedin: https://www.linkedin.com/in/shah-soumil/
* Blog: https://soumilshah1995.blogspot.com/
* Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
* Facebook Page : https://www.facebook.com/soumilshah1995/
* Email : shahsoumil519@gmail.com
* projects : https://soumilshah.herokuapp.com/project

* I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

# Learn About HUDI Soft Deletes



In [221]:
try:
    import os
    import sys
    import uuid

    import boto3

    import pyspark
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, asc, desc
    from faker import Faker
    from pyspark.sql.functions import *
    print("All modules are loaded .....")

except Exception as e:
    print("Some modules are missing {} ".format(e))


All modules are loaded .....


In [222]:
SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark = SparkSession.builder \
    .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
    .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
    .config('className', 'org.apache.hudi') \
    .config('spark.sql.hive.convertMetastoreParquet', 'false') \
    .getOrCreate()

In [223]:
spark

# Inserting some data into Hudi Tables 

In [224]:
db_name = "hudidb"
table_name = "hudi_table"

recordkey = 'emp_id'
precombine = 'ts'

path = f"file:///C:/tmp/{db_name}/{table_name}"
method = 'upsert'
table_type = "COPY_ON_WRITE"  # COPY_ON_WRITE | MERGE_ON_READ

hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.recordkey.field': recordkey,
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': method,
    'hoodie.datasource.write.precombine.field': precombine,
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2
}

data_items = [
    (11, "This is APPEND", "Sales", "RJ", 81000, 30, 23000, 827307999),
    (12, "This is APPEND", "Engineering", "RJ", 79000, 53, 15000, 1627694678),
]
columns = ["emp_id", "employee_name", "department", "state", "salary", "age", "bonus", "ts"]

spark_df = spark.createDataFrame(data=data_items, schema=columns)

spark_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)

df = spark. \
      read. \
      format("hudi"). \
      load(path)

df.select(["_hoodie_file_name", "emp_id", "employee_name"]).show(truncate=False)

+--------------------------------------------------------------------------+------+--------------+
|_hoodie_file_name                                                         |emp_id|employee_name |
+--------------------------------------------------------------------------+------+--------------+
|1c288720-ab7a-43c1-8482-17b64a0d9a94-0_0-717-853_20230117175856050.parquet|12    |This is APPEND|
|1c288720-ab7a-43c1-8482-17b64a0d9a94-0_0-717-853_20230117175856050.parquet|11    |This is APPEND|
+--------------------------------------------------------------------------+------+--------------+



# Performing Soft Deletes

##### Step 1: Create Snapshot  

In [225]:
spark. \
      read. \
      format("hudi"). \
      load(path). \
      createOrReplaceTempView("hudi_snapshot")

#### Write SQL query for data point where you want to perfrom soft delete 

In [226]:
soft_delete_ds  = spark.sql("SELECT * FROM hudi_snapshot where emp_id='11' ")

soft_delete_ds.select(["_hoodie_file_name", "emp_id", "employee_name"]).show(truncate=False)

+--------------------------------------------------------------------------+------+--------------+
|_hoodie_file_name                                                         |emp_id|employee_name |
+--------------------------------------------------------------------------+------+--------------+
|1c288720-ab7a-43c1-8482-17b64a0d9a94-0_0-717-853_20230117175856050.parquet|11    |This is APPEND|
+--------------------------------------------------------------------------+------+--------------+



In [227]:
meta_columns = [
    "_hoodie_commit_time",
    "_hoodie_commit_seqno",
    "_hoodie_record_key", 
    "_hoodie_partition_path",
    "_hoodie_file_name"
]
excluded_columns = meta_columns + ["ts", "emp_id"]

nullify_columns = list(__builtin__.filter(lambda field: field[0] not in excluded_columns, 
  list(map(lambda field: (field.name, field.dataType), soft_delete_ds.schema.fields))))
soft_delete_df = reduce(lambda df, col: df.withColumn(col[0], lit(None).cast(col[1])),
                        nullify_columns, reduce(lambda df,col: df.drop(col[0]), meta_columns, soft_delete_ds))


In [228]:
soft_delete_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)


# Read Again from Datalake

In [229]:
df = spark. \
      read. \
      format("hudi"). \
      load(path)

df.toPandas()

Unnamed: 0,_hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name,emp_id,employee_name,department,state,salary,age,bonus,ts
0,20230117175856050,20230117175856050_0_0,12,,1c288720-ab7a-43c1-8482-17b64a0d9a94-0_0-753-8...,12,This is APPEND,Engineering,RJ,79000.0,53.0,15000.0,1627694678
1,20230117180008210,20230117180008210_0_1,11,,1c288720-ab7a-43c1-8482-17b64a0d9a94-0_0-753-8...,11,,,,,,,827307999
