## Soumil Nitin Shah 
Bachelor in Electronic Engineering |
Masters in Electrical Engineering | 
Master in Computer Engineering |

* Website : http://soumilshah.com/
* Github: https://github.com/soumilshah1995
* Linkedin: https://www.linkedin.com/in/shah-soumil/
* Blog: https://soumilshah1995.blogspot.com/
* Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
* Facebook Page : https://www.facebook.com/soumilshah1995/
* Email : shahsoumil519@gmail.com
* projects : https://soumilshah.herokuapp.com/project

* I earned a Bachelor of Science in Electronic Engineering and a double master’s in Electrical and Computer Engineering. I have extensive expertise in developing scalable and high-performance software applications in Python. I have a YouTube channel where I teach people about Data Science, Machine learning, Elastic search, and AWS. I work as data Team Lead at Jobtarget where I spent most of my time developing Ingestion Framework and creating microservices and scalable architecture on AWS. I have worked with a massive amount of data which includes creating data lakes (1.2T) optimizing data lakes query by creating a partition and using the right file format and compression. I have also developed and worked on a streaming application for ingesting real-time streams data via kinesis and firehose to elastic search

#### Goal
###### Goal of this labs is to educate and teach you fundemental concepts on HUDI 

## Step 1: 
##### Define Imports 

In [1]:
try:

    import os
    import sys
    import uuid

    import pyspark
    from pyspark.sql import SparkSession
    from pyspark import SparkConf, SparkContext
    from pyspark.sql.functions import col, asc, desc
    from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    from datetime import datetime
    from functools import reduce
    from faker import Faker


except Exception as e:
    pass

# Step 2:
#### Create Spark Instance 

In [2]:
SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

spark = SparkSession.builder \
    .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
    .config('className', 'org.apache.hudi') \
    .config('spark.sql.hive.convertMetastoreParquet', 'false') \
    .getOrCreate()

In [3]:
spark

# Step 3: 
#### Definje Hudi Settings for this project 

In [5]:
db_name = "hudidb"
table_name = "hudi_table"

recordkey = 'uuid'
precombine = 'precomb'

path = f"file:///C:/tmp/{db_name}/{table_name}"

method = 'upsert'
table_type = "COPY_ON_WRITE"  # COPY_ON_WRITE | MERGE_ON_READ
partiton_field = "partition"


hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.recordkey.field': recordkey,
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': method,
    'hoodie.datasource.write.precombine.field': precombine,
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'hoodie.datasource.write.partitionpath.field': partiton_field,
}

# Step 4:
#### Lets create out Hudidatalake and insert records and learn about precomb key

In [6]:
data_items = [
    (1, "This is APPEND 1",  111, 1),
    (2, "This is APPEND 2",  222, 2),
]

columns = ["uuid", "message", "precomb", "partition"]

In [7]:
spark_df = spark.createDataFrame(data=data_items, schema=columns)

In [8]:
spark_df.show()

+----+----------------+-------+---------+
|uuid|         message|precomb|partition|
+----+----------------+-------+---------+
|   1|This is APPEND 1|    111|        1|
|   2|This is APPEND 2|    222|        2|
+----+----------------+-------+---------+



# Step 5:
##### Write the data into HUDI Table"

In [None]:
spark_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)

In [34]:
df = spark. \
      read. \
      format("hudi"). \
      load(path)

In [35]:
df.select(["uuid", "precomb", "message", "_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)

+----+-------+----------------+-------------------+---------------------+
|uuid|precomb|message         |_hoodie_commit_time|_hoodie_commit_seqno |
+----+-------+----------------+-------------------+---------------------+
|1   |111    |This is APPEND 1|20230108134654761  |20230108134654761_0_0|
|2   |222    |This is APPEND 2|20230108134654761  |20230108134654761_0_1|
+----+-------+----------------+-------------------+---------------------+



# Step 6:
#### Understand the concepts

# Case 1: 
#### Same Precomb and ID 

In [36]:
data_items = [
    (1, "This is UPDATE 1",  111),
    (1, "This is UPDATE 2",  111),
]

columns = ["uuid", "message", "precomb"]
spark_df = spark.createDataFrame(data=data_items, schema=columns)

spark_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)

df = spark. \
      read. \
      format("hudi"). \
      load(path)

df.select(["uuid", "precomb", "message", "_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)


+----+-------+----------------+-------------------+---------------------+
|uuid|precomb|message         |_hoodie_commit_time|_hoodie_commit_seqno |
+----+-------+----------------+-------------------+---------------------+
|1   |111    |This is UPDATE 2|20230108134802216  |20230108134802216_0_0|
|2   |222    |This is APPEND 2|20230108134654761  |20230108134654761_0_1|
+----+-------+----------------+-------------------+---------------------+



# Case 2: 
### Same ID different precomb 

In [37]:
data_items = [
    (1, "This is UPDATE 1 ** ",  111),
    (1, "This is UPDATE 2 **",  112),
]

columns = ["uuid", "message", "precomb"]
spark_df = spark.createDataFrame(data=data_items, schema=columns)

spark_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)

df = spark. \
      read. \
      format("hudi"). \
      load(path)

df.select(["uuid", "precomb", "message", "_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)


+----+-------+-------------------+-------------------+---------------------+
|uuid|precomb|message            |_hoodie_commit_time|_hoodie_commit_seqno |
+----+-------+-------------------+-------------------+---------------------+
|1   |112    |This is UPDATE 2 **|20230108134924305  |20230108134924305_0_0|
|2   |222    |This is APPEND 2   |20230108134654761  |20230108134654761_0_1|
+----+-------+-------------------+-------------------+---------------------+



# Case 3: 
### Same ID different precomb switching Orders

In [38]:
data_items = [
    (1, "This is UPDATE 1 ## ",  114),
    (1, "This is UPDATE 2 ## ",  115),
]

columns = ["uuid", "message", "precomb"]
spark_df = spark.createDataFrame(data=data_items, schema=columns)

spark_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)

df = spark. \
      read. \
      format("hudi"). \
      load(path)

df.select(["uuid", "precomb", "message", "_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)


+----+-------+--------------------+-------------------+---------------------+
|uuid|precomb|message             |_hoodie_commit_time|_hoodie_commit_seqno |
+----+-------+--------------------+-------------------+---------------------+
|1   |115    |This is UPDATE 2 ## |20230108135029978  |20230108135029978_0_0|
|2   |222    |This is APPEND 2    |20230108134654761  |20230108134654761_0_1|
+----+-------+--------------------+-------------------+---------------------+



In [39]:
data_items = [
    (1, "This is UPDATE 1 @ ",  115),
    (1, "This is UPDATE 2 @ ",  112),
]

columns = ["uuid", "message", "precomb"]
spark_df = spark.createDataFrame(data=data_items, schema=columns)

spark_df.write.format("hudi"). \
    options(**hudi_options). \
    mode("append"). \
    save(path)

df = spark. \
      read. \
      format("hudi"). \
      load(path)

df.select(["uuid", "precomb", "message", "_hoodie_commit_time", "_hoodie_commit_seqno"]).show(truncate=False)


+----+-------+-------------------+-------------------+---------------------+
|uuid|precomb|message            |_hoodie_commit_time|_hoodie_commit_seqno |
+----+-------+-------------------+-------------------+---------------------+
|1   |115    |This is UPDATE 1 @ |20230108135121430  |20230108135121430_0_0|
|2   |222    |This is APPEND 2   |20230108134654761  |20230108134654761_0_1|
+----+-------+-------------------+-------------------+---------------------+



### Conculsion 
* When you have same ID and Same PRECOMB Key HUDI will take the Most recent Items.
* When you have same ID  but if oprecomb key is different HUDI will take items with new precomb key 