# Working with Apache Hudi Deltastreamer

Working with Apache Hudi Deltastreamer
HoodieDeltaStreamer utility is part of hudi-utilities-bundle that provides a way to ingest data from sources such as DFS or Kafka.

In this notebook, you will learn to use DeltaStreamer Utility to bulk insert data into a Hudi Dataset as a Copy on Write(CoW) table and perform batch upsert. 

We will run queries in hudi-cli and SparkSQL to verify the tables and subsequent updates are incorporated into our datalake on Amazon S3

Let's get started !

## Generate Data

### Install Python Faker 

In [None]:
!pip install Faker

### Fake Profile Generator

Fake profile generator uses Python's Faker [https://faker.readthedocs.io/en/master/index.html] library. Let's define a method to generate a number of random person profiles.

In [None]:
import os
import json
import random
import boto3
import io
from io import StringIO
from faker import Faker
from faker.providers import date_time, credit_card
from json import dumps


# Intialize Faker library and S3 client
fake = Faker() 
fake.add_provider(date_time)
fake.add_provider(credit_card)

s3 = boto3.resource('s3')

# Write the fake profile data to a S3 bucket
# Replace with your own bucket
s3_bucket = "mrworkshop-youraccountID-dayone"
s3_load_prefix = 'hudi-ds/inputdata/'
s3_update_prefix = 'hudi-ds/updates/'

# Number of records in each file and number of files
# Adjust per your need - this produces 40MB files
#num_records = 150000
#num_files = 50

num_records = 10000
num_files = 15

def generate_bulk_data():
    '''
    Generates bulk profile data
    '''
    # Generate number of files equivalent to num_files
    for i in range (num_files):
        fake_profile_data = fake_profile_generator(num_records, fake)
        fakeIO = StringIO()
        filename = 'profile_' + str(i + 1) + '.json'
        s3key = s3_load_prefix + filename 
        fakeIO.write(str(''.join(dumps_lines(fake_profile_data))))
        s3object = s3.Object(s3_bucket, s3key)
        s3object.put(Body=(bytes(fakeIO.getvalue().encode('UTF-8'))))
        fakeIO.close()

def generate_updates():
    '''
    Generates updates for the profiles
    '''
    #
    # We will make updates to records in randomly picked files
    #
    random_file_list = []
    for i in range (1, num_files):
        random_file_list.append('profile_' + str(i) + '.json')
    for f in random_file_list:
        #print(f)
        s3key = s3_load_prefix + f
        obj = s3.Object(s3_bucket, s3key)
        profile_data = obj.get()['Body'].read().decode('utf-8')
        #s3_profile_list = json.loads(profile_data)
        stringIO_data = io.StringIO(profile_data)
        data = stringIO_data.readlines()
        #Its time to use json module now.
        json_data = list(map(json.loads, data))
        fakeIO = StringIO()
        s3key = s3_update_prefix + f
        fake_profile_data = []
        for rec in json_data:
            # Let's generate a new address
            #print ("old address: " + rec['street_address'])
            rec['street_address'] = fake.address()
            #print ("new address: " + rec['street_address'])
            fake_profile_data.append(rec)       
        fakeIO.write(str(''.join(dumps_lines(fake_profile_data))))
        s3object = s3.Object(s3_bucket, s3key)
        s3object.put(Body=(bytes(fakeIO.getvalue().encode('UTF-8'))))
        fakeIO.close()

def fake_profile_generator(length, fake, new_address=""):
    """
    Generates fake profiles
    """
    for x in range (length):       
        yield {'Name': fake.name(),
               'phone': fake.phone_number(),
               'job': fake.job(),
               'company': fake.company(),
               'ssn': fake.ssn(),
               'street_address': (new_address if new_address else fake.address()),
               'dob': (fake.date_of_birth(tzinfo=None, minimum_age=21, maximum_age=105).isoformat()),
               'email': fake.email(),
               'ts': (fake.date_time_between(start_date='-10y', end_date='now', tzinfo=None).isoformat()),
               'credit_card': fake.credit_card_number(),
               'record_id': fake.pyint(),
               'id': fake.uuid4()}
        
def dumps_lines(objs):
    for obj in objs:
        yield json.dumps(obj, separators=(',',':')) + '\n'   

### Start the data generator

Following code kicks off the fake data generator to produce files each with certain records (configurable) in JSON format. The files are written to a specified S3 bucket.

In [None]:
generate_bulk_data()

Now let's check the generated data:

```
$ aws s3 ls s3://mrworkshop-youraccountID-dayone/hudi-ds/inputdata/
2022-03-01 06:42:36    3685908 profile_1.json
2022-03-01 06:44:18    3685807 profile_10.json
2022-03-01 06:44:30    3684892 profile_11.json
2022-03-01 06:44:42    3684254 profile_12.json
2022-03-01 06:44:53    3684155 profile_13.json
2022-03-01 06:45:05    3685178 profile_14.json
2022-03-01 06:45:16    3685062 profile_15.json
2022-03-01 06:42:47    3683295 profile_2.json
2022-03-01 06:42:58    3686567 profile_3.json
2022-03-01 06:43:10    3683613 profile_4.json
2022-03-01 06:43:21    3686654 profile_5.json
2022-03-01 06:43:32    3685491 profile_6.json
2022-03-01 06:43:44    3683970 profile_7.json
2022-03-01 06:43:55    3685578 profile_8.json
2022-03-01 06:44:06    3685117 profile_9.json
```

## Copy Hudi Libraries on the EMR Cluster and create Hive table

0. For the following steps to work, you should have launched the EMR cluster with appropriate permissions set for **Systems Manager Session Manager** 
1. From the AWS Console, type SSM in the search box and navigate to the **Amazon System Manager console**
2. On the left hand side, select **Session Manager** from **Instances and Nodes** section
3. Click on the start session and you should see two EC2 instances listed 
4. Select instance-id of the **EMR's Master** Node and click on **Start session**
5. From the terminal type the following to change to user *ec2-user*
 
```bash
sh-4.2$ sudo su hadoop
hadoop@ip-10-0-2-73 /]$ cd
hdfs dfs -mkdir -p /apps/hudi/lib
hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar
hdfs dfs -copyFromLocal /usr/lib/spark/external/lib/spark-avro.jar /apps/hudi/lib/spark-avro.jar
hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-utilities-bundle.jar /apps/hudi/lib/hudi-utilities-bundle.jar
hdfs dfs -copyFromLocal /usr/lib/spark/jars/httpclient-4.5.9.jar /apps/hudi/lib/httpclient-4.5.9.jar
hdfs dfs -copyFromLocal /usr/lib/spark/jars/httpcore-4.4.11.jar /apps/hudi/lib/httpcore-4.4.11.jar
hdfs dfs -ls /apps/hudi/lib/
Found 5 items
-rw-r--r--   1 hadoop hadoop     774384 2021-10-11 02:51 /apps/hudi/lib/httpclient-4.5.9.jar
-rw-r--r--   1 hadoop hadoop     326874 2021-10-11 02:51 /apps/hudi/lib/httpcore-4.4.11.jar
-rw-r--r--   1 hadoop hadoop   35041795 2021-10-11 02:51 /apps/hudi/lib/hudi-spark-bundle.jar
-rw-r--r--   1 hadoop hadoop   39996793 2021-10-11 02:51 /apps/hudi/lib/hudi-utilities-bundle.jar
-rw-r--r--   1 hadoop hadoop     161984 2021-10-11 02:51 /apps/hudi/lib/spark-avro.jar
```

## Run DeltaStreamer to write a Copy on Write (COW) table

We will now run the DeltaStreamer utility as an EMR Step to write the above JSON formatted data into a Hudi dataset. To do that, we will need the following:

* Properties file on localfs or dfs, with configurations for Hudi client, schema provider, key generator and data source 
* Schema file for source dataset
* Schema file for target dataset

To run DeltaStreamer

```
! ~/.local/bin/aws emr add-steps --cluster-id j-1GMG9EJ4Z4ZL0 --steps Type=Spark,Name="Deltastreamer COW - Bulk Insert",ActionOnFailure=CONTINUE,Args=[--jars,hdfs:///apps/hudi/lib/*.jar,--class,org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer,hdfs:///apps/hudi/lib/hudi-utilities-bundle.jar,--props,s3://my-bucket/hudi-ds/config/json-deltastreamer.properties,--table-type,COPY_ON_WRITE,--source-class,org.apache.hudi.utilities.sources.JsonDFSSource,--source-ordering-field,ts,--target-base-path,s3://my-bucket/hudi-ds-output/person-profile-out1,--target-table,person_profile_cow,--schemaprovider-class,org.apache.hudi.utilities.schema.FilebasedSchemaProvider,--op,BULK_INSERT] --region us-east-1

```


Replace the following values in the above command in the text editor

1. --cluster-id with the value you got from previous step
2. For --props value replace xxxx part in hudi-workshop-xxxx with the S3 bucket name 
3. For -- target-base-path value with the S3 bucket name
4. After replacing the values, copy the entire commmand and run it in the next cell
5. If the values are replaced correctly, you should see a step id displayed as the output



In [None]:
!pip install awscli --upgrade --user

In [None]:
! ~/.local/bin/aws emr add-steps --cluster-id j-1GMG9EJ4Z4ZL0 --steps Type=Spark,Name="Deltastreamer COW - Bulk Insert",ActionOnFailure=CONTINUE,Args=[--jars,hdfs:///apps/hudi/lib/*.jar,--class,org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer,hdfs:///apps/hudi/lib/hudi-utilities-bundle.jar,--props,s3://mrworkshop-youraccountID-dayone/hudi-ds/config/json-deltastreamer.properties,--table-type,COPY_ON_WRITE,--source-class,org.apache.hudi.utilities.sources.JsonDFSSource,--source-ordering-field,ts,--target-base-path,s3://mrworkshop-youraccountID-dayone/hudi-ds-output/person-profile-out1,--target-table,person_profile_cow,--schemaprovider-class,org.apache.hudi.utilities.schema.FilebasedSchemaProvider,--op,BULK_INSERT] --region us-east-1

## Query the Hudi Dataset

Now let us check the S3 path:

```
$ aws s3 ls s3://mrworkshop-youraccountID-dayone/hudi-ds-output/person-profile-out1/
                           PRE .hoodie/
2022-03-01 06:49:05          0 .hoodie_$folder$
2022-03-01 06:49:27         93 .hoodie_partition_metadata
2022-03-01 06:49:30    2488921 15aaf95c-38c1-4889-9987-cdc0e8e7f913-0_7-4-39_20220301064915.parquet
2022-03-01 06:49:31    2259709 55662f83-a4b3-4278-b5d0-4176c1146ac7-0_0-4-32_20220301064915.parquet
2022-03-01 06:49:29    2467117 5ed0070b-55a6-4743-8aab-2e97d57d28f6-0_5-4-37_20220301064915.parquet
2022-03-01 06:49:29    2231503 7d24eabb-7fe6-4e7a-b2fc-9582caef059e-0_9-4-41_20220301064915.parquet
2022-03-01 06:49:31    2383519 a07bea06-3671-45c1-90ef-038e1f60e012-0_4-4-36_20220301064915.parquet
2022-03-01 06:49:30    2165923 a787d255-cd60-46c0-a8d1-f5405a3ac5de-0_3-4-35_20220301064915.parquet
2022-03-01 06:49:30    2352220 ae944d48-0379-4974-854b-46b2f2a641af-0_6-4-38_20220301064915.parquet
2022-03-01 06:49:29    2070634 b463dd72-044e-45a5-b520-87e79c397fd9-0_2-4-34_20220301064915.parquet
2022-03-01 06:49:31    2021565 cdd10a9f-1c6a-45ae-b65c-12b24c63360a-0_8-4-40_20220301064915.parquet
2022-03-01 06:49:29    2354644 f70ef867-2e07-4120-95ae-1703201a4067-0_1-4-33_20220301064915.parquet
```

To query the Hudi dataset you can do one of the following

- Navigate to the another sparkmagic notebook and run queries in Spark using SparkMagic cell
- SSH to the master node (you can also SSM if you launched your cluster with SSM permissions) and run queries using Hive/Presto
- Head to the Hue console on Amazon EMR and run queries
- Query using Amazon Athena or Redshift spectrum (preferred)

Let us use Athena to query

```

In Athena console: 

select * from profile_cow limit 2;

	_hoodie_commit_time	_hoodie_commit_seqno	_hoodie_record_key	_hoodie_partition_path	_hoodie_file_name	name	phone	job	company	ssn	street_address	dob	email	ts
1	20220301064915	20220301064915_6_2	95d748fc-158d-44b0-85b6-ad198e7ad2f1		15219e18-c613-405a-9f80-3ebac5e7a2a3-0_6-22-136_20220301064915.parquet	Joshua Johnson	(165)401-1609x877	Accountant, chartered	Gallegos, Patel and Perez	675-97-0588	82503 Morgan Cliff Apt. 310 South Eddie, DE 38645	1937-04-30	jacob89@example.org	2015-06-05T00:38:23
2	20220301064915	20220301064915_6_4	95d74b34-5d9d-4e75-b5aa-a70d73167cbd		15219e18-c613-405a-9f80-3ebac5e7a2a3-0_6-22-136_20220301064915.parquet	Keith Chen	576-351-8011x7651	Ecologist	Boyd-Jones	382-35-5590	76979 Robert Summit North Ashleymouth, HI 73317	1973-04-23	kflores@example.net	2020-06-14T06:17:54

Now, lets make a note of street_address in one of these two records -> "82503 Morgan Cliff Apt. 310 South Eddie, DE 38645"

select _hoodie_commit_time, street_address from profile_cow where _hoodie_record_key='95d748fc-158d-44b0-85b6-ad198e7ad2f1';

_hoodie_commit_time	street_address
1	20220301064915	82503 Morgan Cliff Apt. 310 South Eddie, DE 38645

```

Lets now run an upsert to observe the change in records


## Run updates

In [None]:
generate_updates()

Check the records in updates/ location.

```
$ aws s3 ls s3://mrworkshop-youraccountID-dayone/hudi-ds/updates/
2022-03-01 06:58:10    3686930 profile_1.json
2022-03-01 06:58:35    3686555 profile_10.json
2022-03-01 06:58:38    3686528 profile_11.json
2022-03-01 06:58:41    3684902 profile_12.json
2022-03-01 06:58:44    3683917 profile_13.json
2022-03-01 06:58:47    3685412 profile_14.json
2022-03-01 06:58:12    3683398 profile_2.json
2022-03-01 06:58:15    3686330 profile_3.json
2022-03-01 06:58:18    3685814 profile_4.json
2022-03-01 06:58:21    3686473 profile_5.json
2022-03-01 06:58:24    3687483 profile_6.json
2022-03-01 06:58:27    3684895 profile_7.json
2022-03-01 06:58:29    3685616 profile_8.json
2022-03-01 06:58:32    3683469 profile_9.json

```

## Run DeltaStreamer to apply updates

We will now run the Deltastreamer again to run upserts using the updates generated in the previous step.

```

! ~/.local/bin/aws emr add-steps --cluster-id j-XXXXXXX --steps Type=Spark,Name="Deltastreamer Profile Upserts",ActionOnFailure=CONTINUE,Args=[--class,org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer,hdfs:///apps/hudi/lib/hudi-utilities-bundle.jar,--props,s3://<my-bucket>/hudi-ds/config/json-deltastreamer.properties,--table-type,COPY_ON_WRITE,--source-class,org.apache.hudi.utilities.sources.JsonDFSSource,--source-ordering-field,ts,--target-base-path,s3://<my-bucket>/hudi-ds/output/profile-test15-out,--target-table,profile_test15_cow,--schemaprovider-class,org.apache.hudi.utilities.schema.FilebasedSchemaProvider,--op,UPSERT] --region us-east-1

```

In [None]:
! ~/.local/bin/aws emr add-steps --cluster-id j-1GMG9EJ4Z4ZL0 --steps Type=Spark,Name="Deltastreamer COW",ActionOnFailure=CONTINUE,Args=[--jars,hdfs:///apps/hudi/lib/*.jar,--class,org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer,hdfs:///apps/hudi/lib/hudi-utilities-bundle.jar,--props,s3://mrworkshop-youraccountID-dayone/config/json-deltastreamer_upsert.properties,--table-type,COPY_ON_WRITE,--source-class,org.apache.hudi.utilities.sources.JsonDFSSource,--source-ordering-field,ts,--target-base-path,s3://mrworkshop-youraccountID-dayone/hudi-ds-output/person-profile-out1,--target-table,person_profile_cow,--schemaprovider-class,org.apache.hudi.utilities.schema.FilebasedSchemaProvider,--op,UPSERT] --region us-east-1

## Query the updated Hudi Dataset

Now lets check the S3 path of output location. Notice the new Parquet files. 

```

$ aws s3 ls s3://mrworkshop-youraccountID-dayone/hudi-ds-output/person-profile-out1/
                           PRE .hoodie/
2022-03-01 06:49:05          0 .hoodie_$folder$
2022-03-01 06:49:27         93 .hoodie_partition_metadata
2022-03-01 07:04:59    2494102 15aaf95c-38c1-4889-9987-cdc0e8e7f913-0_7-22-135_20220301070424.parquet
2022-03-01 06:49:30    2488921 15aaf95c-38c1-4889-9987-cdc0e8e7f913-0_7-4-39_20220301064915.parquet
2022-03-01 06:49:31    2259709 55662f83-a4b3-4278-b5d0-4176c1146ac7-0_0-4-32_20220301064915.parquet
2022-03-01 07:04:57    2263906 55662f83-a4b3-4278-b5d0-4176c1146ac7-0_1-22-129_20220301070424.parquet
2022-03-01 07:04:59    2470890 5ed0070b-55a6-4743-8aab-2e97d57d28f6-0_4-22-132_20220301070424.parquet
2022-03-01 06:49:29    2467117 5ed0070b-55a6-4743-8aab-2e97d57d28f6-0_5-4-37_20220301064915.parquet
2022-03-01 07:04:57    2235852 7d24eabb-7fe6-4e7a-b2fc-9582caef059e-0_9-22-137_20220301070424.parquet
2022-03-01 06:49:29    2231503 7d24eabb-7fe6-4e7a-b2fc-9582caef059e-0_9-4-41_20220301064915.parquet
2022-03-01 07:04:58    2387713 a07bea06-3671-45c1-90ef-038e1f60e012-0_3-22-131_20220301070424.parquet
2022-03-01 06:49:31    2383519 a07bea06-3671-45c1-90ef-038e1f60e012-0_4-4-36_20220301064915.parquet
2022-03-01 06:49:30    2165923 a787d255-cd60-46c0-a8d1-f5405a3ac5de-0_3-4-35_20220301064915.parquet
2022-03-01 07:04:57    2169821 a787d255-cd60-46c0-a8d1-f5405a3ac5de-0_6-22-134_20220301070424.parquet
2022-03-01 06:49:30    2352220 ae944d48-0379-4974-854b-46b2f2a641af-0_6-4-38_20220301064915.parquet
2022-03-01 07:04:58    2355854 ae944d48-0379-4974-854b-46b2f2a641af-0_8-22-136_20220301070424.parquet
2022-03-01 07:04:56    2074694 b463dd72-044e-45a5-b520-87e79c397fd9-0_2-22-130_20220301070424.parquet
2022-03-01 06:49:29    2070634 b463dd72-044e-45a5-b520-87e79c397fd9-0_2-4-34_20220301064915.parquet
2022-03-01 07:04:59    2025729 cdd10a9f-1c6a-45ae-b65c-12b24c63360a-0_5-22-133_20220301070424.parquet
2022-03-01 06:49:31    2021565 cdd10a9f-1c6a-45ae-b65c-12b24c63360a-0_8-4-40_20220301064915.parquet
2022-03-01 07:04:57    2358794 f70ef867-2e07-4120-95ae-1703201a4067-0_0-22-128_20220301070424.parquet
2022-03-01 06:49:29    2354644 f70ef867-2e07-4120-95ae-1703201a4067-0_1-4-33_20220301064915.parquet

```

Let's query an upserted record. 

```
select _hoodie_commit_time, street_address from profile_cow where _hoodie_record_key='95d748fc-158d-44b0-85b6-ad198e7ad2f1';

_hoodie_commit_time    street_address
1    20220301070424    82503 Morgan Cliff Apt. 310 South Eddie, DE 38645    # Old address 
2    20220301064915	   35740 Young Orchard Suite 147 South Williamport, MT 82610   # Our recent update 


```

Now lets check out Hudi CLI

hudi:person_profile_cow->commits show
2022-03-01 07:11:45,875 INFO timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20220301070424__commit__COMPLETED]}
2022-03-01 07:11:45,918 INFO s3n.S3NativeFileSystem: Opening 's3://mrworkshop-youraccountID-dayone/hudi-ds-output/person-profile-out1/.hoodie/20220301070424.commit' for reading
2022-03-01 07:11:46,265 INFO s3n.S3NativeFileSystem: Opening 's3://mrworkshop-youraccountID-dayone/hudi-ds-output/person-profile-out1/.hoodie/20220301064915.commit' for reading

```
╔════════════════╤═════════════════════╤═══════════════════╤═════════════════════╤══════════════════════════╤═══════════════════════╤══════════════════════════════╤══════════════╗
║ CommitTime     │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
║ 20220301070424 │ 21.8 MB             │ 0                 │ 10                  │ 1                        │ 150000                │ 140000                       │ 0            ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20220301064915 │ 21.7 MB             │ 10                │ 0                   │ 1                        │ 150000                │ 0                            │ 0            ║
╚════════════════╧═════════════════════╧═══════════════════╧═════════════════════╧══════════════════════════╧═══════════════════════╧══════════════════════════════╧══════════════╝

```