[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

torvalds-dev · 2024-07-12T05:35:25Z

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Is there a way to speed up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS? Currently, the Multipart Upload parts seem to be written in serial.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev · 2024-07-12T05:35:28Z

Relevant Sources:

https://www.onehouse.ai/blog/apache-hudi-native-aws-integrations

https://medium.com/@devjain1299/apache-hudi-on-aws-glue-a-step-by-step-guide-503c34a9aa95

https://aws.amazon.com/blogs/big-data/part-1-query-an-apache-hudi-dataset-in-an-amazon-s3-data-lake-with-amazon-athena-part-1-read-optimized-queries/

https://www.onehouse.ai/blog/reliable-ingestion-from-aws-s3-using-apache-hudi

apache#10358

torvalds-dev · 2024-07-12T05:35:31Z

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev · 2024-07-12T05:35:34Z

Yes, it is possible to speed up S3 Multipart Uploads using Hudi on AWS EMR. When using EMRFS, you may notice Multipart Upload parts being written in serial by default, causing potential performance bottlenecks. However, there is a solution available for speeding up writes by configuring Hudi jars in EMR clusters.

To override Hudi jars in your EMR clusters and potentially enhance write speeds, follow these steps:

Build Hudi Jars:
- Clone the Hudi repository and get version 0.7.0.
- Build jars with the required configurations for EMR clusters, such as Spark 3.0.0 and Scala 2.12.
Copy Jars to S3:
- After building the jars, copy them to a temporary location in S3.
Include Hudi Jars as part of the EMR Bootstrap Script:
- Create a script that downloads the Hudi jars from the S3 location to each node in the EMR cluster.
- Use the script as a part of the bootstrap-actions while launching the EMR cluster to install the jars on each node.
Use the Overridden Jar in Deltastreamer:
- When invoking DeltaStreamer, specify the location of the overridden jar in the spark-submit command.

By following these steps, you can optimize S3 Multipart Uploads using Hudi on AWS EMR, potentially enhancing write performance. (Source: Apache Hudi - FAQ Integrations)

torvalds-dev · 2024-07-12T05:35:37Z

Please configure the GitHub URL to enable the creation of GitHub issues.

torvalds-dev · 2024-07-12T05:35:40Z

Please configure the GitHub URL to enable the creation of GitHub issues.

torvalds-dev · 2024-07-12T05:35:43Z

Please configure the GitHub URL to enable the creation of GitHub issues.

torvalds-dev · 2024-07-12T05:35:46Z

Ticket created successfully. Here is the link to the GitHub issue: #59

torvalds-dev · 2024-07-12T06:01:12Z

Ticket created successfully. Here is the link to the GitHub issue: #59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

Comments

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024

torvalds-dev bot commented Jul 12, 2024