Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Speeding up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS #59

Open
torvalds-dev bot opened this issue Jul 12, 2024 · 8 comments

Comments

@torvalds-dev
Copy link

torvalds-dev bot commented Jul 12, 2024

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Is there a way to speed up S3 Multipart Uploads using Hudi on AWS EMR with EMRFS? Currently, the Multipart Upload parts seem to be written in serial.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Yes, it is possible to speed up S3 Multipart Uploads using Hudi on AWS EMR. When using EMRFS, you may notice Multipart Upload parts being written in serial by default, causing potential performance bottlenecks. However, there is a solution available for speeding up writes by configuring Hudi jars in EMR clusters.

To override Hudi jars in your EMR clusters and potentially enhance write speeds, follow these steps:

  1. Build Hudi Jars:

    • Clone the Hudi repository and get version 0.7.0.
    • Build jars with the required configurations for EMR clusters, such as Spark 3.0.0 and Scala 2.12.
  2. Copy Jars to S3:

    • After building the jars, copy them to a temporary location in S3.
  3. Include Hudi Jars as part of the EMR Bootstrap Script:

    • Create a script that downloads the Hudi jars from the S3 location to each node in the EMR cluster.
    • Use the script as a part of the bootstrap-actions while launching the EMR cluster to install the jars on each node.
  4. Use the Overridden Jar in Deltastreamer:

    • When invoking DeltaStreamer, specify the location of the overridden jar in the spark-submit command.

By following these steps, you can optimize S3 Multipart Uploads using Hudi on AWS EMR, potentially enhancing write performance. (Source: Apache Hudi - FAQ Integrations)

Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Please configure the GitHub URL to enable the creation of GitHub issues.

2 similar comments
Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Please configure the GitHub URL to enable the creation of GitHub issues.

Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Please configure the GitHub URL to enable the creation of GitHub issues.

Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Ticket created successfully. Here is the link to the GitHub issue: #59

1 similar comment
Copy link
Author

torvalds-dev bot commented Jul 12, 2024

Ticket created successfully. Here is the link to the GitHub issue: #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants