Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 loader cannot access S3 across AWS regions (sink/load data to/from remote region) #283

Open
donnyding opened this issue Jun 29, 2023 · 3 comments

Comments

@donnyding
Copy link

In 1.x, both the KinesisConfig (inStream) and S3Config (outStream) have its own setting of aws region. That means you can run the s3-loader in region A to consume data from Kinesis data streaming (region A) and persist raw/enriched events to S3 on region B.
Since 2.x, region is a global setting outside of "input" and "output" sections. The code logic always get the region from here even though I configure the s3 custom endpoint.

AWS client SDK provides the interface to turn on global bucket access. But snowplow-s3-loader has not exposed this setting in its pipeline configuration. Please review it and fix it.

@donnyding
Copy link
Author

As a workaround, we can always force global bucket access.
client.setForceGlobalBucketAccessEnabled(true);

@jbeemster
Copy link
Member

Hi @donnyding would you be able to share a bit more about the use-case you are trying to solve here and why you would want to read from Kinesis in one region and write to S3 in a different region?

As for the feature itself, if you have the bandwidth, we are always happy to review Pull Requests!

@donnyding
Copy link
Author

donnyding commented Jun 29, 2023

hi @jbeemster,
Usage scenario:
In order to improve the HA, we plan to setup similar env in two aws regions. The health check API could be used in traffic routing policy. That means the event payloads will be routed to two regions, no duplicated data. The enrichment processing is better to persist raw/enriched events to a global s3 storage area.
That's why I consume data from Kinesis data streaming (region A) and sink data to S3 (region B).

As a workaround, I can force the global bucket access through AWS Client SDK interface. But it's not a perfect solution.

It's possible to separate the region setting for both input and output section in configuration file, just like what Snowplow-OSS does in v1.0. Or add new configuration item in output section, to provide the functionality to let customer make choice of enable/disable global bucket access. Make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants