You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there something special I need to do in order to use the ReadFromSnowflake IO source in Python if the staging_bucket_name is in AWS S3?
I've documented the issue I'm encountering more fully in this StackOverflow question, but the gist is that when I try to use an S3 bucket for the staging bucket Beam throws an error about "no filesystem found for scheme s3".
The Snowflake side of things seems to be working properly, because I can see the gzipped CSVs showing up in the bucket -- ie, there's an object at s3://my-bucket//sf_copy_csv_20250303_145430_11651ed9/run_825c058f/data_0_0_0.csv.gz -- but the expansion service just doesn't seem to be able to read the object because the s3 scheme isn't registered as a filesystem?
Is there something extra I have to do in my Python code to fix this? Do I have to run the expansion service manually to pass some additional arguments?
Additionally, I'm a bit confused by the docs; they seem to be a bit contradictory. These docs state I can use an S3 bucket, but the pydocs for the Snowflake module don't mention S3 at all. It seems like the pydocs are correct, in which case can the site docs get updated so things are clearer? Or can both sets of documents be updated with the instructions for how to use S3 if it is indeed possible to use S3 for the staging bucket?
The last thing is that if I leave the trailing / off the bucket name, Beam complains and doesn't even run the pipeline. But as you can see above, this ends up creating a path with double-slashes: s3://my-bucket//sf_copy.... I doubt it has anything to do with the error I'm encountering, but it would be nice if that was fixed so that it's easier to find the files in S3. Not sure if this also occurs for GCS.
I've tried posted this to the user@beam.apache.org mailing list, but that requires an @apache.org email, so I posted this bug.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
Component: Python SDK
Component: Java SDK
Component: Go SDK
Component: Typescript SDK
Component: IO connector
Component: Beam YAML
Component: Beam examples
Component: Beam playground
Component: Beam katas
Component: Website
Component: Infrastructure
Component: Spark Runner
Component: Flink Runner
Component: Samza Runner
Component: Twister2 Runner
Component: Hazelcast Jet Runner
Component: Google Cloud Dataflow Runner
The text was updated successfully, but these errors were encountered:
@Abacn I have done that, I'm able to read from S3 using ReadFromText ( for plain CSVs ) or MatchFiles & ReadMatches ( for gzipped CSVs that need to be unzipped first ). The issue really seems to be that the Java expansion service doesn't have the S3 filesystem registered, not something on the Python side of things.
What happened?
Is there something special I need to do in order to use the ReadFromSnowflake IO source in Python if the staging_bucket_name is in AWS S3?
I've documented the issue I'm encountering more fully in this StackOverflow question, but the gist is that when I try to use an S3 bucket for the staging bucket Beam throws an error about "no filesystem found for scheme s3".
The Snowflake side of things seems to be working properly, because I can see the gzipped CSVs showing up in the bucket -- ie, there's an object at
s3://my-bucket//sf_copy_csv_20250303_145430_11651ed9/run_825c058f/data_0_0_0.csv.gz
-- but the expansion service just doesn't seem to be able to read the object because the s3 scheme isn't registered as a filesystem?Is there something extra I have to do in my Python code to fix this? Do I have to run the expansion service manually to pass some additional arguments?
Additionally, I'm a bit confused by the docs; they seem to be a bit contradictory. These docs state I can use an S3 bucket, but the pydocs for the Snowflake module don't mention S3 at all. It seems like the pydocs are correct, in which case can the site docs get updated so things are clearer? Or can both sets of documents be updated with the instructions for how to use S3 if it is indeed possible to use S3 for the staging bucket?
The last thing is that if I leave the trailing / off the bucket name, Beam complains and doesn't even run the pipeline. But as you can see above, this ends up creating a path with double-slashes:
s3://my-bucket//sf_copy...
. I doubt it has anything to do with the error I'm encountering, but it would be nice if that was fixed so that it's easier to find the files in S3. Not sure if this also occurs for GCS.I've tried posted this to the
user@beam.apache.org
mailing list, but that requires an@apache.org
email, so I posted this bug.Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: