Skip to content

Fix HTTP external tables using pre-signed S3 URLs#183

Merged
mildbyte merged 5 commits intomainfrom
180-cant-create-external-table-with-a-csv-file-from-a-pre-signed-s3-url
Oct 31, 2022
Merged

Fix HTTP external tables using pre-signed S3 URLs#183
mildbyte merged 5 commits intomainfrom
180-cant-create-external-table-with-a-csv-file-from-a-pre-signed-s3-url

Conversation

@mildbyte
Copy link
Copy Markdown
Contributor

Fix several issues that make it impossible to query Parquet/CSV/JSON files from a pre-signed S3 URL.

  • HEAD requests unsupported (can only sign for one HTTP verb): work around by issuing a one-byte GET and reading the total body size from the Content-Range header (bytes=0-0/12345678)
  • query string stripped by DataFusion (contains the actual signature): fix by urlencoding the whole LOCATION so that it looks like it's a self-contained path without a query string (decode in the HTTP object store before performing the request)
  • query string causes the URL to have no extension suffix at the end, making DataFusion skip it altogether: fix by disabling suffix filtering when using the HTTP object store

DataFusion's `LOCATION` seems to drop the query string in HTTP URLs, so, to
hack around that, URI-encode the whole string and decode it on the other end.
Pre-signed S3 URLs can only work with a `HEAD` or a `GET` request, but not both
at the same time. As a workaround, if we get a 403 in response to a `HEAD`,
try sending a `GET` request with a zero range. The response should have a
`Content-Range` header showing the total length of the resource, which we report
back to DataFusion.
If our HTTP URL has a query string, it won't end with an actual file extension.
DataFusion's `ListingTable` filters by the extension by default (useful when
scanning through a directory where some files aren't part of the table). To fix
this, disable this feature when using HTTP (since we can't scan through a
directory in HTTP anyway).

Also extend the e2e HTTP test to have a query string in it (ignored by the mock,
but exercises this workaround).
@mildbyte mildbyte linked an issue Oct 31, 2022 that may be closed by this pull request
@mildbyte mildbyte merged commit b2241d2 into main Oct 31, 2022
@mildbyte mildbyte deleted the 180-cant-create-external-table-with-a-csv-file-from-a-pre-signed-s3-url branch October 31, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't CREATE EXTERNAL TABLE with a CSV file from a pre-signed S3 URL

1 participant