Skip to content

Support remote/httpfs URLs in the from field #567

@alexkreidler

Description

@alexkreidler

DuckDB's HTTPFS feature, which can read parquet, csv, json, and other files on HTTP servers or cloud object storage, is an incredibly powerful tool that allows the query engine to use range reads to push down queries on parquet (and use its builtin statistics) to limit the amount of data transferred over the network. This helps DuckDB run queries really quickly even over files that might be too large to load into DuckDB WASM's memory.

When I tried this spec in Mosaic Playground:

{
    "plot": [
        {
            "mark": "lineY",
            "data": {
                "from": "read_parquet('https://f005.backblazeb2.com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30.parquet')"
            },
            "x": "file",
            "y": "Close"
        }
    ],
    "width": 680,
    "height": 200
}

Mosaic created this query:

DESCRIBE SELECT "Date" AS "col0", "Close" AS "col1" FROM "read_parquet('https://f005/"."backblazeb2"."com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30"."parquet')" AS "source"

And when I changed it to remove the read_parquet function I got

DESCRIBE SELECT "Date" AS "col0", "Close" AS "col1" FROM "https://f005/"."backblazeb2"."com/file/alk-data/courtlistener/2024-10-27/opinion-clusters-2024-09-30"."parquet" AS "source"

It would be great to add some logic to detect https:// and http:// strings (and maybe s3:// and hf:// which are also supported by the httpfs extension) in the from field, and output them directly into the output SQL.

from(...expr) {
const { query } = this;
if (expr.length === 0) {
// @ts-ignore
return query.from;
} else {
const list = [];
expr.flat().forEach(e => {
if (e == null) {
// do nothing
} else if (typeof e === 'string') {
list.push({ as: e, from: asRelation(e) });
} else if (e instanceof Ref) {
list.push({ as: e.table, from: e });
} else if (isQuery(e) || isSQLExpression(e)) {
list.push({ from: e });
} else if (Array.isArray(e)) {
list.push({ as: unquote(e[0]), from: asRelation(e[1]) });
} else {
for (const as in e) {
list.push({ as: unquote(as), from: asRelation(e[as]) });
}
}
});
query.from = query.from.concat(list);
return this;
}
}

And to add docs/examples for mosaic-sql, vgplot, and mosaic-spec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions