Skip to content

feat: Big Query temp tables #11268

Open
Open
@Wopple

Description

@Wopple

Is your feature request related to a problem?

The existing support for "temp" tables with the Big Query backend is to upload session scoped tables from in-memory tables to a long-lived table that is cleaned up with a finalizer. This does not support the ETL use case of caching intermediate calculation in table-to-table queries.

Edit: the session scoped tables will be cleaned up by Big Query if the finalizers do not run.

What is the motivation behind your request?

I write multi-statement queries so that I can deduplicate computation with the use of temp tables. I want to be able to unit test that code. In production, the temporary storage should be cleaned up by Big Query even in the case of catastrophic failure.

Describe the solution you'd like

There are 2 potential solutions to this, I'll start with the ideal one.

Multi-Statement Queries

Big Query's SQL allows for submitting a single SQL string with multiple statements. A query can begin with creating a temp table, then use that temp table multiple times in following statements. Here is a silly example that illustrates the pattern:

-- will only be calculated once
CREATE TEMP TABLE intermediate AS
SELECT foo, SUM(bar) AS total
FROM source
GROUP BY foo
;

INSERT INTO destination

WITH
counts AS (
    SELECT total, COUNT(*) AS num
    FROM intermediate
    GROUP BY total
)

SELECT *
FROM intermediate
JOIN counts ON intermediate.foo = counts.num
;

-- optional
-- saves a little storage cost
-- would otherwise be retained for 24h and then dropped by Big Query
DROP TABLE intermediate;

This would require support for multi-statement queries which would require consideration of other backends. This is preferred in part because it would unlock other use-cases like DECLARE style variables.

Session Scoped Temp Tables

Big Query also supports sessions. So it could be implemented in the Big Query backend as a separate job that creates the temp table in one query, and then the following queries in the same session would be able to reference that table. Even if the finalizers fail, Big Query will still clean up the tables after 24h. I'm not sure how this would translate to the Pandas backend though.

What version of ibis are you running?

10.5.0

What backend(s) are you using, if any?

Big Query, Pandas

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeatures or general enhancements

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions