From f0c162d6a35bf4a28062bea621e2a950e7d522ed Mon Sep 17 00:00:00 2001 From: Puskar Basu Date: Thu, 18 Sep 2025 21:47:12 +0530 Subject: [PATCH 1/3] initial ducklake updates --- docs/collect/configure.md | 36 ++++++++++++++++++----------------- docs/collect/manage-data.md | 6 +++--- docs/query/index.md | 11 +++++++++-- docs/reference/cli/connect.md | 15 ++++++++++----- docs/reference/glossary.md | 4 ++-- 5 files changed, 43 insertions(+), 29 deletions(-) diff --git a/docs/collect/configure.md b/docs/collect/configure.md index ea81c83..2042cfd 100644 --- a/docs/collect/configure.md +++ b/docs/collect/configure.md @@ -15,11 +15,11 @@ Tailpipe [plugins](/docs/collect/plugins) define tables for common log sources a If your logs are not in a standard format or are not currently supported by a plugin, you can create [custom tables](/docs/collect/custom-tables) to collect data from arbitrary log files and other sources. -Tables are implemented as DuckDB views over the Parquet files. Tailpipe creates tables (that is, creates views in the `tailpipe.db` database) based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules. +Tailpipe creates DuckLake tables based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules. -When you run `tailpipe query` or `tailpipe connect`, Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and adds a view for the table. The view definitions will include qualifiers that implement any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`). +When you run `tailpipe query` or `tailpipe connect` with any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`), Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and adds a filtered view over the tables. -You can see what tables are available with the `tailpipe plugin list` command. +You can see what tables are available with the `tailpipe table list` command. ## Partitions A partition represents data gathered from a [source](/docs/collect/configure#sources). Partitions are defined [in HCL](/docs/reference/config-files/partition) and are required for [collection](/docs/collect/collect). @@ -61,20 +61,22 @@ The standard partitioning/hive structure enables efficient queries that only nee tp_table=aws_cloudtrail_log └── tp_partition=prod └── tp_index=default - ├── tp_date=2024-12-31 - │   └── data_20250106140713_740378_0.parquet - ├── tp_date=2025-01-01 - │   └── data_20250106140713_740378_0.parquet - ├── tp_date=2025-01-02 - │   └── snap_20250106140823_952067.parquet - ├── tp_date=2025-01-03 - │   └── snap_20250106140824_011599.parquet - ├── tp_date=2025-01-04 - │   └── data_20250106140752_829722_0.parquet - ├── tp_date=2025-01-05 - │   └── snap_20250106140824_073116.parquet - └── tp_date=2025-01-06 - └── snap_20250106140824_131637.parquet + └── year=2024 + ├── month=7 + │ └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet + │ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet + │ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet + │ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet + ├── month=8 + │   └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet + │ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet + │ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet + │ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet + ├── month=9 + │   └── ducklake-01995d38-7f1e-7867-b7f1-8f523d546353.parquet + │ └── ducklake-01995d38-7f75-77ce-a0ec-5972d4d6c7ae.parquet + │ └── ducklake-01995d38-7fd2-7365-997d-65a6ad005e83.parquet + │ └── ducklake-01995d38-80e5-7185-b15e-5ee808222b73.parquet ``` diff --git a/docs/collect/manage-data.md b/docs/collect/manage-data.md index 296dd63..0f448b1 100644 --- a/docs/collect/manage-data.md +++ b/docs/collect/manage-data.md @@ -292,16 +292,16 @@ Plugin: hub.tailpipe.io/plugins/turbot/aws@latest ## Connecting from Other Tools -You can connect to your Tailpipe database with the native DuckDB client or other tools and libraries that can connect to DuckDB. To do so, you can generate a new db file for the connection using `tailpipe connect`: +You can connect to your Tailpipe database with the native DuckDB client or other tools and libraries that can connect to DuckDB. To do so, you can generate a new SQL script to initialise DuckDB to use the tailpipe database using `tailpipe connect`: ```bash tailpipe connect ``` -A new DB file will be generated and returned: +The path to a new SQL script will be returned: ```bash $ tailpipe connect -/Users/jsmyth/.tailpipe/data/default/tailpipe_20250409151453.db +/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918210704.sql ``` If you've collected a lot of data and want to optimize your queries for a subset of it, you can pre-filter the database. You can restrict to the most recent 45 days: diff --git a/docs/query/index.md b/docs/query/index.md index 864f844..27bbef2 100644 --- a/docs/query/index.md +++ b/docs/query/index.md @@ -2,9 +2,16 @@ title: Query Tailpipe --- -# Powered by DuckDB! +# Powered by DuckDB + DuckLake! -Tailpipe [collects](/docs/collect/collect) logs into a [DuckDB](https://duckdb.org/) database that uses [standard SQL syntax](https://duckdb.org/docs/sql/introduction.html) to query. It's easy to [get started writing queries](/docs/sql), and the [Tailpipe Hub](https://hub.tailpipe.io) provides ***hundreds of example queries*** that you can use or modify for your purposes. There are [example queries for each table](https://hub.tailpipe.io/plugins/turbot/aws/tables/aws_cloudtrail_log) in every plugin, and you can also [browse, search, and view the queries](https://hub.tailpipe.io/mods/turbot/tailpipe-mod-aws-dections/queries) in every published mod! +Tailpipe [collects](/docs/collect/collect) logs into open parquet files and catalogs them with [DuckLake](https://ducklake.select/), so you query everything with [standard SQL syntax](https://duckdb.org/docs/sql/introduction.html). This brings a simple "lakehouse" model: open data files, a lightweight metadata catalog, and fast local analytics. + +- Open formats: data is stored as Parquet on disk. +- Cataloged: DuckLake tracks tables/columns/partitions for efficient queries. +- Fast by design: partition pruning and vectorized execution via DuckDB. +- SQL-first: use familiar DuckDB syntax, functions, and tooling. + +It's easy to [get started writing queries](/docs/sql), and the [Tailpipe Hub](https://hub.tailpipe.io) provides ***hundreds of example queries*** that you can use or modify for your purposes. There are [example queries for each table](https://hub.tailpipe.io/plugins/turbot/aws/tables/aws_cloudtrail_log) in every plugin, and you can also [browse, search, and view the queries](https://hub.tailpipe.io/mods/turbot/tailpipe-mod-aws-dections/queries) in every published mod! ## Interactive Query Shell diff --git a/docs/reference/cli/connect.md b/docs/reference/cli/connect.md index b0d241c..bae37b4 100644 --- a/docs/reference/cli/connect.md +++ b/docs/reference/cli/connect.md @@ -4,7 +4,12 @@ title: tailpipe connect # tailpipe connect -Return a connection string for a database with a schema determined by the provided parameters. +Return the path of SQL script to initialise DuckDB to use the tailpipe database. + +The generated SQL script contains: +- DuckDB extension installations (sqlite, ducklake) +- Database attachment configuration +- View definitions with optional filters ## Usage ```bash @@ -32,15 +37,15 @@ tailpipe connect --from 2025-01-01 ``` ```bash -/home/jon/.tailpipe/data/default/tailpipe_20250115140447.db +/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql ``` > [!NOTE] -> You can use this connection string with DuckDB to directly query the Tailpipe database. +> You can use this sql script with DuckDB to directly query the Tailpipe database. To ensure compatibility with tables that include JSON columns, make sure you’re using DuckDB version 1.1.3 or later. > > ```bash -> duckdb /home/jon/.tailpipe/data/default/tailpipe_20241212134120.db +> duckdb -init /Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql > ``` Connect with no filter, show output as json: @@ -50,6 +55,6 @@ tailpipe connect --output json ``` ```bash -{"database_filepath":"/Users/jonudell/.tailpipe/data/default/tailpipe_20250129204416.db"} +{"init_script_path":"/Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204828.sql"} ``` diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md index 9dfd48b..bb95781 100644 --- a/docs/reference/glossary.md +++ b/docs/reference/glossary.md @@ -30,7 +30,7 @@ A detection is a Tailpipe query, optionally bundled into a benchmark, that runs ## DuckDB -Tailpipe uses DuckDB, an embeddable column-oriented database. DuckDB reads the Parquet files created by `tailpipe collect` and enables queries against that data. +Tailpipe uses DuckDB for fast local analytics over Parquet data. DuckLake maintains a lightweight metadata catalog (`metadata.sqlite`) that references the Parquet files collected by Tailpipe, so you query with standard DuckDB SQL while benefiting from partition pruning and a lakehouse-style layout. ## Format A [format](/docs/reference/config-files/format) describe the layout of the source data so that it can be collected into a table. @@ -40,7 +40,7 @@ A [format type](/docs/reference/config-files/format#format-types) defines the pa ## Hive -A tree of Parquet files in the Tailpipe workspace (by default,`~/.tailpipe/data/default`). The `tailpipe.db` in `~/.tailpipe/data/default` (and derivatives created by `tailpipe connect`, e.g. `tailpipe_20241212152506.db`) are thin wrappers that materialize views over the Parquet data. +A tree of Parquet files in the Tailpipe workspace (by default, `~/.tailpipe/data/default`), organized with hive-style partition keys (for example, `tp_table=.../tp_partition=.../tp_index=.../year=YYYY/month=mm`). DuckLake’s catalog (`metadata.sqlite`) points to these files to enable efficient SQL queries. ## Index From 6ebadc009b935bc9153b2a6f59914f7f67acd392 Mon Sep 17 00:00:00 2001 From: Puskar Basu Date: Fri, 19 Sep 2025 18:22:44 +0530 Subject: [PATCH 2/3] tp_date changes --- docs/query/snapshots.md | 6 +++--- docs/sql/index.md | 18 +++++++++--------- docs/sql/querying-ips.md | 2 +- docs/sql/tips.md | 4 ++-- 4 files changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/query/snapshots.md b/docs/query/snapshots.md index b553d24..5b65239 100644 --- a/docs/query/snapshots.md +++ b/docs/query/snapshots.md @@ -16,7 +16,7 @@ To upload snapshots to Turbot Pipes, you must either [log in via the `powerpipe To take a snapshot and save it to [Turbot Pipes](https://turbot.com/pipes/docs), simply add the `--snapshot` flag to your command. ```bash -powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --snapshot +powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --snapshot ``` ```bash @@ -34,13 +34,13 @@ powerpipe benchmark run cloudtrail_log_detections --share You can set a snapshot title in Turbot Pipes with the `--snapshot-title` argument. ```bash -powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --share --snapshot-title "Recent Cloudtrail log lines" +powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --share --snapshot-title "Recent Cloudtrail log lines" ``` If you wish to save the snapshot to a different workspace, such as an org workspace, you can use the `--snapshot-location` argument with `--share` or `--snapshot`: ```bash -powerpipe query run "select * from aws_cloudtrail_log order by tp_date desc limit 1000" --share --snapshot-location my-org/my-workspace +powerpipe query run "select * from aws_cloudtrail_log order by tp_timestamp desc limit 1000" --share --snapshot-location my-org/my-workspace ``` diff --git a/docs/sql/index.md b/docs/sql/index.md index d64e2df..5c3e365 100644 --- a/docs/sql/index.md +++ b/docs/sql/index.md @@ -27,7 +27,7 @@ You can **filter** rows where columns only have a specific value: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type from @@ -41,7 +41,7 @@ or a **range** of values: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type from @@ -55,7 +55,7 @@ or match a **pattern**: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -70,7 +70,7 @@ You can **filter on multiple columns**, joined by `and` or `or`: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -78,7 +78,7 @@ from aws_cloudtrail_log where event_name = 'UpdateTrail' - and tp_date > date '2024-11-06'; + and tp_timestamp > date '2024-11-06'; ``` You can **sort** your results: @@ -86,7 +86,7 @@ You can **sort** your results: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -101,7 +101,7 @@ You can **sort on multiple columns, ascending or descending**: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name @@ -109,7 +109,7 @@ from aws_cloudtrail_log order by aws_region asc, - tp_date desc; + tp_timestamp desc; ``` You can group and use standard aggregate functions. You can **count** results: @@ -147,7 +147,7 @@ or exclude **all but one matching row**: ```sql select distinct on (event_type) tp_partition, - tp_date, + tp_timestamp, aws_region, event_type, event_name diff --git a/docs/sql/querying-ips.md b/docs/sql/querying-ips.md index 0350088..d22e486 100644 --- a/docs/sql/querying-ips.md +++ b/docs/sql/querying-ips.md @@ -12,7 +12,7 @@ You can find requests **from a specific IP address**: ```sql select tp_partition, - tp_date, + tp_timestamp, aws_region, event_type from diff --git a/docs/sql/tips.md b/docs/sql/tips.md index 55a44b9..614f0e8 100644 --- a/docs/sql/tips.md +++ b/docs/sql/tips.md @@ -20,10 +20,10 @@ select count(*) from aws_cloudtrail_log where partition = 'prod' select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 ``` -*Date*. Each file contains log data for one day. You can filter to include only files for that day. +*Timestamp*. Filter by timestamp, to efficiently get all matching files. ```sql -select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 and tp_date = '2024-12-01' +select count(*) from aws_cloudtrail_log where partition = 'prod' and index = 123456789 and tp_timestamp > date '2024-12-01' ``` The [hive directory structure](/docs/collect/configure#hive-partitioning) enables you to exclude large numbers of Parquet files. From 1e26c79308a90b29d387b3cd882c6dcb91f7e172 Mon Sep 17 00:00:00 2001 From: Puskar Basu Date: Fri, 19 Sep 2025 18:33:59 +0530 Subject: [PATCH 3/3] duckdb version --- docs/collect/configure.md | 2 +- docs/reference/cli/connect.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/collect/configure.md b/docs/collect/configure.md index 2042cfd..fa8aaf4 100644 --- a/docs/collect/configure.md +++ b/docs/collect/configure.md @@ -17,7 +17,7 @@ If your logs are not in a standard format or are not currently supported by a pl Tailpipe creates DuckLake tables based on the data and metadata that it discovers in the [workspace](#workspaces), along with the filter rules. -When you run `tailpipe query` or `tailpipe connect` with any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`), Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and adds a filtered view over the tables. +When you run `tailpipe query` or `tailpipe connect` with any filter arguments that you specify (`--from`,`--to`,`--index`,`--partition`), Tailpipe finds all the tables in the workspace according to the [hive directory layout](/docs/collect/configure#hive-partitioning) and filters the view of the table. You can see what tables are available with the `tailpipe table list` command. diff --git a/docs/reference/cli/connect.md b/docs/reference/cli/connect.md index bae37b4..f89a792 100644 --- a/docs/reference/cli/connect.md +++ b/docs/reference/cli/connect.md @@ -42,7 +42,7 @@ tailpipe connect --from 2025-01-01 > [!NOTE] > You can use this sql script with DuckDB to directly query the Tailpipe database. -To ensure compatibility with tables that include JSON columns, make sure you’re using DuckDB version 1.1.3 or later. +To ensure compatibility with DuckLake features, make sure you’re using DuckDB version 1.4.0 or later. > > ```bash > duckdb -init /Users/pskrbasu/.tailpipe/data/default/tailpipe_init_20250918204456.sql