feat(_mapping): timestamp pushdown + column-hints fast path (on top of #6439)#6443
feat(_mapping): timestamp pushdown + column-hints fast path (on top of #6439)#6443congx4 wants to merge 3 commits into
Conversation
fef38b8 to
5bab546
Compare
e8a723e to
0cd3d6c
Compare
| #[serde(default)] | ||
| pub end_timestamp: Option<i64>, | ||
| #[serde(default)] | ||
| pub fields: Option<String>, |
There was a problem hiding this comment.
yes, they are. I will rename it to field_pattern to align with your code.
|
|
||
| #[test] | ||
| fn empty_query_string_yields_none() { | ||
| let params: IndexMappingQueryParams = serde_urlencoded::from_str("").unwrap(); |
There was a problem hiding this comment.
Can we use serde_qs? I believe it's already a dependency.
| .map(|index_metadata| { | ||
| let field_mappings = &index_metadata.index_config.doc_mapping.field_mappings; | ||
| let mut properties = build_properties(field_mappings); | ||
| if !filter.is_empty() { |
There was a problem hiding this comment.
Can we push down the filter to the leaves and single split list fields requests so we don't need to do this here?
There was a problem hiding this comment.
we have pushed down the filter to the leaves. https://github.com/quickwit-oss/quickwit/pull/6443/changes#diff-390322509891292c93be51dccd4522bb2b0d352e41f3be483d41e92cc4bfdc68R278
| let raw = params.fields.as_deref()?; | ||
| let tokens: Vec<String> = raw | ||
| .split(',') | ||
| .map(|s| s.trim().to_string()) |
There was a problem hiding this comment.
nit: in general, it's better to allocate after filtering, but it does not matter here.
| } | ||
|
|
||
| fn collect_declared_top_level_names(indexes_metadata: &[IndexMetadata]) -> HashSet<String> { | ||
| let mut names = HashSet::new(); |
There was a problem hiding this comment.
nit: functional way might be nicer... depending on who you ask :)
| } | ||
|
|
||
| fn collect_declared_top_level_names(indexes_metadata: &[IndexMetadata]) -> HashSet<String> { | ||
| let mut names = HashSet::new(); |
There was a problem hiding this comment.
Don't forget to use with_capacity if you know the final size in advance. It also signals to the reader that there will be no filtering or "explosion"
Two small additions to the ES-compat `_mapping(s)` endpoint that
together let downstream callers (e.g. Trino's ES connector) skip the
expensive `list_fields` scan in the common case.
Today `GET /_elastic/{index}/_mapping(s)` calls `list_fields` over every
published split. On indexes with hundreds of thousands of dynamic
fields this can take several seconds and runs into
`QW_FIELD_LIST_SIZE_LIMIT` (100k by default). This PR addresses both
pieces with no proto change:
1. Timestamp pushdown
New `?start_timestamp=…&end_timestamp=…` URL params on `_mapping(s)`,
forwarded into `ListFieldsRequest` verbatim. The metastore prunes
the candidate split set by time window before any leaf fan-out.
Unit is epoch seconds, half-open interval — matching the existing
`ListFieldsRequest` proto contract.
2. Column-hints fast path
New `?fields=…` URL param (comma-separated names). When every
requested name is a flat literal (no `*`, no `?`, no `.`) declared
in the union of the indexes' `doc_mapping`, the handler builds the
response straight from the declared mapping, filtered to those
names. No `list_fields` call, no split I/O.
Anything else (wildcards, dotted paths, names not in `doc_mapping`)
falls through to the full-mapping path: `list_fields` over the
splits in the time range, full unfiltered mapping returned — same
shape as today, just with the timestamp-pushdown optimization
applied.
Notes:
- Unknown query params are silently ignored (no `deny_unknown_fields`)
to stay compatible with standard ES clients that pass `pretty`,
`ignore_unavailable`, `allow_no_indices`, etc.
- No proto change. Stays on existing `ListFieldsRequest`.
- `IndexMappingQueryParams` parser and the new
`ElasticsearchMappingsResponse::from_doc_mapping_filtered` are
unit-tested in their respective modules.
- rename `fields` query param to `field_patterns` to mirror `ListFieldsRequest.field_patterns` - switch tests to `serde_qs` (already a workspace dep, matches the bulk-query-params test style) - move the declared-field filter out of `mappings.rs` and into the rest_handler fast path: trim `field_mappings` in place before calling `from_doc_mapping`, dropping `from_doc_mapping_filtered` entirely (dynamic fields were already filtered at the leaves via `ListFieldsRequest.field_patterns`) - nits in `parse_field_patterns`: trim/filter before allocating the owned String per token - nits in `collect_declared_top_level_names`: functional flat_map style
5bab546 to
2f719f6
Compare
- drop the fast-path declared-field `retain(...)` in rest_handler. `field_patterns` is now hint-only: it triggers the fast path (skip `list_fields`) when every pattern matches a flat declared field, and is pushed down to the leaves for dynamic-field filtering. Both fast and slow paths now return the full declared schema, matching slow- path semantics that existed before. - remove unused `serde_urlencoded` dev-dep from `quickwit-serve` and the workspace `Cargo.toml` (was already unused after switching tests to `serde_qs`). - `collect_declared_top_level_names`: switch back to a procedural form preallocated with `HashSet::with_capacity(sum)` to signal the upper bound — no filtering, no explosion.
Summary
This is PR #6436 ported on top of #6439 so the two can land together cleanly. Identical behavior to #6436; the only changes are the conflict resolutions required by #6439's renames.
If you're reviewing this PR after #6439 merges, the diff against
mainwill be exactly the PR #6436 diff. While #6439 is open, this PR's base isguilload/list-fieldsso GitHub only shows the column-hints + timestamp-pushdown changes, not #6439's churn.What's in this PR
Two small additions to the ES-compat
_mapping(s)endpoint that together let downstream callers (e.g. Trino's ES connector) skip the expensivelist_fieldsscan in the common case.Today
GET /_elastic/{index}/_mapping(s)callslist_fieldsover every published split. On indexes with hundreds of thousands of dynamic fields this can take several seconds, and over a certain threshold the leaf hitsQW_FIELD_LIST_SIZE_LIMIT(100k by default) and the request fails.Timestamp pushdown — new
?start_timestamp=…&end_timestamp=…URL params on_mapping(s), forwarded intoListFieldsRequestverbatim. The metastore prunes the candidate split set by time window before any leaf fan-out. Unit is epoch seconds, half-open interval — matching the existingListFieldsRequestproto contract.Column-hints fast path — new
?fields=…URL param (comma-separated names). When every requested name is a flat literal (no*, no?, no.) declared in the union of the indexes'doc_mapping, the handler builds the response straight from the declared mapping, filtered to those names. Nolist_fieldscall, no split I/O.Anything else (wildcards, dotted paths, names not in
doc_mapping) falls through to the full-mapping path:list_fieldsover the splits in the time range, full unfiltered mapping returned — same shape as today, just with the timestamp-pushdown optimization applied.Conflict resolutions vs #6436
Three conflicts were resolved to align with #6439's renames:
rest_handler.rs:ListFieldsRequest { fields: Vec::new(), ... }→field_patterns: Vec::new()(rename from Optimize list fields #6439). Keptparams.start_timestamp/params.end_timestampfrom feat(_mapping): timestamp pushdown + column-hints fast path #6436.mappings.rs: test'suse quickwit_proto::search::ListFieldsEntryResponse;→ListFieldsEntry(rename from Optimize list fields #6439).Cargo.toml: trivial — keptserde_with = "3.20"from Optimize list fields #6439 plus the newserde_urlencoded = "0.7"from feat(_mapping): timestamp pushdown + column-hints fast path #6436.Test plan
cargo build -p quickwit-serve— cleancargo clippy -p quickwit-serve --tests— cleancargo +nightly fmt --all -- --check— cleancargo nextest run -p quickwit-serve --lib elasticsearch_api::— 75 / 75 passTest coverage (same as #6436)
IndexMappingQueryParamsparser:start_timestamp/end_timestamp/fieldsaccepted in isolation and together; emptyfields=; unknown params silently ignored.ElasticsearchMappingsResponse::from_doc_mapping_filtered: keeps only requested names; object subtree preserved on top-level match; empty filter behaves identically tofrom_doc_mapping.collect_declared_top_level_names: unions field names across multiple indexes.parse_field_hints: empty / whitespace / commas-only / well-formed lists.🤖 Generated with Claude Code