Add Queue source with SQS implementation #5148

rdettai · 2024-06-21T13:44:22Z

Description

This PR proposes the generic implementation of a "queue" source. For now, only an implementation for AWS SQS with its data backed by AWS S3 is exposed to the users. Google Pubsub as the queue implementation or inlined data (i.e messages containing the data itself and not the link to the object store) will come next.

We use the shard API to provide deduplication of messages. For the current implementation where the source data is stored on S3, the deduplication is made on the object URI.

High level summary of the abstractions that are part of the generic implementation:

Processor exposes the exact same methods as the Source trait but does not implement it directly. Instead, the concrete queue sources (e.g. SqsSource) wrap the Processor.
A pipeline of message states:
- RawMessage: the message as received from the Queue
- PreProcessedPayload: the message went through the minimal transformation to discover its partition id
- CheckpointedMessage: the message was checked against the shared state (shard API), it is now ready to be processed
- InProgressMessage: the message that is actively being read
QueueSharedState is an abstraction over shard API. By calling open_shard upon reception of the messages we avoid costly redundant processing when receiving a duplicate message.
QueueLocalState represents the state machine of the messages as they are processed by the indexing pipeline
VisibilityTaskHandle a task that extends the message visibility when required (needs to be reworked)

TODO:

TODO in subsequent PRs:

GCP Pubsub (small)
data within the queue payload (small)
shard garbage collection (medium)
improve the visibility extension logic (medium)

How was this PR tested?

This PR contains unit tests and higher level tests that use LocalStack.

github-actions · 2024-06-21T14:14:02Z

On SSD:

Average search latency is 1.01x that of the reference (lower is better).
Ref run id: 2337, ref commit: 4ade7b5
Link

On GCS:

Average search latency is 0.981x that of the reference (lower is better).
Ref run id: 2339, ref commit: 4ade7b5
Link

quickwit/quickwit-indexing/src/source/doc_file_reader.rs

quickwit/quickwit-metastore/src/metastore/postgres/queries/shards/open.sql

quickwit/quickwit-config/src/source_config/mod.rs

quickwit/quickwit-indexing/src/source/queue_sources/visibility.rs

quickwit/quickwit-indexing/src/source/queue_sources/mod.rs

quickwit/quickwit-indexing/src/source/queue_sources/processor.rs

quickwit/quickwit-indexing/src/source/queue_sources/local_state.rs

quickwit/quickwit-indexing/src/source/queue_sources/processor.rs

fulmicoton · 2024-06-25T08:41:59Z

we need a different handling of transient vs non-transient error.
e.g.
in the message parsing -> non-transient
disconnection while streaming file -> transient...
gzip corruption -> non-transient.

quickwit/quickwit-config/src/source_config/mod.rs

quickwit/quickwit-indexing/src/source/doc_file_reader.rs

fulmicoton · 2024-07-02T07:46:27Z

quickwit/quickwit-indexing/src/source/doc_file_reader.rs

+    ) -> anyhow::Result<BatchBuilder> {
+        let mut batch_builder = BatchBuilder::new(source_type);
+        while batch_builder.num_bytes < BATCH_NUM_BYTES_LIMIT {
+            let mut buf = String::new();


why allocate at every single loop iteration?

Copied from main

quickwit/quickwit/quickwit-indexing/src/source/file_source.rs

Line 80 in 84572ea

let mut doc_line = String::new();

I guess we could allocate one big memory region of size BATCH_NUM_BYTES_LIMIT, load data into it and slice it by row. I think it's a bit far from this PR's concern though. Do you think these allocations are costly enough for opening an issue?

rdettai self-assigned this Jun 21, 2024

rdettai changed the base branch from main to uri-file-src-params June 21, 2024 13:45

rdettai requested a review from guilload June 21, 2024 15:19

rdettai force-pushed the sqs branch 4 times, most recently from 93cce59 to 056f785 Compare June 24, 2024 14:01