Skip to content

Use dedicated partitions for each layout #2096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 21, 2022

Conversation

tobim
Copy link
Member

@tobim tobim commented Feb 18, 2022

This makes it possible to use homogeneous partitions, so only the data from one particular event type ends up in a partition.

Requires #2120!

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

@tobim tobim requested a review from a team February 18, 2022 14:13
@mavam
Copy link
Member

mavam commented Feb 18, 2022

I'd be curious to benchmark this. If I get it right, it introduces concurrency as function of unique schemas/layouts.

@tobim tobim force-pushed the story/sc-31754/homogeneous-partition branch from d5048fd to d92ea68 Compare February 27, 2022 19:49
@tobim tobim changed the base branch from master to topic/fix-explorer-shutdown February 27, 2022 19:50
@tobim tobim force-pushed the topic/fix-explorer-shutdown branch from 0a29187 to e558403 Compare March 1, 2022 17:07
@tobim tobim force-pushed the story/sc-31754/homogeneous-partition branch from d92ea68 to f636b54 Compare March 2, 2022 13:59
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this briefly in a call earlier, and I wanted to write my thoughts down somewhere: This is mostly done, but some important adjustments are yet missing.

  • The memory usage for this is through the roof. We need to scale down the default partition size.
  • Queries currently go to all active partitions, which is just unnecessary now. We probably should register active partitions with the meta index catalog, although I'm not quite sure if that should be part of this PR. It's not critical, but definitely warrants a follow-up story that we tackle in the short- to medium-term.
  • The biggest issue in practice when using this is low-volume types never being flushed to disk, because the partition size is set globally. We need some timeout parameter after which we close an active partition, similar to how we have a both a batch-timeout and a batch-size for the table slices produced in the sources.

@mavam
Copy link
Member

mavam commented Mar 4, 2022

This makes sense. Regarding memory usage, I want to raise the concern of slicing out types from record batches that keep alive large chunks of memory while only a small fraction is ever needed. This is the detach discussion we had the other day. My intuition is that this is now amplified. I'd like to figure out a way to measure it.

@dominiklohmann
Copy link
Member

@mavam We already detach the schema in table slices, so this is not a concern.

Base automatically changed from topic/fix-explorer-shutdown to master March 6, 2022 07:01
tobim added 2 commits March 6, 2022 12:55
Since we now have a dedicated active partition for each layout,
importing suricata data doesn't push out zeek conns any more.
The solution is to import one more zeek conn event to make sure
the previous data is written to disk.
@tobim tobim force-pushed the story/sc-31754/homogeneous-partition branch from f636b54 to acff288 Compare March 6, 2022 12:11
We now have more partitions, so the output of lsvast grew larger.
@tobim
Copy link
Member Author

tobim commented Mar 10, 2022

We probably should register active partitions with the meta index catalog, although I'm not quite sure if that should be part of this PR. It's not critical, but definitely warrants a follow-up story that we tackle in the short- to medium-term.

I veto this approach. The catalog has no business with active partitions and it must stay that way. We want to be able to run the active and passive parts of the database on different nodes, and such a change would only add communication overhead.
I think the query should get sent to each active partition, but the lookup function should consult the active partition synopsis. This is completely orthogonal though and it should not be done in this PR.

@mavam
Copy link
Member

mavam commented Mar 10, 2022

This is completely orthogonal though and it should not be done in this PR.

👆 That. Exactly. Let's not couple things together can operate in a decoupled manner.

For this PR, it's completely fine to not run queries over active partitions (or "partition builders").

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the changes we discussed, @tobim. I approve of yours earlier in the PR, but would really like to see this PR get reviewed and tested by someone else. @dispanser maybe?

@dominiklohmann dominiklohmann force-pushed the story/sc-31754/homogeneous-partition branch from 00838ec to 4c35bd2 Compare March 11, 2022 16:44
@dominiklohmann dominiklohmann added feature New functionality and removed enhancement labels Mar 17, 2022
@dominiklohmann dominiklohmann force-pushed the story/sc-31754/homogeneous-partition branch from c523996 to 70614f1 Compare March 17, 2022 16:05
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobim I am approving of your changes. If you approve of mine, please go ahead with the merge.

@dominiklohmann dominiklohmann added feature New functionality and removed feature New functionality labels Mar 18, 2022
@tobim
Copy link
Member Author

tobim commented Mar 21, 2022

I approve of @dominiklohmann 's changes as well.

@tobim tobim merged commit 3482d21 into master Mar 21, 2022
@tobim tobim deleted the story/sc-31754/homogeneous-partition branch March 21, 2022 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants