Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the default segment size to 1 GiB #1166

Merged
merged 2 commits into from
Nov 13, 2020

Conversation

mavam
Copy link
Member

@mavam mavam commented Nov 13, 2020

📔 Description

The 4x increase in segment size decreases the range map fragmentation in the archive. It also will reduce startup time for larger archives. There should not be a latency problem with larger file sizes since we mmap the segments.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

All at once.

@mavam mavam added the performance Improvements or regressions of performance label Nov 13, 2020
@mavam mavam requested review from a team and dominiklohmann and removed request for a team November 13, 2020 10:57
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Startup and export speed seems unchanged; import speed is increased with larger segments. Setting it to 2 GiB causes segfaults on my machine.

128 MiB

❯ rm -rf vast.db; repeat 10000; do gunzip -c integration/data/zeek/conn.log.gz; done | time vast --max-segment-size=128 -vquiet -N import --batch-encoding=msgpack zeek
user=1887.21s system=250.88s cpu=394% total=9:01.61

❯ hyperfine --warmup=3 --shell=zsh --min-runs=50 -- "vast -vquiet -N export -n 1 ascii '\"zeek\" in #type'"
Benchmark #1: vast -vquiet -N export -n 1 ascii '"zeek" in #type'
  Time (mean ± σ):      1.076 s ±  0.139 s    [User: 511.8 ms, System: 297.1 ms]
  Range (min … max):    1.030 s …  1.800 s    50 runs

1 GiB

❯ rm -rf vast.db; repeat 10000; do gunzip -c integration/data/zeek/conn.log.gz; done | time vast --max-segment-size=1024 -vquiet -N import --batch-encoding=msgpack zeek
user=1766.27s system=233.89s cpu=394% total=8:27.49

❯ hyperfine --warmup=3 --shell=zsh --min-runs=50 -- "vast -vquiet -N export -n 1 ascii '\"zeek\" in #type'"
Benchmark #1: vast -vquiet -N export -n 1 ascii '"zeek" in #type'
  Time (mean ± σ):      1.069 s ±  0.010 s    [User: 591.1 ms, System: 330.0 ms]
  Range (min … max):    1.053 s …  1.094 s    50 runs

2 GiB

❯ rm -rf vast.db; repeat 10000; do gunzip -c integration/data/zeek/conn.log.gz; done | time vast --max-segment-size=2048 -vquiet -N import --batch-encoding=msgpack zeek
user=507.01s system=48.39s cpu=463% total=1:59.77
zsh: segmentation fault  repeat 10000; do; gunzip -c integration/data/zeek/conn.log.gz; done

@mavam
Copy link
Member Author

mavam commented Nov 13, 2020

Yeah, writing larger strides intuitively also increases throughput. So an additional import speedup makes sense.

A backtrace for the segfault would help a lot.

@dominiklohmann
Copy link
Member

A backtrace for the segfault would help a lot.

The maximum size of a FlatBuffers builder is 2 GiB.

The 4x increase in segment size decreases the range map fragmentation in
the archive. It also will reduce startup time for larger archives. There
should not be a latency problem with larger file sizes since we mmap the
segments.
@mavam mavam force-pushed the topic/increase-default-segment-size branch from be4a358 to d2fbcfa Compare November 13, 2020 15:47
@mavam mavam merged commit ed2ddd8 into master Nov 13, 2020
@mavam mavam deleted the topic/increase-default-segment-size branch November 13, 2020 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements or regressions of performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants