Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink bloom filters when serializing partitions #1172

Merged
merged 3 commits into from Nov 20, 2020

Conversation

lava
Copy link
Member

@lava lava commented Nov 16, 2020

馃摂 Description

Shrink bloom filters when storing partitions.

Running this on the 5GiB suricata dataset and comparing against a CI build from latest master there seems to be a reduction of 10-15 MiB per partition, so a significant chunk of the 19 MiB we expect for synopsis bloom filters in total.

vast-bloomresize$ ls -lh
total 759M
-rw-r--r-- 1 benno benno 92M Nov 20 01:01 0fc5ac1b-0bfd-4912-bece-521a77338370
-rw-r--r-- 1 benno benno 96M Nov 20 01:01 20f6569e-9833-4c86-bef0-370095a7b4ac
-rw-r--r-- 1 benno benno 18M Nov 20 01:03 3ad75efb-434c-44c0-86a8-2302728e6f80
-rw-r--r-- 1 benno benno 93M Nov 20 00:59 7c12d1b4-ef11-4bcb-92ba-e0f300ccb01c
-rw-r--r-- 1 benno benno 98M Nov 20 01:00 afb6ad5d-4f1b-4534-b328-9a8ad037c204
-rw-r--r-- 1 benno benno 91M Nov 20 01:02 c065e7a5-c936-4c47-9a93-08f85ed3ab98
-rw-r--r-- 1 benno benno 91M Nov 20 01:00 ca76cbc7-3b74-40f8-b5c5-a74dc93bed5c
-rw-r--r-- 1 benno benno 95M Nov 20 01:03 e543a4c2-39c1-4e80-8721-320678b2fff5
-rw-r--r-- 1 benno benno 90M Nov 20 00:59 ea1d7114-b65d-458f-a7e4-2d1f108a48ea
-rw-r--r-- 1 benno benno 720 Nov 20 01:03 index.bin


vast-master$ ls -lh
total 906M
-rw-r--r-- 1 benno benno 109M Nov 19 17:26 1b819c77-0ed6-4f7e-b357-e192081bbbcf
-rw-r--r-- 1 benno benno 111M Nov 19 17:28 27cda834-a790-406a-833e-e979a27f776e
-rw-r--r-- 1 benno benno 110M Nov 19 17:28 2e182fe7-461d-40b8-bc80-dc5bede6120f
-rw-r--r-- 1 benno benno 106M Nov 19 17:26 42aaad1b-d490-414f-a1ba-d3eb1604d201
-rw-r--r-- 1 benno benno 108M Nov 19 17:29 6304aef2-a6ec-462b-af39-b9d07408dac7
-rw-r--r-- 1 benno benno  34M Nov 19 17:29 7019a0e7-14bb-44d0-a900-f3549ca2fba2
-rw-r--r-- 1 benno benno 107M Nov 19 17:25 7f3cd1bb-3add-416b-95d4-e1e42198f335
-rw-r--r-- 1 benno benno 112M Nov 19 17:27 8a6c9d05-fa4f-486e-a9c6-6b9552227693
-rw-r--r-- 1 benno benno 113M Nov 19 17:29 b1f06f77-459e-4487-9b90-ff950a6244d2
-rw-r--r-- 1 benno benno  720 Nov 19 17:29 index.bin


Looking at the address synopsis in particular: (for reference, all of these were exactly 1.2MiB before the changes in this PR)

$ ~/src/vast/build-release/bin/lsvast -h ea1d7114-b65d-458f-a7e4-2d1f108a48ea | grep "ip: opaque_synopsis"
    suricata.smtp.dest_ip: opaque_synopsis (569 B)
    suricata.smtp.src_ip: opaque_synopsis (313 B)
    suricata.ftp.src_ip: opaque_synopsis (377 B)
    suricata.dns.dest_ip: opaque_synopsis (48.1 KiB)
    suricata.http.src_ip: opaque_synopsis (377 B)
    suricata.fileinfo.src_ip: opaque_synopsis (28.0 KiB)
    suricata.http.dest_ip: opaque_synopsis (42.7 KiB)
    suricata.fileinfo.dest_ip: opaque_synopsis (14.8 KiB)
    suricata.flow.dest_ip: opaque_synopsis (76.9 KiB)
    suricata.flow.src_ip: opaque_synopsis (76.5 KiB)
    suricata.ftp.dest_ip: opaque_synopsis (1.5 KiB)
    suricata.dns.src_ip: opaque_synopsis (377 B)
    suricata.tls.src_ip: opaque_synopsis (377 B)
    suricata.tls.dest_ip: opaque_synopsis (9.6 KiB)

馃摑 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

馃幆 Review Instructions

Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure the approach taken here will lead to a complete solution, see the other comment.

libvast/vast/qualified_record_field.hpp Show resolved Hide resolved
libvast/vast/system/partition.hpp Outdated Show resolved Hide resolved
@mavam mavam added the performance Improvements or regressions of performance label Nov 17, 2020
@lava lava force-pushed the topic/bloom-filter-sizing branch 3 times, most recently from bb70f24 to d2bd1ee Compare November 17, 2020 17:55
@mavam
Copy link
Member

mavam commented Nov 17, 2020

[..] there seems to be a reduction of 10-15 MiB per partition, so a significant chunk of the 19 MiB [..]

So 50-80% reduction in size?

Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good already.
The one thing I'm missing is an in-process measurement of the synopsis sizes. The overall size of the meta index should probably be reported in vast status anyway.

libvast/src/system/index.cpp Outdated Show resolved Hide resolved
libvast/vast/fwd.hpp Outdated Show resolved Hide resolved
libvast/vast/address_synopsis.hpp Outdated Show resolved Hide resolved
@lava lava force-pushed the topic/bloom-filter-sizing branch 2 times, most recently from f65da53 to de33222 Compare November 19, 2020 14:59
@lava lava marked this pull request as ready for review November 19, 2020 15:57
@lava lava force-pushed the topic/bloom-filter-sizing branch 2 times, most recently from 2aad26d to 1d09499 Compare November 20, 2020 09:13
Copy link
Member

@tobim tobim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

When creating a new address synopsis structure, it is sized
for the worst-case scenario of every event in the synopsis
having a unique ip address.

When finalizing a synopsis, the exact number of unique ips
is known and the minimal size for the synopsis can be computed,
so that's what we do.
The `bloom_filter_parameters` class has no input validation,
so users have to rely on the `evaluate` function to see if a
given combination of parameters is valid.

However, this function can die with an assertion failure if some
parameters were set incorrectly, creating a catch-22 situation.

Additionally, this meant that these checks were not performed
in release builds with assertions disabled, potentially leading
to bloom filters with nonsensical settings.
@lava lava force-pushed the topic/bloom-filter-sizing branch 2 times, most recently from ea443e7 to 0211bd7 Compare November 20, 2020 15:41
Use a std::set over a std::vector + bloom_filter combination,
since the latter confused multiple independent reviewers.
This also has the benefit of allowing for stricter checks on
where and how the buffered_address_synopsis is used, and we
don't have to rely on the `type` member being annotated with
stringified bloom filter parameters.

Also, pass the actor handle of the index together with
the persist atom, as opposed to when spawning a partition
actor.
@lava lava merged commit 5078b9f into master Nov 20, 2020
@lava lava deleted the topic/bloom-filter-sizing branch November 20, 2020 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements or regressions of performance
Projects
None yet
3 participants