Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simdjson dependency to libvast #1246

Merged

Conversation

ngrodzitski
Copy link
Contributor

📔 Description

Restore PR with rebased epic.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

@dominiklohmann
Copy link
Member

Instead of re-opening, feel free to rebase and force-push next time. We don't mind force pushes before reviews within a PR.

@ngrodzitski
Copy link
Contributor Author

Instead of re-opening, feel free to rebase and force-push next time. We don't mind force pushes before reviews within a PR.

Ok, I get it.

@@ -55,6 +55,11 @@ in {
configureFlags = old.configureFlags ++ [ "--enable-prof" "--enable-stats" ];
});
broker = final.callPackage ./broker {inherit stdenv; python = final.python3;};
simdjson = prev.simdjson.overrideAttrs (old: {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are plans for static-lib as the default option for build, but currently simdjson uses shared-lib. So a little adjustment is required.

simdjson/simdjson#1199

@ngrodzitski
Copy link
Contributor Author

@dominiklohmann, @tobim Please, review the changes to nix* I've made.

nix/overlay.nix Outdated Show resolved Hide resolved
Co-authored-by: tobim <tobim+github@fastmail.fm>
@tobim
Copy link
Member

tobim commented Dec 18, 2020

Not bad figuring the nix stuff out so fast. It's not particularly newcomer-friendly.

dominiklohmann and others added 7 commits December 18, 2020 13:50
This is unnecessary after 5b4b052, which I accidentally pushed to master
directly. Sorry about that one.
Remove check whether config file is a regular file
Co-authored-by: Matthias Vallentin <matthias@tenzir.com>
@mavam
Copy link
Member

mavam commented Dec 18, 2020

You almost make that sound like a "feature", @tobim. 😉

Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this out locally and it works great, it's really a lot faster already than the current parser in real world usage. Really great work so far!

Code-wise I really like the approach you've taken (and commented) with the conversion from JsonType to VastType. It certainly does its job well enough, and after having toyed with the code a I bit it is the way to go over implementing a visit-based approach.

Do you need any more input or want us to look at some things in-depth specifically, e.g., do you need further help with the test issues?

@ngrodzitski
Copy link
Contributor Author

ngrodzitski commented Jan 11, 2021

@dominiklohmann Currently I'm stuck with "Node suricata alert simdjson" integration test which compares a csv export results. Except this I'm going to fresh-look the code and that is it for phase 2.

@dominiklohmann
Copy link
Member

Sounds good. I'll take an in-depth look at the integration test issue tomorrow.

@dominiklohmann
Copy link
Member

@ngrodzitski I reviewed this at length today, and discussed with @mavam what we want to do with regards to the different escaping behaviors of the old and the new json readers.

  • We consider the current JSON import behavior to be broken with its escaping (it escaped \r\n to the string \r\n, which then got converted to \r\n during export).

  • We consider the simdjson import behavior to work as expected (no escaping during import, \r\n gets converted to the string \x0D\x0A during export where necessary).

  • We agreed that it was not worth the effort to fix the legacy JSON reader, because it will be removed anyways after this work is finished.

  • We agreed that the integration test references files between vast import json --simdjson and vast import json|suricata --simdjson may differ.

  • @mavam checked that Zeek does the same form of escaping for control characters (they use lowercase 0x0d over our 0x0D, but we can change that easily if need be).

  • I verified that newlines in CSV strings are okay per the specification as long as the strings are quoted, which they are in this case.

To summarize, we can move forward with this PR. Please update the remaining reference files accordingly. You can run sh build/vast/integration_Debug.sh -u (where Debug is your build type) to update them all automatically.

Once the CI is green I would like to merge this into the epic branch, do a few fixups and write a changelog entry and merge the epic branch into master. Phase 3 and 4 can then happen in separate PRs that branch off master.

Side note: The performance looks to be greatly improved!

@ngrodzitski
Copy link
Contributor Author

ngrodzitski commented Jan 12, 2021

To summarize, we can move forward with this PR. Please update the remaining reference files accordingly. You can run sh build/vast/integration_Debug.sh -u (where Debug is your build type) to update them all automatically.

Ok, I will do. Nevertheless, I'd like to note that this new-lines tollerance in csv can lead to errors.
Consider the following case:

record1 f1 "value\nAAAAAA", f2
record2 f2 "value\nBBBBBB", f2

if \n becomes a newline, then the output will be the following:

record1 f1 "value
AAAAAA", f2
record1 f2 "value
BBBBBB", f2

To compare with reference file the output is sorted on line basis:

AAAAAA", f2
BBBBBB", f2
record1 f1 "value
record1 f2 "value

The catch is that the following wrong output (A with record2 and B with record1)

record1 f1 "value
BBBBBB", f2
record2 f2 "value
AAAAAA", f2

will be sorted and compared to reference without errors. That is unlikely to be a problem now, but it shows that compare method for under such conditions is not reliable in general.

@dominiklohmann
Copy link
Member

Ok, I will do. Nevertheless, I'd like to note that this new-lines tollerance in csv can lead to errors.

That's absolutely correct. We sort the lines in the integration test framework before comparing them, which cannot work with this.

I'll discuss adapting the CSV writer with the team in tomorrow's standup.

@mavam
Copy link
Member

mavam commented Jan 12, 2021

We sort the lines in the integration test framework before comparing them, which cannot work with this.

It's possible to deactivate the sorting, which could be an option if the output is guaranteed to be deterministic.

@mavam
Copy link
Member

mavam commented Jan 12, 2021

I verified that newlines in CSV strings are okay per the specification as long as the strings are quoted, which they are in this case.

Even though it's legal to have CR/LF unescaped, per the spec, we could decide to escape newlines, because most CSV parsers are not that fancy. This would be an option to the csv command.

As of right now, I'd go with the simplest path forward.

@dominiklohmann
Copy link
Member

dominiklohmann commented Jan 13, 2021

We'll fix the CSV issue separately from the simdjson-related PRs so you don't need to worry about it.

I'll merge this today into the epic branch, and merge the epic branch into master after doing some fixups and writing a changelog entry. Phase 3 should then be a new PR based off of master.

@ngrodzitski
Copy link
Contributor Author

I think the reason of vast: error while loading shared libraries: libsimdjson.so.4: cannot open shared object file: No such file or directory can be because of installing libsimdjson-dev for ubuntu in vast/.github/workflows/vast.yaml which I can't test.

@dominiklohmann
Copy link
Member

I'll merge this into the epic branch now and fix the remaining CI issue myself in the epic branch, it likely is a permission issue. No big deal.

@dominiklohmann dominiklohmann merged commit 9629bd8 into tenzir:epic/simdjson Jan 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants