-
-
Notifications
You must be signed in to change notification settings - Fork 96
Improve json
parser, add null
type, and various fixes
#3503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
de14e1c
to
b16cad0
Compare
series_builder
, add null
type, fix json
parser and allow empty recordsseries_builder
, add null
type, fix json
parser and other small fixes
series_builder
, add null
type, fix json
parser and other small fixesseries_builder
, add null
type, fix json
parser and other fixes
series_builder
, add null
type, fix json
parser and other fixesjson
parser, add null
type, and various fixes
json
parser, add null
type, and various fixesjson
parser, add null
type, and various fixes
a35cf10
to
ad8a307
Compare
ad8a307
to
c0ef1b0
Compare
c3f6897
to
ab9dd40
Compare
Marking this as ready for review now. I'm not 100% happy with everything (in particular there are still some small API issues remaining with regards to conversions and infallibility), but I think it's good enough for a review. I will also read review the diff myself once more and do some additional cleanup (see tasklist above). |
Great perf improvement PR: Seeing 2x improvement on import time for Suricata eve.json file, thanks! |
Edit: Solved in latest commits. Previous problem: I'm getting a crash for another dataset (I can share the eve.json file privately, I was able to reproduce it reliably). Client logs:
Node logs:
|
Edit: Solved in latest commits. Previous problem: On another eve.json file, I'm getting a stuck pipeline (reproducible as well):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback from an initial review, more coming later or tomorrow.
4f5b849
to
9de7db2
Compare
9de7db2
to
8a27325
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is so massive that I think that after addressing the remaining TODOs and writing follow-up issues we should merge it, but only shortly after today's v4.2 release. This will give us some time to catch any outstanding issues. I went through the code and except for the things you already noted I'm really happy with this. The performance looks to be really good as well.
I also managed to port my YAML format PR to the new API rather easily, and that now works as expected. I'll wait with pushing that until after this is merged so you can more freely rebase.
Please add changelog entries about the following things:
- Performance and correctness improvements of the JSON parser.
- Support for empty lists and empty records, and the implied change in behavior for
drop
andselect
.
912ff0c
to
88f76e0
Compare
I have added a counting benchmark to the PR description. We went from 15.7s to 8.5s, which is a big improvement. 🚀 Also, this pushes us very close to |
15d750b
to
089a8b1
Compare
42d1b4f
to
fd26cff
Compare
Improve reliability
This PR improves the reliability of the JSON parser by swapping out one of its core components with another implementation. Previously, the parser could crash or produce erroneous output in some situations.
Improve accuracy
The new parser tries to parse all incoming events as accurately as possible, without sticky type promotion. For example, if a field is almost always a record, but an integer in a few other events, then the new parser yields events where the types match their input (either record or integer, depending on the event). The previous approach would have produced records up to the first batch that sees the integer, and after that always strings (due to promotion rules). Assuming that null fields are considered the same as non-existing fields, the new approach should always1 yield events that match exactly they input type.
Improve performance
We will be benchmarking this PR against the status quo with this dataset and
read suricata | summarize count(.)
as our pipeline. Previously, this took around 15.7s on my machine. Now, we are done after 8.5s. That is more than 180% of the original event processing rate! Compare this tojq -s length eve.json
, which takes 7.6s. Thus, we are only off by 10%, even though we arguably do a lot more work.Add
--raw
Sometimes (depending on the input stream), the JSON parser produces many small batches due to frequent type switches. This PR also adds the
--raw
flag (initially built for testing purposes), which disables type inference for strings (which can normally be parsed as IP addresses, durations, etc.) and numbers (for which we generally useint64
ordouble
). With--raw
, we can parse documents where JSON type of all events is compatible into a single batch. However, this means that, for example, IP addresses are parsed as astring
instead of anip
.Add
null
typeAlthough the
null
value is an inhabitant of every type (due to implicit nullability), there are situations where having a dedicatednull
type is useful, for example if an expression always evaluates to null (e.g., field does not exist). It is also necessary to faithfully represent empty lists (which were previously not parsed at all), which is why this is part of this PR.Allow empty records
Previously, empty records were neither parsed nor handled "correctly" by operators. For example, dropping all fields previously meant that the whole event was dropped instead of yielding an empty event. Furthermore, allowing them is also necessary to faithfully represent nested empty records.
Miscellaneous
type_kind
as an alternative to theuint8
type indextransform_columns
, which previously transformednull
records to non-null records with null fields (see diff)Tasklist
try_atom
,atom
,try_data
,data
(and perhapstry_record
?)type::operator bool()
transform_columns()
withflatten()
(this only contains a workaround)null
Footnotes
There is an edge case where a list inside a single event contains mixed types, for example
{"foo": [{"bar": []}, {"bar": 42}]
. Because there currently are no union types, this cannot be accurately represented. The implementation contains some special logic that casts the conflicting types into strings, such that we get{"foo": [{"bar": "[]"}, {"bar": "42"}]
for our example. But this behavior is limited to a single event, unlike previously. ↩