New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align output of the Zeek TSV reader with schemas #2887
Conversation
2885c14
to
a8586ac
Compare
I can see the value of the casting, but I don't understand the big picture. I would imagine it suffices to:
Then we're done. Am I missing something here? |
Yes. With what you've suggested you get data with a different schemas for the Zeek JSON and the Zeek TSV readers, which is a really bad UX when you're working with both.
Before this change, the Zeek reader already looked at the schema files we had loaded, and iff it found a matching schema attempted to adjust the schema to that. However, that was incomplete—the end result was a schema that was neither that of the data nor the one we found in the loaded schema files. Also, there is at least one edge case where unflattening based on dots does not work at all: Consider Zeek data that has the following field names (in order): |
I don't see how that would happen.
Is that really the desired UX? The format is self-describing. I would expect that the first layer is just onboarding the data. Transforming it according to known schemas should happen down the line (in the processing model we are currently working towards to).
That is an artificial example. Zeek doesn't create such data. |
Yes, I would say so. If the user describes the schema of their data, then we should adhere to that. The Zeek reader still works in isolation—but as a user I would not want two different schemas to show up when I run Besides, how else would the Zeek reader know about the
Look at the diff of the integration test reference files. For example, you can see the field
Then think of CSV instead of Zeek, for which you cannot rely on the educated guess that nobody will ever write such data. It has all the same problems. Remember the CSV reader problems that @rdettai encountered? This is able to fix them all the same, agnostic to the readers themselves. It's a building block that while for now only applied to Zeek TSV will in the near future be useful in all the many variants of the |
This should make it less likely for clang-format to split long error message strings into multiple short lines over breaking at an function earlier argument.
This introduces a new set of functions `vast::cast` and `vast::can_cast`, with the former operating on arrays and the latter operating on types. For now, this only supports reshuffling columns in record types (and corresponding struct arrays), but the mechanism is written in a way that is easily extensible via template specializations.
This changes the Zeek TSV reader to respect available schemas from schema files in case the parsed output can be casted to a schema with an exactly matching name. In case the casting is impossible, the Zeek reader warns exactly once when switching to a new file and ignores the user-provided schema.
f77e984
to
b2b4ec3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, solid work here on the core side.
Change requests:
- There seems to be a bug in the output,
_path
and_write_ts
are alwaysnull
- The Zeek TSV writer is too far away from TSV output. I propose to (1) move
_path
to#path
and drop_write_ts
for now. - We need documentation changes that we "upgraded" the Zeek JSON logs to be structured. This breaks tools that rely on the flattened version, e.g., RITA, ZAT, etc. While there is probably
jq
hotfix for some use cases where the user can impose, we can't assume that. We should probably also mention that an upcomingcast
operator (or whatever we choose) will make it possible transition explicitly.
@mavam To address the meta comment from your review, as it's not in a review comment:
They are not present in Zeek TSV, so these fields are only set when you import Zeek JSON. This is working as expected.
I don't think the Zeek TSV writer should make implicit changes to data based on the type name. It is not restricted to writing Zeek logs.
|
This removes the `_path` from the Zeek schemas as they are only required for disambiguation in the Zeek JSON reader, but should not actually be part of the data. It also moves the `_write_ts` field to the end to increase compatibility with downstream tooling that relies on the column order in Zeek tools (please use `zeek-cut`, folks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for now. I think the Zeek writer is troublesome in many ways and we've reached a working solution.
The fact that we have to add _write_ts
to all schemas feels wrong, but there's no alternative at the moment.
This took a while to get to this point, but I think it's in a way better spot than what we had before this PR now. Thanks for all the discussion, @mavam. 🙏 |
This changes the Zeek TSV reader to respect available schemas from schema files in case the parsed output can be casted to a schema with an exactly matching name.
In case the casting is impossible, the Zeek reader warns exactly once when switching to a new file and ignores the user-provided schema.
To achieve this, I have introduced a new set of functions
vast::cast
andvast::can_cast
, with the former operating on arrays and the latter operating on types. For now, this only supports reshuffling columns in record types (and corresponding struct arrays), but the mechanism is written in a way that is easily extensible via template specializations.Tasks
Update docsDocs already had unflattened fieldsFixes tenzir/issues#36.
To whoever is reviewing this: I'm sorry for the ginormous diff. We happen to use the Zeek TSV reader in a lot of integration tests.