Align output of the Zeek TSV reader with schemas #2887

dominiklohmann · 2023-01-26T23:33:36Z

This changes the Zeek TSV reader to respect available schemas from schema files in case the parsed output can be casted to a schema with an exactly matching name.

In case the casting is impossible, the Zeek reader warns exactly once when switching to a new file and ignores the user-provided schema.

To achieve this, I have introduced a new set of functions vast::cast and vast::can_cast, with the former operating on arrays and the latter operating on types. For now, this only supports reshuffling columns in record types (and corresponding struct arrays), but the mechanism is written in a way that is easily extensible via template specializations.

Tasks

Give feedback

Implement no-op cast operations

Implement no-op cast operations
Options
Successfully updated the issue's project

There was an error updating the issue's project
Implement cast operations on record types to reshuffle columns

Implement cast operations on record types to reshuffle columns
Options
Successfully updated the issue's project

There was an error updating the issue's project
Utilize cast operation in Zeek TSV reader

Utilize cast operation in Zeek TSV reader
Options
Successfully updated the issue's project

There was an error updating the issue's project
Implement casting for lists

Implement casting for lists
Options
Successfully updated the issue's project

There was an error updating the issue's project
Write unit tests for cast/can_cast

Write unit tests for cast/can_cast
Options
Successfully updated the issue's project

There was an error updating the issue's project
Write changelog entry

Write changelog entry
Options
Successfully updated the issue's project

There was an error updating the issue's project
~Update docs~ Docs already had unflattened fields

~~Update docs~~ Docs already had unflattened fields
Options
Successfully updated the issue's project

There was an error updating the issue's project
Options

Fixes tenzir/issues#36.

To whoever is reviewing this: I'm sorry for the ginormous diff. We happen to use the Zeek TSV reader in a lot of integration tests.

mavam · 2023-01-27T06:47:04Z

I can see the value of the casting, but I don't understand the big picture. I would imagine it suffices to:

Read the Zeek fields
Unflatten records by splitting on .
Set the type in the builder

Then we're done. Am I missing something here?

dominiklohmann · 2023-01-27T07:01:51Z

Then we're done. Am I missing something here?

Yes. With what you've suggested you get data with a different schemas for the Zeek JSON and the Zeek TSV readers, which is a really bad UX when you're working with both.

Unflatten records by splitting on .

Before this change, the Zeek reader already looked at the schema files we had loaded, and iff it found a matching schema attempted to adjust the schema to that. However, that was incomplete—the end result was a schema that was neither that of the data nor the one we found in the loaded schema files.

Also, there is at least one edge case where unflattening based on dots does not work at all: Consider Zeek data that has the following field names (in order): id.orig_h, ts, id.orig_p. What you suggested breaks in this case as you would need to re-order leaves.

mavam · 2023-01-27T07:42:04Z

Yes. With what you've suggested you get data with a different schemas for the Zeek JSON and the Zeek TSV readers, which is a really bad UX when you're working with both.

I don't see how that would happen.

[..] the Zeek reader already looked at the schema files we had loaded

Is that really the desired UX? The format is self-describing. I would expect that the first layer is just onboarding the data. Transforming it according to known schemas should happen down the line (in the processing model we are currently working towards to).

Also, there is at least one edge case where unflattening based on dots does not work at all: Consider Zeek data that has the following field names (in order): id.orig_h, ts, id.orig_p. What you suggested breaks in this case as you would need to re-order leaves.

That is an artificial example. Zeek doesn't create such data.

dominiklohmann · 2023-01-27T07:55:55Z

[..] the Zeek reader already looked at the schema files we had loaded

Is that really the desired UX? [...]

Yes, I would say so. If the user describes the schema of their data, then we should adhere to that. The Zeek reader still works in isolation—but as a user I would not want two different schemas to show up when I run vast show schemas zeek.conn.

Besides, how else would the Zeek reader know about the port and timestamp aliases, or about #index=hash attribute? Just unflattening on a dot is not good enough.

[...] different schemas for the Zeek JSON and the Zeek TSV readers [...]

I don't see how that would happen.

Look at the diff of the integration test reference files. For example, you can see the field local_resp missing before this change in the ingested zeek.conn events from Zeek TSV files, which was (1) described in the schema file and (2) available in the Zeek JSON data.

That is an artificial example. Zeek doesn't create such data.

Then think of CSV instead of Zeek, for which you cannot rely on the educated guess that nobody will ever write such data. It has all the same problems.

Remember the CSV reader problems that @rdettai encountered? This is able to fix them all the same, agnostic to the readers themselves. It's a building block that while for now only applied to Zeek TSV will in the near future be useful in all the many variants of the from and read operators.

This should make it less likely for clang-format to split long error message strings into multiple short lines over breaking at an function earlier argument.

This introduces a new set of functions `vast::cast` and `vast::can_cast`, with the former operating on arrays and the latter operating on types. For now, this only supports reshuffling columns in record types (and corresponding struct arrays), but the mechanism is written in a way that is easily extensible via template specializations.

This changes the Zeek TSV reader to respect available schemas from schema files in case the parsed output can be casted to a schema with an exactly matching name. In case the casting is impossible, the Zeek reader warns exactly once when switching to a new file and ignores the user-provided schema.

mavam

Overall, solid work here on the core side.

Change requests:

There seems to be a bug in the output, _path and _write_ts are always null
The Zeek TSV writer is too far away from TSV output. I propose to (1) move _path to #path and drop _write_ts for now.
We need documentation changes that we "upgraded" the Zeek JSON logs to be structured. This breaks tools that rely on the flattened version, e.g., RITA, ZAT, etc. While there is probably jq hotfix for some use cases where the user can impose, we can't assume that. We should probably also mention that an upcoming cast operator (or whatever we choose) will make it possible transition explicitly.

libvast/include/vast/cast.hpp

libvast/include/vast/format/zeek.hpp

vast/integration/reference/arrow-export/step_01.ref

plugins/pcap/integration/reference/pivot-with-pcap/step_08.ref

vast/integration/reference/pivot/step_06.ref

plugins/pcap/integration/reference/pivot-with-pcap/step_08.ref

plugins/web/integration/reference/export-endpoint/step_02.ref

dominiklohmann · 2023-01-28T09:32:29Z

@mavam To address the meta comment from your review, as it's not in a review comment:

There seems to be a bug in the output, _path and _write_ts are always null

They are not present in Zeek TSV, so these fields are only set when you import Zeek JSON. This is working as expected.

The Zeek TSV writer is too far away from TSV output. I propose to (1) move _path to #path and drop _write_ts for now.

I don't think the Zeek TSV writer should make implicit changes to data based on the type name. It is not restricted to writing Zeek logs.

We need documentation changes that we "upgraded" the Zeek JSON logs to be structured. This breaks tools that rely on the flattened version, e.g., RITA, ZAT, etc. While there is probably jq hotfix for some use cases where the user can impose, we can't assume that. We should probably also mention that an upcoming cast operator (or whatever we choose) will make it possible transition explicitly.

vast export json --flatten has existed for a long time now to flatten on export.
The imported data is now exactly aligned with the schema, which we have always done for Zeek JSON, but until now not for Zeek TSV. This is a bug fix, not a feature.

This removes the `_path` from the Zeek schemas as they are only required for disambiguation in the Zeek JSON reader, but should not actually be part of the data. It also moves the `_write_ts` field to the end to increase compatibility with downstream tooling that relies on the column order in Zeek tools (please use `zeek-cut`, folks).

mavam

This works for now. I think the Zeek writer is troublesome in many ways and we've reached a working solution.

The fact that we have to add _write_ts to all schemas feels wrong, but there's no alternative at the moment.

dominiklohmann · 2023-01-28T16:53:39Z

This took a while to get to this point, but I think it's in a way better spot than what we had before this PR now. Thanks for all the discussion, @mavam. 🙏

dominiklohmann added the bug Incorrect behavior label Jan 26, 2023

dominiklohmann force-pushed the topic/unflatten-zeek-tsv branch from 2885c14 to a8586ac Compare January 26, 2023 23:46

dominiklohmann changed the title ~~Align output of the Zeek ZSV reader with schemas~~ Align output of the Zeek TSV reader with schemas Jan 27, 2023

dominiklohmann added 4 commits January 27, 2023 12:00

Increase string break penalty for clang-format

6108d0a

This should make it less likely for clang-format to split long error message strings into multiple short lines over breaking at an function earlier argument.

Implement casting for lists

b2b4ec3

dominiklohmann force-pushed the topic/unflatten-zeek-tsv branch from f77e984 to b2b4ec3 Compare January 27, 2023 11:05

dominiklohmann added 2 commits January 27, 2023 22:48

Add unit tests for casting a record type

3867bc6

Document changes to the Zeek reader

9618ca5

dominiklohmann marked this pull request as ready for review January 27, 2023 21:50

dominiklohmann requested a review from a team January 27, 2023 21:51

mavam requested changes Jan 28, 2023

View reviewed changes

mavam approved these changes Jan 28, 2023

View reviewed changes

Add changelog entry for _path removal

7d9f66c

dominiklohmann enabled auto-merge January 28, 2023 16:53

dominiklohmann merged commit 0a02ff5 into master Jan 28, 2023

dominiklohmann deleted the topic/unflatten-zeek-tsv branch January 28, 2023 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align output of the Zeek TSV reader with schemas #2887

Align output of the Zeek TSV reader with schemas #2887

dominiklohmann commented Jan 26, 2023 •

edited

Tasks

mavam commented Jan 27, 2023

dominiklohmann commented Jan 27, 2023

mavam commented Jan 27, 2023 •

edited

dominiklohmann commented Jan 27, 2023

mavam left a comment

dominiklohmann commented Jan 28, 2023

mavam left a comment

dominiklohmann commented Jan 28, 2023

Align output of the Zeek TSV reader with schemas #2887

Align output of the Zeek TSV reader with schemas #2887

Conversation

dominiklohmann commented Jan 26, 2023 • edited

Tasks

mavam commented Jan 27, 2023

dominiklohmann commented Jan 27, 2023

mavam commented Jan 27, 2023 • edited

dominiklohmann commented Jan 27, 2023

mavam left a comment

Choose a reason for hiding this comment

dominiklohmann commented Jan 28, 2023

mavam left a comment

Choose a reason for hiding this comment

dominiklohmann commented Jan 28, 2023

dominiklohmann commented Jan 26, 2023 •

edited

mavam commented Jan 27, 2023 •

edited