Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align output of the Zeek TSV reader with schemas #2887

Merged
merged 8 commits into from Jan 28, 2023

Conversation

dominiklohmann
Copy link
Member

@dominiklohmann dominiklohmann commented Jan 26, 2023

This changes the Zeek TSV reader to respect available schemas from schema files in case the parsed output can be casted to a schema with an exactly matching name.

In case the casting is impossible, the Zeek reader warns exactly once when switching to a new file and ignores the user-provided schema.

To achieve this, I have introduced a new set of functions vast::cast and vast::can_cast, with the former operating on arrays and the latter operating on types. For now, this only supports reshuffling columns in record types (and corresponding struct arrays), but the mechanism is written in a way that is easily extensible via template specializations.

Tasks

Edit tasklist title
Beta Give feedback Tasklist Tasks, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Implement no-op cast operations
    Options
  2. Implement cast operations on record types to reshuffle columns
    Options
  3. Utilize cast operation in Zeek TSV reader
    Options
  4. Implement casting for lists
    Options
  5. Write unit tests for cast/can_cast
    Options
  6. Write changelog entry
    Options
  7. Update docs Docs already had unflattened fields
    Options

Fixes tenzir/issues#36.

To whoever is reviewing this: I'm sorry for the ginormous diff. We happen to use the Zeek TSV reader in a lot of integration tests.

@dominiklohmann dominiklohmann added the bug Incorrect behavior label Jan 26, 2023
@dominiklohmann dominiklohmann changed the title Align output of the Zeek ZSV reader with schemas Align output of the Zeek TSV reader with schemas Jan 27, 2023
@mavam
Copy link
Member

mavam commented Jan 27, 2023

I can see the value of the casting, but I don't understand the big picture. I would imagine it suffices to:

  1. Read the Zeek fields
  2. Unflatten records by splitting on .
  3. Set the type in the builder

Then we're done. Am I missing something here?

@dominiklohmann
Copy link
Member Author

Then we're done. Am I missing something here?

Yes. With what you've suggested you get data with a different schemas for the Zeek JSON and the Zeek TSV readers, which is a really bad UX when you're working with both.

Unflatten records by splitting on .

Before this change, the Zeek reader already looked at the schema files we had loaded, and iff it found a matching schema attempted to adjust the schema to that. However, that was incomplete—the end result was a schema that was neither that of the data nor the one we found in the loaded schema files.

Also, there is at least one edge case where unflattening based on dots does not work at all: Consider Zeek data that has the following field names (in order): id.orig_h, ts, id.orig_p. What you suggested breaks in this case as you would need to re-order leaves.

@mavam
Copy link
Member

mavam commented Jan 27, 2023

Yes. With what you've suggested you get data with a different schemas for the Zeek JSON and the Zeek TSV readers, which is a really bad UX when you're working with both.

I don't see how that would happen.

[..] the Zeek reader already looked at the schema files we had loaded

Is that really the desired UX? The format is self-describing. I would expect that the first layer is just onboarding the data. Transforming it according to known schemas should happen down the line (in the processing model we are currently working towards to).

Also, there is at least one edge case where unflattening based on dots does not work at all: Consider Zeek data that has the following field names (in order): id.orig_h, ts, id.orig_p. What you suggested breaks in this case as you would need to re-order leaves.

That is an artificial example. Zeek doesn't create such data.

@dominiklohmann
Copy link
Member Author

[..] the Zeek reader already looked at the schema files we had loaded

Is that really the desired UX? [...]

Yes, I would say so. If the user describes the schema of their data, then we should adhere to that. The Zeek reader still works in isolation—but as a user I would not want two different schemas to show up when I run vast show schemas zeek.conn.

Besides, how else would the Zeek reader know about the port and timestamp aliases, or about #index=hash attribute? Just unflattening on a dot is not good enough.

[...] different schemas for the Zeek JSON and the Zeek TSV readers [...]

I don't see how that would happen.

Look at the diff of the integration test reference files. For example, you can see the field local_resp missing before this change in the ingested zeek.conn events from Zeek TSV files, which was (1) described in the schema file and (2) available in the Zeek JSON data.

That is an artificial example. Zeek doesn't create such data.

Then think of CSV instead of Zeek, for which you cannot rely on the educated guess that nobody will ever write such data. It has all the same problems.

Remember the CSV reader problems that @rdettai encountered? This is able to fix them all the same, agnostic to the readers themselves. It's a building block that while for now only applied to Zeek TSV will in the near future be useful in all the many variants of the from and read operators.

This should make it less likely for clang-format to split long error
message strings into multiple short lines over breaking at an function
earlier argument.
This introduces a new set of functions `vast::cast` and
`vast::can_cast`, with the former operating on arrays and the latter
operating on types. For now, this only supports reshuffling columns in
record types (and corresponding struct arrays), but the mechanism is
written in a way that is easily extensible via template specializations.
This changes the Zeek TSV reader to respect available schemas from
schema files in case the parsed output can be casted to a schema with an
exactly matching name.

In case the casting is impossible, the Zeek reader warns exactly once
when switching to a new file and ignores the user-provided schema.
@dominiklohmann dominiklohmann marked this pull request as ready for review January 27, 2023 21:50
@dominiklohmann dominiklohmann requested a review from a team January 27, 2023 21:51
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, solid work here on the core side.

Change requests:

  • There seems to be a bug in the output, _path and _write_ts are always null
  • The Zeek TSV writer is too far away from TSV output. I propose to (1) move _path to #path and drop _write_ts for now.
  • We need documentation changes that we "upgraded" the Zeek JSON logs to be structured. This breaks tools that rely on the flattened version, e.g., RITA, ZAT, etc. While there is probably jq hotfix for some use cases where the user can impose, we can't assume that. We should probably also mention that an upcoming cast operator (or whatever we choose) will make it possible transition explicitly.

libvast/include/vast/cast.hpp Show resolved Hide resolved
libvast/include/vast/format/zeek.hpp Show resolved Hide resolved
vast/integration/reference/arrow-export/step_01.ref Outdated Show resolved Hide resolved
vast/integration/reference/pivot/step_06.ref Outdated Show resolved Hide resolved
vast/integration/reference/pivot/step_06.ref Outdated Show resolved Hide resolved
@dominiklohmann
Copy link
Member Author

@mavam To address the meta comment from your review, as it's not in a review comment:

  • There seems to be a bug in the output, _path and _write_ts are always null

They are not present in Zeek TSV, so these fields are only set when you import Zeek JSON. This is working as expected.

  • The Zeek TSV writer is too far away from TSV output. I propose to (1) move _path to #path and drop _write_ts for now.

I don't think the Zeek TSV writer should make implicit changes to data based on the type name. It is not restricted to writing Zeek logs.

  • We need documentation changes that we "upgraded" the Zeek JSON logs to be structured. This breaks tools that rely on the flattened version, e.g., RITA, ZAT, etc. While there is probably jq hotfix for some use cases where the user can impose, we can't assume that. We should probably also mention that an upcoming cast operator (or whatever we choose) will make it possible transition explicitly.
  • vast export json --flatten has existed for a long time now to flatten on export.
  • The imported data is now exactly aligned with the schema, which we have always done for Zeek JSON, but until now not for Zeek TSV. This is a bug fix, not a feature.

This removes the `_path` from the Zeek schemas as they are only required for
disambiguation in the Zeek JSON reader, but should not actually be part of the
data. It also moves the `_write_ts` field to the end to increase compatibility
with downstream tooling that relies on the column order in Zeek tools (please
use `zeek-cut`, folks).
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for now. I think the Zeek writer is troublesome in many ways and we've reached a working solution.

The fact that we have to add _write_ts to all schemas feels wrong, but there's no alternative at the moment.

@dominiklohmann
Copy link
Member Author

This took a while to get to this point, but I think it's in a way better spot than what we had before this PR now. Thanks for all the discussion, @mavam. 🙏

@dominiklohmann dominiklohmann merged commit 0a02ff5 into master Jan 28, 2023
@dominiklohmann dominiklohmann deleted the topic/unflatten-zeek-tsv branch January 28, 2023 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
2 participants