Reapproach the log event data model and nesting strategy #704

binarylogic · 2019-08-02T20:25:25Z

We need to re-think how we're handling nested log event data. The above referenced PR introduces very complicated "unflatten" code that was concerning to the rest of the team. While it works, it is a red flag that we might be better off reapproaching the internal log event data model to store data in a nested fashion, avoiding the need to unflatten data entirely.

MarkusH · 2019-08-08T05:28:46Z

How about approaching it the other way around? Treat events as structured and provide a flatten method that converts them to a flat structure as it is the default right now?

lukesteensen · 2019-08-12T23:18:58Z

@MarkusH yes, it's starting to seem like we'd be better off directly modeling nested data. We initially avoided this to keep things simpler, but it's causing significant complexity around the edges.

binarylogic · 2019-10-21T15:31:45Z

@lukesteensen assigning this to you, for now, to create the spec for this.

ghost · 2019-11-07T15:18:58Z

Here is a an initial proposal which considers using actually nested events in the internal representation.

Current approach

Currently we have two internal data types, Value and ValueKind. The former wraps latter and includes explicit field. LogEvent is a map containing Values.

Proposed changes in Rust types

In order to avoid the need to do flattening/unflattening, we can extend ValueKind enum to support nested objects and arrays. Then Rust definitions would look like this:

// stays the same
pub struct LogEvent {
    fields: HashMap<Atom, Value>,
}

// stays the same
pub struct Value {
    value: ValueKind,
    explicit: bool,
}

// two new fields are added
pub enum ValueKind {
    Bytes(Bytes),
    Integer(i64),
    Float(f64),
    Boolean(bool),
    Timestamp(DateTime<Utc>),
    // nested object
    Object(HashMap<Atom, ValueKind>),
    // nested array
    Array(Vec<ValueKind>)
}

Note that nested objects and and arrays contain values of ValueKind type, so they cannot be separately marked as explicit/implicit.

Proposed changes in ProtoBuf definitions

The corresponding ProtoBuf definitions to match Rust definitions from the previous section are the following:

// stays the same
message Log {
  map<string, Value> fields = 1;
}

message ValueArray {
  repeated ValueKind array = 1; // packed by default in proto3
}

message ValueMap {
  map<string, ValueKind> fields = 1;
}

message ValueKind {
  oneof kind {
    bytes raw_bytes = 1;
    google.protobuf.Timestamp timestamp = 2;
    int64 integer = 4;
    double float = 5;
    bool boolean = 6;
    ValueArray array = 7;
    ValueMap map = 8;
  }
}

message Value {
  // just placing ValueKind here might be better, but it would break
  // wire compatibility with the current version of the protocol
  oneof kind {
    bytes raw_bytes = 1;
    google.protobuf.Timestamp timestamp = 2;
    int64 integer = 4;
    double float = 5;
    bool boolean = 6;
    ValueArray array = 7;
    ValueMap map = 8;
  }
  bool explicit = 3;
}

The following schema gives backward compatibility between new and the current versions of the protocol in case if an event doesn't contain nested arrays or maps. If we want 100% compatibiltiy, we would need to do flattening/unflattening when serializing/deserializing messages to ProtoBuf.

Proposed changes in configuration

There should be no changes in configuration, as our configs already support nested fields. We use "." as separator for nested fields in config now, which should continue working.

Proposed changes in transforms

Right now, the Lua transform gets flat objects with keys looking like a.b.c. After this change, the objects in Lua should be actually nested tables. The same is true about JavaScript transform from #721.

lukesteensen · 2019-11-08T23:41:57Z

Nice! This looks like a solid start. There are a couple of other ideas I've had that might be worth exploring as well.

First, I think we can totally drop the concept of explicit and unify Value/ValueKind into a single type. We removed the only use of that data that I'm aware of when we got rid of dynamic encoding. Hopefully that will make things a bit simpler.

Second, I was wondering if it'd be worth adding a string type based on the string crate that uses the same design as Bytes. We do a lot of to_string_lossy and this could potentially save the effort or rechecking UTF-8 validity and the work of copying into a normal String.

Finally, as we discussed in slack, it may be worth re-evaluating our use of Protocol Buffers here. It adds a decent amount of complexity and we don't get a lot of benefit from it yet. I do think we should focus on the Rust side of things for now, and then potentially look into this as a followup.

ghost · 2019-11-11T14:05:23Z

First, I think we can totally drop the concept of explicit and unify Value/ValueKind into a single type. We removed the only use of that data that I'm aware of when we got rid of dynamic encoding. Hopefully that will make things a bit simpler.

I fully agree with this.

Second, I was wondering if it'd be worth adding a string type based on the string crate that uses the same design as Bytes. We do a lot of to_string_lossy and this could potentially save the effort or rechecking UTF-8 validity and the work of copying into a normal String.

Do you mean to have two field types, Bytes which correspond to raw bytes and String which would be required to be valid UTF-8? If so, I think such separation could play well with JavaScript transform, where it would be possible to use ArrayBuffer for binary arrays and String for UTF-8 strings.

Finally, as we discussed in slack, it may be worth re-evaluating our use of Protocol Buffers here. It adds a decent amount of complexity and we don't get a lot of benefit from it yet. I do think we should focus on the Rust side of things for now, and then potentially look into this as a followup.

I support the idea of keeping re-evaluation of the encoding separate from the current issue.

lukesteensen · 2019-11-11T17:34:42Z

Do you mean to have two field types, Bytes which correspond to raw bytes and String which would be required to be valid UTF-8?

Yep, exactly. Currently, we use bytes for everything, even when we know from the source that the data is valid UTF-8.

MOZGIII · 2020-02-13T03:56:41Z

This is relevant: #1532

ghost · 2020-03-02T14:19:38Z

The core changes proposed and discussed in this issue were implemented in #1836, #1838, and #1902. Because of that I think this issue can be closed.

binarylogic added type: tech debt A code change that does not add user value. domain: buffers Anything related to Vector's memory/disk buffers Core: Logs needs: requirements Needs a a list of requirements before work can be begin labels Aug 2, 2019

This was referenced Aug 4, 2019

Restore event structure before passing it to the lua runtime #706

Closed

feat(new transform): Initial javascript transform implementation #721

Closed

binarylogic removed the domain: buffers Anything related to Vector's memory/disk buffers label Sep 12, 2019

binarylogic added this to the Improve data processing milestone Sep 17, 2019

Jeffail mentioned this issue Oct 17, 2019

Configuration unit tests #232

Closed

binarylogic assigned lukesteensen Oct 21, 2019

binarylogic assigned ghost and unassigned lukesteensen Nov 4, 2019

lukesteensen mentioned this issue Nov 20, 2019

json_parser ... arrays? #1204

Closed

ghost mentioned this issue Nov 20, 2019

Periodically read content of a file using file source #1198

Open

Jeffail mentioned this issue Nov 25, 2019

feat(new source): Initial splunk_hec source implementation #1178

Closed

3 tasks

This was referenced Dec 16, 2019

Split Bytes data type into two separate types #1377

Open

Nested fields and data types #1406

Closed

ghost mentioned this issue Jan 31, 2020

enhancement!: Initial implementation of nested fields support for log events #1662

Closed

ghost mentioned this issue Feb 12, 2020

chore(testing): Add behavior tests for add_fields and remove_fields transforms #1781

Merged

binarylogic removed the needs: requirements Needs a a list of requirements before work can be begin label Feb 14, 2020

sciyoshi mentioned this issue Feb 21, 2020

enhancement(lua transform)!: respect Lua types when converting back to events #1886

Merged

ghost mentioned this issue Feb 24, 2020

chore: Replace flatten/unflatten by native nesting #1902

Merged

ghost closed this as completed Mar 2, 2020

binarylogic added domain: logs Anything related to Vector's log events domain: data model Anything related to Vector's internal data model and removed event type: log labels Aug 6, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reapproach the log event data model and nesting strategy #704

Reapproach the log event data model and nesting strategy #704

binarylogic commented Aug 2, 2019

MarkusH commented Aug 8, 2019

lukesteensen commented Aug 12, 2019

binarylogic commented Oct 21, 2019

ghost commented Nov 7, 2019

lukesteensen commented Nov 8, 2019

ghost commented Nov 11, 2019

lukesteensen commented Nov 11, 2019

MOZGIII commented Feb 13, 2020

ghost commented Mar 2, 2020

Reapproach the log event data model and nesting strategy #704

Reapproach the log event data model and nesting strategy #704

Comments

binarylogic commented Aug 2, 2019

MarkusH commented Aug 8, 2019

lukesteensen commented Aug 12, 2019

binarylogic commented Oct 21, 2019

ghost commented Nov 7, 2019

Current approach

Proposed changes in Rust types

Proposed changes in ProtoBuf definitions

Proposed changes in configuration

Proposed changes in transforms

lukesteensen commented Nov 8, 2019

ghost commented Nov 11, 2019

lukesteensen commented Nov 11, 2019

MOZGIII commented Feb 13, 2020

ghost commented Mar 2, 2020