Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reapproach the log event data model and nesting strategy #704

Closed
binarylogic opened this issue Aug 2, 2019 · 9 comments
Closed

Reapproach the log event data model and nesting strategy #704

binarylogic opened this issue Aug 2, 2019 · 9 comments
Labels
domain: data model Anything related to Vector's internal data model domain: logs Anything related to Vector's log events type: tech debt A code change that does not add user value.

Comments

@binarylogic
Copy link
Contributor

See #678 (comment)

We need to re-think how we're handling nested log event data. The above referenced PR introduces very complicated "unflatten" code that was concerning to the rest of the team. While it works, it is a red flag that we might be better off reapproaching the internal log event data model to store data in a nested fashion, avoiding the need to unflatten data entirely.

@binarylogic binarylogic added type: tech debt A code change that does not add user value. domain: buffers Anything related to Vector's memory/disk buffers Core: Logs needs: requirements Needs a a list of requirements before work can be begin labels Aug 2, 2019
@MarkusH
Copy link
Contributor

MarkusH commented Aug 8, 2019

How about approaching it the other way around? Treat events as structured and provide a flatten method that converts them to a flat structure as it is the default right now?

@lukesteensen
Copy link
Member

@MarkusH yes, it's starting to seem like we'd be better off directly modeling nested data. We initially avoided this to keep things simpler, but it's causing significant complexity around the edges.

@binarylogic binarylogic removed the domain: buffers Anything related to Vector's memory/disk buffers label Sep 12, 2019
@binarylogic binarylogic added this to the Improve data processing milestone Sep 17, 2019
@binarylogic
Copy link
Contributor Author

@lukesteensen assigning this to you, for now, to create the spec for this.

@binarylogic binarylogic assigned ghost and unassigned lukesteensen Nov 4, 2019
@ghost
Copy link

ghost commented Nov 7, 2019

Here is a an initial proposal which considers using actually nested events in the internal representation.

Current approach

Currently we have two internal data types, Value and ValueKind. The former wraps latter and includes explicit field. LogEvent is a map containing Values.

Proposed changes in Rust types

In order to avoid the need to do flattening/unflattening, we can extend ValueKind enum to support nested objects and arrays. Then Rust definitions would look like this:

// stays the same
pub struct LogEvent {
    fields: HashMap<Atom, Value>,
}

// stays the same
pub struct Value {
    value: ValueKind,
    explicit: bool,
}

// two new fields are added
pub enum ValueKind {
    Bytes(Bytes),
    Integer(i64),
    Float(f64),
    Boolean(bool),
    Timestamp(DateTime<Utc>),
    // nested object
    Object(HashMap<Atom, ValueKind>),
    // nested array
    Array(Vec<ValueKind>)
}

Note that nested objects and and arrays contain values of ValueKind type, so they cannot be separately marked as explicit/implicit.

Proposed changes in ProtoBuf definitions

The corresponding ProtoBuf definitions to match Rust definitions from the previous section are the following:

// stays the same
message Log {
  map<string, Value> fields = 1;
}

message ValueArray {
  repeated ValueKind array = 1; // packed by default in proto3
}

message ValueMap {
  map<string, ValueKind> fields = 1;
}

message ValueKind {
  oneof kind {
    bytes raw_bytes = 1;
    google.protobuf.Timestamp timestamp = 2;
    int64 integer = 4;
    double float = 5;
    bool boolean = 6;
    ValueArray array = 7;
    ValueMap map = 8;
  }
}

message Value {
  // just placing ValueKind here might be better, but it would break
  // wire compatibility with the current version of the protocol
  oneof kind {
    bytes raw_bytes = 1;
    google.protobuf.Timestamp timestamp = 2;
    int64 integer = 4;
    double float = 5;
    bool boolean = 6;
    ValueArray array = 7;
    ValueMap map = 8;
  }
  bool explicit = 3;
}

The following schema gives backward compatibility between new and the current versions of the protocol in case if an event doesn't contain nested arrays or maps. If we want 100% compatibiltiy, we would need to do flattening/unflattening when serializing/deserializing messages to ProtoBuf.

Proposed changes in configuration

There should be no changes in configuration, as our configs already support nested fields. We use "." as separator for nested fields in config now, which should continue working.

Proposed changes in transforms

Right now, the Lua transform gets flat objects with keys looking like a.b.c. After this change, the objects in Lua should be actually nested tables. The same is true about JavaScript transform from #721.

@lukesteensen
Copy link
Member

Nice! This looks like a solid start. There are a couple of other ideas I've had that might be worth exploring as well.

First, I think we can totally drop the concept of explicit and unify Value/ValueKind into a single type. We removed the only use of that data that I'm aware of when we got rid of dynamic encoding. Hopefully that will make things a bit simpler.

Second, I was wondering if it'd be worth adding a string type based on the string crate that uses the same design as Bytes. We do a lot of to_string_lossy and this could potentially save the effort or rechecking UTF-8 validity and the work of copying into a normal String.

Finally, as we discussed in slack, it may be worth re-evaluating our use of Protocol Buffers here. It adds a decent amount of complexity and we don't get a lot of benefit from it yet. I do think we should focus on the Rust side of things for now, and then potentially look into this as a followup.

@ghost
Copy link

ghost commented Nov 11, 2019

First, I think we can totally drop the concept of explicit and unify Value/ValueKind into a single type. We removed the only use of that data that I'm aware of when we got rid of dynamic encoding. Hopefully that will make things a bit simpler.

I fully agree with this.

Second, I was wondering if it'd be worth adding a string type based on the string crate that uses the same design as Bytes. We do a lot of to_string_lossy and this could potentially save the effort or rechecking UTF-8 validity and the work of copying into a normal String.

Do you mean to have two field types, Bytes which correspond to raw bytes and String which would be required to be valid UTF-8? If so, I think such separation could play well with JavaScript transform, where it would be possible to use ArrayBuffer for binary arrays and String for UTF-8 strings.

Finally, as we discussed in slack, it may be worth re-evaluating our use of Protocol Buffers here. It adds a decent amount of complexity and we don't get a lot of benefit from it yet. I do think we should focus on the Rust side of things for now, and then potentially look into this as a followup.

I support the idea of keeping re-evaluation of the encoding separate from the current issue.

@lukesteensen
Copy link
Member

Do you mean to have two field types, Bytes which correspond to raw bytes and String which would be required to be valid UTF-8?

Yep, exactly. Currently, we use bytes for everything, even when we know from the source that the data is valid UTF-8.

@MOZGIII
Copy link
Contributor

MOZGIII commented Feb 13, 2020

This is relevant: #1532

@ghost
Copy link

ghost commented Mar 2, 2020

The core changes proposed and discussed in this issue were implemented in #1836, #1838, and #1902. Because of that I think this issue can be closed.

@ghost ghost closed this as completed Mar 2, 2020
@binarylogic binarylogic added domain: logs Anything related to Vector's log events domain: data model Anything related to Vector's internal data model and removed event type: log labels Aug 6, 2020
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: data model Anything related to Vector's internal data model domain: logs Anything related to Vector's log events type: tech debt A code change that does not add user value.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants