-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changing BroValUnion to use ZAM_record/ZAM_vector #1231
Conversation
CI is failing because the following sort of btest diff isn't getting normalized:
i.e. the filenames differ, and that's it. How do I fix that? Also, is there a way to cancel pending CI once a problem like this is spotted? |
scripts/base/files/x509/main.zeek
Outdated
@@ -43,7 +43,7 @@ export { | |||
## All extensions in the order they were raised. | |||
## This is used for caching certificates that are commonly | |||
## encountered and should not be relied on in user scripts. | |||
extensions_cache: vector of any &default=vector(); | |||
extensions_cache: table[count] of any; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that we have to change this from a vector to a table of count?
From a semantic perspective I like a vector quite a bit more here; using a table[count] seems to return this to the old use of count tables (like we used it for string indexing) that we have tried to get away from.
I admittedly have a suspicion - this probably is one of the few places of zeek where different record types are stuck into a single vector. I still like the original syntax more :)
Reading through the description again I see that you call out the heterogeneous use of vector. Just from a how it looks like in a script perspective, I still like it better - however I can see that one can argue that a vector has to be of one specific type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is due to vector-of-any's needing to be homogeneous. I agree that the original syntax is somewhat more pleasing, though I also was happy to find that making this switch required only very minor tweaks.
putting
Not to my knowledge. It will cancel automatically when you push a new commit; otherwise it doesn't really hurt that it continues running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are mostly small stylistic things - I will try to take a closer look later when I have a bit more time :)
src/Type.h
Outdated
@@ -605,6 +605,12 @@ class RecordType final : public Type { | |||
const TypeDecl* FieldDecl(int field) const; | |||
TypeDecl* FieldDecl(int field); | |||
|
|||
// Returns flags corresponding to which fields in the record | |||
// have types requiring memory management (reference counting). | |||
// Primarily used by the compiler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for newly added functions we should add doxygen-style comments for their usage.
This also is mentioned in https://github.com/zeek/zeek/wiki/Coding-Style-and-Conventions :)
This comment also applies to the following new functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have made a doxygen commenting pass.
@@ -1366,7 +1383,9 @@ class VectorVal final : public Val, public notifier::detail::Modifiable { | |||
return At(static_cast<unsigned int>(i)).get(); | |||
} | |||
|
|||
unsigned int Size() const { return val.vector_val->size(); } | |||
ZAM_vector* RawVector() const { return val.vector_val; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RawVector seems to be missing documentation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through and documented this stuff.
src/ZVal.cc
Outdated
using namespace zeek; | ||
|
||
|
||
bool* zeek::zval_error_addr = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a small side-note. I think since c++17 you can specify zval_error_addr as inline in the header file, which lets you do the assignment directly in the header file. Using that you can get rid of the additional line in a .cc file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I answered this somewhere else, but if not: thanks for the tip! I've made this change.
src/ZVal.h
Outdated
// | ||
// We use this somewhat clunky coupling to enable isolating ZVal from | ||
// ZAM compiler specifics. | ||
extern bool* zval_error_addr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this will mostly be needed for the compiler branch? In the current PR, I only found two uses for it, and it never seems to be read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's just needed for ZAM code execution, to catch run-time errors.
// variant type, and (2) it won't allow access to the managed_val member, | ||
// which not only simplifies memory management but also is required for | ||
// sharing of ZAM frame slots. | ||
union ZAMValUnion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we're merging this outside of ZAM into the main part of the code, I'd prefer to rename the "ZAM" types (ZAMValUnion
, ZAM_record
, etc) to something independent of ZAM. Otherwise I can see things getting confusing because the ZAM terminology seems to suggest that it's something specific to the optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ZValUnion
/ Z_vector
/ Z_record
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(with Z
standing for Zeek
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure. Then we'd end up with having both BroValUnion
and Z(eek)ValUnion
. While I don't have a better idea right now, maybe this is actually hinting as something else: there's a lot of overlap between new and existing types. Is there any chance we could unify that so that we end up with just single representation that covers all use cases? I assume the answer is no (otherwise you'd have done it), but just seeing the current code leaves me wondering if the structure is quite right yet. (I'm not deep enough in the code to suggest something else right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the core problem as I see it. Memory management for aggregate values is handled by Val
reference-counting rather than direct reference-counting. For example, a RecordVal
's low-level representation in BroValUnion
is PDict<TableEntryVal>*
, and dictionaries do not support reference-counting. Thus, to share an aggregate (as often required by Zeek scripting semantics), the sharing has to be done using the RecordVal
rather than the low-level representation.
One approach would be to change all of these low-level representations to be Obj
's, though getting this right w/o introducing leaks would likely be a pain. If we made that change, then ZAMValUnion
could be expressed as some additional fields to the existing (modified) BroValUnion
rather than a separate representation.
Another approach is to get rid of BroValUnion
and hoist its different variants into the different Val
subclasses that use them. For example, RecordVal
's would have the PDict
as an explicit member variable, rather than using the common BroValUnion
to store them. The scope of this change also gives me pause, though, and it could wind up being clunky or potentially buggy for closely-related Val
subclasses like TimeVal
and DoubleVal
. If we did this then we'd still have ZAMValUnion
but it would be the only such low-level representation of values (other than those spread across different Val
subclasses).
// variant type, and (2) it won't allow access to the managed_val member, | ||
// which not only simplifies memory management but also is required for | ||
// sharing of ZAM frame slots. | ||
union ZAMValUnion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we should switch this to a std::variant
to make it type-safe and all memory-managed. Would we then actually still need the notion of being "managed"?
Could we do (2) through a dispatcher method a returning an object pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I resist this. It means that every low-level copy of a ZAMValUnion is more expensive, and they cost twice as much to store. Plus sharing of ZAM frame slots provides a significant performance gain: some fully inlined event handlers get a 10x savings in frame size (which is also a speed savings due to initialization issues). I think the uneasiness of non-type-safety is worth it in this case for the performance benefits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite reluctant to merge new code that's neither type-safe nor automatically memory managed. It's not only "uneasiness", but also a major maintenance burden and a barrier to entry for new people. We've been working very hard to get away from that, and I want to be careful that we don't end up moving the wrong way.
Of course performance is paramount for an optimizer, so that's a perfectly valid argument. However, I don't know if it's as bad as you suggest. Honestly, I've given up on making these kinds of assumptions on performance without measuring, I've been wrong often enough about trying to guess how something will turn out. For example, expensive copies may be avoidable by moving instances; storing a heap-allocated object twice the size might not matter because of malloc alignment; and storing/accessing more information can be largely invisible if it all fits into a cache line.
In short: I don't know the answer yet either, but I wouldn't just rule out yet that a safer way do it might be feasible, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try a different tack: Are we sure we need the full ZAMValUnion at the interpreter level (vs moving it down into the optimizer-only code)? I know we need to have a version of Z_vector
/ Z_record
inside the BroValUnion
and I know those depend on BroValUnion in turn. But could there be a different way of structuring this so we extend BroValUnion
without pulling in the full BroValUnion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been noodling on this but I'm not seeing how to make such a path work. There are multiple challenges:
- For the foreseeable future, we'll want compiled code and interpreted code able to live together (each can call the other). This requires that they share underlying data types, since otherwise the types need to be inter-converted, which is not only inefficient but more fundamentally leads to broken semantics (aggregates no longer being shallow-copied).
- There are major performance gains in representing Zeek vectors and records as
std::vector
's of some sort of low-level type, rather thanVal*
's. Doing so enables avoiding a bunch ofVal
construction/destruction. - To preserve current APIs (fundamental for the script interpreter),
ZAM_vector
andZAM_record
need to be able to returnVal*
's for their elements/fields. WhileZAMValUnion
supports doing so,BroValUnion
does not, and it lacks the hooks for adding this in a manner that preserves the shallow-copy semantics of aggregates. We could get those hooks by making all of the low-level types inBroValUnion
derived fromObj
, but per my comment above unfortunately I think this is a heavy lift :-( .
} | ||
|
||
/** | ||
* Provides direct raw access to one of the record's fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style is something that would be nice to avoid, I saw in it a couple of places: Providing raw mutating access to internal class state. It breaks encapsulation and means a class can't enforce (and test) consistent properties on its internal state (memory mgmt being a case in point here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I can see how to make that work. Check out the change I made in this regard for ZAM_vector
, per 48585db. It allows for a few special friends to still have raw access, but no general raw access.
For ZAM_record
's SetField()
(mentioned above), we can switch to having an Assign
variant suitable for initializing empty record fields, as that's what SetField()
is used for. This makes code a bit less explicit, for example this current code:
r->SetField(1).uint_val = opcode;
would become:
r->Init(1, ZAMValUnion(opcode));
(where for efficiency we'd add a ZAMValUnion constructor
that doesn't require a type, as is almost always the case when you know the type ahead of time).
How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got about halfway through this afternoon. I'll get the rest tomorrow. Here's a first round of comments.
src/Type.cc
Outdated
@@ -1041,6 +1050,12 @@ const char* RecordType::AddFields(const type_decl_list& others, | |||
return nullptr; | |||
} | |||
|
|||
void RecordType::AddField(unsigned int field, const TypeDecl* td) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be changed to AppendField()
? That would remove the need for the assert, and the contract with the caller would be that it always goes at the end. It would also remove the need for passing the field number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good thought. I picked up the name just because its caller is AddFields
, and the assert was to make sure I hadn't screwed anything up, but I'm okay with removing it now. I've made this change and it will be part of the next set of updates I push.
src/Type.cc
Outdated
loop_over_list(*types, i) | ||
AddField(i, (*types)[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If AddField
becomes AppendField
, this can change to:
loop_over_list(*types, i) | |
AddField(i, (*types)[i]); | |
for ( const auto& type : types ) | |
AppendField(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(likewise done)
@@ -3001,7 +3022,20 @@ void RecordVal::DoneParsing() | |||
parse_time_records.clear(); | |||
} | |||
|
|||
const ValPtr& RecordVal::GetField(const char* field) const | |||
Val* RecordVal::Lookup(int field) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Val::Lookup
methods are marked as deprecated. I'm not sure that we care if this leaks or not, since it's just going to get removed in v4.1. Does the GetField
method function as it's supposed to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was hoping that the leak is okay given it's deprecated, as I don't see any way to fix it. The GetField
method behaves appropriately (no leak).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have doubts about how acceptable it is to introduce a leak into now-deprecated code -- seems like just another shade of "we broke this API" and I'd maybe even prefer a build-time breakage due to removing it rather than a run-time misbehavior.
I haven't finished paging in all the details of the PR and about to run out of time for a while, but if the "is managed" property of ZamValUnion
were per-object rather than based on the static list of managed types, maybe it could do something like convert the unmanaged value into a managed one at the point when someone is found using the deprecated API (but don't know how much other difficulties there are with changing everything to now be like a struct { ZAMValUnion zval; bool is_managed; }
... plus maybe that's just another form of begging to go down the road of wrapping things into a more concise representation that's type/memory safe like variant
... which was an initial impression I share with Robin if there's desire to do more brainstorming toward making that wishlist-item come true).
Generally, if there's other ideas for having temporary slow-paths or ways to opt-in to breakages (compile-time options?), that could be a way to address things like this (if you find it doesn't add too much complexity by itself) -- any such workaround would only exist in master
until 4.0 is released (planned for around end of this year) and then they'd get removed along with the deprecations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a simpler change that tracks ValPtr's returned by the deprecated Lookup()
method and recovers the storage upon destruction of the RecordVal
. I tested this using the #if
code I introduced for GetField()
, which is commented out in the pushed version.
(BTW, it wouldn't work, without major headaches, to migrate the field to "managed", since that would require conditionals in all low-level access checking for its presence.)
src/Val.cc
Outdated
} | ||
|
||
vector<ValPtr>::iterator it; | ||
auto& vv = val.vector_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing off this reference doesn't seem necessary since it's only used once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I think that's vestigial from when more was involved. Now addressed in the upcoming change I'll push.
@@ -1395,6 +1418,10 @@ class VectorVal final : public Val, public notifier::detail::Modifiable { | |||
bool Remove(unsigned int index); | |||
|
|||
protected: | |||
// For a vector with yield type void/any, concretizes it to instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I've ever heard the word "concretize" before today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh - well per https://books.google.com/ngrams/graph?content=pother%2Cconcretize&year_start=1800&year_end=2019&corpus=26&smoothing=3&direct_url=t1%3B%2Cpother%3B%2Cc0%3B.t1%3B%2Cconcretize%3B%2Cc0, it's somewhat "potheresque" (a term my wife & I use to express grumpiness at some crossword answers), but it does seem nicely apt here :-)
I thought about the fact that this change forbids "vector of any" with differing types a bit more - and I still don't really like it. For one - sticking differing types into a "vector of any" looks way nicer to me than doing the same with a "table[count] of any". Second - even if one says that this should not be allowed, since the type for a vector should be consistent - why would this not be the case for a table[count] of any too? To me that just starts feeling internally inconsistent. Without having looked at this deeply - I assume that the reason for the vector-change is that you only want to have one type for the entire vector - instead of having to track it for each element. Would it perhaps be possible to just special-case vector of any - and to track the types for each element individually in these cases? |
In reply to @0xxon:
Yes.
I'll give this a try. I do appreciate the point about consistency. Ultimately where I've been coming from is that most of the uses of |
src/ZVal.cc
Outdated
// We can't use make_intrusive directly because this | ||
// constructor is protected, sigh. | ||
v = new Val(int_val, zeek::TYPE_INT); | ||
return {AdoptRef{}, v}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's val_mgr->Int()
(and also val_mgr->Bool()
/ val_mgr->Count()
for bool and count types below) that would give back a ValPtr
to a preallocated-or-else-newly-constructed object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging, my next push will include this.
Again in reply to @0xxon:
I've pushed changes to support heterogeneous vector-of-any's. In the process, I remembered why this gets messy: vector-of-any is used both as a way of managing lists of heterogeneous values (as in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments on the rest of it.
if ( sort_type_is_managed ) | ||
{ | ||
if ( ! a.managed_val ) | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return 0; | |
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly a reasonable change, but note that that's not code I've touched, it's like that in master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird. I thought I had fixed all of those a while back. I must have missed these two.
if ( ! a.managed_val ) | ||
return 0; | ||
if ( ! b.managed_val ) | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return 1; | |
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same here)
int int_result = result->CoerceToInt(); | ||
|
||
return int_result < 0; | ||
} | ||
|
||
bool indirect_sort_function(size_t a, size_t b) | ||
bool signed_sort_function (zeek::ZAMValUnion& a, zeek::ZAMValUnion& b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do all of these single-line functions need to exist anymore, or can they be turned into lambdas where you used them to call sort
? If not, are they used anywhere else or could they be marked static
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can certainly be turned into lambdas or made static, I was going with minimal changes, as the original code is non-static, non-lambdas.
ZAM_vector(zeek::VectorVal* _vv, zeek::TypePtr yt, int n = 0) | ||
: zvec(n) | ||
{ | ||
vv = _vv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is vv for? it's a protected member that doesn't seem to be used anywhere?
(and: why are all the members protected instead of private?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re vv
, good catch - it's vestigial, as is rv
in ZAM_record
. I'll remove them if we wind up keeping these classes.
Re why protected rather than private: just because that's the style still used in large parts of the code base. Are we switching to a style of using private in cases like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re why protected rather than private: just because that's the style still used in large parts of the code base.
Ah, I guess never really realized that. I think most newer classes use private
, and protected
only for stuff that indeed needs access from derived classes. I would suggest we use that approach going forward as well in the interest of encapsulation, but point taken that that's not settled style.
Closing this out as some of the core elements will now be changed per #1281. |
Here's a branch for the script optimization work isolated to the changes to
BroValUnion
and the various repercussions. It adds two new source files,ZVal.h
andZVal.cc
, for definingZAM_vector
andZAM_record
(which in turn needZAMValUnion
). There's also (1) bug fixes to scripts that either used heterogeneous vector-of-any's, or had typing errors that happen to work for the interpreter even though they're incorrect, (2) a low-level debugging function,obj_desc
, that I've made heavy use of when debugging , and (3) a tweak to thesort()
andorder()
BiFs so that now "double" types don't require a comparison function (not needed in a minimal sense, but I was worried I'd forget about this otherwise as I have to modify those functions anyway).A couple of non-user-visible things I left out, but would make sense to do next for performance improvements: (1) converting some event engine hot-spots to use direct
ZAM_record
accesses rather thanRecordVal
accesses, and (2) a change to record initialization to be done byRecordType
(which builds up a template for how to initialize records) rather thanRecordVal
(which does a bunch of redundant work for records of type that are frequently used).There's a companion branch (topic/vern/val-change) for
zeek/zeek-docs
documenting the user-visible differences (which are quite small). I'll do a draft pull request for it, too.