changing BroValUnion to use ZAM_record/ZAM_vector #1231

vpax · 2020-10-17T01:14:53Z

Here's a branch for the script optimization work isolated to the changes to BroValUnion and the various repercussions. It adds two new source files, ZVal.h and ZVal.cc, for defining ZAM_vector and ZAM_record (which in turn need ZAMValUnion). There's also (1) bug fixes to scripts that either used heterogeneous vector-of-any's, or had typing errors that happen to work for the interpreter even though they're incorrect, (2) a low-level debugging function, obj_desc, that I've made heavy use of when debugging , and (3) a tweak to the sort() and order() BiFs so that now "double" types don't require a comparison function (not needed in a minimal sense, but I was worried I'd forget about this otherwise as I have to modify those functions anyway).

A couple of non-user-visible things I left out, but would make sense to do next for performance improvements: (1) converting some event engine hot-spots to use direct ZAM_record accesses rather than RecordVal accesses, and (2) a change to record initialization to be done by RecordType (which builds up a template for how to initialize records) rather than RecordVal (which does a bunch of redundant work for records of type that are frequently used).

There's a companion branch (topic/vern/val-change) for zeek/zeek-docs documenting the user-visible differences (which are quite small). I'll do a draft pull request for it, too.

vpax · 2020-10-17T20:42:25Z

CI is failing because the following sort of btest diff isn't getting normalized:

-expression error in /Users/vern/warehouse/zeek-bvu-change/testing/btest/.tmp/language.vector-unspecified/vector-unspecified.zeek, line 8: vector index assignment failed for invalid type 'string', value: Hi (a[1])
+expression error in /zeek/testing/btest/.tmp/language.vector-unspecified/vector-unspecified.zeek, line 8: vector index assignment failed for invalid type 'string', value: Hi (a[1])

i.e. the filenames differ, and that's it. How do I fix that? Also, is there a way to cancel pending CI once a problem like this is spotted?

0xxon · 2020-10-19T07:17:19Z

scripts/base/files/x509/main.zeek

@@ -43,7 +43,7 @@ export {
 		## All extensions in the order they were raised.
 		## This is used for caching certificates that are commonly
 		## encountered and should not be relied on in user scripts.
-		extensions_cache: vector of any &default=vector();
+		extensions_cache: table[count] of any;


Is there a reason that we have to change this from a vector to a table of count?

From a semantic perspective I like a vector quite a bit more here; using a table[count] seems to return this to the old use of count tables (like we used it for string indexing) that we have tried to get away from.

I admittedly have a suspicion - this probably is one of the few places of zeek where different record types are stuck into a single vector. I still like the original syntax more :)

Reading through the description again I see that you call out the heterogeneous use of vector. Just from a how it looks like in a script perspective, I still like it better - however I can see that one can argue that a vector has to be of one specific type.

Yes, this is due to vector-of-any's needing to be homogeneous. I agree that the original syntax is somewhat more pleasing, though I also was happy to find that making this switch required only very minor tweaks.

0xxon · 2020-10-19T07:20:31Z

i.e. the filenames differ, and that's it. How do I fix that?

putting TEST_DIFF_CANONIFIER=$SCRIPTS/diff-remove-abspath before btest-diff should fix this specific issue, I think.

Also, is there a way to cancel pending CI once a problem like this is spotted?

Not to my knowledge. It will cancel automatically when you push a new commit; otherwise it doesn't really hurt that it continues running.

0xxon

These are mostly small stylistic things - I will try to take a closer look later when I have a bit more time :)

0xxon · 2020-10-19T07:27:46Z

src/Type.h

@@ -605,6 +605,12 @@ class RecordType final : public Type {
 	const TypeDecl* FieldDecl(int field) const;
 	TypeDecl* FieldDecl(int field);

+	// Returns flags corresponding to which fields in the record
+	// have types requiring memory management (reference counting).
+	// Primarily used by the compiler.


I think that for newly added functions we should add doxygen-style comments for their usage.

This also is mentioned in https://github.com/zeek/zeek/wiki/Coding-Style-and-Conventions :)

This comment also applies to the following new functions.

Have made a doxygen commenting pass.

0xxon · 2020-10-19T07:28:24Z

src/Val.h

@@ -1366,7 +1383,9 @@ class VectorVal final : public Val, public notifier::detail::Modifiable {
 		return At(static_cast<unsigned int>(i)).get();
 		}

-	unsigned int Size() const { return val.vector_val->size(); }
+	ZAM_vector* RawVector() const	{ return val.vector_val; }


RawVector seems to be missing documentation :)

I've gone through and documented this stuff.

0xxon · 2020-10-19T07:32:13Z

src/ZVal.cc

+using namespace zeek;
+
+
+bool* zeek::zval_error_addr = nullptr;


As a small side-note. I think since c++17 you can specify zval_error_addr as inline in the header file, which lets you do the assignment directly in the header file. Using that you can get rid of the additional line in a .cc file.

I think I answered this somewhere else, but if not: thanks for the tip! I've made this change.

0xxon · 2020-10-19T07:43:38Z

src/ZVal.h

+//
+// We use this somewhat clunky coupling to enable isolating ZVal from
+// ZAM compiler specifics.
+extern bool* zval_error_addr;


I assume this will mostly be needed for the compiler branch? In the current PR, I only found two uses for it, and it never seems to be read.

Right, that's just needed for ZAM code execution, to catch run-time errors.

rsmmr · 2020-10-20T18:06:37Z

src/ZVal.h

+// variant type, and (2) it won't allow access to the managed_val member,
+// which not only simplifies memory management but also is required for
+// sharing of ZAM frame slots.
+union ZAMValUnion {


Given that we're merging this outside of ZAM into the main part of the code, I'd prefer to rename the "ZAM" types (ZAMValUnion, ZAM_record, etc) to something independent of ZAM. Otherwise I can see things getting confusing because the ZAM terminology seems to suggest that it's something specific to the optimizer.

How about ZValUnion / Z_vector / Z_record ?

(with Z standing for Zeek)

Hmm, not sure. Then we'd end up with having both BroValUnion and Z(eek)ValUnion. While I don't have a better idea right now, maybe this is actually hinting as something else: there's a lot of overlap between new and existing types. Is there any chance we could unify that so that we end up with just single representation that covers all use cases? I assume the answer is no (otherwise you'd have done it), but just seeing the current code leaves me wondering if the structure is quite right yet. (I'm not deep enough in the code to suggest something else right now)

Here's the core problem as I see it. Memory management for aggregate values is handled by Val reference-counting rather than direct reference-counting. For example, a RecordVal's low-level representation in BroValUnion is PDict<TableEntryVal>*, and dictionaries do not support reference-counting. Thus, to share an aggregate (as often required by Zeek scripting semantics), the sharing has to be done using the RecordVal rather than the low-level representation.

One approach would be to change all of these low-level representations to be Obj's, though getting this right w/o introducing leaks would likely be a pain. If we made that change, then ZAMValUnion could be expressed as some additional fields to the existing (modified) BroValUnion rather than a separate representation.

Another approach is to get rid of BroValUnion and hoist its different variants into the different Val subclasses that use them. For example, RecordVal's would have the PDict as an explicit member variable, rather than using the common BroValUnion to store them. The scope of this change also gives me pause, though, and it could wind up being clunky or potentially buggy for closely-related Val subclasses like TimeVal and DoubleVal. If we did this then we'd still have ZAMValUnion but it would be the only such low-level representation of values (other than those spread across different Val subclasses).

rsmmr · 2020-10-20T18:13:29Z

src/ZVal.h

+// variant type, and (2) it won't allow access to the managed_val member,
+// which not only simplifies memory management but also is required for
+// sharing of ZAM frame slots.
+union ZAMValUnion {


I do think we should switch this to a std::variant to make it type-safe and all memory-managed. Would we then actually still need the notion of being "managed"?

Could we do (2) through a dispatcher method a returning an object pointer?

I resist this. It means that every low-level copy of a ZAMValUnion is more expensive, and they cost twice as much to store. Plus sharing of ZAM frame slots provides a significant performance gain: some fully inlined event handlers get a 10x savings in frame size (which is also a speed savings due to initialization issues). I think the uneasiness of non-type-safety is worth it in this case for the performance benefits

I'm quite reluctant to merge new code that's neither type-safe nor automatically memory managed. It's not only "uneasiness", but also a major maintenance burden and a barrier to entry for new people. We've been working very hard to get away from that, and I want to be careful that we don't end up moving the wrong way.

Of course performance is paramount for an optimizer, so that's a perfectly valid argument. However, I don't know if it's as bad as you suggest. Honestly, I've given up on making these kinds of assumptions on performance without measuring, I've been wrong often enough about trying to guess how something will turn out. For example, expensive copies may be avoidable by moving instances; storing a heap-allocated object twice the size might not matter because of malloc alignment; and storing/accessing more information can be largely invisible if it all fits into a cache line.

In short: I don't know the answer yet either, but I wouldn't just rule out yet that a safer way do it might be feasible, too.

Let me try a different tack: Are we sure we need the full ZAMValUnion at the interpreter level (vs moving it down into the optimizer-only code)? I know we need to have a version of Z_vector / Z_record inside the BroValUnion and I know those depend on BroValUnion in turn. But could there be a different way of structuring this so we extend BroValUnion without pulling in the full BroValUnion?

I've been noodling on this but I'm not seeing how to make such a path work. There are multiple challenges:

For the foreseeable future, we'll want compiled code and interpreted code able to live together (each can call the other). This requires that they share underlying data types, since otherwise the types need to be inter-converted, which is not only inefficient but more fundamentally leads to broken semantics (aggregates no longer being shallow-copied).

There are major performance gains in representing Zeek vectors and records as std::vector's of some sort of low-level type, rather than Val*'s. Doing so enables avoiding a bunch of Val construction/destruction.

To preserve current APIs (fundamental for the script interpreter), ZAM_vector and ZAM_record need to be able to return Val*'s for their elements/fields. While ZAMValUnion supports doing so, BroValUnion does not, and it lacks the hooks for adding this in a manner that preserves the shallow-copy semantics of aggregates. We could get those hooks by making all of the low-level types in BroValUnion derived from Obj, but per my comment above unfortunately I think this is a heavy lift :-( .

rsmmr · 2020-10-22T07:55:49Z

src/ZVal.h

+		}
+
+	/**
+	 * Provides direct raw access to one of the record's fields


This style is something that would be nice to avoid, I saw in it a couple of places: Providing raw mutating access to internal class state. It breaks encapsulation and means a class can't enforce (and test) consistent properties on its internal state (memory mgmt being a case in point here).

Okay, I think I can see how to make that work. Check out the change I made in this regard for ZAM_vector, per 48585db. It allows for a few special friends to still have raw access, but no general raw access.

For ZAM_record's SetField() (mentioned above), we can switch to having an Assign variant suitable for initializing empty record fields, as that's what SetField() is used for. This makes code a bit less explicit, for example this current code:

r->SetField(1).uint_val = opcode;

would become:

r->Init(1, ZAMValUnion(opcode));

(where for efficiency we'd add a ZAMValUnion constructor that doesn't require a type, as is almost always the case when you know the type ahead of time).

How does that sound?

timwoj

I got about halfway through this afternoon. I'll get the rest tomorrow. Here's a first round of comments.

timwoj · 2020-10-22T21:24:09Z

src/Type.cc

@@ -1041,6 +1050,12 @@ const char* RecordType::AddFields(const type_decl_list& others,
 	return nullptr;
 	}

+void RecordType::AddField(unsigned int field, const TypeDecl* td)


Can this be changed to AppendField()? That would remove the need for the assert, and the contract with the caller would be that it always goes at the end. It would also remove the need for passing the field number.

Yes, that's a good thought. I picked up the name just because its caller is AddFields, and the assert was to make sure I hadn't screwed anything up, but I'm okay with removing it now. I've made this change and it will be part of the next set of updates I push.

timwoj · 2020-10-22T21:25:41Z

src/Type.cc

+	loop_over_list(*types, i)
+		AddField(i, (*types)[i]);


If AddField becomes AppendField, this can change to:

Suggested change

loop_over_list(*types, i)

AddField(i, (*types)[i]);

for ( const auto& type : types )

AppendField(type);

(likewise done)

timwoj · 2020-10-22T21:32:34Z

src/Val.cc

@@ -3001,7 +3022,20 @@ void RecordVal::DoneParsing()
 	parse_time_records.clear();
 	}

-const ValPtr& RecordVal::GetField(const char* field) const
+Val* RecordVal::Lookup(int field) const


The Val::Lookup methods are marked as deprecated. I'm not sure that we care if this leaks or not, since it's just going to get removed in v4.1. Does the GetField method function as it's supposed to?

Right, I was hoping that the leak is okay given it's deprecated, as I don't see any way to fix it. The GetField method behaves appropriately (no leak).

I have doubts about how acceptable it is to introduce a leak into now-deprecated code -- seems like just another shade of "we broke this API" and I'd maybe even prefer a build-time breakage due to removing it rather than a run-time misbehavior.

I haven't finished paging in all the details of the PR and about to run out of time for a while, but if the "is managed" property of ZamValUnion were per-object rather than based on the static list of managed types, maybe it could do something like convert the unmanaged value into a managed one at the point when someone is found using the deprecated API (but don't know how much other difficulties there are with changing everything to now be like a struct { ZAMValUnion zval; bool is_managed; }... plus maybe that's just another form of begging to go down the road of wrapping things into a more concise representation that's type/memory safe like variant... which was an initial impression I share with Robin if there's desire to do more brainstorming toward making that wishlist-item come true).

Generally, if there's other ideas for having temporary slow-paths or ways to opt-in to breakages (compile-time options?), that could be a way to address things like this (if you find it doesn't add too much complexity by itself) -- any such workaround would only exist in master until 4.0 is released (planned for around end of this year) and then they'd get removed along with the deprecations.

I've pushed a simpler change that tracks ValPtr's returned by the deprecated Lookup() method and recovers the storage upon destruction of the RecordVal. I tested this using the #if code I introduced for GetField(), which is commented out in the pushed version.

(BTW, it wouldn't work, without major headaches, to migrate the field to "managed", since that would require conditionals in all low-level access checking for its presence.)

timwoj · 2020-10-22T21:35:42Z

src/Val.cc

 		}

-	vector<ValPtr>::iterator it;
+	auto& vv = val.vector_val;


Storing off this reference doesn't seem necessary since it's only used once.

Agreed. I think that's vestigial from when more was involved. Now addressed in the upcoming change I'll push.

timwoj · 2020-10-22T21:39:18Z

src/Val.h

@@ -1395,6 +1418,10 @@ class VectorVal final : public Val, public notifier::detail::Modifiable {
 	bool Remove(unsigned int index);

 protected:
+	// For a vector with yield type void/any, concretizes it to instead


I can't say I've ever heard the word "concretize" before today.

Heh - well per https://books.google.com/ngrams/graph?content=pother%2Cconcretize&year_start=1800&year_end=2019&corpus=26&smoothing=3&direct_url=t1%3B%2Cpother%3B%2Cc0%3B.t1%3B%2Cconcretize%3B%2Cc0, it's somewhat "potheresque" (a term my wife & I use to express grumpiness at some crossword answers), but it does seem nicely apt here :-)

0xxon · 2020-10-23T13:55:40Z

I thought about the fact that this change forbids "vector of any" with differing types a bit more - and I still don't really like it.

For one - sticking differing types into a "vector of any" looks way nicer to me than doing the same with a "table[count] of any".

Second - even if one says that this should not be allowed, since the type for a vector should be consistent - why would this not be the case for a table[count] of any too? To me that just starts feeling internally inconsistent.

Without having looked at this deeply - I assume that the reason for the vector-change is that you only want to have one type for the entire vector - instead of having to track it for each element. Would it perhaps be possible to just special-case vector of any - and to track the types for each element individually in these cases?

vpax · 2020-10-23T20:17:24Z

In reply to @0xxon:

Without having looked at this deeply - I assume that the reason for the vector-change is that you only want to have one type for the entire vector - instead of having to track it for each element.

Yes.

Would it perhaps be possible to just special-case vector of any - and to track the types for each element individually in these cases?

I'll give this a try. I do appreciate the point about consistency. Ultimately where I've been coming from is that most of the uses of vector of any (at least in the base scripts) are actually trying to provide vector of T, where T is like a C++ template type. For those, enforcing homogeneity makes sense. But I agree that there's another usage where the heterogeneity is reasonable. We should though also give some thought to what templates might look like for Zeek scripting.

jsiwek · 2020-10-23T23:27:22Z

src/ZVal.cc

+		// We can't use make_intrusive directly because this
+		// constructor is protected, sigh.
+		v = new Val(int_val, zeek::TYPE_INT);
+		return {AdoptRef{}, v};


There's val_mgr->Int() (and also val_mgr->Bool() / val_mgr->Count() for bool and count types below) that would give back a ValPtr to a preallocated-or-else-newly-constructed object.

Thanks for flagging, my next push will include this.

vpax · 2020-10-24T01:32:05Z

Again in reply to @0xxon:

Would it perhaps be possible to just special-case vector of any - and to track the types for each element individually in these cases?

I'll give this a try.

I've pushed changes to support heterogeneous vector-of-any's. In the process, I remembered why this gets messy: vector-of-any is used both as a way of managing lists of heterogeneous values (as in scripts/base/files/x509/main.zeek) and to allow for generic functions, where the parameters are declared vector of any but the actual arguments are concrete vectors. For this latter case, there's no guarantee that the called function won't change the vector from homogeneous to heterogeneous. Thus, there needs to be some explicit checking upon return from the call that that hasn't happened. Anyhoo, we can kick that can down the road a tad to when we're integrating the main optimization code.

timwoj

A few more comments on the rest of it.

timwoj · 2020-10-26T21:44:18Z

src/zeek.bif

+	if ( sort_type_is_managed )
+		{
+		if ( ! a.managed_val )
+			return 0;


Suggested change

return 0;

return false;

Certainly a reasonable change, but note that that's not code I've touched, it's like that in master.

Weird. I thought I had fixed all of those a while back. I must have missed these two.

timwoj · 2020-10-26T21:44:28Z

src/zeek.bif

+		if ( ! a.managed_val )
+			return 0;
+		if ( ! b.managed_val )
+			return 1;


Suggested change

return 1;

return true;

(same here)

timwoj · 2020-10-26T21:46:49Z

src/zeek.bif

 	int int_result = result->CoerceToInt();

 	return int_result < 0;
 	}

-bool indirect_sort_function(size_t a, size_t b)
+bool signed_sort_function (zeek::ZAMValUnion& a, zeek::ZAMValUnion& b)


Do all of these single-line functions need to exist anymore, or can they be turned into lambdas where you used them to call sort? If not, are they used anywhere else or could they be marked static?

They can certainly be turned into lambdas or made static, I was going with minimal changes, as the original code is non-static, non-lambdas.

rsmmr · 2020-11-05T13:29:13Z

src/ZVal.h

+	ZAM_vector(zeek::VectorVal* _vv, zeek::TypePtr yt, int n = 0)
+	: zvec(n)
+		{
+		vv = _vv;


what is vv for? it's a protected member that doesn't seem to be used anywhere?

(and: why are all the members protected instead of private?)

Re vv, good catch - it's vestigial, as is rv in ZAM_record. I'll remove them if we wind up keeping these classes.

Re why protected rather than private: just because that's the style still used in large parts of the code base. Are we switching to a style of using private in cases like this?

Re why protected rather than private: just because that's the style still used in large parts of the code base.

Ah, I guess never really realized that. I think most newer classes use private, and protected only for stuff that indeed needs access from derived classes. I would suggest we use that approach going forward as well in the interest of encapsulation, but point taken that that's not settled style.

vpax · 2020-11-11T16:21:18Z

Closing this out as some of the core elements will now be changed per #1281.

vpax added 3 commits October 16, 2020 17:54

changing BroValUnion to use ZAM_record/ZAM_vector

e18e9a4

I have no idea why this isn't normalized out ...

b15cba6

I'm guessing this change is due to Johanna's recent tweak to dump-events

afc3bd4

vpax marked this pull request as ready for review October 17, 2020 20:14

0xxon reviewed Oct 19, 2020

View reviewed changes

vpax added 5 commits October 19, 2020 09:29

btest fix from Johanna

fa0ae34

not sure why this ever worked ...

3994e82

a bit of doxygen updates, and a clarifying comment

2782b81

a bunch of low-level cleanup and documentation

e155b9b

bug fix for records having additional fields added to them

9929f53

rsmmr reviewed Oct 20, 2020

View reviewed changes

rsmmr reviewed Oct 22, 2020

View reviewed changes

make raw vector access protected

48585db

timwoj reviewed Oct 23, 2020

View reviewed changes

vpax added 2 commits October 23, 2020 13:31

incorporating low-level changes suggested by Tim

58c0e21

remove need for low-level ZAM_vector::InitVec() method

f1ceae1

jsiwek mentioned this pull request Oct 23, 2020

documenting (ZAM_)vector semantics zeek/zeek-docs#47

Closed

jsiwek reviewed Oct 23, 2020

View reviewed changes

support for heterogeneous vector-of-any's

16dc887

vpax added 3 commits October 23, 2020 20:45

nit-cleanup of vestigial code

dd165cf

use val_mgr for constructing int/count/bool's

2cb1c14

work-around to avoid deprecated RecordVal::Lookup leak

8010dba

timwoj reviewed Oct 26, 2020

View reviewed changes

vpax mentioned this pull request Oct 27, 2020

topic/vern/ast duplicate #1255

Closed

rsmmr reviewed Nov 5, 2020

View reviewed changes

vpax closed this Nov 11, 2020

vpax deleted the topic/vern/brovalunion-change branch January 24, 2023 20:23

changing BroValUnion to use ZAM_record/ZAM_vector #1231

changing BroValUnion to use ZAM_record/ZAM_vector #1231

Conversation

vpax commented Oct 17, 2020

vpax commented Oct 17, 2020

0xxon Oct 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xxon commented Oct 19, 2020

0xxon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpax Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timwoj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsiwek Oct 24, 2020 • edited Loading

Choose a reason for hiding this comment

vpax Oct 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xxon commented Oct 23, 2020

vpax commented Oct 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpax commented Oct 24, 2020

timwoj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpax commented Nov 11, 2020

0xxon Oct 19, 2020 •

edited

Loading

vpax Oct 23, 2020 •

edited

Loading

jsiwek Oct 24, 2020 •

edited

Loading

vpax Oct 25, 2020 •

edited

Loading