Skip to content

[3.x] Provide quick access to Object ancestry #107462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 22, 2025

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Jun 12, 2025

Although we have increased Object::cast_to performance significantly with #103708 (4.x) and #104825 (3.x) we discussed at the time that for some high performance bottleneck scenarios we may want an even faster way of determining whether an Object is one of the key main types.

At the time I trialed using a bitfield to store this info and it worked well, and is likely to be one of the fastest methods, and discussed this with @Ivorforce .

While it involves storing (and retrieving) data from the Object / Node (thus cache effects), it avoids overheads with a virtual function approach, and the virtual function requires reading the vtable from the object, so there is a read in all cases I think.

Benchmarks

In order to compare the different approaches, I benchmarked between:

  • ancestry (this PR currently)
  • virtual function (e.g. virtual bool is_spatial())
  • Object::cast_to<Spatial>

I wrote gdscript to create a node with 200,000 children, alternating Node and Spatial. The reason for alternating and the count is to ensure that the optimizer doesn't optimize out the benchmark (although the alternating may make things easier for branch predictor compared to random).

Created a benchmark c++ function:

void Node::benchmark_ancestry() {
	int num_children = get_child_count();
	uint32_t num_iterations = 100;
	uint32_t count = 0;
	
	// Heat cache
	for (uint32_t iter = 0; iter < 1; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (child->test_ancestry(AncestralClass::ANCESTRAL_CLASS_SPATIAL)) {
				count++;
			}
		}
	}
	
	
	uint64_t before = OS::get_singleton()->get_ticks_usec();
	for (uint32_t iter = 0; iter < num_iterations; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (child->test_ancestry(AncestralClass::ANCESTRAL_CLASS_SPATIAL)) {
				count++;
			}
		}
	}
	uint64_t after = OS::get_singleton()->get_ticks_usec();
	uint64_t ancestry_time = after - before;
	
	
	before = OS::get_singleton()->get_ticks_usec();
	for (uint32_t iter = 0; iter < num_iterations; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (child->virtual_is_spatial()) {
				count++;
			}
		}
	}
	after = OS::get_singleton()->get_ticks_usec();
	uint64_t virtual_time = after - before;

	before = OS::get_singleton()->get_ticks_usec();
	for (uint32_t iter = 0; iter < num_iterations; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (Object::cast_to<Spatial>(child)) {
				count++;
			}
		}
	}
	after = OS::get_singleton()->get_ticks_usec();
	uint64_t cast_time = after - before;

	print_line("ancestry " + itos(ancestry_time) + ", virtual " + itos(virtual_time) + ", cast " + itos(cast_time));
	print_line("count " + itos(count));
}

Results

Godot Engine v3.7.dev.custom_build [57304c0ce] - X11 - 2 monitors - GLES2 - Mesa Intel(R) Graphics (RPL-S) - 13th Gen Intel(R) Core(TM) i7-13700T (24 threads)

200,000 node children

release
ancestry 85793, virtual 86305, cast 103659
debug
ancestry 350618, virtual 249225, cast 308729

2000 node children

release
ancestry 175, virtual 297, cast 344
debug
ancestry 2719, virtual 2804, cast 3386

Discussion

Well, as I suspected, this is a turnup for the books. I did question whether the cache slowdown of actually reading the data would compare to the overhead of a virtual function, and it turns out to be closer than we thought.

In release, with large numbers of children, storing the type on the object (ancestry) and virtual seem to be neck and neck, with cast_to not that much slower.

With a smaller number of children, the results seem to flip, and everything is more likely to be in cache, so the costs of reading the object are lower, and the ancestry approach clearly wins. So with large numbers of nodes, memory read speed may be the bottleneck, and with smaller numbers of nodes, the processing itself may be more a bottleneck.

So the question at the moment is which approach is likely to be faster in the wild, and by how much. At the moment it looks like ancestry is fastest in all cases, but possibly not by a huge margin once we have a really large number of nodes, or cache is cold. We also have to bear in mind this might be hardware dependent as memory read speed / cache versus virtual overhead may vary e.g. on mobile.

Also I have realized that we need to read from the memory in both cases. Even for virtual function, we need to read the vtable pointer from the object in RAM, so there should in theory always be more work for the virtual approach (though it uses no memory on the object itself).

Power use on mobile

I also considered power use on mobile, and asked Grok:

For 200,000 objects, virtual calls could consume 2–5x more power, depending on cache hit rates and branch prediction success.

Notes

  • Only having a single usage so far to keep the PR simple, and @Ivorforce can do a follow up to convert cast_to to use Ancestry for the specific cases covered, so it will invisibly be used everywhere.
  • I'm not absolutely sure yet on using the name "ancestry", as we also tend to use this to refer to ancestors in the scene tree (parents of parent), rather than in terms of inheritance. But the principle should stand whichever name we go for, if we did go with this approach.

@lawnjelly

This comment was marked as outdated.

@lawnjelly lawnjelly force-pushed the quick_ancestry branch 2 times, most recently from 4330714 to 31023df Compare June 16, 2025 18:40
@lawnjelly
Copy link
Member Author

Actually on second thoughts there's no reason why we can't agree on the implementation then do the call points.
I've now added to Object instead and included reference as a type.

We don't have a is_ref_counted() function in 3.x but I'm sure there's a call site that can be accelerated.

@lawnjelly lawnjelly changed the title [3.x] Provide quick access to Node ancestry [3.x] Provide quick access to Object ancestry Jun 16, 2025
@lawnjelly lawnjelly marked this pull request as ready for review June 17, 2025 03:31
@lawnjelly lawnjelly requested review from a team as code owners June 17, 2025 03:31
Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense to me. From my side, this is ready to merge.

I think we should bring this into the core meeting before making a final decision. But I'm very confident this will be quite important at little cost in the areas where it's needed.

@lawnjelly
Copy link
Member Author

lawnjelly commented Jun 19, 2025

As discussed in the core meeting, I've created some code to measure how many cast_tos are used in the engine during runtime, and report at exit (this is probably more appropriate than profiling, because the relative numbers during profiling will be more difficult to gauge).

If anyone interested in running, the counting debugger is here:
https://github.com/lawnjelly/godot/tree/count_cast_tos

I'll try and write some results here.

TPS demo

a few seconds of gameplay:

references : 148131
canvas_items : 8995
spatials : 408783
visual_instances : 156993
resources : 146627
nodes : 464454
controls : 8995
input_defaults : 3
node_2ds : 0
bone_2ds : 0
area_2ds : 0
skeleton_2ds : 0
animation_players : 25482
collision_object_2ds : 0
tree_items : 0
viewports : 3450
scripts : 32390
areas : 5287
cameras : 6059
collision_objects : 91437
collision_shapes : 50011
rigid_bodies : 49426
physics_bodies : 86150
static_bodies : 13078
skeletons : 8053
textures : 5177
materials : 2593
mesh_instances : 69703
geometry_instances : 146343
directional_lights : 0
omni_lights : 9285
spot_lights : 1330

This confirms that the FTI (even optimized) is a heavy user of cast_to, and Spatial and VisualInstance are currently highly used.

Jetpaca (2D game)

references : 71005
canvas_items : 637547
spatials : 0
visual_instances : 0
resources : 69713
nodes : 884949
controls : 33336
input_defaults : 3
node_2ds : 604211
bone_2ds : 0
area_2ds : 109461
skeleton_2ds : 0
animation_players : 3858
collision_object_2ds : 253551
tree_items : 0
viewports : 10681
scripts : 32235
areas : 0
cameras : 0
collision_objects : 0
collision_shapes : 0
rigid_bodies : 0
physics_bodies : 0
static_bodies : 0
skeletons : 0
textures : 2659
materials : 35
mesh_instances : 0
geometry_instances : 0
directional_lights : 0
omni_lights : 0
spot_lights : 0

As expected, there's no use of the 3D cast_to.

Editor (Jetpaca)

references : 436987
canvas_items : 6591240
spatials : 292
visual_instances : 0
resources : 435979
nodes : 7987456
controls : 3851408
input_defaults : 4
node_2ds : 2739832
bone_2ds : 0
area_2ds : 529424
skeleton_2ds : 0
animation_players : 110205
collision_object_2ds : 942535
tree_items : 2347
viewports : 12352
scripts : 20763
areas : 0
cameras : 292
collision_objects : 0
collision_shapes : 0
rigid_bodies : 0
physics_bodies : 0
static_bodies : 0
skeletons : 0
textures : 119644
materials : 71
mesh_instances : 0
geometry_instances : 0
directional_lights : 0
omni_lights : 0
spot_lights : 0

CanvasItem / Node / Control cast_to is crazy in the editor!! So this could really benefit from this PR.

In fact I wonder whether we have some dodgy loops there which are causing this, perhaps it is recursive layout for the GUI. Regardless, this should make editor more snappy if there is a 2x speedup in these casts.

Editor (TPS demo)

references : 369328
canvas_items : 3314418
spatials : 245510
visual_instances : 91975
resources : 365110
nodes : 3808419
controls : 3314317
input_defaults : 4
node_2ds : 101
bone_2ds : 0
area_2ds : 0
skeleton_2ds : 0
animation_players : 757
collision_object_2ds : 0
tree_items : 1869
viewports : 8220
scripts : 5176
areas : 559
cameras : 589
collision_objects : 38171
collision_shapes : 88870
rigid_bodies : 1212
physics_bodies : 37612
static_bodies : 35390
skeletons : 540
textures : 115455
materials : 4907
mesh_instances : 86251
geometry_instances : 89390
directional_lights : 0
omni_lights : 770
spot_lights : 1088

I'll try and add some more types and repost data.

@lawnjelly
Copy link
Member Author

lawnjelly commented Jun 19, 2025

So to summarize the results, some casts I can see that are particularly hot:

// Editor
References
CanvasItem
Nodes
Controls
Resources
Node2D


// 3D
Spatials
VisualInstances
GeometryInstances
CollisionObjects
PhysicsBodies
MeshInstances

// 2D
CollisionObject2D
Area2D
Script

I've updated the PR to include all of these, and it takes 15 bits.

As I say, going up from 4 bits is highly likely to be "free" for all intents and purposes, because of padding, and not worth skimping too much as we already store heavy data on Objects / Nodes.

Now it will just be left for @Ivorforce to work his magic after this PR merged, in order to make it work with cast_to automagically. 😁

@lawnjelly lawnjelly requested a review from a team as a code owner June 19, 2025 10:05
@lawnjelly
Copy link
Member Author

Just had a little try at the cast_to but ran out of time as I have to go away this long weekend:

I was guessing we were aiming for something like:

if constexpr (std::is_base_of<Reference, T>::value) {
	return p_object->has_ancestry(AncestralClass::REFERENCE);
}

I'm currently bumping c++ to 17, so we can assume we'll be able to use if constexp.

The dependency management with headers is tricky though, as we don't want to be including reference.h, node.h etc in object.h.

I tried an experiment of detecting whether the class had been defined:

	class Node; // forward declare

	template <typename T, typename = void>
	struct is_complete : std::false_type {};

	template <typename T>
	struct is_complete<T, std::void_t<decltype(sizeof(T))>> : std::true_type {};

	template <typename T>
	static constexpr bool is_complete_v = is_complete<T>::value;

...
	template <class T>
	static T *cast_to(Object *p_object) {

		if constexpr (is_complete_v<Node>) {
			if constexpr (std::is_base_of<Node, T>::value) {
				return p_object->has_ancestry(AncestralClass::NODE);
			}
		}
	}

The idea was to only compile in the ancestry check if Node had already been included. However, it didn't work, I think because object.h is included before node.h in all cases, so at the time of object.h, Node is never defined, and the constexpr is never true.

Anyway will leave @Ivorforce to try and figure this out.

Copy link
Member

@Ivorforce Ivorforce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more types than I expected, but I suppose it's free to add more until the bit field costs 32 bits. So we still have some slack even after the PR.

Looks great, ready to merge!

@lawnjelly lawnjelly merged commit c6f9190 into godotengine:3.x Jun 22, 2025
14 checks passed
@lawnjelly lawnjelly deleted the quick_ancestry branch June 22, 2025 12:10
@lawnjelly
Copy link
Member Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants