[3.x] Provide quick access to `Object` ancestry #107462

lawnjelly · 2025-06-12T17:39:28Z

Although we have increased Object::cast_to performance significantly with #103708 (4.x) and #104825 (3.x) we discussed at the time that for some high performance bottleneck scenarios we may want an even faster way of determining whether an Object is one of the key main types.

At the time I trialed using a bitfield to store this info and it worked well, and is likely to be one of the fastest methods, and discussed this with @Ivorforce .

While it involves storing (and retrieving) data from the Object / Node (thus cache effects), it avoids overheads with a virtual function approach, and the virtual function requires reading the vtable from the object, so there is a read in all cases I think.

Benchmarks

In order to compare the different approaches, I benchmarked between:

ancestry (this PR currently)
virtual function (e.g. virtual bool is_spatial())
Object::cast_to<Spatial>

I wrote gdscript to create a node with 200,000 children, alternating Node and Spatial. The reason for alternating and the count is to ensure that the optimizer doesn't optimize out the benchmark (although the alternating may make things easier for branch predictor compared to random).

Created a benchmark c++ function:

void Node::benchmark_ancestry() {
	int num_children = get_child_count();
	uint32_t num_iterations = 100;
	uint32_t count = 0;
	
	// Heat cache
	for (uint32_t iter = 0; iter < 1; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (child->test_ancestry(AncestralClass::ANCESTRAL_CLASS_SPATIAL)) {
				count++;
			}
		}
	}
	
	
	uint64_t before = OS::get_singleton()->get_ticks_usec();
	for (uint32_t iter = 0; iter < num_iterations; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (child->test_ancestry(AncestralClass::ANCESTRAL_CLASS_SPATIAL)) {
				count++;
			}
		}
	}
	uint64_t after = OS::get_singleton()->get_ticks_usec();
	uint64_t ancestry_time = after - before;
	
	
	before = OS::get_singleton()->get_ticks_usec();
	for (uint32_t iter = 0; iter < num_iterations; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (child->virtual_is_spatial()) {
				count++;
			}
		}
	}
	after = OS::get_singleton()->get_ticks_usec();
	uint64_t virtual_time = after - before;

	before = OS::get_singleton()->get_ticks_usec();
	for (uint32_t iter = 0; iter < num_iterations; iter++) {
		for (int n = 0; n < num_children; n++) {
			Node *child = get_child(n);
			if (Object::cast_to<Spatial>(child)) {
				count++;
			}
		}
	}
	after = OS::get_singleton()->get_ticks_usec();
	uint64_t cast_time = after - before;

	print_line("ancestry " + itos(ancestry_time) + ", virtual " + itos(virtual_time) + ", cast " + itos(cast_time));
	print_line("count " + itos(count));
}

Results

Godot Engine v3.7.dev.custom_build [57304c0ce] - X11 - 2 monitors - GLES2 - Mesa Intel(R) Graphics (RPL-S) - 13th Gen Intel(R) Core(TM) i7-13700T (24 threads)

200,000 node children

release
ancestry 85793, virtual 86305, cast 103659
debug
ancestry 350618, virtual 249225, cast 308729

2000 node children

release
ancestry 175, virtual 297, cast 344
debug
ancestry 2719, virtual 2804, cast 3386

Discussion

Well, as I suspected, this is a turnup for the books. I did question whether the cache slowdown of actually reading the data would compare to the overhead of a virtual function, and it turns out to be closer than we thought.

In release, with large numbers of children, storing the type on the object (ancestry) and virtual seem to be neck and neck, with cast_to not that much slower.

With a smaller number of children, the results seem to flip, and everything is more likely to be in cache, so the costs of reading the object are lower, and the ancestry approach clearly wins. So with large numbers of nodes, memory read speed may be the bottleneck, and with smaller numbers of nodes, the processing itself may be more a bottleneck.

So the question at the moment is which approach is likely to be faster in the wild, and by how much. At the moment it looks like ancestry is fastest in all cases, but possibly not by a huge margin once we have a really large number of nodes, or cache is cold. We also have to bear in mind this might be hardware dependent as memory read speed / cache versus virtual overhead may vary e.g. on mobile.

Also I have realized that we need to read from the memory in both cases. Even for virtual function, we need to read the vtable pointer from the object in RAM, so there should in theory always be more work for the virtual approach (though it uses no memory on the object itself).

Power use on mobile

I also considered power use on mobile, and asked Grok:

For 200,000 objects, virtual calls could consume 2–5x more power, depending on cache hit rates and branch prediction success.

Notes

Only having a single usage so far to keep the PR simple, and @Ivorforce can do a follow up to convert cast_to to use Ancestry for the specific cases covered, so it will invisibly be used everywhere.
I'm not absolutely sure yet on using the name "ancestry", as we also tend to use this to refer to ancestors in the scene tree (parents of parent), rather than in terms of inheritance. But the principle should stand whichever name we go for, if we did go with this approach.

scene/main/node.h

lawnjelly · 2025-06-16T18:44:14Z

Actually on second thoughts there's no reason why we can't agree on the implementation then do the call points.
I've now added to Object instead and included reference as a type.

We don't have a is_ref_counted() function in 3.x but I'm sure there's a call site that can be accelerated.

core/object.h

core/variant.cpp

core/object.h

Ivorforce

Makes a lot of sense to me. From my side, this is ready to merge.

I think we should bring this into the core meeting before making a final decision. But I'm very confident this will be quite important at little cost in the areas where it's needed.

lawnjelly · 2025-06-19T08:26:40Z

As discussed in the core meeting, I've created some code to measure how many cast_tos are used in the engine during runtime, and report at exit (this is probably more appropriate than profiling, because the relative numbers during profiling will be more difficult to gauge).

If anyone interested in running, the counting debugger is here:
https://github.com/lawnjelly/godot/tree/count_cast_tos

I'll try and write some results here.

TPS demo

a few seconds of gameplay:

references : 148131
canvas_items : 8995
spatials : 408783
visual_instances : 156993
resources : 146627
nodes : 464454
controls : 8995
input_defaults : 3
node_2ds : 0
bone_2ds : 0
area_2ds : 0
skeleton_2ds : 0
animation_players : 25482
collision_object_2ds : 0
tree_items : 0
viewports : 3450
scripts : 32390
areas : 5287
cameras : 6059
collision_objects : 91437
collision_shapes : 50011
rigid_bodies : 49426
physics_bodies : 86150
static_bodies : 13078
skeletons : 8053
textures : 5177
materials : 2593
mesh_instances : 69703
geometry_instances : 146343
directional_lights : 0
omni_lights : 9285
spot_lights : 1330

This confirms that the FTI (even optimized) is a heavy user of cast_to, and Spatial and VisualInstance are currently highly used.

Jetpaca (2D game)

references : 71005
canvas_items : 637547
spatials : 0
visual_instances : 0
resources : 69713
nodes : 884949
controls : 33336
input_defaults : 3
node_2ds : 604211
bone_2ds : 0
area_2ds : 109461
skeleton_2ds : 0
animation_players : 3858
collision_object_2ds : 253551
tree_items : 0
viewports : 10681
scripts : 32235
areas : 0
cameras : 0
collision_objects : 0
collision_shapes : 0
rigid_bodies : 0
physics_bodies : 0
static_bodies : 0
skeletons : 0
textures : 2659
materials : 35
mesh_instances : 0
geometry_instances : 0
directional_lights : 0
omni_lights : 0
spot_lights : 0

As expected, there's no use of the 3D cast_to.

Editor (Jetpaca)

references : 436987
canvas_items : 6591240
spatials : 292
visual_instances : 0
resources : 435979
nodes : 7987456
controls : 3851408
input_defaults : 4
node_2ds : 2739832
bone_2ds : 0
area_2ds : 529424
skeleton_2ds : 0
animation_players : 110205
collision_object_2ds : 942535
tree_items : 2347
viewports : 12352
scripts : 20763
areas : 0
cameras : 292
collision_objects : 0
collision_shapes : 0
rigid_bodies : 0
physics_bodies : 0
static_bodies : 0
skeletons : 0
textures : 119644
materials : 71
mesh_instances : 0
geometry_instances : 0
directional_lights : 0
omni_lights : 0
spot_lights : 0

CanvasItem / Node / Control cast_to is crazy in the editor!! So this could really benefit from this PR.

In fact I wonder whether we have some dodgy loops there which are causing this, perhaps it is recursive layout for the GUI. Regardless, this should make editor more snappy if there is a 2x speedup in these casts.

Editor (TPS demo)

references : 369328
canvas_items : 3314418
spatials : 245510
visual_instances : 91975
resources : 365110
nodes : 3808419
controls : 3314317
input_defaults : 4
node_2ds : 101
bone_2ds : 0
area_2ds : 0
skeleton_2ds : 0
animation_players : 757
collision_object_2ds : 0
tree_items : 1869
viewports : 8220
scripts : 5176
areas : 559
cameras : 589
collision_objects : 38171
collision_shapes : 88870
rigid_bodies : 1212
physics_bodies : 37612
static_bodies : 35390
skeletons : 540
textures : 115455
materials : 4907
mesh_instances : 86251
geometry_instances : 89390
directional_lights : 0
omni_lights : 770
spot_lights : 1088

I'll try and add some more types and repost data.

lawnjelly · 2025-06-19T09:34:29Z

So to summarize the results, some casts I can see that are particularly hot:

// Editor
References
CanvasItem
Nodes
Controls
Resources
Node2D


// 3D
Spatials
VisualInstances
GeometryInstances
CollisionObjects
PhysicsBodies
MeshInstances

// 2D
CollisionObject2D
Area2D
Script

I've updated the PR to include all of these, and it takes 15 bits.

As I say, going up from 4 bits is highly likely to be "free" for all intents and purposes, because of padding, and not worth skimping too much as we already store heavy data on Objects / Nodes.

Now it will just be left for @Ivorforce to work his magic after this PR merged, in order to make it work with cast_to automagically. 😁

lawnjelly · 2025-06-19T11:53:55Z

Just had a little try at the cast_to but ran out of time as I have to go away this long weekend:

I was guessing we were aiming for something like:

if constexpr (std::is_base_of<Reference, T>::value) {
	return p_object->has_ancestry(AncestralClass::REFERENCE);
}

I'm currently bumping c++ to 17, so we can assume we'll be able to use if constexp.

The dependency management with headers is tricky though, as we don't want to be including reference.h, node.h etc in object.h.

I tried an experiment of detecting whether the class had been defined:

	class Node; // forward declare

	template <typename T, typename = void>
	struct is_complete : std::false_type {};

	template <typename T>
	struct is_complete<T, std::void_t<decltype(sizeof(T))>> : std::true_type {};

	template <typename T>
	static constexpr bool is_complete_v = is_complete<T>::value;

...
	template <class T>
	static T *cast_to(Object *p_object) {

		if constexpr (is_complete_v<Node>) {
			if constexpr (std::is_base_of<Node, T>::value) {
				return p_object->has_ancestry(AncestralClass::NODE);
			}
		}
	}

The idea was to only compile in the ancestry check if Node had already been included. However, it didn't work, I think because object.h is included before node.h in all cases, so at the time of object.h, Node is never defined, and the constexpr is never true.

Anyway will leave @Ivorforce to try and figure this out.

Ivorforce

That's more types than I expected, but I suppose it's free to add more until the bit field costs 32 bits. So we still have some slack even after the PR.

Looks great, ready to merge!

lawnjelly · 2025-06-22T12:10:27Z

Thanks!

lawnjelly added this to the 3.7 milestone Jun 12, 2025

lawnjelly added enhancement topic:core performance labels Jun 12, 2025

Ivorforce reviewed Jun 12, 2025

View reviewed changes

scene/main/node.h Outdated Show resolved Hide resolved

lawnjelly force-pushed the quick_ancestry branch from abc9057 to cb14c3a Compare June 13, 2025 06:05

This comment was marked as outdated.

Sign in to view

lawnjelly force-pushed the quick_ancestry branch 2 times, most recently from 4330714 to 31023df Compare June 16, 2025 18:40

lawnjelly changed the title ~~[3.x] Provide quick access to Node ancestry~~ [3.x] Provide quick access to Object ancestry Jun 16, 2025

lawnjelly force-pushed the quick_ancestry branch from 31023df to 5331f8c Compare June 16, 2025 19:00

Ivorforce reviewed Jun 16, 2025

View reviewed changes

lawnjelly force-pushed the quick_ancestry branch from 5331f8c to d3a31a8 Compare June 17, 2025 03:19

lawnjelly marked this pull request as ready for review June 17, 2025 03:31

lawnjelly requested review from a team as code owners June 17, 2025 03:31

Ivorforce reviewed Jun 18, 2025

View reviewed changes

Ivorforce added the for pr meeting label Jun 18, 2025

Provide quick access to Object ancestry

ae786bd

lawnjelly force-pushed the quick_ancestry branch from d3a31a8 to ae786bd Compare June 19, 2025 10:05

lawnjelly requested a review from a team as a code owner June 19, 2025 10:05

lawnjelly removed the for pr meeting label Jun 19, 2025

Ivorforce approved these changes Jun 20, 2025

View reviewed changes

lawnjelly merged commit c6f9190 into godotengine:3.x Jun 22, 2025
14 checks passed

lawnjelly deleted the quick_ancestry branch June 22, 2025 12:10

Ivorforce mentioned this pull request Jun 22, 2025

[3.x] Optimize Object::cast_to with ancestral classes when possible #107844

Merged

lawnjelly mentioned this pull request Jun 22, 2025

Provide quick access to Object ancestry #107868

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[3.x] Provide quick access to `Object` ancestry #107462

[3.x] Provide quick access to `Object` ancestry #107462

Uh oh!

lawnjelly commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

This comment was marked as outdated.

lawnjelly commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ivorforce left a comment

Uh oh!

lawnjelly commented Jun 19, 2025 •

edited

Loading

Uh oh!

lawnjelly commented Jun 19, 2025 •

edited

Loading

Uh oh!

lawnjelly commented Jun 19, 2025

Uh oh!

Ivorforce left a comment

Uh oh!

Uh oh!

lawnjelly commented Jun 22, 2025

Uh oh!

Uh oh!

Uh oh!

[3.x] Provide quick access to Object ancestry #107462

[3.x] Provide quick access to Object ancestry #107462

Uh oh!

Conversation

lawnjelly commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Results

200,000 node children

2000 node children

Discussion

Power use on mobile

Notes

Uh oh!

Uh oh!

This comment was marked as outdated.

lawnjelly commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ivorforce left a comment

Choose a reason for hiding this comment

Uh oh!

lawnjelly commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TPS demo

Jetpaca (2D game)

Editor (Jetpaca)

Editor (TPS demo)

Uh oh!

lawnjelly commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lawnjelly commented Jun 19, 2025

Uh oh!

Ivorforce left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lawnjelly commented Jun 22, 2025

Uh oh!

Uh oh!

[3.x] Provide quick access to `Object` ancestry #107462

[3.x] Provide quick access to `Object` ancestry #107462

lawnjelly commented Jun 12, 2025 •

edited

Loading

lawnjelly commented Jun 19, 2025 •

edited

Loading

lawnjelly commented Jun 19, 2025 •

edited

Loading