Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rmodels] DrawSphereEx() optimization #4106

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

smalltimewizard
Copy link
Contributor

@smalltimewizard smalltimewizard commented Jun 26, 2024

Removed unnecessary cos/sin usage in DrawSphereEx(). For a sphere with 16 rings and 16 slices (as with DrawSphere()), 8640 cos/sin calls pre-optimization are reduced to 72.

Benchmark (1000 spheres):

// gcc main.c -o main.exe -I ./include -L ./lib -lraylib -lgdi32 -lwinmm

#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <assert.h>

#include "raylib.h"
#include "raymath.h"

Camera3D camera = {
	.position = (Vector3){-50.0f,50.0f,-50.0f},
	.target = (Vector3){22.5f,22.5f,22.5f},
	.up = (Vector3){0,1,0},
	.fovy = 70.0f,
	.projection = CAMERA_PERSPECTIVE,
};

int main(int argc, char** argv) {
	InitWindow(1280, 720, "DrawSphere benchmark");
	
	int density = 10; // density^3 = number of rendered spheres
	float scale = 50.0f;
	float cameraspeed = 15.0f;
	
	while (!WindowShouldClose()) {
		float deltatime = GetFrameTime();
		printf("%f\n", deltatime);
		
		Vector3 camerafacing = Vector3Normalize(Vector3Subtract(camera.target, camera.position));
		
		if (IsKeyDown(KEY_W)) {
		camera.position = Vector3Add(camera.position, Vector3Scale(camerafacing, cameraspeed*deltatime));
		}
		if (IsKeyDown(KEY_S)) {
			camera.position = Vector3Add(camera.position, Vector3Scale(camerafacing, -cameraspeed*deltatime));
		}
		if (IsKeyDown(KEY_A)) {
			camera.position = Vector3Add(camera.position, Vector3Scale(Vector3Normalize(Vector3CrossProduct(camerafacing, camera.up)), -cameraspeed*deltatime));
		}
		if (IsKeyDown(KEY_D)) {
			camera.position = Vector3Add(camera.position, Vector3Scale(Vector3Normalize(Vector3CrossProduct(camerafacing, camera.up)), cameraspeed*deltatime));
		}
		
		BeginDrawing();
		ClearBackground(RAYWHITE);
			BeginMode3D(camera);
				for (int i = 0; i < density; i++) {
					for (int j = 0; j < density; j++) {
						for (int k = 0; k < density; k++) {
							Color color = (Color){255-i*20, 255-j*20, k*20, 255};
							DrawSphere((Vector3){i*scale/density,j*scale/density,k*scale/density}, 1.0f, color);
						}
					}
				}
			EndMode3D();
			DrawFPS(10,10);
		EndDrawing();
	}
	
	CloseWindow();
	
	return 0;
};

I compiled raylib (mingw-w64, static, Release) without and with the optimization. On my machine, the above benchmark gives a 75 ms frametime without, 50 ms with the optimization.

Precalculates sin/cos to eliminate unnecessary calls.
OBO error -- added 1 additional precalculated cos/sin value to each array to complete the 360-degree wraparound. Technically the value of these last elements will always be the same as the first element due to 360-degree wraparound, but this is the simplest solution.
@smalltimewizard
Copy link
Contributor Author

smalltimewizard commented Jun 26, 2024

I have another version without calloc/free, but it's not faster (than the calloc version) on my machine. If someone wants to play around with it:

// Draw sphere with extended parameters
void DrawSphereEx(Vector3 centerPos, float radius, int rings, int slices, Color color)
{
    rlPushMatrix();
        // NOTE: Transformation is applied in inverse order (scale -> translate)
        rlTranslatef(centerPos.x, centerPos.y, centerPos.z);
        rlScalef(radius, radius, radius);

        rlBegin(RL_TRIANGLES);
            rlColor4ub(color.r, color.g, color.b, color.a);

            float cosring[2];
            float sinring[2];
            float cosslice[2];
            float sinslice[2];
            cosring[1] = 1.0f;
            sinring[1] = 0.0f;
            cosslice[1] = 1.0f;
            sinslice[1] = 0.0f;

            for (int i = 0; i < (rings + 2); i++)
            {
                float nextringangle = DEG2RAD*(270 + (180.0f/(rings + 1))*(i+1));
                cosring[0] = cosring[1];
                sinring[0] = sinring[1];
                cosring[1] = cosf(nextringangle);
                sinring[1] = sinf(nextringangle);

                for (int j = 0; j < slices; j++)
                {
                    float nextsliceangle = DEG2RAD*(360.0f*(j+1)/slices);
                    cosslice[0] = cosslice[1];
                    sinslice[0] = sinslice[1];
                    cosslice[1] = cosf(nextsliceangle);
                    sinslice[1] = sinf(nextsliceangle);

                    rlVertex3f(cosring[0]*sinslice[0], sinring[0], cosring[0]*cosslice[0]);
                    rlVertex3f(cosring[1]*sinslice[1], sinring[1], cosring[1]*cosslice[1]);
                    rlVertex3f(cosring[1]*sinslice[0], sinring[1], cosring[1]*cosslice[0]);

                    rlVertex3f(cosring[0]*sinslice[0], sinring[0], cosring[0]*cosslice[0]);
                    rlVertex3f(cosring[0]*sinslice[1], sinring[0], cosring[0]*cosslice[1]);
                    rlVertex3f(cosring[1]*sinslice[1], sinring[1], cosring[1]*cosslice[1]);
                }
            }

        rlEnd();
    rlPopMatrix();
}

@raysan5
Copy link
Owner

raysan5 commented Jun 27, 2024

@smalltimewizard I'd prefer to avoid allocators inside the function, what is the improvement wihtout the allocators in comparison to original version?

@smalltimewizard
Copy link
Contributor Author

smalltimewizard commented Jun 27, 2024

I get 65 ms frametime from my benchmark with no allocators (for a sphere with 16 rings and 16 slices, the no allocators algorithm makes 612 cos/sin calls).
I also have an idea for a better algorithm, but not sure when I will find time to work on it.

@raysan5
Copy link
Owner

raysan5 commented Jun 28, 2024

@smalltimewizard I'm reviewing those numbers and they seem irregularly big, are we talking about milliseconds or nanoseconds?

In any case, I'm considering if this change really worth it because I adds an extra level of code complexity for newcomers and readibility for future maintenance, for the performance benefit it provides.

EDIT: I see those numbers are computed for 1000 spheres, what is exactly computed? A for loop calling this function 1000 times? I think it should be measured just the function body 1 time for more accurate numbers (or 1 function call).

@raysan5 raysan5 changed the title Optimization to DrawSphereEx() [rmodels] Optimization to DrawSphereEx() Jun 28, 2024
@raysan5 raysan5 changed the title [rmodels] Optimization to DrawSphereEx() [rmodels] DrawSphereEx() optimization Jun 28, 2024
@smalltimewizard
Copy link
Contributor Author

smalltimewizard commented Jun 29, 2024

New barebones benchmark:

// gcc main.c -o main.exe -I "./include" -L "./lib" -lraylib -lgdi32 -lwinmm

#include <stdio.h>
#include <time.h>

#include "raylib.h"
#include "raymath.h"

#define RUNS 1000

Camera3D camera = {
	.position = (Vector3){0,3,0},
	.target = (Vector3){0,0,100},
	.up = (Vector3){0,1,0},
	.fovy = 70.0f,
	.projection = CAMERA_PERSPECTIVE,
};

int main(int argc, char** argv) {
	InitWindow(1280, 720, "DrawSphere benchmark");
	
	while (!WindowShouldClose()) {
		BeginDrawing();
			ClearBackground(RAYWHITE);
			BeginMode3D(camera);
				double timerstart = (double) clock();
				for (int i = 0; i < RUNS; i++) {
					DrawSphereEx((Vector3){0,0,i}, 1.0f, 16, 16, RED);
				}
				printf("%d calls of DrawSphere() took %f seconds.\n", RUNS, (double) (clock()-timerstart)/CLOCKS_PER_SEC);
			EndMode3D();
			DrawFPS(10,10);
		EndDrawing();
	}
	
	CloseWindow();
	
	return 0;
};

On my machine (mingw-w64, compiled as Release, static library) for 1000 DrawSphereEx() calls:
The original algorithm completes in 0.074 seconds and with 32 rings/slices completes in 0.33 seconds.
The heap allocation algorithm completes in 0.055 seconds and with 32 rings/slices completes in 0.21 seconds.
The stack allocation algorithm completes in 0.063 seconds and with 32 rings/slices completes in 0.24 seconds.

@smalltimewizard
Copy link
Contributor Author

smalltimewizard commented Jun 29, 2024

@raysan5 Rebuilt DrawSphereEx from scratch -- the new one uses stack allocation and is hopefully simple enough.
It gives 1000 spheres in 0.048 seconds and 1000 spheres of 32 rings/slices in 0.181 seconds.
(And just to flex, it makes only 4 cos/sin calls. :P)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants