Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading and Optimization Discussion #6

Open
TheCherno opened this issue Dec 10, 2022 · 50 comments
Open

Multithreading and Optimization Discussion #6

TheCherno opened this issue Dec 10, 2022 · 50 comments

Comments

@TheCherno
Copy link
Owner

Hey all! For those following along with the series on YouTube, I hope you've been enjoying it thus far! In the latest episode (Ep 11), we introduced multithreading into our code base by using std::for_each with the parallel execution policy. I mentioned that if the community has any other suggestions, and wants to do some testing to see if we can multithread in a more efficient way, I'd open a GitHub issue for this discussion - and here we are!

I figured crowd-sourcing this would be a good idea since your mileage may vary - certain techniques might be better or worse on certain hardware and architectures. Feel free to fork this repository and implement something, and then drop it in a comment here so we can test. If your method is faster than our std::for_each method, make sure to include some profiling data and your hardware specifications.

Thanks all! ❤️

  • Cherno
@cvb941
Copy link

cvb941 commented Dec 10, 2022

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

@scorpion-26
Copy link

@TheCherno I checked the newest commit out and tried to run it under x64 Linux and using clang as the compiler.
Aside from some difficulties compiling under Linux, std::for_each() by default only runs on 1 thread. Apparently, for multithreading to work with libstdc++, one needs to install Intel TBB and link to it with -ltbb (Docs). With tbb installed and linked, it now runs fairly slowly(Ryzen 7 3800x, default camera angle, 1600x900-> 40-50ms). CPU usage hovers around 90%, so not perfect.
I'll try to implement a manual Thread-Pool style approach to see if that is any better.

@cvb941
Copy link

cvb941 commented Dec 10, 2022

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

After enabling OpenMP in Project properties > C/C++ > Language, I have added the #pragma omp parallel for clause to the outer loop:

#pragma omp parallel for
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

This gets me from 65ms single threaded to 35ms. However, the for_each implementation runs even lower at 28ms.

The difference comes from the default static schedule which splits the for loop into equal chunks for each CPU thread.
This makes the threads that render the sky finish faster than the ones rendering the ball.
Adding schedule(dynamic, 1) should distribute the work row-by-row to the threads, making it behave similarly to for_each and
speeding up the rendering down to 28ms.

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

Code is in my fork at https://github.com/cvb941/RayTracing

@cvb941
Copy link

cvb941 commented Dec 10, 2022

Speedup from 65ms to 28ms isn't very much on an 8-core machine. The profiler shows a lot of time being spent in the Random::Vec3 function, where the random number generator probably synchronizes the threads, causing a bottleneck there.

Just using a simple rand() in the Rand::Float() function reduces the frame time from 28ms to 10ms. Theoretically, this can be reduced further with a faster random (returning a constant achieves 7ms).

static float Float()
{
    return static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    // return (float)s_Distribution(s_RandomEngine) / (float)std::numeric_limits<uint32_t>::max();
}

@leddoo
Copy link

leddoo commented Dec 10, 2022

oof

i mean, do i need to say anything? :D

was pretty confused about the 8x increase in physical core count only getting a 2x speedup. (i'm getting around 3x on my i7-8750h -- 180ms -> 65ms -- which both seems high, but that's beside the point)

for this kind of simple parallelization of a compute heavy task, i'd expect a speedup of 90% * #physical-cores. (hyperthreading doesn't tend to help much, when you're compute bound. and there are always some losses, hence the 90%)

but yeah, get some thread local RNGs going :D

@HildingLinden
Copy link

For fast non-cryptographically-secure PRNGs I usually go with the xoshiro family of generators since they are robust enough for my applications and very fast. I've often implemented one myself but a wrapper like Xoshiro-cpp seem quite practical.

@athieman
Copy link

athieman commented Dec 10, 2022

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

Update: @leddoo already covered it.

@xLoge
Copy link

xLoge commented Dec 11, 2022

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each.
https://github.com/xLoge/RayTracing

@LilyIsTrans
Copy link

LilyIsTrans commented Dec 16, 2022

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

I tried this and it worked quite well. My render times were consistently around 5-6ms, and CPU usage was well below 100%, closer to 10%. It seems that once RNG is thread local, the bottleneck is no longer the CPU, though I don't know how to verify that or figure out what the new bottleneck is.

@LeoMash
Copy link

LeoMash commented Dec 24, 2022

@Rilazy The new "bottleneck" is not a bottleneck - it's a vsync enabled by default in the Imgui renderer. So after your CPU finishes calculation for new frame image, it waits for gpu vsync event. To test that you can increase viewport size (on my CPU current scene renders slower than 60fps in 4k viewport), or disable vsync in Imgui by uncommenting define IMGUI_UNLIMITED_FRAME_RATE in Application.cpp.

@LilyIsTrans
Copy link

Interesting, thanks!

@soerenfi
Copy link

So what are you guys using for profiling? I set up Tracy but It's not really usable due to the large amount of threads being created and destroyed with the current architecture.
I am going to look into creating a threadpool with number of threads according to available cores. Does that make sense?

@LilyIsTrans
Copy link

I've just been using Visual Studio's built in profiler when I need proper profiling, but mostly I'm looking at the frame times and seeing what they hover around

@soerenfi
Copy link

Alright, it's done here. Added a thread pool and queue for the rendering jobs. Every Line of the image is submitted to the queue individually. Here is how it looks in Tracy:
image_2022-12-29_232710033

performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

@LilyIsTrans
Copy link

LilyIsTrans commented Dec 29, 2022

Have you implemented the optimization making random number generation thread local? If not, it's possible your thread pool just isn't handling synchronization as well as the std::foreach method

@LilyIsTrans
Copy link

@soerenfi Disregard my previous comment for now; it looks like it's the same thing I ran into with vsync, since your framerate looks to be locked at pretty much exactly 60fps. What's happening is that ImGui is set up by default to wait for the next time your monitor refreshes the image to start drawing the new one. LeoMash explained it here and ways to get around it when I ran into the same thing.

@soerenfi
Copy link

@Rilazy yeah, I saw that. For now I just compared CPU load between the approaches and varied the amount of threads in the pool.

Profiler shows that the workers are idling some of the time so that's probably the difference. Did not expect to write a better scheduler than the OS's in one evening.

No point in optimising this further since the ray tracing algorithm itself is so inefficient right now. We could do the PerPixel() intersection calculations on the GPU using CUDA though, just as an exercise..

@LilyIsTrans
Copy link

LilyIsTrans commented Dec 30, 2022

Another option for moving work to the GPU would be the Vulkan ray tracing pipeline itself. I'm not super familiar with GPU programming, but reading up on the Vulkan API it looks like we could be allowed to still use implicit functions for our spheres by passing them as AABBs that enclose them, then doing the more expensive intersection testing in the intersection shader, and from there I think all of our existing code is valid to port pretty much directly to shader code.

I'd have tested it myself already if it weren't for the fact that I'm not familiar enough with graphics programming to have the first clue on how to actually set up Vulkan to draw to the framebuffer of our viewport, I've only ever done the "Draw a triangle" Vulkan tutorial. I've been reading the docs, but there are a lot of docs.

Of course, that all does depend on a graphics driver that supports the Vulkan Ray Tracing Extensions.

@soerenfi
Copy link

I believe that is what @TheCherno is aiming to do, and if so, I am looking forward to it very much. Since Vulkan takes care of building BVHs and (HW-)accelerating ray-triangle intersections, there is not much point in spending time to implement them on the CPU (still worth understanding them though).

Meanwhile you can check out the NVIDIA guide to VK raytracing (uses NVIDIA HW specific support libraries and this amazing implementation of Peter Shirley's Raytracing in one Weekend using Vulkan Raytracing.

@LilyIsTrans
Copy link

Neat! I'll have to check that out.

@soerenfi
Copy link

I'll also add this more simplified path tracing example that I'm about to dive into.

@mmerabet42
Copy link

I stumbled into your video, and i think there is a better way to use iterators for std::for_each.
Basically you could make your own iterator with all the appropriate operators (following the LegacyForwardIt requirements) which simply increments an integer. This way you're not intializing a whole vector with N elements going from X to Y. But instead you'll have two iterators, one representing the moving index and a second one representing the end iterator, and thanks to the != operator of the iterator, std::for_each knows when to stop.

template <typename T>
class ForEachIterator
{
public:
  ForEachIterator(const T &p_value) : _value(p_value) {}

  ForEachIterator<T> &operator++()
  {
    ++this->_value;
    return *this;
  }

  T &operator*()
  {
    return this->_value;
  }

  bool operator!=(const ForEachIterator<T> &p_it)
  {
    return this->_value != p_it._value;
  }

private:
  T _value;
};

And you can you use it this way:

std::for_each(ForEachIterator<uint32_t>(0), ForEachIterator<uint32_t>(width), [](const uint32_t &i) {
  ...
});

@WhoseTheNerd
Copy link

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each. https://github.com/xLoge/RayTracing

SHISHUA is faster than xoshiro256+x8.

@Paulo-D2000
Copy link

Has anyone tried Intel OneAPI TBB for the threads parallelization?

@denizdiktas
Copy link

denizdiktas commented Jan 31, 2023

Initializing a vector of random unit normals gave me another 30-40% performance improvement on top of making the random number generator thread_local , as @TheCherno described in his corresponding YouTube episode.

@denizdiktas
Copy link

ok here is what I have done later on: I have added a timer that just measures the time it takes for the code that renders the image (basically a timer surrounding just the code with #define MT) by taking the average render time (accumulate the timer and divide by the number of total frames so far). By caching the normals and measuring the performance this way my render time dropped from about 30ms to 24ms on average. there are about 1 million cached random directions, so the image quality more or less remains the same (at least visually).

@denizdiktas
Copy link

denizdiktas commented Feb 6, 2023

Regarding different threading and parallelization methods: I think we should first measure the performance of the scheduled rendering code inside the thread and the overall threading method together with the rendering code. The difference will then tell us how much of a room there is for improvement. For this I have done the following: leaving the code as it is, I have measured the time it takes for rendering only (basically, the code inside the lambda function) and associated this measured time with its corresponding thread: now to do this a natural way would be to use an associative container (std::map or std::unordered_map) but I didn't want to add an extra overhead by including a lock for this map as well, instead I used a preallocated array of float values to keep track of the total rendering time inside each thread. This eliminated the overhead of locking & unlocking a map, so it has minimal impact on the measurements.

Now why am I talking about threads here? @TheCherno uses for_each(std::execution::par,..) anyway, right? The std implementation must be using some kind of a thread-pool in the background, I have tried this in a separate project: I have submitted a bunch of trivial jobs and looked at unique thread-ids. It turns out that the number of acutal worker-threads is less than the number of submitted jobs, and if some jobs take more time than others, the number of threads tends to increase.
So applying this method to the ray-tracing code, I have found out that there are about 8-9 threads (I have 8 cores) and there is very little difference between the time it takes just for rendering and the overall time for rendering + scheduling. Of course the average time in each thread is going to vary, but in my case the variation was like 1ms in 22ms. Also the difference between the thread rendering times and the overall rendering time was like about 1-2 ms (overall was about 23-24ms). So I am not sure if this is a significant overhead when using for_each(std::execution::par).

To be honest, first I implemented my own thread-pool and quickly realized that there are lots of improvements that can be done on it. Then I wondered if it was worth it for this particular rendering problem (it is still always a good idea to think about the potential improvements regardless of the use case of course, I am aware of this. I am just talking in the context of this problem). That's when and how I thought of a method for finding out how much room there is for improvement. I would love to hear your thoughts about this and let me know if something is not clear.

@denizdiktas
Copy link

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

@LilyIsTrans
Copy link

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

I'm not sure, but I think it probably has to do with it being a Forward iterator as opposed to a Random Access iterator. Because the iterator doesn't provide us with Random Access, it must be iterated sequentially, since knowing the index of each pixel/column technically requires knowing the index of the previous, as computed by a mutable, non shareable object

@denizdiktas
Copy link

Hi @Rilazy,
Ok thanks, I'll look into it. I was trying to do a tiled-rendering version and didn't want to change the usage of for_each(execution::par). So I had to implement my own tile-iterators, otherwise I have to define a long list of tiles and then for_each() needs to touch each memory location, which seems inefficient. It is also a good exercise to implement such an iterator compatible with for_each, just for its own sake 🙂

@denizdiktas
Copy link

denizdiktas commented Feb 15, 2023

another idea: why not sample the ray just over the hemi-sphere in the tangent-space? This has 2 benefits: first we will use only two random samples (in spherical coordinates just two angles are sufficient to determine a unit direction vector). Also since we are sampling over the hemi-sphere looking outwards, all of the rays will go out into the scene, while in @TheCherno implementation all directions are sampled over the whole unit sphere, so on average half of the bouncing rays will hit the surface from which they emanate and won't contribute to the final result. To measure the effectiveness of this, we should define a performance metric that measures how much time it takes for the final image to settle to its 'equilibrium' state. I am saying equilibrium in quotation marks, because there is no true equilibrium in the sense we reach in finite amount of time, but a visually pleasing one should be sufficient. What do you think?

@BenBE
Copy link

BenBE commented Feb 15, 2023

You could define this equilibrium state to be when the overall (absolute) change of pixel values ((old-new)^2)is pushed below a certain threshold. The metric would then be two-fold:

  1. How many iterations does it take for the threshold to be reached
  2. How much (absolute) time does it take to render that many frames

The first metric favors implementations that converge to their eigenstate quickly, while the second metric favors implementations that avoid overhead in the implementation and thus offer high render performance. Incidently, implementations good at 1 have an advantage in 2 also as fewer iterations have to be computed.

A third approach would divide the time taken to reach equilibrium by the number of frames needed, resulting in a time-per-frame. While high single-frame performance might be of interest,this is by far the least interesting aspect, as bad-looking results at high performance is like watching static being generated at random in response to a task to generate the Mona Lisa.

That said, maybe a quick note re optimizations: One should also take into account the specifics of the hardware things run on. When working with large blocks of memory it might be good practice, to provide the CPU with some information for prefetching, as well as e.g. timing memory accesses in a way to avoid DRAM refreshes or working from cache while memory write-back can present a bottleneck. Have a look at CPU and GPU hardware effects for other types of things that might be relevant for structuring your workloads. HTH.

@denizdiktas
Copy link

Hi @BenBE, thank you. I had similar things in mind. I will post here once have initial implemetations. Also it looks like most of the discussions are in the Discord channel, right?

@BenBE
Copy link

BenBE commented Feb 20, 2023

I'm not in the discord channel, thus cannot say too much about the discussion there. So, if there are any important new findings I'd appreciate hearing of them here.

@denizdiktas
Copy link

Btw, you can view my fork at the link below (let me know if there are any problems):
https://github.com/denizdiktas/RayTracing/

@denizdiktas
Copy link

Hi everyone. I have modified the code to schedule the tasks at different granularity levels. At the beginning of the Rendere.cpp file there are some preprocessor switches that begin with MT_TASK_GRANULARITY_
Comment in the one you want to activate and comment out the others. On my system
'MT_TASK_GRANULARITY_ ROW' seems to be the best performer.

Regarding the timers: The overall average rendering time and the time it takes to execute the code inside each worker-thread are measured individually and printed to the console at 1sec intervals. This is to give an idea on the following

  • How much variation there is among the code run in worker-threads
  • The overhead for scheduling the tasks

I would be glad if you could try it and share your results and thoughts.
You can checkout the code at the following link:
https://github.com/denizdiktas/RayTracing/

@denizdiktas
Copy link

Hi again, I made further improvement by eliminating whole BUNDLES (BEAMS) of rays by testing them against each sphere in the scene. There is a new preprocessor called 'USE_TILE_BEAM_INTERSECTION_TEST'. Comment it in if you want to use it. However, it can only be used with tiled-rendering for now. Note that I have delibarately used red color to show the parts on the screen where the intersection test fails and just the sky-color should be printed, I am aware of it, so don't kill me for it please 😅

@mgradwohl
Copy link

Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster?

@nickphilomath
Copy link

i learned a lot with these series. i wanna thank TheCherno. you'r the best :)
also i learned to render triangles with this program
look how beautiful they are:

photo_2023-03-15_15-55-32
photo_2023-03-15_15-55-37

@denizdiktas
Copy link

Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster?

You mean in Cherno's code?

@mgradwohl
Copy link

Alright, it's done [here]

> performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

I tried the same thing expecting a speedup for the same reason and didn't get it either.

@denizdiktas
Copy link

@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements.

@mgradwohl
Copy link

@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements.

Yes i found that caching values, adding const, moving things out of the loops when they didn't need to be, changing compiler settings, etc helped. I think one big one was a glm force intrinsics

@denizdiktas
Copy link

@mgradwohl cool. I think I didn't try compiler settings and glm-intrinsics. which settings did you use? could you share those, if it is ok for you?

@mgradwohl
Copy link

For release/distro I am using the following preprocessor defines:
GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE

Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.

For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).

I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2

You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

@mgradwohl
Copy link

Just an update. I tried two other random number generators Xoshiro256++ and std::ranlux48 and Xoshiro wins. I am at 10 bounces, 60fps, and my CPU is not pegged at all. All three are in my repo and easy to try out (look for // PERF TEST comments). The PRNG that is in the original code really hits the CPU.

@LilyIsTrans
Copy link

For release/distro I am using the following preprocessor defines: GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE

Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.

For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).

I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2

You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster?

@mgradwohl
Copy link

For release/distro I am using the following preprocessor defines: GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE
Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.
For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).
I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2
You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster?

I'll test again, my PC at work is only AVX2.

@mgradwohl
Copy link

I tried AVX2 vs. AVX512, no difference. However this may mean:

  1. The compiler does not do a good job of taking advantage of AVX512
  2. GLM does not do a good job of taking advantage of AVX512
  3. The compiler and GLM do a good job, but the code is not suited.

@duckdoom4
Copy link

Has anyone tried Intel OneAPI TBB for the threads parallelization?

I'm also curious about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests