Larger simulations are not deterministic #32

syifan · 2023-12-18T17:35:43Z

To Reproduce
MGPUSim version of commit ID:
40c4cd4

Command that recreates the problem

./fir -length=65536 -timing

Current behavior
The estimated execution time differs from execution to execution

Expected behavior
The estimated execution time should be the same.

MaxKev1n · 2024-03-06T12:17:13Z

I want to know if this bug has been fixed. I found that the results of parallel engine are also different from those of serial engine.

syifan · 2024-03-06T12:58:11Z

The problem is still there.

If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.

Also, for parallel simulations, how different are they from serial simulations?

MaxKev1n · 2024-03-06T13:25:48Z

In the fir with 4096 * 32 samples to filter, the parallel simulations may slower 3% than serial simulation. In addition, I print all the events and their scheduled time. In parallel simulation, the first event of mmu is scheduled at 0.0000000120, but in serial simulation, it is scheduled at 0.0000000350.

MaxKev1n · 2024-03-06T13:28:43Z

The problem is still there.

If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.

Also, for parallel simulations, how different are they from serial simulations?

I wonder what the possible cause of this problem is, Golang or mgpusim itself. If I know the possible reason, I may be able to try fixing this bug myself.

syifan · 2024-03-06T14:29:08Z

Well, we cannot blame Go for this. There are definitely some features will cause non-deterministic execution, we should avoid those.

Here are some good discussion on how to avoid non-deterministic behavior in Go golang/go#33702. They also point to the potential source of non-deterministic behavior.

One thin I am thinking about is to try to create super simple simulations. The root of the problem may be on the Akita side.

The difference in parallel and serial simulations is another problem. I have created #45 for this problem. For now, can you mainly use the serial simulation?

MaxKev1n · 2024-03-06T14:45:09Z

I can use the serial simulation currently. Thanks for your reply.

MaxKev1n · 2024-03-12T04:17:28Z

Prof.Sun, I think I may fix the bug about larger simulations are not deterministic.

Firstly, I record the scheduling and handling order of events and find that the access order of the device port of an endpoint is not deterministic (akita/noc/networking/switching/endpoint.go:sendFlitOut(now)), which causes that the Tick() function of different components connected to the endpoint may be executed randomly.

As the figures show, the left figure executes the RDMA firstly but the right figure executes the commandprocessor firstly. This is because when the timing platform plugin the device onto the endpoint (PlugInDevice(pcieSwitchID, gpu.Domain.Ports())), the result of gpu.Domain.Ports() is not deterministic.

So, I modify the codes of Port() and the bug looks like be fixed.

syifan · 2024-03-12T12:58:31Z

@MaxKev1n Looks great! Can you start a pull request, and I can look deeper into it?

syifan · 2024-03-12T13:00:35Z

BTW, there is a deterministic test script under test/deterministic. In the Python file, you can see that there is a line being commented out. You can reinclude that line and see if the problem is solved. If not solved, at least determinicity can last until what problem size.

MaxKev1n · 2024-03-12T15:02:28Z

In your deterministic test script, I find that running fir with single GPU is not able to reproduce the problem, so I run fir with 4 GPUs and reproduce the problem successfully. My code can eliminate the majority of deterministic problem except a set of metrics called CPIStack, I think CPIStack may have other problem. And I found that there may be a small difference in the total time of fir, but I think this is acceptable.

syifan · 2024-03-13T12:41:01Z

@MaxKev1n Thanks for the PR. I am merging it.

However, I do not think this problem is fully resolved given the small difference. Being fully deterministic is more about debugging. When we find a bug, we want to rerun the program and the bug takes place at the exact same location. We will keep looking into the problem. I think we are close.

syifan added the bug Something isn't working label Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Larger simulations are not deterministic #32

Larger simulations are not deterministic #32

syifan commented Dec 18, 2023

MaxKev1n commented Mar 6, 2024

syifan commented Mar 6, 2024

MaxKev1n commented Mar 6, 2024

MaxKev1n commented Mar 6, 2024

syifan commented Mar 6, 2024 •

edited

MaxKev1n commented Mar 6, 2024

MaxKev1n commented Mar 12, 2024

syifan commented Mar 12, 2024

syifan commented Mar 12, 2024

MaxKev1n commented Mar 12, 2024

syifan commented Mar 13, 2024

Larger simulations are not deterministic #32

Larger simulations are not deterministic #32

Comments

syifan commented Dec 18, 2023

MaxKev1n commented Mar 6, 2024

syifan commented Mar 6, 2024

MaxKev1n commented Mar 6, 2024

MaxKev1n commented Mar 6, 2024

syifan commented Mar 6, 2024 • edited

MaxKev1n commented Mar 6, 2024

MaxKev1n commented Mar 12, 2024

syifan commented Mar 12, 2024

syifan commented Mar 12, 2024

MaxKev1n commented Mar 12, 2024

syifan commented Mar 13, 2024

syifan commented Mar 6, 2024 •

edited