Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger simulations are not deterministic #32

Open
syifan opened this issue Dec 18, 2023 · 11 comments
Open

Larger simulations are not deterministic #32

syifan opened this issue Dec 18, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@syifan
Copy link
Contributor

syifan commented Dec 18, 2023

To Reproduce
MGPUSim version of commit ID:
40c4cd4

Command that recreates the problem

./fir -length=65536 -timing

Current behavior
The estimated execution time differs from execution to execution

Expected behavior
The estimated execution time should be the same.

@syifan syifan added the bug Something isn't working label Dec 18, 2023
@MaxKev1n
Copy link

MaxKev1n commented Mar 6, 2024

I want to know if this bug has been fixed. I found that the results of parallel engine are also different from those of serial engine.

@syifan
Copy link
Contributor Author

syifan commented Mar 6, 2024

The problem is still there.

If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.

Also, for parallel simulations, how different are they from serial simulations?

@MaxKev1n
Copy link

MaxKev1n commented Mar 6, 2024

In the fir with 4096 * 32 samples to filter, the parallel simulations may slower 3% than serial simulation. In addition, I print all the events and their scheduled time. In parallel simulation, the first event of mmu is scheduled at 0.0000000120, but in serial simulation, it is scheduled at 0.0000000350.

@MaxKev1n
Copy link

MaxKev1n commented Mar 6, 2024

The problem is still there.

If parallel simulation is used, the simulation will be underministic for sure. The scope of making the simulation deterministic only applies to single-kernel serial simulation.

Also, for parallel simulations, how different are they from serial simulations?

I wonder what the possible cause of this problem is, Golang or mgpusim itself. If I know the possible reason, I may be able to try fixing this bug myself.

@syifan
Copy link
Contributor Author

syifan commented Mar 6, 2024

Well, we cannot blame Go for this. There are definitely some features will cause non-deterministic execution, we should avoid those.

Here are some good discussion on how to avoid non-deterministic behavior in Go golang/go#33702. They also point to the potential source of non-deterministic behavior.

One thin I am thinking about is to try to create super simple simulations. The root of the problem may be on the Akita side.

The difference in parallel and serial simulations is another problem. I have created #45 for this problem. For now, can you mainly use the serial simulation?

@MaxKev1n
Copy link

MaxKev1n commented Mar 6, 2024

I can use the serial simulation currently. Thanks for your reply.

@MaxKev1n
Copy link

Prof.Sun, I think I may fix the bug about larger simulations are not deterministic.

Firstly, I record the scheduling and handling order of events and find that the access order of the device port of an endpoint is not deterministic (akita/noc/networking/switching/endpoint.go:sendFlitOut(now)), which causes that the Tick() function of different components connected to the endpoint may be executed randomly.

image

As the figures show, the left figure executes the RDMA firstly but the right figure executes the commandprocessor firstly. This is because when the timing platform plugin the device onto the endpoint (PlugInDevice(pcieSwitchID, gpu.Domain.Ports())), the result of gpu.Domain.Ports() is not deterministic.

So, I modify the codes of Port() and the bug looks like be fixed.

image

@syifan
Copy link
Contributor Author

syifan commented Mar 12, 2024

@MaxKev1n Looks great! Can you start a pull request, and I can look deeper into it?

@syifan
Copy link
Contributor Author

syifan commented Mar 12, 2024

BTW, there is a deterministic test script under test/deterministic. In the Python file, you can see that there is a line being commented out. You can reinclude that line and see if the problem is solved. If not solved, at least determinicity can last until what problem size.

@MaxKev1n
Copy link

In your deterministic test script, I find that running fir with single GPU is not able to reproduce the problem, so I run fir with 4 GPUs and reproduce the problem successfully. My code can eliminate the majority of deterministic problem except a set of metrics called CPIStack, I think CPIStack may have other problem. And I found that there may be a small difference in the total time of fir, but I think this is acceptable.

image

@syifan
Copy link
Contributor Author

syifan commented Mar 13, 2024

@MaxKev1n Thanks for the PR. I am merging it.

However, I do not think this problem is fully resolved given the small difference. Being fully deterministic is more about debugging. When we find a bug, we want to rerun the program and the bug takes place at the exact same location. We will keep looking into the problem. I think we are close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants