New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Titan X GPU (Maxwell, CC 5.2) locks up during sort_by_key with zip_iterator and custom comparator #742

Closed
Oblomov opened this Issue Feb 4, 2016 · 10 comments

Comments

Projects
None yet
3 participants
@Oblomov

Oblomov commented Feb 4, 2016

Hello, we've been experiencing consistently reproducible hardware lock-ups (Xid 8) when running thrust::sort_by_key with a custom comparator function on Titan X GPUs (Maxwell architecture, compute capability 5.2). The problem only manifests when sorting arrays with millions of elements, multiple keys and a custom comparator function. All our Titan X GPUs seem to be affected, and in all cases there is no sign of thermal issues, indicating (from the Xid value of 8) that this is either a problem in the driver or a problem in the (thrust) code. I'm not sure where the bug resides, so I'm reporting this on “both sides”.

Sample code to reproduce the issue is non trivial, so I've created the github project titanxstall to host a sample program. Running the program just repeatedly sorts the arrays (and scrambles them again) forever (or until the GPU locks up). With the default settings, the device usually locks up after less than a hundred thousand iterations, but sometimes it locks up as quickly as 2K iterations. Other architectures (Fermi, Kepler) seem to not be affected. The problem also manifests with as few as 1024×1024 elements, in longer runs.

The test program can optionally be run with a custom caching allocator (based on the one in the thrust examples) as an option to verify that the problem manifests even without the continuous allocation/deallocation done by thrust::sort_by_key.

@jaredhoberock

This comment has been minimized.

Member

jaredhoberock commented Feb 4, 2016

Thanks for reporting this bug. I can reproduce it on my system with a Titan X.

After playing with your reproducer, I found that the hang does not appear to depend on the input -- simply sorting the same input over and over will lead to the hang. I suspected that your custom comparison function might not define a strict weak ordering as required by sort_by_key(). Comparison functions which do not fulfill this requirement cause undefined behavior. I checked it with a smaller data set, and it seems to be OK. Also, the hang doesn't seem to depend on the value of -arch= given to nvcc. Compiling with the default also produces the hang.

Anyway, It seems like some kernel involved sometimes takes longer than expected and is killed by the driver in a way that brings down the entire driver. Since this only happens occasionally, and the program's behavior is correct otherwise, it leads me to believe that the bug lives somewhere outside of Thrust.

I produced a more minimal reproducer program and submitted a pull request. The program has a GPL license, so simply posting here in this issue didn't seem possible.

@egaburov

This comment has been minimized.

Collaborator

egaburov commented Feb 5, 2016

It also freezes on GTX980Ti

@Oblomov

This comment has been minimized.

Oblomov commented Feb 5, 2016

Thanks for looking into this @jaredhoberock . The custom comparator should be just a lexicographical order on three non-negative integral keys (hash, PART_TYPE, id), the last of which is unique, so it should satisfy the strict weak ordering criteria, but I will try and verify it on larger datasets with the code in your link. One thing I've noticed is that the hang only seems to happen with at least 1Mi elements (1024×1024), I haven't been able to reproduce it with less (but maybe the runtime had to be much longer in that case?).

Thanks also for going the extra mile of further minimalizing the test case. I actually introduced the options “a posteriori” to simplify testing some different configurations, but it does make sense to have something simpler to just try. I'll look into merging it upstream.

@jaredhoberock

This comment has been minimized.

Member

jaredhoberock commented Feb 10, 2016

Thanks @Oblomov. We've filed an issue with Nvidia's internal issue tracker.

@Oblomov

This comment has been minimized.

Oblomov commented Feb 11, 2016

Thanks for that @jaredhoberock . For what it's worth, I've tried running the strict weak ordering check with my custom comparator with 2^20 elements, but it takes very, very long (I'm guessing the check is O(N^2) if not more?), so I had to terminate it early. I'll try again as soon as I can get a machine to run it to completion, just to be on the safe side about that aspect.

@jaredhoberock

This comment has been minimized.

Member

jaredhoberock commented Feb 11, 2016

The strict weak ordering code I linked to will perform an exhaustive check over the input. I believe the asymptotic complexity is O(N^3), so it is only practical for modestly-sized inputs. Like I mentioned, I've already performed the check with an input and your comparison function, and don't believe your code is invalid.

@Oblomov

This comment has been minimized.

Oblomov commented Mar 17, 2016

Hello, any news on this, or possible workarounds? On the NVIDIA devtalk forums someone commented that this might be an issue specific to using device pointers rather than thrust::device_vector. I can look into using the latter, but for our production code this might have a significant impact.

@egaburov

This comment has been minimized.

Collaborator

egaburov commented Mar 22, 2016

Placing data in thrust::device_vector doesn't solve the issue. However, freezes are gone when moving initParticles outside the while loop:

  initParticles<<<numBlocks, blockSize>>>(info, hash, partidx, numParticles);
  while(true) { ... }

I need to understand further where the problems lies.

@egaburov

This comment has been minimized.

Collaborator

egaburov commented Nov 22, 2016

Closing issue as it was solved in cuda 8

@egaburov egaburov closed this Nov 22, 2016

@Oblomov

This comment has been minimized.

Oblomov commented Nov 24, 2016

Interesting, thanks (I've been busy and didn't have time to check). I assume we still don't know exactly what caused it? Some miscompilation maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment