New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression from 2.4.3 to 2.4.4 #185

Closed
filmor opened this Issue Oct 15, 2015 · 6 comments

Comments

Projects
None yet
3 participants
@filmor

filmor commented Oct 15, 2015

I have the following piece of code:

import numexpr as ne
import numpy as np

x1, x2 = [np.random.rand(100000) for i in (0, 0)]

prev = None
for i in range(10000):
    d = ne.evaluate("x1 - x2")
    if prev is None:
        pass
    elif np.any(np.abs(prev - d) > 1e-10):
        print("Step", i, "Sums:", np.sum(prev), "vs.", np.sum(d))
        break
    prev = d

I'm essentially taking the difference of two random vectors and finding that after repeating that for some time the difference changes significantly.

This code runs fine with numexpr-2.4.3, I'm using Anaconda on Windows AMD64 with the newest version of NumPy (1.10, 1.9 showed the same problem though), and the problem only appears when running multithreaded. Can someone else reproduce this?

It seems like a threading issue that leads to parts of the difference array being values very close to zero (~1e-315).

@filmor

This comment has been minimized.

filmor commented Oct 16, 2015

According to this https://www.diffchecker.com/vufifls8 (left git version of win32/pthread.c, right the current one used in numexpr) the tid field of the pthread_t is not filled in numexpr's pthread but it is in the one used by git. Is there a particular reason for this?

I saw that the only non-trivial change between 2.4.3 and 2.4.4 is f6462f5, could this be the reason for this? It seems to rely on the value being set ...

@filmor

This comment has been minimized.

filmor commented Oct 16, 2015

Sorry, that might have been the wrong track, I didn't see that the tid is passed as an arg by init_threads. Nevertheless, doesn't the linked commit assume that tid=0 is the first thread to be started? Is that guaranteed?

@FrancescAlted

This comment has been minimized.

Contributor

FrancescAlted commented Oct 21, 2015

Hmm, the commit for this is f6462f5. @legrosbuffle any comments? You said that you was fixing a "harmless data race" in ThreadSanitizer, but turns out that the fix has side effects.

@filmor

This comment has been minimized.

filmor commented Oct 21, 2015

It does indeed look harmless, doesn't it? The threads are synchronized after this so this is called exactly once instead of once per thread in the previous version. Maybe the synchronization is failing on Windows? A friend of mine tried the snippet on Gentoo and couldn't reproduce the error.

@FrancescAlted

This comment has been minimized.

Contributor

FrancescAlted commented Oct 24, 2015

I reverted the 'harmless data race'. If @legrosbuffle has comments they are still welcome.

@legrosbuffle

This comment has been minimized.

Contributor

legrosbuffle commented Dec 8, 2015

Sorry, I missed the notification e-mails for this.
As you mention, there is a barrier just after setting the gs.init_sentinels_done and the thread id is always set, so any thread (but only one, since the access is not locked) should be setting gs.init_sentinels_done. There only one caller of th_worker, which is at line 199, and tid=0 is the only thread that's guaranteed to exist, so I picked this one to set gs.init_sentinels_done.
I'm fine with this being reverted as I was mainly fixing a ThreadSanitizer warning and not a real bug, however I'm still not fully convinced that this was the culprit. Did this get bisected ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment