Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XrdCl: Fix race condition in PostMaster initialization #251

Merged
merged 1 commit into from
Jul 15, 2015

Conversation

esindril
Copy link
Contributor

@esindril esindril commented Jul 7, 2015

In case several threads start doing transfers in parallel some can
fail with the following error: [FATAL] Initialization error
because the post master object is created but not initialized.

This only happens in the beginning when the PostMaster is not set up. The initial idea was probably to make the GetPostMaster code as efficient as possible by not taking the lock on the most frequent case. But this leads to a situation where a thread creates the sPostMaster object but does not get to initialize it, and another thread picks it up as the object exists and tries to use it.

@bbockelm
Copy link
Contributor

bbockelm commented Jul 7, 2015

This won't work on platforms without atomic writes of 64-bit pointers. Other threads may pull half-written pointer addresses from the static.

@esindril
Copy link
Contributor Author

esindril commented Jul 7, 2015

That's true but it's still better than what we have now. Unless we take the lock all the time it will be a race condition no matter what. But the penalty of taking the lock is not worth it given that this only happens briefly in the beginning.

@bbockelm
Copy link
Contributor

bbockelm commented Jul 7, 2015

@esindril - locks aren't necessary here; we can use std::atomics to verify the read/write is done correctly. On x86 platforms, the generated assembler turns out to be the same - IIRC, the potentially problematic platforms are PowerPC and ARM (CMSSW maintains a build for ARM which I try to help keep shiny).

@abh3
Copy link
Member

abh3 commented Jul 7, 2015

I suppose a silly question but why can't the postmaster be initialized before we start all the other threads? Why did this start showing up now? I do agree that an atomic read/write would work here but it's iffy whether or not an atomic read/write will be more efficient than getting a lock. In general, a full memory barrier has to be established around the memory access. For most (though not all) X86 architecture that is automatic for properly aligned values, which we have here. I agree, it's usually a problem for non-X86 architectures.

@esindril esindril force-pushed the fix_postmaster branch 2 times, most recently from 1b6b853 to 6889ed1 Compare July 15, 2015 13:58
In case several threads start doing transfers in parallel some can
fail with the following error: [FATAL] Initialization error
because the post master object is created but not initialized.
@esindril
Copy link
Contributor Author

I have updated this pull request. Indeed, as Lukasz noticed I removed by mistake the wrong check. What I actually wanted to do was to remove the check of pInitialized as it should never happen that you get a PostMaster object without it being initialized.

And as far as the "half-written" pointer issue is concerned, we are no better off in the current situation without this patch. The only bullet-proof solution would be a lock or atomics - which I don't believe is worth given this happens only when the XrdCl library is initialized and the lock (or any flavour of synchronisation) would be an unnecessary overhead after the start-up phase.

esindril added a commit that referenced this pull request Jul 15, 2015
XrdCl: Fix race condition in PostMaster initialization
@esindril esindril merged commit ac3ac4c into xrootd:master Jul 15, 2015
@esindril esindril deleted the fix_postmaster branch July 15, 2015 14:10
@esindril esindril added the fixed label Jul 15, 2015
@esindril esindril added this to the v4.2.2 milestone Jul 15, 2015
@bbockelm
Copy link
Contributor

Hi @esindril -

Just to clear up misconceptions. There is no overhead for the atomic I suggested on platforms besides ARM (and there it is pretty small compared to a lock).

Not all synchronization primitives are built the same!

Brian

@esindril
Copy link
Contributor Author

For this particular case since we only set the variable once and then only do get operations I totally agree. No misconception here. I will add it as an atomic so that all concerns are addressed - the target of this patch was the initialization bit. Thanks for your input @bbockelm.

@abh3
Copy link
Member

abh3 commented Jul 15, 2015

Just to add to he noise. Quite true not all atomics are built the same but
it's how you build them and how you use them can make a big diffrenece.
So, let's take x86. A strong atomic store is typically built as:

fence(); store(); fence()

The fence() is the problematic thing here. It all depends on the
architecture. A fence requires that all load and stores be committed
before continuing. This usually implies discarding all speculative memory
fetches. That can be very expensive on some architectures (and from my
understanding even on some x86 variants). Depending on the atomic
requirements it may very well be better to simply obtain a lock.

Andy

On Wed, 15 Jul 2015, Brian Bockelman wrote:

Hi @esindril -

Just to clear up misconceptions. There is no overhead for the atomic I suggested on platforms besides ARM (and there it is pretty small compared to a lock).

Not all synchronization primitives are built the same!

Brian


Reply to this email directly or view it on GitHub:
#251 (comment)

@ljanyst
Copy link
Contributor

ljanyst commented Jul 15, 2015

This will only be triggered in places where networking is involved, so were talking of the latencies in the order of tens (hundreds) of miliseconds. I don't think that couple hundred nanoseconds delay caused by a memory barrier will be much of an issue here. Alternatively you may move the PostMaster initialization to the static initializer, where it will be done on the app start up in a single-threaded context, but this may have other consequences for people who expect to do some things in a single-threaded mode before they do any data access. Your call.

@esindril __sync_lock_test_and_set is your friend.

@xrootd-dev
Copy link

I guess my point is that atomics should not be looked at as a general
solution. Sometimes they are good and sometimes there are better ways of
doing it. In this particular context, my impression was that the pointer
would be fetched gadzillions of times during execution but set only once.
Using atomics in this case is a complete overkill and, frankly, a waste of
CPU time. Better to set it once in a single threaded environment and fetch
it normally afterwards.

On Wed, 15 Jul 2015, Lukasz Janyst wrote:

This will only be triggered in places where networking is involved, so were talking of the latencies in the order of tens (hundreds) of miliseconds. I don't think that couple hundred nanoseconds delay caused by a memory barrier will be much of an issue here. Alternatively you may move the PostMaster initialization to the static initializer, where it will be done on the app start up in a single-threaded context, but this may have other consequences for people who expect to do some things in a single-threaded mode before they do any data access. Your call.

@esindril __sync_lock_test_and_set is your friend.


Reply to this email directly or view it on GitHub:
#251 (comment)

########################################################################
Use REPLY-ALL to reply to list

To unsubscribe from the XROOTD-DEV list, click the following link:
https://listserv.slac.stanford.edu/cgi-bin/wa?SUBED1=XROOTD-DEV&A=1

@ljanyst
Copy link
Contributor

ljanyst commented Jul 15, 2015

Your call. Bear in mind that it may break the apps that need single-threaded context before doing the IO (ie. some forking stuff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants