Skip to content

Commit

Permalink
prov/sockets: Fix ofiwg#3217, bug in buffered receive path
Browse files Browse the repository at this point in the history
This bug was found by testing GASNet, which has some tests that very
rigourously stress messaging endpoints. The behaviour was that tests would
sporadically hang when put under heavy loads.

The root cause is found in the provider's handling of previously buffered
messages (ones that came in without a matching receive buffer). When the
handler grabbed a posted buffer to place a buffered message into, it was not
marking that receive buffer as free once it was done with it, which causes
buffers posted with FI_MULTI_RECV to become unusable in the future, which
leads to the provider running out of receive buffers.

After fixing the root cause, there was also an issue in calculating the
position in which to place a message in such a multi-recv buffer, which this
patch also fixes.

Signed-off-by: Erik Paulson <erik.r.paulson@intel.com>
  • Loading branch information
Erik Paulson authored and shefty committed Aug 16, 2017
1 parent e990827 commit 280fdba
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions prov/sockets/src/sock_progress.c
Original file line number Diff line number Diff line change
Expand Up @@ -1420,8 +1420,11 @@ static int sock_pe_progress_buffered_rx(struct sock_rx_ctx *rx_ctx)
pe_entry.data_len = 0;
pe_entry.buf = 0L;
for (i = 0; i < rx_posted->rx_op.dest_iov_len && rem > 0; i++) {
if (used_len >= rx_posted->rx_op.dest_iov_len) {
used_len -= rx_posted->rx_op.dest_iov_len;
/* Try to find the first iovec entry where the data
* has not been consumed. In the common case, there
* is only one iovec, i.e. a single buffer */
if (used_len >= rx_posted->iov[i].iov.len) {
used_len -= rx_posted->iov[i].iov.len;
continue;
}

Expand Down Expand Up @@ -1467,6 +1470,11 @@ static int sock_pe_progress_buffered_rx(struct sock_rx_ctx *rx_ctx)
sock_pe_report_recv_completion(&pe_entry);
}

/* Mark that we are done processing the posted recv buff.
* This allows another thread to grab it when calling
* sock_rx_get_entry() */
rx_posted->is_busy = 0;

dlist_remove(&rx_buffered->entry);
sock_rx_release_entry(rx_buffered);

Expand Down

0 comments on commit 280fdba

Please sign in to comment.