Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in asconf timer event #653

Closed

Conversation

lukaszwojciechowski
Copy link

Everytime a new IP address appears as source or destination address:
when receiving INIT, INIT ACK, ASCONF or binding to a new address;
such address is appended to the addr_wq list and a SCTP_TIMER_TYPE_ADDR_WQ
is started.
The SCTP_TIMER_TYPE_ADDR_WQ timer iterates over all IPs from addr_wq,
over all endpoints and over all associations in these endpoints
and matches changes of IP address list with network structures in associations.

List of all endpoints is protected by the global mutex: ipi_ep_mtx (A).
List addr_wq is protected by the global mutex: wq_addr_mtx (B).

Everytime a timer event is run a proper mutex protecting structure
that the event concerns is locked for the time of handling an event.
All timer events except for SCTP_TIMER_TYPE_ADDR_WQ are related to an endpoint
or an association, so the proper endpoint or association locks are used.
However in case of SCTP_TIMER_TYPE_ADDR_WQ a global wq_addr_mtx is used.

It protects the addr_wq that event handler iterates over. So the timer thread
running the ADDR_WQ event locked wq_addr_mtx (B) and inside the handler
locked also ipi_ep_mtx (A) for iterating over endpoints.

Meanwhile any other thread processing INIT, INIT ACK, ASCONF
or binding to a new address locked ipi_ep_mtx (A) for making changes
in global endpoint structures and during that changes locked wq_addr_mtx (B)
for adding new IPs to the wq_addr queue.

This caused a typical deadlock, where threads used locks in following order:

  • timer thread: B, A;
  • other threads: A, B.

This patch fixes the issue by reducing the scope of wq_addr_mtx (B)
during processing of SCTP_TIMER_TYPE_ADDR_WQ event only to places which
uses addr_wq structure. This way lock wq_addr_mtx (B) is not used
when ipi_ep_mtx (A) is required.

Signed-off-by: Lukasz Wojciechowski l.wojciechow@partner.samsung.com

Everytime a new IP address appears as source or destination address:
when receiving INIT, INIT ACK, ASCONF or binding to a new address;
such address is appended to the addr_wq list and a SCTP_TIMER_TYPE_ADDR_WQ
is started.
The SCTP_TIMER_TYPE_ADDR_WQ timer iterates over all IPs from addr_wq,
over all endpoints and over all associations in these endpoints
and matches changes of IP address list with network structures in associations.

List of all endpoints is protected by the global mutex: ipi_ep_mtx (A).
List addr_wq is protected by the global mutex: wq_addr_mtx (B).

Everytime a timer event is run a proper mutex protecting structure
that the event concerns is locked for the time of handling an event.
All timer events except for SCTP_TIMER_TYPE_ADDR_WQ are related to an endpoint
or an association, so the proper endpoint or association locks are used.
However in case of SCTP_TIMER_TYPE_ADDR_WQ a global wq_addr_mtx is used.

It protects the addr_wq that event handler iterates over. So the timer thread
running the ADDR_WQ event locked wq_addr_mtx (B) and inside the handler
locked also ipi_ep_mtx (A) for iterating over endpoints.

Meanwhile any other thread processing INIT, INIT ACK, ASCONF
or binding to a new address locked ipi_ep_mtx (A) for making changes
in global endpoint structures and during that changes locked wq_addr_mtx (B)
for adding new IPs to the wq_addr queue.

This caused a typical deadlock, where threads used locks in following order:
* timer thread: B, A;
* other threads: A, B.

This patch fixes the issue by reducing the scope of wq_addr_mtx (B)
during processing of SCTP_TIMER_TYPE_ADDR_WQ event only to places which
uses addr_wq structure. This way lock wq_addr_mtx (B) is not used
when ipi_ep_mtx (A) is required.

Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>
@tuexen tuexen self-assigned this Mar 29, 2022
@tuexen tuexen added the bug label Mar 29, 2022
@tuexen
Copy link
Member

tuexen commented Apr 18, 2022

Everytime a new IP address appears as source or destination address: when receiving INIT, INIT ACK, ASCONF or binding to a new address; such address is appended to the addr_wq list and a SCTP_TIMER_TYPE_ADDR_WQ is started. The SCTP_TIMER_TYPE_ADDR_WQ timer iterates over all IPs from addr_wq, over all endpoints and over all associations in these endpoints and matches changes of IP address list with network structures in associations.

I don't think this is done when receiving an INIT, INIT ACK, or ASCONF chunk. But when a local address is added to / removed from an end point or to the stack.

List of all endpoints is protected by the global mutex: ipi_ep_mtx (A). List addr_wq is protected by the global mutex: wq_addr_mtx (B).

Everytime a timer event is run a proper mutex protecting structure that the event concerns is locked for the time of handling an event. All timer events except for SCTP_TIMER_TYPE_ADDR_WQ are related to an endpoint or an association, so the proper endpoint or association locks are used. However in case of SCTP_TIMER_TYPE_ADDR_WQ a global wq_addr_mtx is used.

It protects the addr_wq that event handler iterates over. So the timer thread running the ADDR_WQ event locked wq_addr_mtx (B) and inside the handler locked also ipi_ep_mtx (A) for iterating over endpoints.

Meanwhile any other thread processing INIT, INIT ACK, ASCONF or binding to a new address locked ipi_ep_mtx (A) for making changes in global endpoint structures and during that changes locked wq_addr_mtx (B) for adding new IPs to the wq_addr queue.

This caused a typical deadlock, where threads used locks in following order:

  • timer thread: B, A;
  • other threads: A, B.

Can you point me to a code path / call sequence, where the sequence A, B happens?
Maybe I'm missing something, but I don't see it.

This patch fixes the issue by reducing the scope of wq_addr_mtx (B) during processing of SCTP_TIMER_TYPE_ADDR_WQ event only to places which uses addr_wq structure. This way lock wq_addr_mtx (B) is not used when ipi_ep_mtx (A) is required.

Signed-off-by: Lukasz Wojciechowski l.wojciechow@partner.samsung.com

@tuexen tuexen removed the bug label Apr 18, 2022
@lukaszwojciechowski
Copy link
Author

The deadlock doesn't appear in the current version of usrsctp.

It happened on former version with:

The external thread calling usrsctp_bind was:

  • stuck on getting lock of wq_addr_mtx with SCTP_WQ_ADDR_LOCK() in usrsctplib/netinet/sctp_pcb.c:790
  • while it was owning ipi_ep_mtx acquired by SCTP_INP_INFO_WLOCK in usrsctplib/netinet/sctp_pcb.c:3788

At the same time the timer thread was processing SCTP_TIMER_TYPE_ADDR_WQ timer event by sctp_timeout_handler() and was:

  • holding wq_addr_mtx locked with SCTP_WQ_ADDR_LOCK() in usrsctplib/netinet/sctputil.c:1865
  • and trying to acquire ipi_ep_mtx with SCTP_INP_INFO_RLOCK() in usrsctplib/netinet/sctp_pcb.c:8082

but the PR does not longer apply as in the current version of usrsctp the sctp_inpcb_bind() function does not call sctp_add_addr_to_vrf() anymore.

@tuexen
Copy link
Member

tuexen commented Apr 29, 2022

Thanks for double checking and letting me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants