Fix deadlock in asconf timer event #653

lukaszwojciechowski · 2022-03-22T22:45:51Z

Everytime a new IP address appears as source or destination address:
when receiving INIT, INIT ACK, ASCONF or binding to a new address;
such address is appended to the addr_wq list and a SCTP_TIMER_TYPE_ADDR_WQ
is started.
The SCTP_TIMER_TYPE_ADDR_WQ timer iterates over all IPs from addr_wq,
over all endpoints and over all associations in these endpoints
and matches changes of IP address list with network structures in associations.

List of all endpoints is protected by the global mutex: ipi_ep_mtx (A).
List addr_wq is protected by the global mutex: wq_addr_mtx (B).

Everytime a timer event is run a proper mutex protecting structure
that the event concerns is locked for the time of handling an event.
All timer events except for SCTP_TIMER_TYPE_ADDR_WQ are related to an endpoint
or an association, so the proper endpoint or association locks are used.
However in case of SCTP_TIMER_TYPE_ADDR_WQ a global wq_addr_mtx is used.

It protects the addr_wq that event handler iterates over. So the timer thread
running the ADDR_WQ event locked wq_addr_mtx (B) and inside the handler
locked also ipi_ep_mtx (A) for iterating over endpoints.

Meanwhile any other thread processing INIT, INIT ACK, ASCONF
or binding to a new address locked ipi_ep_mtx (A) for making changes
in global endpoint structures and during that changes locked wq_addr_mtx (B)
for adding new IPs to the wq_addr queue.

This caused a typical deadlock, where threads used locks in following order:

timer thread: B, A;
other threads: A, B.

This patch fixes the issue by reducing the scope of wq_addr_mtx (B)
during processing of SCTP_TIMER_TYPE_ADDR_WQ event only to places which
uses addr_wq structure. This way lock wq_addr_mtx (B) is not used
when ipi_ep_mtx (A) is required.

Signed-off-by: Lukasz Wojciechowski l.wojciechow@partner.samsung.com

Everytime a new IP address appears as source or destination address: when receiving INIT, INIT ACK, ASCONF or binding to a new address; such address is appended to the addr_wq list and a SCTP_TIMER_TYPE_ADDR_WQ is started. The SCTP_TIMER_TYPE_ADDR_WQ timer iterates over all IPs from addr_wq, over all endpoints and over all associations in these endpoints and matches changes of IP address list with network structures in associations. List of all endpoints is protected by the global mutex: ipi_ep_mtx (A). List addr_wq is protected by the global mutex: wq_addr_mtx (B). Everytime a timer event is run a proper mutex protecting structure that the event concerns is locked for the time of handling an event. All timer events except for SCTP_TIMER_TYPE_ADDR_WQ are related to an endpoint or an association, so the proper endpoint or association locks are used. However in case of SCTP_TIMER_TYPE_ADDR_WQ a global wq_addr_mtx is used. It protects the addr_wq that event handler iterates over. So the timer thread running the ADDR_WQ event locked wq_addr_mtx (B) and inside the handler locked also ipi_ep_mtx (A) for iterating over endpoints. Meanwhile any other thread processing INIT, INIT ACK, ASCONF or binding to a new address locked ipi_ep_mtx (A) for making changes in global endpoint structures and during that changes locked wq_addr_mtx (B) for adding new IPs to the wq_addr queue. This caused a typical deadlock, where threads used locks in following order: * timer thread: B, A; * other threads: A, B. This patch fixes the issue by reducing the scope of wq_addr_mtx (B) during processing of SCTP_TIMER_TYPE_ADDR_WQ event only to places which uses addr_wq structure. This way lock wq_addr_mtx (B) is not used when ipi_ep_mtx (A) is required. Signed-off-by: Lukasz Wojciechowski <l.wojciechow@partner.samsung.com>

tuexen · 2022-04-18T17:16:36Z

Everytime a new IP address appears as source or destination address: when receiving INIT, INIT ACK, ASCONF or binding to a new address; such address is appended to the addr_wq list and a SCTP_TIMER_TYPE_ADDR_WQ is started. The SCTP_TIMER_TYPE_ADDR_WQ timer iterates over all IPs from addr_wq, over all endpoints and over all associations in these endpoints and matches changes of IP address list with network structures in associations.

I don't think this is done when receiving an INIT, INIT ACK, or ASCONF chunk. But when a local address is added to / removed from an end point or to the stack.

List of all endpoints is protected by the global mutex: ipi_ep_mtx (A). List addr_wq is protected by the global mutex: wq_addr_mtx (B).

Everytime a timer event is run a proper mutex protecting structure that the event concerns is locked for the time of handling an event. All timer events except for SCTP_TIMER_TYPE_ADDR_WQ are related to an endpoint or an association, so the proper endpoint or association locks are used. However in case of SCTP_TIMER_TYPE_ADDR_WQ a global wq_addr_mtx is used.

It protects the addr_wq that event handler iterates over. So the timer thread running the ADDR_WQ event locked wq_addr_mtx (B) and inside the handler locked also ipi_ep_mtx (A) for iterating over endpoints.

Meanwhile any other thread processing INIT, INIT ACK, ASCONF or binding to a new address locked ipi_ep_mtx (A) for making changes in global endpoint structures and during that changes locked wq_addr_mtx (B) for adding new IPs to the wq_addr queue.

This caused a typical deadlock, where threads used locks in following order:

timer thread: B, A;

other threads: A, B.

Can you point me to a code path / call sequence, where the sequence A, B happens?
Maybe I'm missing something, but I don't see it.

This patch fixes the issue by reducing the scope of wq_addr_mtx (B) during processing of SCTP_TIMER_TYPE_ADDR_WQ event only to places which uses addr_wq structure. This way lock wq_addr_mtx (B) is not used when ipi_ep_mtx (A) is required.

Signed-off-by: Lukasz Wojciechowski l.wojciechow@partner.samsung.com

lukaszwojciechowski · 2022-04-29T07:32:52Z

The deadlock doesn't appear in the current version of usrsctp.

It happened on former version with:

The external thread calling usrsctp_bind was:

stuck on getting lock of wq_addr_mtx with SCTP_WQ_ADDR_LOCK() in usrsctplib/netinet/sctp_pcb.c:790
while it was owning ipi_ep_mtx acquired by SCTP_INP_INFO_WLOCK in usrsctplib/netinet/sctp_pcb.c:3788

At the same time the timer thread was processing SCTP_TIMER_TYPE_ADDR_WQ timer event by sctp_timeout_handler() and was:

holding wq_addr_mtx locked with SCTP_WQ_ADDR_LOCK() in usrsctplib/netinet/sctputil.c:1865
and trying to acquire ipi_ep_mtx with SCTP_INP_INFO_RLOCK() in usrsctplib/netinet/sctp_pcb.c:8082

but the PR does not longer apply as in the current version of usrsctp the sctp_inpcb_bind() function does not call sctp_add_addr_to_vrf() anymore.

tuexen · 2022-04-29T10:40:27Z

Thanks for double checking and letting me know.

tuexen self-assigned this Mar 29, 2022

tuexen added the bug label Mar 29, 2022

tuexen removed the bug label Apr 18, 2022

lukaszwojciechowski closed this Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in asconf timer event #653

Fix deadlock in asconf timer event #653

lukaszwojciechowski commented Mar 22, 2022

tuexen commented Apr 18, 2022

lukaszwojciechowski commented Apr 29, 2022

tuexen commented Apr 29, 2022

Fix deadlock in asconf timer event #653

Fix deadlock in asconf timer event #653

Conversation

lukaszwojciechowski commented Mar 22, 2022

tuexen commented Apr 18, 2022

lukaszwojciechowski commented Apr 29, 2022

tuexen commented Apr 29, 2022