Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] fix session deadlock (#2290) #2300

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cdevelop
Copy link

@cdevelop cdevelop commented Nov 5, 2023

@cdevelop
Copy link
Author

cdevelop commented Nov 5, 2023

@seven1240 Teacher Du and I explained this issue in detail.

@signalwire-ci
Copy link

signalwire-ci bot commented Nov 5, 2023

@seven1240
Copy link
Collaborator

@andywolk cdeveop has a system which left 3-10 dead locked channels per day, which could be fixed by this patch.

Review the patch, it looks like the read_frame parts are fine as the message will be processed on the next read. But I'm worrying about the write parts. What if this happens on a sendonly channel that never has read? or on a App which never read but write ( I don't know such an App in my head though). So the message will never be processed.

Best to find out the root cause of the dead lock instead of the work around I think.

@bferreirq
Copy link

Hello, any news about this commit ?

@KerryRJ
Copy link

KerryRJ commented Feb 28, 2024

I have an installation where 30-40 inbound and outbound calls are getting deadlocked on a daily basis, and the call center agent connected to the deadlocked call also gets stuck, until a restart or the extension leg of the call is hung up.

#2387 and #2390 did not resolve the the deadlocks nor the stuck call center agent.

This fix has stopped all deadlocks and subsequently no stuck call center agents.

@televoicepl
Copy link

I have an installation where 30-40 inbound and outbound calls are getting deadlocked on a daily basis, and the call center agent connected to the deadlocked call also gets stuck, until a restart or the extension leg of the call is hung up.

#2387 and #2390 did not resolve the the deadlocks nor the stuck call center agent.

This fix has stopped all deadlocks and subsequently no stuck call center agents.

same problem in my instance.
1.10.11 and 1.10.10 is affected. 1.10.9 is ok
Random calls are dead and only core restart helps.

@shaunjstokes
Copy link
Contributor

shaunjstokes commented Apr 4, 2024

This fix has fixed all deadlocks and stuck calls in our environment.

#2387 and #2390 did not fix the issue.

@bfroemel
Copy link

I can also report that this patch fixes a similar issue with stuck channels and stuck call center agents in my installation.

@boteman
Copy link

boteman commented Apr 28, 2024

Why not make it pass the Unit Tests so that it can be merged? Many people seem to be experiencing this problem.

@televoicepl
Copy link

@jakubkarolczyk can you check it?

@televoicepl televoicepl mentioned this pull request Apr 29, 2024
@gregoriusus
Copy link

Can I bump this thread. This issue is real problem and a lot of people have this issue...

@boteman
Copy link

boteman commented May 8, 2024

Yesterday on Office Hours I asked about this and the related tickets which exhibit a similar problem. BKW said that it is a much deeper problem than is solved by the proposed patch and that they are refarming the code to solve everything at once. So it is a bigger effort than we thought, no ETR.

@zooptwopointone
Copy link

Before I open another ticket like this. I am curious if the issue I am finding could be related to this one. I have found that I am getting a lot of stuck calls in this scenario.
Inbound call does a bg_api command to originate another call. Both calls with join a conference. Though the bg executed originate call will get stuck if it receives a re-invite. In all the cases of these stuck calls I have. the re-invite comes immediately after the answer. Maybe 1 - 4 RTP packets might have been transmitted. The re-invite in some cases was for suggesting another codec. and others were for changing the media IP. but in all cases. the both call legs get stuck. That is specifically they show up when you execute "show calls" What I have found is the first inbound leg will end and show false for uuid_exists, and the outbound call leg will show true, I can run uuid_kill on it, but it will not go away, though after the kill it will result in "no such channel" . Though "show calls" will still display them.

This does seem like something that has started/gotten worse after upgrading from 1.10.7 which we were getting stuck calls but at a low rate. say 10 / week vs 100/day Getting a carrier to change some behaviour reduced the re-invites and does seemed to have directly affected the number of dead calls.

Let me know if I should submit a different report for this I can provide more details there.

@hhadzem
Copy link
Contributor

hhadzem commented Jul 23, 2024

Is there any progress/update related to this issue?

@zrd740
Copy link

zrd740 commented Sep 6, 2024

I upgraded several servers from FreeSWITCH 1.10.8 to 1.10.12. This error showed up and it's creating dozens of stuck calls per day, as well as numerous client complaints on a daily basis. 1.10.8 does not have the issue. Is there any update on this, it seems like a problem that has been going on for a long time? FusionPBX support said the solution to the problem is to apply this patch and compile from source as a resolution to the problem. Many users have reported this resolves the problem. I have wanted to use the signalwire packages on debian, but they do not work due to this issue. If we can't get a solution to this I'll need to start compiling from source.

@zrd740
Copy link

zrd740 commented Sep 7, 2024

We've resolved the issue by migrating away from prebuilt packages and compiling FS from source with this patch, and it works perfectly. However, I'm disappointed that after almost a year, this hasn't been included in the release despite its impact on many users.

@boteman
Copy link

boteman commented Sep 9, 2024

My understanding from what the core developers have said is that this ticket and at least 3 others are a combination of problems deep down in the core that require extensive re-farming of source code to correct. They found some long-standing bugs in the process and aim to correct all of them in this effort. This patch works for some, but not others, and that is why. I'm as anxious as you for a permanent fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.