Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firefox constantly crashes using native wayland #7645

Open
thiloho opened this issue Jun 21, 2023 · 24 comments
Open

Firefox constantly crashes using native wayland #7645

thiloho opened this issue Jun 21, 2023 · 24 comments
Labels
bug Not working as intended

Comments

@thiloho
Copy link

thiloho commented Jun 21, 2023

Sway Version:

1.8.1


Debug Log:

https://termbin.com/zwso


Configuration File:

https://termbin.com/iqa7e


Description:

Install sway, in my case I am using NixOS, disable xwayland via configuration, and launch Firefox from the terminal with the command firefox. Use it a couple of minutes or hours and wait until it randomly closes showing the following console output:

Gdk-Message: Lost connection to Wayland compositor

I figured out that it happens more often when you constantly use the web developer tools.
Here is a video of me following the above described procedure and crashing Firefox after two minutes: https://github.com/swaywm/sway/assets/123883702/97d6a9ea-6ad9-48b7-8166-e096fb303951

Unfortunately I cannot find a more specific crash log, there is nothing in about:crash and the debug mode also shows no further information.

@thiloho thiloho added the bug Not working as intended label Jun 21, 2023
@triallax
Copy link

I seem to be hitting this too. I'm happy to provide more details if needed.

@the8472
Copy link

the8472 commented Jul 23, 2023

That's probably FF bug 1743144 which is a somewhat fundamental wayland design issue possibly exacerbated by GTK forcefully terminating the process instead of giving it a chance to handle the error. Some compositors implement workarounds, but wlroots maintainers don't want to implement those since they consider this a bug that needs to be fixed somewhere else.
It's not clear if any project considers it their responsibility to fix this issue.

It's especially common if you have a high polling rate mouse.

A possible workaround without patching any project is to write a wrapper program that invokes firefox and then looks up the the file descriptor of the child which represents the wayland connection between firefox and sway and adjusting the buffer sizes on the socket, this could probably be done in a few lines of python or rust.

@mnd999
Copy link

mnd999 commented Aug 17, 2023

I get that on FreeBSD

@aqxa1
Copy link

aqxa1 commented Aug 19, 2023

That behaviour with suspending processes is pretty annoying too (as linked in Wayland issue), and it explains why suspending processes seemed to break with some apps but not others (being a Wayland and not XWayland issue).

On the firefox issue, I hope some kind of fix or workaround occurs because the problem is infuriating and seems to happen under heavy RAM pressure as well and not just disk i/o.

@Ranguvar
Copy link

This has started to plague my desktop, crashing several windows with hundreds of tabs multiple times per day.
It's pretty rough to choose between Wayland Firefox or sway.

I'll look into the wrapper idea @the8472 mentioned - thank you - but I'm probably only capable of writing a bash script.
Any further hints would be extremely appreciated.

I'll see if I can reactivate my Mozilla Bugzilla and help from that side, too.

@the8472
Copy link

the8472 commented Nov 29, 2023

If you're feeling adventurous/desperate you can try https://github.com/the8472/weyland-p5000

Note that this is proof-of-concept level code that may well be worse than the issue it's supposed to work around. I'm not even dog-fooding it since i3 is still my daily driver.

@Ranguvar
Copy link

Ranguvar commented Nov 29, 2023

I've only tested an hour so far, but I've not been able to reproduce the crash.

Enormous thanks. I will keep testing and report any findings on FF bug 1743144.

@JonasVautherin
Copy link

JonasVautherin commented Dec 4, 2023

For the record, I ended up on this issue because my Firefox was repeatedly crashing when starting up, with this very Gdk-Message.

I tried running with @the8472's proxy (weyland-p5000) and until now it seems like it is working (it's been an hour).

EDIT: Firefox has been running fine for 4 hours with @the8472's proxy. If I quit Firefox and try to start it without the proxy, it crashes immediately.

@Cloudef
Copy link

Cloudef commented Dec 8, 2023

That's probably FF bug 1743144 which is a somewhat fundamental wayland design issue possibly exacerbated by GTK forcefully terminating the process instead of giving it a chance to handle the error. Some compositors implement workarounds, but wlroots maintainers don't want to implement those since they consider this a bug that needs to be fixed somewhere else. It's not clear if any project considers it their responsibility to fix this issue.

It's especially common if you have a high polling rate mouse.

A possible workaround without patching any project is to write a wrapper program that invokes firefox and then looks up the the file descriptor of the child which represents the wayland connection between firefox and sway and adjusting the buffer sizes on the socket, this could probably be done in a few lines of python or rust.

This is kinda insane to be honest

@emersion
Copy link
Member

emersion commented Dec 8, 2023

Note that the fix for this is https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188.

@lonjil
Copy link

lonjil commented Dec 10, 2023

I tried @the8472's proxy, which made Firefox more stable, but I can still easily cause a crash by simply opening a lot of tabs very quickly, or selecting and trying to move many tabs at the same time.

@Ranguvar
Copy link

Another data point: I haven't had a single crash with the proxy, since the day the8472 (🙏🙏) released it.

I intentionally stressed the system, turning back on Tree Style Tab. I use 6 windows with a couple hundred tabs each and thousands on the main, and was able to open hundreds rapidly with a free scrolling wheel.
Sorry that I can't offer any advice.

@lonjil
Copy link

lonjil commented Dec 11, 2023

I massively increased the sysctls net.core.{r,w}mem_{default,max} and after restarting Firefox, it no longer crashes with the proxy (I'll try it without the proxy later). I did manage to crash Sway by opening FF tabs quickly though. But after starting Sway again, presumably making those changed numbers apply to it, Sway has not crashed again.

@Ranguvar
Copy link

Interesting. I have 212992 as the value for all of those sysctls - just Arch defaults.

@lonjil
Copy link

lonjil commented Dec 11, 2023

That's the number I had before changing it.

@the8472
Copy link

the8472 commented Dec 11, 2023

I tried @the8472's proxy, which made Firefox more stable, but I can still easily cause a crash by simply opening a lot of tabs very quickly, or selecting and trying to move many tabs at the same time.

I massively increased the sysctls net.core.{r,w}mem_{default,max} and after restarting Firefox, it no longer crashes with the proxy (I'll try it without the proxy later). I did manage to crash Sway by opening FF tabs quickly though. But after starting Sway again, presumably making those changed numbers apply to it, Sway has not crashed again.

Strange. The proxy is supposed to make larger buffers unnecessary by doing all the buffering in userspace. It's job is to keep those queues empty. Increasing the default buffer sizes for all sockets in the system is a quite a waste of memory.

If it still cannot keep up then I speculate that it might be for one of the following reasons:

  • your system is resource-constrained in some way that keeps multiple processes from doing work (memory pressure? swapping? few CPU cores? some high priority task grabbing all the CPU cycles?)
  • something is generating events at an ever higher rate than on my system (I do have a 1kHz mouse). Maybe a 8kHz mouse? pen tablet?
  • there's some subtle efficiency/latency issue in my code that makes it slower at processing events than I expect it to be.

But what I can do is bump the buffer sizes for those connections specifically, that way you don't have to do it for the whole system.

@lonjil
Copy link

lonjil commented Dec 11, 2023

your system is resource-constrained in some way that keeps multiple processes from doing work (memory pressure? swapping? few CPU cores? some high priority task grabbing all the CPU cycles?)

I got crashes, and actually just got a crash again, when Firefox has so much work to do that it slows down my entire system, so yes, this is the culprit. I wonder if it could be resolved by giving p5wl (and Sway) higher scheduling priority?

@the8472
Copy link

the8472 commented Dec 11, 2023

That might work, though tail latencies might still go up even for higher-priority processes if the system is under heavy load. Since non-root users are limited in what thread priorities they can set you can try lowering the priority of firefox instead. p5wl nice -n 5 firefox

And if you haven't already upgrade to kernel 6.6 which has the EEVDF scheduler which is supposed to be better at ensuring latency fairness.

@the8472
Copy link

the8472 commented Dec 11, 2023

Also, can you test if setting /proc/sys/kernel/sched_autogroup_enabled to 0 makes any difference?

@lonjil
Copy link

lonjil commented Dec 11, 2023

I increased the priority for Sway and p5wl, and lowered it for Firefox. Will see how it goes. I'm on 6.6, but since I finally switched to Wayland and also created a new Firefox profile about the time of that update, I don't even have a subjective idea of whether it did anything.

Also, can you test if setting /proc/sys/kernel/sched_autogroup_enabled to 0 makes any difference?

Will do.

@rpigott
Copy link
Member

rpigott commented Dec 12, 2023

Linux autogrouping only affects procs in the root cgroup, making it irrelevant on any distro using systemd. The hierarchy of "nice" values that affect firefox, aside from the process nice value, are those of the parent cgroups with the cpu accounting enabled:

$ grep -H $ /sys/fs/cgroup/**/app-firefox*.scope/(../)#cpu.weight.nice(Od:a) | column -ts:
/sys/fs/cgroup/user.slice/cpu.weight.nice                                              0
/sys/fs/cgroup/user.slice/user-1000.slice/cpu.weight.nice                              0
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cpu.weight.nice            0
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/cpu.weight.nice  0

Firefox's nice will be balanced proportionally amongst its peers in app.slice on most distros, or simply the entire user session if app.slice is not populated. Sway is sched_rr if installed with cap_sys_nice, so its nice value cannot be modified.

@the8472
Copy link

the8472 commented Dec 12, 2023

I see, I misunderstood the hierarchical part in the manpage and thought it would still perform a sub-group under the cgroup.

@retropc
Copy link

retropc commented Dec 23, 2023

A possible workaround without patching any project is to write a wrapper program that invokes firefox and then looks up the the file descriptor of the child which represents the wayland connection between firefox and sway and adjusting the buffer sizes on the socket, this could probably be done in a few lines of python or rust.

I've done this but in reverse, a nasty LD_PRELOAD hack into sway to override the wayland-server send buffer size

https://gitlab.com/retropc/accepthack

seems to mitigate the issue

@Cloudef
Copy link

Cloudef commented Jan 23, 2024

@retropc Sure aint pretty but I'm testing it now Cloudef/nixos-flake@bd9e1b1 thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not working as intended
Development

No branches or pull requests