Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modes array for output keeps growing #5044

Open
ammgws opened this issue Feb 26, 2020 · 20 comments
Open

modes array for output keeps growing #5044

ammgws opened this issue Feb 26, 2020 · 20 comments

Comments

@ammgws
Copy link
Contributor

ammgws commented Feb 26, 2020

Every now and again (usually after leaving the computer for several hours), I lose all ability to talk with the sway IPC, however it can work intermittently:
image

strace on the script when the above error has occurred, which seems like the sway server has gone:
asciicast

For reference this is what normal output looks like:
asciicast

So far the simplest way I have been able to reproduce this (albeit have only tried this once so maybe I just got lucky), was to have 5 terminals endlessly looping swaymsg -t get_tree and then leaving it until the next morning.

Need some pointers on what else to try in order to pinpoint what is going on.

Environment

  • sway version 1.4-46561fc9 (Feb 7 2020, branch 'makepkg'), though I have been seeing this from at least around January this year
  • haven't managed to get a useful debug log yet (Every time I have come across this issue has been after leaving the computer for a few hours and by then the log has gotten out of control)
  • haven't tried with default sway config yet (still investigating exactly how to reproduce this issue)

Possibly related to #5043

@ammgws
Copy link
Contributor Author

ammgws commented Feb 27, 2020

Have been stepping through sway and swaymsg with gdb but haven't been able to pinpoint anything yet.

Interestingly, apart from get_tree and get_outputs, all the other commands work, even though they are all using the same socket fd... :/

@ammgws
Copy link
Contributor Author

ammgws commented Feb 29, 2020

OK I finally pinpointed it down.

The server is shutting down the IPC connection because the response is too large (just over 4MB) (ipc-server:912

I dumped the response contents using gdb:
https://raw.githubusercontent.com/ammgws/sway/temporary/gdbmemdump.prettyjson

The reason the response is so large is because the modes array for one of my three monitors has tens of thousands of entries:

rg "refresh\":" gdbmemdump.prettyjson | wc -l
73481

@emersion Any pointers on how to continue debugging this issue? I have no idea why it's reporting so many different (refresh) modes.

@emersion
Copy link
Member

Eh, that's pretty interesting. It's a good thing that swaywm/wlroots#1968 has been merged, otherwise all these modes would have been sent to each client.

Can you check whether https://github.com/ascent12/drm_info reports also these modes? And can you try reading the EDID with https://git.linuxtv.org/edid-decode.git/ too?

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

I restarted my computer before I could check the output of those commands, but it just happened again so here they are:

before=output of the command from 2 days ago
after=now

drm_info -j (it cuts off at 612 lines if not in JSON format)
before clock appears 107 times
after clock appears 107 times

edid-decode
before
after

Both appear to be the same to me.

@emersion
Copy link
Member

emersion commented Mar 3, 2020

This seems like an amdgpu bug to me.

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

Starting two days ago, I ran get_outputs every 5 minutes and this was the result: https://gist.github.com/ammgws/39fde4a420f448cf4b5cec4b28d3c890

You can see the output start growing in size after the March 3 00:47 entry.
There is a pattern with the size increase, as the exact same increments (94941, 92064) appear a lot, which means its probably just repeating the same data each time.

cat swaymsg_getoutputs_382 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
32
cat swaymsg_getoutputs_383 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
280
cat swaymsg_getoutputs_384 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
1303
cat swaymsg_getoutputs_385 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
2295
cat swaymsg_getoutputs_387 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
4310
cat swaymsg_getoutputs_388 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
5302
cat swaymsg_getoutputs_389 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
6325
cat swaymsg_getoutputs_390 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
7317
cat swaymsg_getoutputs_391 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
8340
cat swaymsg_getoutputs_392 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
9332
cat swaymsg_getoutputs_394 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
11347
cat swaymsg_getoutputs_395 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
12339
cat swaymsg_getoutputs_396 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
13362
...
etc

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

@emersion Was there something in the output that made you think that?

@emersion
Copy link
Member

emersion commented Mar 3, 2020

Yes, the kernel reporting duplicate modes (see drm_info output). Please report this to amdgpu.

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

OK, I will see if I can get around to reporting it to amdgpu sometime soon.

Although I wonder what triggers sway to start increasing the modes array since the output of drm_info looks unchanged from the start, yet the increase in get_outputs appears to start at a random time.

@emersion
Copy link
Member

emersion commented Mar 3, 2020

Wait, can you dump your configuration? Do you use output … mode --custom?

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

My config is here

Not using --custom

@emersion
Copy link
Member

emersion commented Mar 3, 2020

I see. Are you sure the drm_info output doesn't grow?

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

20200303_23h26m48s

I only ran it once before and after so I am not sure if it grew in between or not.
I can try recording that next time as well?

@ammgws
Copy link
Contributor Author

ammgws commented Mar 3, 2020

I could try get a sway debug log as well, but I need something will autorotate/trim the log file while sway is running, otherwise it gets too big too soon and I don't have much HDD space at the moment.

Would you happen to have any recommendations for that?

@ammgws
Copy link
Contributor Author

ammgws commented Mar 4, 2020

This time I ran drm_info at the same time. The output of drm_size doesn't balloon, but it does fluctuate a little bit (haven't checked the actual files yet but I think the suspect monitor is dropping in and out), and when those fluctuations happen it affects sway.

https://gist.githubusercontent.com/ammgws/2794be819518e885b9f92c627c9a13a0/raw/95c5a8f7b8e25e96543cda34de8a03dc47d4ab59/sw.log

I may have a way to be able to trigger this without leaving it all day, so in that case I can provide a sway debug log as well.

@ammgws
Copy link
Contributor Author

ammgws commented Mar 4, 2020

Haven't had time to cull it down further, but this is my sway debug log (cut columns to 150 chars): https://gist.github.com/324282fc87fdc32bded8b49972bf4143

The monitor in question (HDMI-A-1) is connecting/disconnecting a lot. I guess every time the modeset is requested it keeps adding the modes to the ever-growing modes array?

rg "' .*connected" sway_cut.log | wc -l
409
rg "Requesting modeset for 'HDMI-A-1'" sway_cut.log | wc -l
202

This is the size of get_outputs at 5 min intervals:

2020-03-04 19:27 12977
2020-03-04 19:32 12977
2020-03-04 19:37 12977
2020-03-04 19:42 12977
2020-03-04 19:47 12981
2020-03-04 19:52 12981
2020-03-04 19:57 8891
2020-03-04 20:02 113647
2020-03-04 20:07 205711
2020-03-04 20:12 300652
2020-03-04 20:17 392716
2020-03-04 20:22 487657

I will have a further look tomorrow to see if there is a smoking gun in the log.

@ammgws ammgws changed the title IPC server/connection randomly dies modes array for output keeps growing Mar 16, 2020
@ammgws
Copy link
Contributor Author

ammgws commented Mar 16, 2020

sort swaymsg_getoutputs_41 | uniq | wc -l
116
sort swaymsg_getoutputs_1 | uniq | wc -l
113

sort swaymsg_getoutputs_41 | uniq -c | sort -nr | bat

1 │ 38665 {
2 │ 38662 },
3 │ 9979 "height": 480,
4 │ 9978 "refresh": 60000
5 │ 9972 "width": 720,
6 │ 8732 "width": 1280,
7 │ 8728 "refresh": 50000
8 │ 7483 "refresh": 59940
9 │ 7481 "width": 1920,
10 │ 7481 "height": 1080,
11 │ 6235 "height": 720,
12 │ 4988 "width": 640,
13 │ 3741 "height": 576,
14 │ 2496 "width": 800,
15 │ 2496 "height": 600,
16 │ 2494 "width": 1024,
17 │ 2494 "height": 768,
18 │ 1253 "width": 1440,
19 │ 1248 "height": 1024,
20 │ 1247 "width": 1680,
21 │ 1247 "refresh": 60317
22 │ 1247 "refresh": 60020
23 │ 1247 "refresh": 60004
24 │ 1247 "refresh": 59901
25 │ 1247 "refresh": 59883
26 │ 1247 "refresh": 56250
27 │ 1247 "height": 900,
28 │ 1247 "height": 800,
29 │ 1247 "height": 1050,
30 │ 1246 "refresh": 70082
31 │ 1246 "refresh": 70069
32 │ 1246 "refresh": 66667
33 │ 1246 "refresh": 59910
34 │ 1246 "height": 400,
35 │ 15 },
36 │ 13 "width": 0,
37 │ 11 "y": 0,
38 │ 11 "height": 0
39 │ 10 "x": 0,
40 │ 9 ],
41 │ 3 "window_rect": {
42 │ 3 "window": null,
43 │ 3 "urgent": false,
44 │ 3 "type": "output",
45 │ 3 "transform": "normal",

This suggests to me that the same modes are duplicated over and over again.
To be honest I'm a bit lost in the code base, but I'm trying to track down whether these duplicate entries already exist in wlr_output->modes before it's used by the IPC code, or whether the duplication happens after that.

@emersion Would checking for duplicates while building the modes_array JSON be an acceptable solution? Or is it just a bandaid fix?

@emersion
Copy link
Member

Would checking for duplicates while building the modes_array JSON be an acceptable solution? Or is it just a bandaid fix?

No, the root cause needs to be discovered.

@kliu128
Copy link

kliu128 commented Jul 3, 2020

I believe I can reproduce this issue - in my case, it appears to coincide with a noticeable cursor lag when moving between window boundaries. Debug log here: https://gist.github.com/kliu128/8125d498a470f682726bce1389f80264 I'm using one HiDPI 3000x2000 internal monitor plus two HDMI monitors connected through separate USB-C adapters, which it seems every Linux display manager has choken on (GNOME refuses to use DPMS to turn them off, KDE Wayland thinks one of them disconnects whenever it turns off) except for sway for the most part. The issue seems to manifest primarily after using DPMS to turn off the displays and then waiting a long time. It seems like this cursor lag issue has also been seen on reddit, e.g. https://www.reddit.com/r/swaywm/comments/hf9b8r/cursor_lag_when_crossing_window_boundaries/, but not sure if it's the same cause.

A temporary workaround I've found is to run swaymsg reload before turning the displays back on, as this seems to reset the modes array and make the problem go away.

@ony
Copy link

ony commented Oct 20, 2020

I can easily reproduce by unplugging and plugging back HDMI cable in my AMD GPU connected to monitor through passive DVI-D to HDMI adapter. But swaymsg reload doesn't reset array in this case and drm_info shows no duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants