Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clay, ames: applying change to widely distributed desk is very slow #6926

Open
Fang- opened this issue Feb 26, 2024 · 5 comments
Open

clay, ames: applying change to widely distributed desk is very slow #6926

Fang- opened this issue Feb 26, 2024 · 5 comments

Comments

@Fang-
Copy link
Member

Fang- commented Feb 26, 2024

I've been suffering this for a while now, but don't think an issue for this has ever been filed, so here we go:

Initiating software updates on ~paldev is a multi-hour affair. |commiting a change on a public desk causes it to <<sync>> for a long time, essentially self-DoS-ing.

Observations:

  • The amount of time this takes to process seems proportional to how many people have installed from the desk.
    • |commiting for less popular apps is relatively fast, on the order of minutes to tens of minutes.
    • |commiting for popular apps like pals and rumors is extremely slow to resolve, on the order of hours.
  • While this is happening, the serf is more or less pegging a CPU core to 100%. The VPS it's on isn't that slow/old (e2-medium in gcloud).
  • paldev has never run out of memory during a |commit, despite sometimes operating close to its memory limit. I should not have written that. It happened.

Happy to help out with additional analysis as desired, but I'm assuming we have good guesses as to where this slowness comes from.

@pkova
Copy link
Collaborator

pkova commented Feb 26, 2024

My bet is on the following:

Clay creates a new ames flow for every aeon of the desk -> congestion control for all subscribers is reset -> insane amounts of ames effects to notify subscribers of the new commit.

Combine this with the fact that most subscribers will be offline and so ~paldev will retry every two minutes for tons of ships multiplied by tons of commits = 1gb per day event log growth.

@bacwyls
Copy link
Contributor

bacwyls commented Mar 20, 2024

I would add that this is a newer issue for me. |commit on %houston with 160 subscribers took over an hour, %radio with 1300 subs took several hours. These used to be on the order of minutes before 412. Its also worth noting that these updates were just a line of text in sys.kelvin.

@bacwyls
Copy link
Contributor

bacwyls commented Mar 25, 2024

[EDIT: crash unrelated]

    data: [%spot [%sys %vane %ames %hoon 0] [3385 15] 3400 35]
leak: 0x25a3dc918 mug=0 swept=1
    size: B/24
    data: [%spot [%sys %vane %ames %hoon 0] [3397 33] 3397 57]
leak: 0x25a3dc930 mug=0 swept=1
    size: B/24
    data: [%spot [%sys %vane %ames %hoon 0] [3397 11] 3400 35]
leak: 0x25a3dc948 mug=0 swept=1
    size: B/24
    data: [%spot [%sys %vane %ames %hoon 0] [3393 11] 3400 35]
leak: 0x25a3dc960 mug=0 swept=1
    size: B/24
    data: [%spot [%sys %vane %ames %hoon 0] [3392 11] 3400 35]
leak: 0x25a3dc978 mug=0 swept=1
    size: B/32
    data: (null)
leak: 0x25a3dca38 mug=0 swept=1
    size: B/32
    data: (null)
leak: 0x25a3dca88 mug=2ed6bdb9 swept=1
    size: B/24
loom: external fault: 0x2a57f8064 (0x200000000 : 0x280000000)

Assertion '0' failed in pkg/noun/manage.c:1791

Here is the potentially related crash which has happened multiple times. Thousands of lines of printout like this.

@joemfb
Copy link
Member

joemfb commented Mar 25, 2024

@bacwyls you're just running out of memory while trying to copy out to the home road (the crash handler itself is crashing, the leaks are happening there and are not persistent). melding or running with a larger loom should get you past this. If you can't do either, you'll need to find some state to delete and then |pack to free up space.

@bacwyls
Copy link
Contributor

bacwyls commented Apr 13, 2024

Bumping this. With this issue, updating a single line of text causes the distributor ship to be inoperable for several hours, and a trail of performance issues after the fact. Note also that this does not only apply to very popular desks, this is something that affects every app dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants