bail meme: out of loom #3645

Zaphod101010 · 2020-10-05T19:46:48Z

Describe the bug
Ship crashes with error

bail: meme
bailing out
pier: serf error: end of file
~hodler-pinryx:dojo>
address 0x458 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: (0), function u3m_bail, file noun/manage.c, line 677.
[1]    99384 abort      ./urbit hodler-pinryx

I used the Mac binary to run meld from: #3235 (comment)

Output of meld:

a6a7b557a95cd246430ce68809d8188d40ae28f4-darwin/urbit-worker meld ./hodler-pinryx
loom: mapped 2048MB
boot: protected loom
live: loaded: GB/1.515.700.224
boot: installed 268 jets
serf: measuring memory:
  kernel: GB/1.309.939.032
total arvo stuff: GB/1.309.939.032
  warm jet state: KB/116.388
  cold jet state: MB/145.005.616
  hank cache: B/372
  battery hash cache: B/288
  call site cache: B/100
  hot jet state: KB/126.496
total jet stuff: MB/145.249.260
  bytecode programs: MB/1.078.368
  bytecode cache: KB/107.576
total nock stuff: MB/1.185.944
  trace buffer: B/36
  memoization cache: B/288
total road stuff: B/324
total marked: GB/1.456.374.560
free lists: MB/59.300.944
sweep: GB/1.456.374.560

hash-cons arena:
  root: B/144
  atoms (445039):
    refs: MB/10.284.580
    data: MB/164.433.182
    dict: MB/27.967.368
  total: MB/202.685.130
  cells (35463450):
    refs: MB/781.763.380
    dict: GB/3.439.758.872
  total: GB/4.221.522.252
total: GB/4.424.207.526

serf: measuring memory:
  kernel: MB/968.540.908
total arvo stuff: MB/968.540.908
  warm jet state: KB/84.400
  cold jet state: MB/56.481.148
  hank cache: B/288
  battery hash cache: B/288
  hot jet state: KB/124.408
total jet stuff: MB/56.690.532
  bytecode cache: B/288
total nock stuff: B/288
  memoization cache: B/288
total road stuff: B/288
total marked: GB/1.025.232.016
free lists: KB/150.220
sweep: GB/1.025.232.016

Started ship after meld, and it crashed again:

~hodler-pinryx:dojo> allocate: reclaim: memo cache: empty

bail: meme
bailing out
pier: serf error: end of file
~hodler-pinryx:dojo>
address 0x458 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: (0), function u3m_bail, file noun/manage.c, line 677.
[1]    3797 abort      ./urbit hodler-pinryx

System (please supply the following information, if relevant):

OS: MacOS 10.15.7

Additional context
My pier is 15.68 GB, which seems kind of large.

The text was updated successfully, but these errors were encountered:

dylanirlbeck · 2020-10-07T16:00:42Z

Getting a very similar error for my fakezod:

/zod
loom: mapped 2048MB
lite: arvo formula 50147a8a
lite: core 590c9d56
lite: final state 590c9d56
loom: mapped 2048MB
boot: protected loom
live: loaded: GB/2.134.016.000
boot: installed 268 jets
---------------- playback starting ----------------
pier: replaying events 237348-237675
allocate: reclaim: memo cache: empty

bail: meme
bailing out
pier: serf error: end of file

tylershuster · 2020-10-07T16:16:13Z

Yeah this has become a really serious problem recently, since the graph-store update. I get it very often on my planet if I don't |pack like an army brat and my star runs pretty hot too.

matildepark · 2020-10-07T16:17:42Z

What's the main culprit in |mass?

tylershuster · 2020-10-07T16:19:58Z

%peers-known: MB/171.312.584
%ford-marks: MB/146.440.020
%channels: GB/1.380.147.520

so...channels for me. And yet...

and

Zaphod101010 · 2020-10-10T15:49:02Z

my ship seems to be getting a slightly different error now. It was bail meme, now it's bail oops.

$ ./urbit hodler-pinryx
~
urbit 0.10.8
boot: home is /Volumes/T5/hodler-pinryx/hodler-pinryx
loom: mapped 2048MB
lite: arvo formula 50147a8a
lite: core 590c9d56
lite: final state 590c9d56
loom: mapped 2048MB
boot: protected loom
live: loaded: MB/476.659.712
boot: installed 268 jets
---------------- playback starting ----------------
pier: replaying events 189892162-189894681
eyre: canceling ~[//http-server/0v18.kt33l/2/1]
pier: (189892662): play: done
pier: (189893163): play: done
pier: (189893664): play: done
pier: (189894165): play: done
[%missing-subscription-in-unsubscribe /channel/subscription/1601987305632-5fd758/11]
[%missing-subscription-in-unsubscribe /channel/subscription/1601987305632-5fd758/12]
pier: (189894666): play: done
pier: (189894681): play: done
---------------- playback complete ----------------
ames: live on 55750
pier: serf error: end of file

address 0x458 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: (0), function u3m_bail, file noun/manage.c, line 677.
[1]    18736 abort      ./urbit hodler-pinryx

precompute · 2020-10-11T16:30:17Z

I have the same issue. I breached a couple of months ago, and was using my moon exclusively since. Switched to my planet yesterday (my planet was only relaying connections for my moon(s)), and encountered this issue after my planet half-joined a group (I think). The name came up as ~ship/name/group (literally "ship"). I did manage to leave it via landscape, but the error persisted.

Used the build here: #3235 (comment)

~planet:dojo> \«ames»
bail: meme
bailing out
pier: serf error: end of file
~planet:dojo>
address 0x440 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

The planet does run for a while, and pulls in messages and data (like which ships have sunk) but then it eventually crashes with some variation of the above.

My planet has ~14.7 million events.

salmun-nister · 2020-10-15T06:16:10Z

I've had a similar issue a few times this evening.

~salmun-nister:dojo> allocate: reclaim: half of 1115 entries

bail: meme
bailing out
pier: serf error: end of file
~salmun-nister:dojo>
address 0x438 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

I have no idea if it's related, but I'm also seeing quite a few messages like these in dojo.

[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/11]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/12]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/14]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/16]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/18]

nameless-nobody · 2020-10-15T10:42:41Z

i am experiencing same issue after i half-joined and/or half-unsubscribed some of the channels/groups yesterday. my ship keeps crashing a few minutes after boot.

bail: meme
bailing out
pier: serf error: end of file
~marryp-worryd:dojo> 
address 0x438 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

+trouble:

  [%home-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%kids-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%glob-hash 0v1.bn7am.9sl00.vfh1o.uvsuf.dn9b7 %done]
  [%our ship=~marryp-worryd point='655494067' life=[~ 5] rift=[~ 3]]
  [%sponsor ship=~dirdev point='2995' life=[~ 8] rift=[~ 6]]
  [%dopzod ship=~dopzod point='4608' life=[~ 3] rift=[~ 2]]
  "Compare lifes and rifts to values here:"
  "https://etherscan.io/address/azimuth.eth#readContract"
  "  life - getKeyRevisionNumber"
  "  rift - getContinuityNumber"
  ~
]

nameless-nobody · 2020-10-15T11:02:51Z

i also got this error while in landscape, but not sure if it's related:

jv@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:461:390751
Zo@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:58173
gs@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:104413
cl@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:96961
sl@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:96886
Qs@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:93916
Qs@[native code]
https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:45557
https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:322:4087
Wi@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:45503
Ui@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:45438
A@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:114587
Kt@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:22967
Kt@[native code]

screenshot:

precompute · 2020-10-15T14:06:52Z

Update to my previous comment: #3645 (comment)

It seems the issue has been resolved for me. I used the stable binary after running meld from the binary mentioned in the OP and my planet stopped crashing, and I joined the group that was giving me trouble. It also OTA'd to o3a14, so maybe that played a role as well.

It's been running for over a day now with no issues.

nameless-nobody · 2020-10-15T14:10:38Z

@t-e-r-m i am on o3a14 but still crashing — what is meld?

salmun-nister · 2020-10-15T16:06:41Z

booted up my moon and I'm seeing the same behavior

address 0x438 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted

output of +trouble:

> +trouble
[ [%base-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%home-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%kids-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%glob-hash 0v1.bn7am.9sl00.vfh1o.uvsuf.dn9b7 %done]
  [%our ship=~tartux-darsyl-salmun-nister point='8749394165105624742' life=[~ 1] rift=[~ 0]]
  [%sponsor ship=~salmun-nister point='184485542' life=[~ 1] rift=[~ 0]]
  [%dopzod ship=~dopzod point='4608' life=[~ 3] rift=[~ 2]]
  "Compare lifes and rifts to values here:"
  "https://etherscan.io/address/azimuth.eth#readContract"
  "  life - getKeyRevisionNumber"
  "  rift - getContinuityNumber"
  ~
]

precompute · 2020-10-15T17:38:34Z

@david-xlvrs see #3235 (comment)

salmun-nister · 2020-10-15T19:39:50Z

thanks @t-e-r-m. I tried that running meld with that binary (#3235 (comment)) and it successfully completed. Unfortunately, I'm still crashing when running the urbit binary from that release and the latest stable binary release (v0.10.8).

joemfb · 2020-10-15T22:53:12Z

Hi All,

Sorry for the delayed response. I realize that the current state of our memory management tooling has not been fully explained in any one place, so I'll attempt to do so here.

First of all, the crash output of the current release is fraught, due to a) bad error messages and b) a crash-handling bug. Both problems will be fixed in the next release (#3471).

"pier: serf error: end of file" means "the worker process unexpectedly shutdown" (in a unique dialect combining urbit and unix idioms). Everything from "address X out of loom" to "Assertion failed" results from the crash-handling bug described above (when it follows the "end of file" error, that is).

The relevant detail preceeds both: bail:meme. This means we ran out of memory on the loom (the persistent memory arena in urbit's runtime, currently limited to 2GB). This is currently always handled as a fatal error -- we're working on tooling to automatically recover where possible.

These are errors are sometimes ephemeral, since the loom holds both are persistent state and the "workspace" for processing a given event. In which case, restarting urbit is all that's required. But they usually recur, and require further intervention to be stopped.

The first thing to try upon restart is |pack, which defragments the persistent state. The next thing to run is |mass, which prints a detailed memory-usage report. If the bail:meme crashes are recurring too quickly to run those, it can help to restart urbit with the -L argument, which effectively disables networking (by only listening on localhost). (If those crashes are being triggered by inbound packets, that is.)

If the above fails, or is not possible due to crash frequency, or does not reclaim enough space to move forward, there are two remaining options. Both involve as-of-yet unreleased features.

|meld is a global deduplicator: it reallocates all persistent state off the loom, unifying any duplicate nouns it discovers along the way, and then copies them back to the loom. This usually reclaims a significant amount of state. The dojo command itself (|meld) depends on unreleased kernel functionality (coming to you in an OTA update any minute), and the runtime implementation is similarly unreleased (coming to you ASAP in v0.10.9). In the meantime, you can download pre-release binaries (from #3235 (comment)) and run the command through the urbit-worker process.

Note that |meld itself can use quite a bit of memory, depending on the degree of duplication in your persistent state. It's not subject to the loom limits, but instead depends on the actual RAM available in your environment. If it runs out of memory, the process will crash with a message something like:

ur: hashcons: dict_grow: allocation failed, out of memory

The only way to work around this is to run it on a machine with more memory. (There are various ways this implementation could be made to use less memory, most of them involving more sophisticated hashtable implementations. Reach out if you're the kind of person who likes writing hashtables!)

Finally, there's |trim, which asks the arvo kernel to delete persistent state that it safely can (mostly caches). The dojo command itself is unreleased (also coming to you soon in an OTA update), but the kernel and runtime implementation is already present, and can be triggered through a tedious manual process. To do so, you'll need to generate a memory pressure event (%trim) and inject it into your ship on startup.

From the dojo, run the following:

> .trim/jam [//arvo trim/0]

This will write a trim.jam file into your pier in $pier/.urb/put/. Then, shutdown your ship gracefully (with ctrl-d or |exit), and restart it as follows:

$ path/to/urbit -I your-ship/.urb/put/trim.jam your-ship/

The following output will tell you that it worked:

~
urbit 0.10.8
...
pier: injecting %trim event on //arvo
...

at which point, you can run |mass again to see how much space was reclaimed.

Obviously, this recovery process is both unnecessarily complex and tediously manual. The next kernel and runtime releases will bring some affordances, but much more are still needed. Urbit should be able to avoid many of these memory-pressure scenarios proactively, and automatically recover from many more. This remains a top development priority. Most of this will involve more sophisticated memory management and error-handling in the runtime -- I've been putting down the foundations for this throughout 2020. But some will need to involve limits and their enforcement inside arvo (such as #3680, or better handling of |trim in various vanes). And still more will likely require new features, like the ability to archive or delete state from in an adhoc manner from clay or gall agents.

salmun-nister · 2020-10-15T23:52:32Z

Thank you very much @joemfb for the detailed response. The manual |trim process seems to have resolved my crash.

sivner-figbus · 2020-10-19T08:42:46Z

My ship is broken. I'm getting regular "out of loom" crashes.

According to mass (see https://hatebin.com/gcjhvqkfqb ) I have 1.3GB of vane-e channels:
%channels: GB/1.310.520.720
|trim followed by |pack doesn't free any significant amount of these; afterwards:
%channels: GB/1.309.383.072

I can't "cram" or "meld" the box using the pre-release binaries mentioned above - full outputs at https://hatebin.com/hmgnwsldft
both crash like this:

0x200000018 is bogus
0x200000018 is bogus
address 0x2981faddc out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

liam-fitzgerald · 2020-10-21T22:50:07Z

It's worth noting that %eyre will only free one channel per |trim event, so you may need to inject the event more than once to avoid a bail:meme

joemfb · 2020-12-17T19:47:08Z

Closing in favor of #4182.

Zaphod101010 added the bug label Oct 5, 2020

This was referenced Oct 16, 2020

Failed to apply OTA out of loom! #3730

Closed

bail:meme upon attempting to join new (presumably large) chat channel #3263

Closed

Crash on startup due to memory issues, not fixed with |pack #3190

Closed

yosoyubik mentioned this issue Oct 16, 2020

Landscape doesn't load any content #3735

Closed

codydh mentioned this issue Oct 16, 2020

Ship crashes: out of loom #3740

Closed

joemfb mentioned this issue Oct 16, 2020

eyre: implement a max-unacked-events counter and discard channels that hit it #3680

Closed

chc4 mentioned this issue Nov 12, 2020

~rys is dying with bail: oops #3928

Closed

lrettig mentioned this issue Nov 25, 2020

Assertion 'c3y == u3a_north_is_normal(u3R, nov)' failed in noun/allocate.c:1161 #4015

Open

joemfb mentioned this issue Dec 17, 2020

urbit runs out of memory (bail:meme) #4182

Open

joemfb closed this as completed Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bail meme: out of loom #3645

bail meme: out of loom #3645

Zaphod101010 commented Oct 5, 2020 •

edited

dylanirlbeck commented Oct 7, 2020

tylershuster commented Oct 7, 2020

matildepark commented Oct 7, 2020

tylershuster commented Oct 7, 2020 •

edited

Zaphod101010 commented Oct 10, 2020

precompute commented Oct 11, 2020 •

edited

salmun-nister commented Oct 15, 2020

nameless-nobody commented Oct 15, 2020 •

edited

nameless-nobody commented Oct 15, 2020

precompute commented Oct 15, 2020

nameless-nobody commented Oct 15, 2020

salmun-nister commented Oct 15, 2020

precompute commented Oct 15, 2020

salmun-nister commented Oct 15, 2020

joemfb commented Oct 15, 2020

salmun-nister commented Oct 15, 2020

sivner-figbus commented Oct 19, 2020

liam-fitzgerald commented Oct 21, 2020

joemfb commented Dec 17, 2020

bail meme: out of loom #3645

bail meme: out of loom #3645

Comments

Zaphod101010 commented Oct 5, 2020 • edited

dylanirlbeck commented Oct 7, 2020

tylershuster commented Oct 7, 2020

matildepark commented Oct 7, 2020

tylershuster commented Oct 7, 2020 • edited

Zaphod101010 commented Oct 10, 2020

precompute commented Oct 11, 2020 • edited

salmun-nister commented Oct 15, 2020

nameless-nobody commented Oct 15, 2020 • edited

nameless-nobody commented Oct 15, 2020

precompute commented Oct 15, 2020

nameless-nobody commented Oct 15, 2020

salmun-nister commented Oct 15, 2020

precompute commented Oct 15, 2020

salmun-nister commented Oct 15, 2020

joemfb commented Oct 15, 2020

salmun-nister commented Oct 15, 2020

sivner-figbus commented Oct 19, 2020

liam-fitzgerald commented Oct 21, 2020

joemfb commented Dec 17, 2020

Zaphod101010 commented Oct 5, 2020 •

edited

tylershuster commented Oct 7, 2020 •

edited

precompute commented Oct 11, 2020 •

edited

nameless-nobody commented Oct 15, 2020 •

edited