Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bail meme: out of loom #3645

Closed
Zaphod101010 opened this issue Oct 5, 2020 · 19 comments
Closed

bail meme: out of loom #3645

Zaphod101010 opened this issue Oct 5, 2020 · 19 comments
Labels

Comments

@Zaphod101010
Copy link

Zaphod101010 commented Oct 5, 2020

Describe the bug
Ship crashes with error

bail: meme
bailing out
pier: serf error: end of file
~hodler-pinryx:dojo>
address 0x458 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: (0), function u3m_bail, file noun/manage.c, line 677.
[1]    99384 abort      ./urbit hodler-pinryx

I used the Mac binary to run meld from: #3235 (comment)

Output of meld:

a6a7b557a95cd246430ce68809d8188d40ae28f4-darwin/urbit-worker meld ./hodler-pinryx
loom: mapped 2048MB
boot: protected loom
live: loaded: GB/1.515.700.224
boot: installed 268 jets
serf: measuring memory:
  kernel: GB/1.309.939.032
total arvo stuff: GB/1.309.939.032
  warm jet state: KB/116.388
  cold jet state: MB/145.005.616
  hank cache: B/372
  battery hash cache: B/288
  call site cache: B/100
  hot jet state: KB/126.496
total jet stuff: MB/145.249.260
  bytecode programs: MB/1.078.368
  bytecode cache: KB/107.576
total nock stuff: MB/1.185.944
  trace buffer: B/36
  memoization cache: B/288
total road stuff: B/324
total marked: GB/1.456.374.560
free lists: MB/59.300.944
sweep: GB/1.456.374.560

hash-cons arena:
  root: B/144
  atoms (445039):
    refs: MB/10.284.580
    data: MB/164.433.182
    dict: MB/27.967.368
  total: MB/202.685.130
  cells (35463450):
    refs: MB/781.763.380
    dict: GB/3.439.758.872
  total: GB/4.221.522.252
total: GB/4.424.207.526

serf: measuring memory:
  kernel: MB/968.540.908
total arvo stuff: MB/968.540.908
  warm jet state: KB/84.400
  cold jet state: MB/56.481.148
  hank cache: B/288
  battery hash cache: B/288
  hot jet state: KB/124.408
total jet stuff: MB/56.690.532
  bytecode cache: B/288
total nock stuff: B/288
  memoization cache: B/288
total road stuff: B/288
total marked: GB/1.025.232.016
free lists: KB/150.220
sweep: GB/1.025.232.016

Started ship after meld, and it crashed again:

~hodler-pinryx:dojo> allocate: reclaim: memo cache: empty

bail: meme
bailing out
pier: serf error: end of file
~hodler-pinryx:dojo>
address 0x458 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: (0), function u3m_bail, file noun/manage.c, line 677.
[1]    3797 abort      ./urbit hodler-pinryx

System (please supply the following information, if relevant):

  • OS: MacOS 10.15.7

Additional context
My pier is 15.68 GB, which seems kind of large.

@dylanirlbeck
Copy link

Getting a very similar error for my fakezod:

/zod
loom: mapped 2048MB
lite: arvo formula 50147a8a
lite: core 590c9d56
lite: final state 590c9d56
loom: mapped 2048MB
boot: protected loom
live: loaded: GB/2.134.016.000
boot: installed 268 jets
---------------- playback starting ----------------
pier: replaying events 237348-237675
allocate: reclaim: memo cache: empty

bail: meme
bailing out
pier: serf error: end of file

@tylershuster
Copy link
Contributor

Yeah this has become a really serious problem recently, since the graph-store update. I get it very often on my planet if I don't |pack like an army brat and my star runs pretty hot too.

@matildepark
Copy link
Contributor

What's the main culprit in |mass?

@tylershuster
Copy link
Contributor

tylershuster commented Oct 7, 2020

%peers-known: MB/171.312.584
%ford-marks: MB/146.440.020
%channels: GB/1.380.147.520

so...channels for me. And yet...

image

and

image

@Zaphod101010
Copy link
Author

my ship seems to be getting a slightly different error now. It was bail meme, now it's bail oops.

$ ./urbit hodler-pinryx
~
urbit 0.10.8
boot: home is /Volumes/T5/hodler-pinryx/hodler-pinryx
loom: mapped 2048MB
lite: arvo formula 50147a8a
lite: core 590c9d56
lite: final state 590c9d56
loom: mapped 2048MB
boot: protected loom
live: loaded: MB/476.659.712
boot: installed 268 jets
---------------- playback starting ----------------
pier: replaying events 189892162-189894681
eyre: canceling ~[//http-server/0v18.kt33l/2/1]
pier: (189892662): play: done
pier: (189893163): play: done
pier: (189893664): play: done
pier: (189894165): play: done
[%missing-subscription-in-unsubscribe /channel/subscription/1601987305632-5fd758/11]
[%missing-subscription-in-unsubscribe /channel/subscription/1601987305632-5fd758/12]
pier: (189894666): play: done
pier: (189894681): play: done
---------------- playback complete ----------------
ames: live on 55750
pier: serf error: end of file

address 0x458 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: (0), function u3m_bail, file noun/manage.c, line 677.
[1]    18736 abort      ./urbit hodler-pinryx

@precompute
Copy link

precompute commented Oct 11, 2020

I have the same issue. I breached a couple of months ago, and was using my moon exclusively since. Switched to my planet yesterday (my planet was only relaying connections for my moon(s)), and encountered this issue after my planet half-joined a group (I think). The name came up as ~ship/name/group (literally "ship"). I did manage to leave it via landscape, but the error persisted.

Used the build here: #3235 (comment)

~planet:dojo> \«ames»
bail: meme
bailing out
pier: serf error: end of file
~planet:dojo>
address 0x440 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

The planet does run for a while, and pulls in messages and data (like which ships have sunk) but then it eventually crashes with some variation of the above.

My planet has ~14.7 million events.

@salmun-nister
Copy link

I've had a similar issue a few times this evening.

~salmun-nister:dojo> allocate: reclaim: half of 1115 entries

bail: meme
bailing out
pier: serf error: end of file
~salmun-nister:dojo>
address 0x438 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

I have no idea if it's related, but I'm also seeing quite a few messages like these in dojo.

[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/11]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/12]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/14]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/16]
[%missing-subscription-in-unsubscribe /channel/subscription/1602741107933-8a4d31/18]

@nameless-nobody
Copy link

nameless-nobody commented Oct 15, 2020

i am experiencing same issue after i half-joined and/or half-unsubscribed some of the channels/groups yesterday. my ship keeps crashing a few minutes after boot.

bail: meme
bailing out
pier: serf error: end of file
~marryp-worryd:dojo> 
address 0x438 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

+trouble:

  [%home-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%kids-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%glob-hash 0v1.bn7am.9sl00.vfh1o.uvsuf.dn9b7 %done]
  [%our ship=~marryp-worryd point='655494067' life=[~ 5] rift=[~ 3]]
  [%sponsor ship=~dirdev point='2995' life=[~ 8] rift=[~ 6]]
  [%dopzod ship=~dopzod point='4608' life=[~ 3] rift=[~ 2]]
  "Compare lifes and rifts to values here:"
  "https://etherscan.io/address/azimuth.eth#readContract"
  "  life - getKeyRevisionNumber"
  "  rift - getContinuityNumber"
  ~
]

@nameless-nobody
Copy link

i also got this error while in landscape, but not sure if it's related:

jv@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:461:390751
Zo@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:58173
gs@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:104413
cl@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:96961
sl@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:96886
Qs@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:93916
Qs@[native code]
https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:45557
https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:322:4087
Wi@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:45503
Ui@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:45438
A@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:114587
Kt@https://marryp-worryd.arvo.network/~landscape/js/bundle/index.efcd373f796e88fce081.js:314:22967
Kt@[native code]

screenshot:
Screenshot 2020-10-15 at 12 57 59

@precompute
Copy link

Update to my previous comment: #3645 (comment)

It seems the issue has been resolved for me. I used the stable binary after running meld from the binary mentioned in the OP and my planet stopped crashing, and I joined the group that was giving me trouble. It also OTA'd to o3a14, so maybe that played a role as well.

It's been running for over a day now with no issues.

@nameless-nobody
Copy link

@t-e-r-m i am on o3a14 but still crashing — what is meld?

@salmun-nister
Copy link

booted up my moon and I'm seeing the same behavior

address 0x438 out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted

output of +trouble:

> +trouble
[ [%base-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%home-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%kids-hash 0vj.en1di.tkklb.dl7ss.mskql.baju8.f3tk0.4hvbv.umcms.qm49p.o3a14]
  [%glob-hash 0v1.bn7am.9sl00.vfh1o.uvsuf.dn9b7 %done]
  [%our ship=~tartux-darsyl-salmun-nister point='8749394165105624742' life=[~ 1] rift=[~ 0]]
  [%sponsor ship=~salmun-nister point='184485542' life=[~ 1] rift=[~ 0]]
  [%dopzod ship=~dopzod point='4608' life=[~ 3] rift=[~ 2]]
  "Compare lifes and rifts to values here:"
  "https://etherscan.io/address/azimuth.eth#readContract"
  "  life - getKeyRevisionNumber"
  "  rift - getContinuityNumber"
  ~
]

@precompute
Copy link

@david-xlvrs see #3235 (comment)

@salmun-nister
Copy link

thanks @t-e-r-m. I tried that running meld with that binary (#3235 (comment)) and it successfully completed. Unfortunately, I'm still crashing when running the urbit binary from that release and the latest stable binary release (v0.10.8).

@joemfb
Copy link
Member

joemfb commented Oct 15, 2020

Hi All,

Sorry for the delayed response. I realize that the current state of our memory management tooling has not been fully explained in any one place, so I'll attempt to do so here.

First of all, the crash output of the current release is fraught, due to a) bad error messages and b) a crash-handling bug. Both problems will be fixed in the next release (#3471).

"pier: serf error: end of file" means "the worker process unexpectedly shutdown" (in a unique dialect combining urbit and unix idioms). Everything from "address X out of loom" to "Assertion failed" results from the crash-handling bug described above (when it follows the "end of file" error, that is).

The relevant detail preceeds both: bail:meme. This means we ran out of memory on the loom (the persistent memory arena in urbit's runtime, currently limited to 2GB). This is currently always handled as a fatal error -- we're working on tooling to automatically recover where possible.

These are errors are sometimes ephemeral, since the loom holds both are persistent state and the "workspace" for processing a given event. In which case, restarting urbit is all that's required. But they usually recur, and require further intervention to be stopped.

The first thing to try upon restart is |pack, which defragments the persistent state. The next thing to run is |mass, which prints a detailed memory-usage report. If the bail:meme crashes are recurring too quickly to run those, it can help to restart urbit with the -L argument, which effectively disables networking (by only listening on localhost). (If those crashes are being triggered by inbound packets, that is.)

If the above fails, or is not possible due to crash frequency, or does not reclaim enough space to move forward, there are two remaining options. Both involve as-of-yet unreleased features.

|meld is a global deduplicator: it reallocates all persistent state off the loom, unifying any duplicate nouns it discovers along the way, and then copies them back to the loom. This usually reclaims a significant amount of state. The dojo command itself (|meld) depends on unreleased kernel functionality (coming to you in an OTA update any minute), and the runtime implementation is similarly unreleased (coming to you ASAP in v0.10.9). In the meantime, you can download pre-release binaries (from #3235 (comment)) and run the command through the urbit-worker process.

Note that |meld itself can use quite a bit of memory, depending on the degree of duplication in your persistent state. It's not subject to the loom limits, but instead depends on the actual RAM available in your environment. If it runs out of memory, the process will crash with a message something like:

ur: hashcons: dict_grow: allocation failed, out of memory

The only way to work around this is to run it on a machine with more memory. (There are various ways this implementation could be made to use less memory, most of them involving more sophisticated hashtable implementations. Reach out if you're the kind of person who likes writing hashtables!)

Finally, there's |trim, which asks the arvo kernel to delete persistent state that it safely can (mostly caches). The dojo command itself is unreleased (also coming to you soon in an OTA update), but the kernel and runtime implementation is already present, and can be triggered through a tedious manual process. To do so, you'll need to generate a memory pressure event (%trim) and inject it into your ship on startup.

From the dojo, run the following:

> .trim/jam [//arvo trim/0]

This will write a trim.jam file into your pier in $pier/.urb/put/. Then, shutdown your ship gracefully (with ctrl-d or |exit), and restart it as follows:

$ path/to/urbit -I your-ship/.urb/put/trim.jam your-ship/

The following output will tell you that it worked:

~
urbit 0.10.8
...
pier: injecting %trim event on //arvo
...

at which point, you can run |mass again to see how much space was reclaimed.


Obviously, this recovery process is both unnecessarily complex and tediously manual. The next kernel and runtime releases will bring some affordances, but much more are still needed. Urbit should be able to avoid many of these memory-pressure scenarios proactively, and automatically recover from many more. This remains a top development priority. Most of this will involve more sophisticated memory management and error-handling in the runtime -- I've been putting down the foundations for this throughout 2020. But some will need to involve limits and their enforcement inside arvo (such as #3680, or better handling of |trim in various vanes). And still more will likely require new features, like the ability to archive or delete state from in an adhoc manner from clay or gall agents.

@salmun-nister
Copy link

Thank you very much @joemfb for the detailed response. The manual |trim process seems to have resolved my crash.

@sivner-figbus
Copy link

My ship is broken. I'm getting regular "out of loom" crashes.

According to mass (see https://hatebin.com/gcjhvqkfqb ) I have 1.3GB of vane-e channels:
%channels: GB/1.310.520.720
|trim followed by |pack doesn't free any significant amount of these; afterwards:
%channels: GB/1.309.383.072

I can't "cram" or "meld" the box using the pre-release binaries mentioned above - full outputs at https://hatebin.com/hmgnwsldft
both crash like this:

0x200000018 is bogus
0x200000018 is bogus
address 0x2981faddc out of loom!
loom: [0x200000000 : 0x280000000)
Assertion '0' failed in noun/events.c:129

bail: oops
bailing out
Assertion failed: 0 (noun/manage.c: u3m_bail: 677)
Aborted (core dumped)

@liam-fitzgerald
Copy link
Member

It's worth noting that %eyre will only free one channel per |trim event, so you may need to inject the event more than once to avoid a bail:meme

@joemfb
Copy link
Member

joemfb commented Dec 17, 2020

Closing in favor of #4182.

@joemfb joemfb closed this as completed Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants