Skip to content
This repository was archived by the owner on May 12, 2021. It is now read-only.

Conversation

@staltz
Copy link
Member

@staltz staltz commented Jan 25, 2020

This is mostly a small update to ssb-conn, here are some highlights (from most relevant to least):

  • mark old and failing peers as defunct in the DB
    • there are lots of dead pubs in your typical conn.json, and those are all going to be (at some point) attempted for a connection
    • I went through most of these and realized that those with hundreds of failures were very likely to be dead pubs
    • this update will detect pubs with 200+ failures, and mark them as defunct in the conn.json
    • defunct means this peer will never be attempted for a connection again by this scheduler (maybe other schedulers that people implement could ignore this)
    • we can't just delete these peers from the conn.json, because they would be re-added to conn.json when your SSB app queries the flumelog for messages of type "pub"
    • but peers marked "defunct" have a bunch of fields deleted, this means that the size of conn.json gets compressed
    • for instance mine went from 534 KB to 189 KB
  • update scheduler: remove neverJustOne, it was hard to justify it
    • tiny update to the scheduler's behavior
    • before, when it picked a pub to connect with, it always picked two of them to maximize chances of connecting
    • that was quite an arbitrary decision, and didn't always make sense, so I removed it
  • update ssb-conn-db with self-healing conn.json

@christianbundy
Copy link
Contributor

LGTM, thanks for this patch!

@christianbundy christianbundy merged commit b5230c6 into master Jan 25, 2020
@christianbundy
Copy link
Contributor

I'm particularly excited about defunct, that's a super welcome addition.

@cinnamon-bun
Copy link
Contributor

@christianbundy @staltz

Re defunct, how long does it take to accumulate 200+ failures, and do they have to be consecutive?

I'm worried about scenarios like...

  • I was traveling and couldn't connect to the pub in my house for a month so it became defunct
  • I was offline for a month and all pubs became defunct
  • I'm offline occasionally and all pubs ended up accumulating 200+ intermittent failures over time
  • I learned about a pub from another feed. It wasn't on the global internet so I marked it defunct. But then I visited their hackerspace in person and could have connected to it, but it was already defunct

E.g. will this work well in a world of peers and pubs that are not part of the globally connected internet?

One solution could be: when we hear a feed mention a pub, make it un-defunct. Since the feed mentioned it, it's probably still alive, we just can't reach it right now.

@black-puppydog
Copy link
Contributor

black-puppydog commented Feb 19, 2020

@cinnamon-bun

when we hear a feed mention a pub

I'd say that does not really apply to pubs. Many people just try following them because they don't understand the invite system, or they try an old invite and it never follows back. But I do agree that any sign of life should reactivate the pub in conn.json. Usually that would mean a message that we see from the pub.

There's also the scenario of a pub just not being on 24/7. Think solar-powered rPi pub. So setting The Right Value ™️ for this is important. I'm already quite concerned about silent fracturing of the network due to undetected communication/gossip inhibition. This has the potential to lighten the work on the client, but it's important to make sure it doesn't prevent gossip from happening.

That all being said: thank you @staltz for this. I'm particularly happy about the self-healing. It's super important to improve robustness, or else we'll have to rely on out-of-band support on this here very technical platform for helping potentially very non-techy users. 👍

@staltz
Copy link
Member Author

staltz commented Feb 19, 2020

Re defunct, how long does it take to accumulate 200+ failures, and do they have to be consecutive?

The count is the number of failed-to-connect events since the last succeeded-to-connect event. ssb-conn (like ssb-gossip before) puts an exponential backoff timeout between the attempt-to-connect events. So the more failures, the longer the timeout lasts, and this can be something like hours. (It has a maximum timeout). And it doesn't happen "every X hours" consistently, because if there is another quicker attempt-to-connect to another pub, then we don't even try the failed one.

All this is to say that I believe "soon-defunct pubs" (e.g. failure count at 150) are attempted-to-connect every ~24 hours or so, supposing that you have the SSB app online during all those 24 hours. So the failure count would go up to 200 in my opinion in about 200 days or maybe even a whole year. I think it's reasonable to assume that if a pub couldn't be connected after hundreds of times in a year, then we consider it defunct.

I was traveling and couldn't connect to the pub in my house for a month so it became defunct

If you are truly offline (don't have an ethernet or wifi network interface active), then those pubs won't even be attempted-to-connect to begin with, so their failure count would not get incremented. Even if they would be attempted-to-connect, I believe that in a month the count would go up by maximum 50.

I was offline for a month and all pubs became defunct

Same as above.

I'm offline occasionally and all pubs ended up accumulating 200+ intermittent failures over time

If you're offline occasionally, and if the pub succeeds-to-connect when the failure count is (say) 120, then the failure count would reset immediately back to zero.

E.g. will this work well in a world of peers and pubs that are not part of the globally connected internet?

Yes. On the other hand, regarding network partitions in general, suppose you are in China, and because of the Great Firewall, suppose you can't connect to pubs in the US. In real life, whether a person is dead or whether they now permanently live (they are alive!) on Jupiter, doesn't really matter to you because they are out of your reach, therefore defunct.

I learned about a pub from another feed. It wasn't on the global internet so I marked it defunct. But then I visited their hackerspace in person and could have connected to it, but it was already defunct
...
One solution could be: when we hear a feed mention a pub, make it un-defunct. Since the feed mentioned it, it's probably still alive, we just can't reach it right now.

This is a really good point, I think a mention of a pub at time X, supposing it got defunct at time Y, and supposing X > Y, then I believe we should resurrect the peer. This would require adding the timeOfDeath as a timestamp when marking them as defunct. I opened issue ssbc/ssb-conn#14 for that.

@staltz staltz deleted the conn15 branch February 19, 2020 13:39
@staltz
Copy link
Member Author

staltz commented Feb 21, 2020

@cinnamon-bun I was so wrong about defunct 😱, apparently my Manyverse account now doesn't connect to anything at all, I looked at my conn.json * and lots (not all) of my peers are marked defunct. In hindsight, I should have known that storing unnecessary kilobytes and unnecessarily trying some pubs is a much less worse problem than the presence of (many) false positives when declaring a peer defunct. I think a better implementation of defunctness will be timestamp based: only mark it defunct if there are hundreds of failed connections and the last timestamp of a successful connection was ~1 year ago. But I'm considering making a hot fix in ssb-conn to just sidestep it for now.

* That said, I also seem to have a problem with my public feed: after 1 or 2 scrolls, it doesn't load more messages, indicating that there might be a database problem (like a JS error that goes silently and doesn't cause a crash) which then kills the JS execution but doesn't kill the app, and that would explain why also ssb-conn doesn't run, because no JS is running. But anyway, there are stuff to investigate and fix.

@christianbundy
Copy link
Contributor

FWIW I think Sami mentioned this problem where they aren't connecting to anyone anymore without using an invite or something. If you ping me when the hotfix is ready I can release a new Patchwork ASAP.

@cinnamon-bun
Copy link
Contributor

@staltz Thanks for answering my worries! I apologize for posting them as a wall of questions like that, I wish I had expressed more gratitude. ❤️

Overall there's no way to know if something is truly dead or just not visible to us for a long time. I agree it makes sense to give up after a long time, I'm just worried that it's permanent. Maybe if defunct peers were still attempted once a month, or something, that would be more resilient.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants