Skip to content
This repository was archived by the owner on Dec 21, 2021. It is now read-only.

Conversation

@timoxley
Copy link
Contributor

@timoxley timoxley commented Sep 21, 2020

This fixes two edge-case issues with sequences & publishing, both of which result in silent/subtle issues, so they may or may not affect existing streams and just haven't been noticed.

Problem 1: Backdated messages corrupt sequencing

Publishing back-dated (i.e. non-sequentially timestamped) messages will break sequencing. This could happen if the user publishes old data with a custom timestamp using the same client instance + stream that they're publishing realtime data to i.e. same message chain.

This happens because we lose track of biggest sequence number whenever the timestamp changes for stream id+partition combo, there's an assumption that if the timestamp changes, then the new timestamp is actually newer, but this isn't verified in the client. Backdated messages then start at sequence 0 again, regardless of the sequencing of existing messages. And the prevMessageRef of these backdated messages will point into future.

e.g. consider this publish sequence, where [timestamp, sequence] -> prevMessageRef,

[0, 0] -> null
[0, 1] -> [0, 0]
[0, 2] -> [0, 1]
[1, 0] -> [0, 2]

Then you publish a new message with timestamp 0, and then a new message with timestamp 1:

[0, 0] -> null (clobbered?)
[0, 0] -> [1, 0] (new)
[0, 1] -> [0, 0]
[0, 2] -> [0, 1]
[1, 0] -> [0, 2] (clobbered?)
[1, 0] -> [0, 0] (new)

Were these messages to be accepted by the network, since it considers a timestamp+sequence number unique, the newer messages occurring with the same timestamp + sequence number would clobber the older messages, while also having a a messed up future-dated & cyclic prevMessageRef 🤢

Thankfully this doesn't seem to happen, currently backdated messages seem to simply disappear from the client's perspective. The backdated messages aren't sent into a stream's subscription, nor are they resent. Unfortunately, all future publishes to this stream also seem to disappear. Only messages before the first backdated message survive.

The backend seems to be responsible for the disappearing behaviour. No error messages appear in the logs on publish, though resends on bad streams do report:

WARN  (streamr:logic:node:0xde3331cA6B8B636E0b82Bf08E941F727B8927442/1 on 3538effb88a9): 
pre-condition: gap overlap in given numbers: previousNumber=1600712118432|0, number=1600712118434|0, state=(1600712118433|0, Infinity|Infinity]

Not sure a good way around this other than blocking backdated messages on the same chain. Should probably enforce this in the protocol e.g. StreamMessage constructor could error on a prevMessageRef with a timestamp greater than their current timestamp. Would also be good if the backend would send some information to the client when such things occur. I found a partial workaround that at least keeps non-backdated publishing working (see below).

We could keep a single global sequence counter for a stream, this would prevent duplicate timestamp + sequence number entries from occurring, and keep everything sortable, but doesn't prevent prevMessageRef from pointing into the future.

Should probably figure out and document an official way to safely publish backdated messages e.g. publishing archived data. Sequencing should be fine so long as the backdated messages end up on a different message chain. Currently this requires using a different client instance. Perhaps we could add a new method to the client like:

const newChainClient = client.createNewChain()
await newChainClient.publish(oldData)

or something else specifically for this use-case.

Partial Workaround

if (!isBackdated) {
this.messageRefs.set(key, new MessageRef(timestamp, nextSequenceNumber))
}

I've added a partial workaround for backdated messages silently breaking sequencing by ignoring updates to the latest sequence number/timestamp if the timestamp is older than the previous message timestamp. Without this all published messages after the backdated message will just silently disappear. With this change, backdated messages still get sent to the network and silently disappear, but at least the "properly" sequenced messages will continue to work. Not good but better.

[0, 0] -> null
[0, 1] -> [0, 0]
[0, 2] -> [0, 1]
[1, 0] -> [0, 2]
[0, 0] -> [1, 0] (backdated, ignored)
[1, 1] -> [1, 0] (workaround allows this to work, without workaround this would be ignored)

Possible alternative workaround

Another option could be to automatically start a new chain when it detects stuff out of order. Using same example as before, if after publishing a message at timestamp 1, you publish new messages at timestamp 0, and then another at 1, it creates a new chain on the first backdated message, which prevents the backdated message from being lost:

[0, 0] -> null
[0, 1] -> [0, 0]
[0, 2] -> [0, 1]
[1, 0] -> [0, 2]
// new chain
[0, 0] -> null // backdated
[1, 0] -> [0, 2]

This results in inconsistent message sequencing on a realtime subscription, but seems to produce reliable sequencing for resends. The upside of this method is that no data is lost, and user doesn't have to explicitly opt-into thinking about message chains. The downside is that it adds breaks into the chain, resulting in less reliable ordering.

Another option could be to remove the ability to have user-supplied timestamps altogether, if users need to sort they can add their own timestamp field into the message body. This makes the timestamp explicitly refer to "publish time", rather than anything about the data's time, which is probably the intended purpose anyway.

Problem 2: Sequencing occurs after unsequenced async calls:

A number of async calls are made before the message is sequenced, specifically:

  1. Fetch publisher id (loads username from server)
  2. Fetch stream partitions (loads stream info from server)

If something causes the timing in these async functions to resolve in a different order than they were issued, then the sequencing will be applied out-of-order. Both of the above calls are cached remote calls, so in practice this should never occur but a bug here is very subtle, may only occur under certain circumstances, leads to only slightly corrupt data, so likely it would go undetected.

I've added a few tests which catch this type of bug going forward, and fixed the publishing code to guarantee the async calls always resolve in the same order they were issued, thus publish call-order sequencing is guaranteed by pushing these async calls into a per-stream queue:

const streamId = getStreamId(streamObjectOrId)
const key = streamId
const queue = this.pending.get(key) || this.pending.set(key, pLimit(1)).get(key)
try {
return await queue(() => (
Promise.all([
this.getPublisherId(),
this.partitioner.get(streamObjectOrId, partitionKey),
])
))
} finally {
if (!queue.activeCount && !queue.pendingCount) {
// clean up
this.pending.delete(key)
}
}


Builds on:

@timoxley timoxley changed the title Message sequencing Message Sequencing – Guarantee sequence follows publish order & Prevent backdated messages silently breaking future publishes Sep 21, 2020
@timoxley timoxley changed the title Message Sequencing – Guarantee sequence follows publish order & Prevent backdated messages silently breaking future publishes 5. Message Sequencing – Guarantee sequence follows publish order & Prevent backdated messages silently breaking future publishes Sep 21, 2020
src/Publisher.js Outdated
// than they were issued we will end up generating the wrong sequence numbers
const streamId = getStreamId(streamObjectOrId)
const key = streamId
const queue = this.pending.get(key) || this.pending.set(key, pLimit(1)).get(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@timoxley timoxley merged commit f4d60b1 into refactor-publish Nov 24, 2020
@timoxley timoxley deleted the message-sequence branch November 24, 2020 15:50
timoxley added a commit that referenced this pull request Nov 24, 2020
* Refactor & co-locate publish code.

* Avoid using expensive ethers.Wallet.createRandom() calls in test. e.g. 1000x calls with ethers: 14s, randomBytes: 3.5ms.

* Ensure messageCreationUtil is cleaned up after test.

* Fix non-functional MessageCreationUtil test.

* Swap out receptacle for more flexible mem/p-memoize/quick-lru.

* Convert LoginEndpoints test to async/await.

* Remove calls to ensureConnected/ensureDisconnected in test.

* 5. Message Sequencing – Guarantee sequence follows publish order & Prevent backdated messages silently breaking future publishes (#166)

* Improve authFetch logging.

* Update message sequencer to strictly enforce message order.

* Queue publishes per-stream otherwise can skip forward then back even when messages published in correct sequence.

* Add partial solution to broken backdated messages, at least doesn't break regular sequential publishes.

* Tidy up, add some comments.

* Move publish queue fn into utils.
timoxley added a commit that referenced this pull request Nov 24, 2020
* Promisify subscribe/unsubscribe.

* Fail _request call if unsubscribe/subscribe request fails to send.

* Fix missing import in resend test.

* Fix browser tests.

* Clean up unused function in test.

* 4. Refactor Publish (#164)

* Refactor & co-locate publish code.

* Avoid using expensive ethers.Wallet.createRandom() calls in test. e.g. 1000x calls with ethers: 14s, randomBytes: 3.5ms.

* Ensure messageCreationUtil is cleaned up after test.

* Fix non-functional MessageCreationUtil test.

* Swap out receptacle for more flexible mem/p-memoize/quick-lru.

* Convert LoginEndpoints test to async/await.

* Remove calls to ensureConnected/ensureDisconnected in test.

* 5. Message Sequencing – Guarantee sequence follows publish order & Prevent backdated messages silently breaking future publishes (#166)

* Improve authFetch logging.

* Update message sequencer to strictly enforce message order.

* Queue publishes per-stream otherwise can skip forward then back even when messages published in correct sequence.

* Add partial solution to broken backdated messages, at least doesn't break regular sequential publishes.

* Tidy up, add some comments.

* Move publish queue fn into utils.
@timoxley timoxley mentioned this pull request Nov 24, 2020
@timoxley timoxley mentioned this pull request Jan 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants