Skip to content

Conversation

alecgibson
Copy link
Contributor

Introduction

At the moment, when calling getOps, sharedb-mongo actually fetches
all the ops from from to the current version.

That means that if I have a document with 1,000 versions, and I only
ask for ops 0-10, we still fetch all 1,000 ops. This is clearly
inefficient, and on documents with large numbers of ops, this fetch can
take a long time. Anecdotally, it takes ~2s to fetch ops 0-10 of
100,000 ops.

This change has an anecdotal performance increase of ~100x, now taking
~0.01s to fetch ops 0-10 of 100,000 ops.

Background

We fetch all the ops from the current version, because we try to fetch a
"valid" chain of operations. Consider this case:

  1. I have a doc v1
  2. I submit an op v2
  3. I concurrently submit another op v2
  4. Both concurrent ops are committed to the database before the
    snapshot
  5. One of the commits successfully writes a snapshot
  6. The other snapshot is rejected, but its op is still committed
  7. I now try to get all ops. A naive find operation would return two
    ops for v2.

In theory, this case should actually already be cleaned up, so I
can only assume that we are:

  • guarding against cases where the op cleanup fails
  • guarding against unknown inconsistencies in ops
  • guarding against stupid consumers mucking about in the db themselves
  • being very defensive

In order to therefore fetch a valid set of ops, we currently therefore
fetch the current snapshot, which has a reference stored to the op that
created it. By looking up that op, we can then check the op that
preceded that, and so on, forming a valid chain of ops that point to
one another.

The obvious downside of this is that we need to start at the current
snapshot and work backwards, necessitating a fetch of all the ops up to
the current version.

Changes

This change attempts to allow us to only fetch the bare minimum number
of ops, whilst still maintaining integrity and discarding erroneous ops.

This is achieved by attempting to find the first op after our to op
that has a unique version. The assumption here is that any op with a
unique version is inherently valid, because it had no collisions.

For example, consider the case where I have a document with some ops:

  • v1: unique
  • v2: unique
  • v3: collision 3
  • v3: collision 3
  • v4: collision 4
  • v4: collision 4
  • v5: unique
  • v6: unique
  • ...
  • v1000: unique

If I want to fetch ops v1-v3, then we:

  • look up v4
  • find that v4 is not unique
  • look up v5
  • see that v5 is unique and therefore assumed valid
  • look backwards from v5 for a chain of valid ops

This way we don't need to fetch all the ops from v5 to the current
version.

In the case where a valid op cannot be determined, we still fall back to
fetching all ops and working backwards from the current version.

Further work

Note that this change only affects the getOps method, notably not
touching either getOpsToSnapshot or getOpsBulk.

@coveralls
Copy link

coveralls commented Aug 15, 2018

Coverage Status

Coverage increased (+0.4%) to 93.458% when pulling 31004ec on reedsy:get-fewer-ops into 87d76b3 on share:master.

@gkubisa
Copy link
Contributor

gkubisa commented Aug 15, 2018

The idea and implementation look good to me, however, I'd add some more tests to cover all the edge cases supported by _getOpLink.

@alecgibson
Copy link
Contributor Author

alecgibson commented Aug 16, 2018

@gkubisa good call - caught a bug by expanding test coverage. I've also added some spies to check that our underlying methods are indeed only attempting to fetch a subset of ops.

@alecgibson alecgibson force-pushed the get-fewer-ops branch 3 times, most recently from db5506c to b57b0dc Compare August 17, 2018 09:26
# Introduction

At the moment, when calling `getOps`, `sharedb-mongo` actually fetches
all the ops from `from` to the current version.

That means that if I have a document with 1,000 versions, and I only
ask for ops 0-10, we still fetch all 1,000 ops. This is clearly
inefficient, and on documents with large numbers of ops, this fetch can
take a long time. Anecdotally, it takes ~2s to fetch ops 0-10 of
100,000 ops.

This change has an anecdotal performance increase of ~100x, now taking
~0.01s to fetch ops 0-10 of 100,000 ops.

# Background

We fetch all the ops from the current version, because we try to fetch a
"valid" chain of operations. Consider this case:

  1. I have a doc v1
  2. I submit an op v2
  3. I concurrently submit another op v2
  4. Both concurrent ops are committed to the database before the
     snapshot
  5. One of the commits successfully writes a snapshot
  6. The other snapshot is rejected, but its op is still committed
  7. I now try to get all ops. A naive `find` operation would return two
     ops for v2.

In theory, this case should actually already be [cleaned up][1], so I
can only assume that we are:

  - guarding against cases where the op cleanup fails
  - guarding against unknown inconsistencies in ops
  - guarding against stupid consumers mucking about in the db themselves
  - being very defensive

In order to therefore fetch a valid set of ops, we currently therefore
fetch the current snapshot, which has a reference stored to the op that
created it. By looking up that op, we can then check the op that
preceded _that_, and so on, forming a valid chain of ops that point to
one another.

The obvious downside of this is that we need to start at the current
snapshot and work backwards, necessitating a fetch of all the ops up to
the current version.

# Changes

This change attempts to allow us to only fetch the bare minimum number
of ops, whilst still maintaining integrity and discarding erroneous ops.

This is achieved by attempting to find the first op after our `to` op
that has a unique version. The assumption here is that any op with a
unique version is inherently valid, because it had no collisions.

For example, consider the case where I have a document with some ops:

  - v1: unique
  - v2: unique
  - v3: collision 3
  - v3: collision 3
  - v4: collision 4
  - v4: collision 4
  - v5: unique
  - v6: unique
  - ...
  - v1000: unique

If I want to fetch ops v1-v3, then we:

  - look up v4
  - find that v4 is not unique
  - look up v5
  - see that v5 is unique and therefore assumed valid
  - look backwards from v5 for a chain of valid ops

This way we don't need to fetch all the ops from v5 to the current
version.

In the case where a valid op cannot be determined, we still fall back to
fetching all ops and working backwards from the current version.

# Further work

Note that this change *only* affects the `getOps` method, notably not
touching either `getOpsToSnapshot` or `getOpsBulk`.

[1]: https://github.com/share/sharedb-mongo/blob/2d579ddb80781e987707076e932cd4e01ca066ef/index.js#L189-L193
@ericyhwang
Copy link
Contributor

Nate's comments from the PR review meeting:

  • He has a concern with an edge case around Mongo's (lack of?) guarantees about cross-collection consistency and about ops potentially getting removed by TTLs
  • Flagging it and having it off by default for now is fine for now

We'll still need to take a look at the PR itself.

The new, faster `getOps` behaviour should be fine, but there may be
unforeseen corner cases where the underlying assumptions break down (eg
if consumers have been meddling with ops or snapshots outside of
ShareDB).

In order to remain conservative, and maintain data consistency in all
mainline cases, this change hides the faster behaviour behind a flag,
which defaults to off.

In order to make sure `sharedb-mongo` still acts correctly with this
flag enabled, we re-run all the tests again with the flag enabled.
@alecgibson
Copy link
Contributor Author

@ericyhwang I've hidden this feature behind a flag. Please let me know when you've had a chance to look through the code itself.

README.md Outdated
performance impact on fetching ops for documents with a large number of ops.

If you need faster performance of `getOps`, you can initialise `sharedb-mongo`
with the option `fasterGetOps: true`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getOpsWithoutStrictLinking: true

Then also document what "strict" linking is (ie linking back from the current snapshot).

"The default behaviour of getOps is...,

"whereas setting this flag will do 1., 2., 3..."

index.js Outdated
// data in the mongo database.
this.allowAggregateQueries = options.allowAllQueries || options.allowAggregateQueries || false;

// By default, when calling fetchOps, sharedb-mongo fetchas *all* ops, even
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchas -> fetches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also doesn't fetch "all" ops - still honours "from", but ignores "to")

README.md Outdated

Setting this flag will use an alternative method that is much faster than the
default method, but may behave incorrectly in corner cases where ops or snapshots
have been manipulated outside of ShareDB (eg by setting a TTL on ops, or manually
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just throw our hands up here and say that we're not entirely sure what may cause a failure here, but because we're playing around with op integrity we should warn that Bad Things might happen by using this mode, especially if people have changed ops themselves.

We should define "behave incorrectly" and explain that we're concerned about returning ops that were not correctly linked to the canonical version. We need to guard against that because ShareDB sometimes writes more than one op (of the same version) when performing optimistic locking, so if eg if the valid op has been deleted, then we may return an invalid op.

If we incorrectly assume an op is canonical, then it may return an invalid chain.

index.js Outdated
// To avoid this, we try to fetch the first op after 'to' which has a unique 'v', and then we
// work backwards from that op using the linked op 'o' field to get a valid chain of ops.
function getFirstOpWithUniqueVersion(cursor, ops, callback) {
if (typeof ops === 'function') {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bother with this requirements check. Just make the first called pass in an empty array.

index.js Outdated
&& previousVersion !== currentVersion
&& previousVersion !== oneBeforePreviousVersion;

if (previousVersionWasUnique) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn this check into a function to shorten this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we should pull this out into its own class, who has only these three objects (so that we don't keep the rest of the array of ops in memory that we don't need).

index.js Outdated
} else {
// If there's no next op to fetch, then we now know the current version is unique so
// long as it doesn't match the previous version.
var currentVersionIsUnique = typeof currentVersion === 'number'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also extract this into its own function for brevity.

@alecgibson
Copy link
Contributor Author

@ericyhwang I've addressed the comments we discussed in the pull review meeting. Namely:

  • extract complicated unique op checking into its own, tested class
  • rename the feature flag to getOpsWithoutStrictLinking
  • update documentation to be more detailed about behaviour with the
    flag enabled and disabled

Could you please re-review?

This change addresses some review comments. Namely it:

  - extracts complicated unique op checking into its own, tested class
  - renames the feature flag to `getOpsWithoutStrictLinking`
  - updates documentation to be more detailed about behaviour with the
    flag enabled and disabled
};

OpLinkValidator.prototype._previousVersionWasUnique = function () {
return typeof this._previousVersion() === 'number'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, might be worth calling this._previousVersion() once and storing it.

@ericyhwang ericyhwang merged commit 1c91b46 into share:master Sep 19, 2018
@ericyhwang
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants