Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test - "subscribed query gets simultaneous insert and remove after reconnecting" #131

Open
ericyhwang opened this issue May 3, 2022 · 4 comments · Fixed by share/sharedb#590

Comments

@ericyhwang
Copy link
Contributor

This test is occasionally flaky on CI:

https://github.com/share/sharedb-mongo/runs/6276722964?check_suite_focus=true#step:7:2548

  1) db
       client query subscribe
         subscribed query gets simultaneous insert and remove after reconnecting:
     Error: Already closed (and Mocha's done() called multiple times)
      at wrapErrorData (node_modules/sharedb/lib/client/connection.js:263:13)
      at Connection.handleMessage (node_modules/sharedb/lib/client/connection.js:191:11)
      at StreamSocket.socket.onmessage (node_modules/sharedb/lib/client/connection.js:142:18)
      at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/stream-socket.js:60:12
      at processTicksAndRejections (node:internal/process/task_queues:78:11)
@ericyhwang
Copy link
Contributor Author

A new test case failed in a test rerun today, similar issue:

  1) mongo db middleware
       beforeOverwrite
         has the expected properties on the request object:
     Error: done() invoked with non-Error: {"code":5101,"message":"Already closed"} (and Mocha's done() called multiple times)
      at /home/runner/work/sharedb-mongo/sharedb-mongo/test/test_mongo_middleware.js:99:27
      at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:45:132
      at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:49:1228
      at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:55:321
      at ShareDbMongo.getDbs (index.js:29:1175)
      at ShareDbMongo.getOpCollection (index.js:55:186)
      at ShareDbMongo._deleteOp (index.js:49:1062)
      at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:45:29
      at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:51:1087
      at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/mongodb2/lib/collection.js:968:18
      at handleCallback (node_modules/mongodb2/lib/utils.js:120:56)
      at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/mongodb2/lib/collection.js:1061:5
      at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/mongodb-core/lib/connection/pool.js:469:18
      at processTicksAndRejections (node:internal/process/task_queues:78:11)

As the error indicates, the test cases pass, then at some point shortly thereafter done() is called again with that "Already closed" error.

  db
    ...
    client query subscribe
      ...
      ✓ subscribed query gets simultaneous insert and remove after reconnecting
      1) subscribed query gets simultaneous insert and remove after reconnecting

@ericyhwang
Copy link
Contributor Author

More info from debugging. The "Already closed" error comes from the server doing a _getOpsBulk in response to the client reconnecting with message { a: 'qs', id: 1, c: 'dogs', q: {}, r: [ [Array], [Array] ] }.

That might be enough hints to set up a local repro, with artificial setTimeouts to increase the delay of certain codepaths.

Server code logs:

DEBUG - sending error to client { code: 5101, message: 'Already closed' }
Trace
    at getReplyErrorObject (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/agent.js:282:13)
    at Agent._reply (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/agent.js:305:21)
    at callback (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/agent.js:349:15)
    at finish (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/agent.js:514:21)
    at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/agent.js:649:21
    at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/backend.js:394:21
    at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/backend.js:341:21
    at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:72:205
    at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:113:665
    at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:55:321
    at ShareDbMongo.getDbs (/home/runner/work/sharedb-mongo/sharedb-mongo/index.js:29:1175)
    at ShareDbMongo.getOpCollection (/home/runner/work/sharedb-mongo/sharedb-mongo/index.js:55:186)
    at ShareDbMongo._getOpsBulk (/home/runner/work/sharedb-mongo/sharedb-mongo/index.js:113:498)
    at /home/runner/work/sharedb-mongo/sharedb-mongo/index.js:72:29
      1) subscribed query gets simultaneous insert and remove after reconnecting
    at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/mongodb4/lib/utils.js:349:28
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
Error stack -  undefined

Client code logs:

-------- DEBUG - done called with Error: Already closed
    at wrapErrorData (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/connection.js:266:13)
    at Connection.handleMessage (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/connection.js:192:11)
    at StreamSocket.socket.onmessage (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/connection.js:143:18)
    at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/stream-socket.js:60:12
    at processTicksAndRejections (node:internal/process/task_queues:78:11) {
  code: 5101,
  data: { a: 'qs', id: 1, c: 'dogs', q: {}, r: [ [Array], [Array] ] }
}
Trace
    at Query.done (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/test/client/query-subscribe.js:208:17)
    at Query.emit (node:events:513:28)
    at Query._finishResponse (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/query.js:153:17)
    at Query._handleResponse (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/query.js:109:24)
    at Query._handleSubscribe (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/query.js:103:8)
    at Connection.handleMessage (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/connection.js:209:24)
    at StreamSocket.socket.onmessage (/home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/client/connection.js:143:18)
    at /home/runner/work/sharedb-mongo/sharedb-mongo/node_modules/sharedb/lib/stream-socket.js:60:12
    at processTicksAndRejections (node:internal/process/task_queues:78:11)

@ericyhwang
Copy link
Contributor Author

The double-done Mocha error can be repro'd locally by adding a setTimeout of at least 500 ms or so before calling the final callback to _getSnapshotOpLinkBulk.

sharedb-mongo/index.js

Lines 812 to 823 in 58c8006

ShareDbMongo.prototype._getSnapshotOpLinkBulk = function(collectionName, ids, options, callback) {
var self = this;
this.getCollection(collectionName, function(err, collection) {
if (err) return callback(err);
var query = {_id: {$in: ids}};
var projection = {_o: 1, _v: 1};
var request = createRequestForMiddleware(options, collectionName);
request.query = query;
self._middleware.trigger(MiddlewareHandler.Actions.beforeSnapshotLookup, request, function(middlewareErr) {
if (middlewareErr) return callback(middlewareErr);
collection.find(query, request.findOptions).project(projection).toArray(callback);

When a client reconnects a query subscription, Agent#querySubscribe in sharedb core does two things in parallel:

  1. Calls Agent#_fetchBulkOps -> Backend#getOpsBulk to get the ops needed to catch up the individual documents' versions
    • This eventually calls sharedb-mongo's getOpsBulk, which in sequence calls _getSnapshotOpLinkBulk then _getOpsBulk.
  2. Calls Backend#querySubscribe -> QueryEmitter#queryPoll catch up the query itself, including result ordering
    • Backend#querySubscribe sets up pubsub. The presence of options.ids means it won't actually query the DB.
    • Instead, querying the Db is handled by the QueryEmitter#queryPoll call. Once done, the agent sends a query diff down to the client.

In the double-callback failure, the second one finishes quickly, sending the query diff down to the client. Once the once the insert and remove diffs are done processing on the client, the test immediately finishes, and an afterEach hook closes the test's sharedb Backend instance.

If the first one's _getSnapshotOpLinkBulk is slower, then the subsequent _getOpsBulk will try to call getDbs and fail with the "already closed" error.

sharedb-mongo/index.js

Lines 113 to 117 in 58c8006

ShareDbMongo.prototype.getDbs = function(callback) {
if (this.closed) {
var err = ShareDbMongo.alreadyClosedError();
return callback(err);
}

That error gets sent to the client code, which trips a query.on('error', done); in the test case, causing the double-done failure:
https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/test/client/query-subscribe.js#L220

alecgibson added a commit to share/sharedb that referenced this issue Feb 15, 2023
Fixes share/sharedb-mongo#131

This change attempts to fix some tests that are flaky in `sharedb-mongo`.

The flakiness can be reproduced locally by wrapping the
`Agent._querySubscribe()` [call to `_fetchBulkOps()`][1] in a long
`setTimeout()`:

```js
  if (options.fetchOps) {
    wait++;
    setTimeout(function() {
      console.log('fetch bulk ops');
      agent._fetchBulkOps(collection, options.fetchOps, finish);
    }, 1000);
  }
```

This forces us into an edge case where the subscribe query triggers and
returns the diff from a [`queryPoll()`][2], which triggers the tests'
`'insert'` handlers, finishes the test, and closes the backend, all
before the `_fetchBulkOps()` call is issued (which subsequently fails
because the database has been closed).

Handling this query subscribe actually triggers a variety of responses
to be sent to the client at different times:

 1. The actual `_querySubscribe()` [callback][3] (which ultimately
    triggers [`agent._reply()`][4] in response to the original request)
 2. Ops sent [independently][5] by [`_fetchBulkOps()`][6]
 3. The diff resulting from [`queryPoll()`][2]

In order to reduce flakiness, this change adds checks that the query's
[`'ready'` event][7] has been called, which will happen once the
resubscribe request has been replied to by the `Agent`.

This ensures we've waited for events 1 & 3 of the above list, although
we notably aren't waiting for event 2 (which is where the error is
actually coming from).

Since no ops will actually be sent to the client, I'm not sure how best
to wait for this. Hopefully waiting for the subscribe ack should be
sufficient.

[1]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/agent.js#L521-L524
[2]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/agent.js#L534
[3]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/agent.js#L511
[4]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/agent.js#L344
[5]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/agent.js#L265
[6]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/agent.js#L523
[7]: https://github.com/share/sharedb/blob/b84290f8a66ee42b3d2c75d6de780e125b99c514/lib/client/query.js#L148
@alecgibson
Copy link
Contributor

Looks like this is still happening: https://github.com/share/sharedb-mongo/actions/runs/4295623744/jobs/7486290853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants