ft: S3C-1398 Use retry CRR topic #278

bennettbuchanan · 2018-05-02T22:16:10Z

Decouple logic around making calls to Redis and updating object metadata. Allows for usage of ioredis's offline queue during cases where the connection to Redis is faulty. To that end, this PR alters the retry feature to do the following:

Push failed entries to a kafka "retry" topic.
Set Redis keys during processing of "retry" kafka entry.

Depends on https://github.com/scality/Federation/pull/1514 for creating the topic.

ironman-machine · 2018-05-02T22:18:24Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-02T22:45:30Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-02T23:10:31Z

PR has been updated. Reviewers, please be cautious.

bennettbuchanan · 2018-05-02T23:23:25Z

Noticing this CI error intermittently:

  1) BackbeatTestConsumer "before all" hook:
     Uncaught Error: KafkaConsumer is disconnected
      at Error (native)
      at KafkaConsumer.commit (node_modules/node-rdkafka/lib/kafka-consumer.js:482:32)
      at Object._processingQueue.drain (lib/BackbeatConsumer.js:202:32)
      at node_modules/async/dist/async.js:2269:19
      at node_modules/async/dist/async.js:958:16
      at node_modules/async/dist/async.js:3874:9
      at node_modules/async/dist/async.js:473:16
      at iteratorCallback (node_modules/async/dist/async.js:1050:13)
      at node_modules/async/dist/async.js:958:16
      at node_modules/async/dist/async.js:3871:13
      at Function._producer.send.err (lib/RetryProducer.js:56:20)
      at wrapper (node_modules/arsenal/lib/jsutil.js:25:30)
      at BackbeatProducer._onDeliveryReport (lib/BackbeatProducer.js:130:13)
      at onDeliveryReport (node_modules/node-rdkafka/lib/producer.js:91:12)

I'm wondering if this is a known timing issue in the CI. In particular, due to this comment. @jonathan-gramain

jonathan-gramain · 2018-05-03T00:34:34Z

@bennettbuchanan thanks for reporting this, looking at it, it could be a real bug, since a kafka consumer could throw when trying to commit its offset if it's not connected to kafka, we should guard this in a try/catch block IMO.

On the tests side, not sure why the chain of callbacks ends up in the consumer side directly from the producer side, but it may be possible to fix by serializing the creation of the BackbeatProducer after the BackbeatConsumer class has emitted the 'ready' event (so that it will be connected at this stage).

jonathan-gramain · 2018-05-03T00:38:52Z

conf/config.json

@@ -78,6 +78,7 @@
            },
            "topic": "backbeat-replication",
            "replicationStatusTopic": "backbeat-replication-status",
+            "replicationRetryTopic": "backbeat-replication-retry",


I would prefer this topic to be called backbeat-replication-failures (+ associated config items and classes), since it only catalogs entries that failed to be replicated, the retry API will be a service consuming this topic for items that may (or not) be retried.

jonathan-gramain · 2018-05-03T00:51:05Z

extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js

-        backends.forEach(backend => {
-            const { status, site } = backend;
+        let message;
+        for (let i = 0; i < backends.length; i++) {


Instead of looping and using break, may I suggest a clearer alternative:

const backend = backends.find(b => b.status === 'FAILED' && b.site === queueEntry.getSite()); if (backend) { const { status, site } = backend; //... return this._retryProducer.publishRetryEntry(...); }

jonathan-gramain · 2018-05-03T00:58:02Z

lib/queuePopulator/QueuePopulator.js

+        this._retryConsumer = new RetryConsumer(this.kafkaConfig);
+        this._retryConsumer.start();
+
+        this._retryProducer = new RetryProducer(this.kafkaConfig);


Is this producer used at all?

No, no it's not—I was following the metrics API design a bit too closely here. 😅

jonathan-gramain · 2018-05-03T01:07:28Z

lib/queuePopulator/QueuePopulator.js

+     * @return {undefined}
+     */
+    _setupRetryClients(cb) {
+        this._retryConsumer = new RetryConsumer(this.kafkaConfig);


Let's try not to stretch the role of queue populator beyond its metadata ingestion scope.

The producer is fine in the replication status processor IMO because that's where the info is readily available to be published.

The consumer should be part of its own component updating redis for separation of concerns.

I'm not sure I follow what separating the start of the consumer it into its own component would look like. Should it be a backbeat "extension" with its own task? Maybe a good place is https://github.com/scality/backbeat/blob/master/extensions/replication/queueProcessor/task.js#L49. In any case, I noticed the metrics consumer was being started when opening the queue populator so added it here.

The idea is to follow a philosophy where each backbeat internal component should be runnable as an independent entity. All entities that belong to the same service (e.g. replication) will go under a single kub service. The general idea as I understand it is to make it more manageable, pluggable on-demand and scalable independently from each other.

In this case, I think it makes sense to create a new component for consuming the retry queue and updating redis. In that case, indeed there should be an entrypoint that starts the process (task.js or service.js), and a set of classes doing the job (e.g. in tasks/ for the processing triggered by consumption of a single kafka entry).

I think it was an overlook (or a shortcut) to put the metrics consumer in the queue populator, it should only contain the producer, and the consumer should be a separate process.

ironman-machine · 2018-05-04T22:24:40Z

PR has been updated. Reviewers, please be cautious.

jonathan-gramain · 2018-05-09T17:17:16Z

lib/FailedCRRProducer.js

+     * @param {Function} cb - The callback function
+     * @return {undefined}
+     */
+    publishRetryEntry(message, cb) {


To be renamed as well to publishFailedCRREntry

jonathan-gramain · 2018-05-09T17:19:22Z

lib/FailedCRRProducer.js

@@ -0,0 +1,61 @@
+'use strict'; // eslint-disable-line strict


Since the producer and consumer classes are specific to replication service, I think they'd be better located somewhere in extensions/replication/ (maybe creating an extra failedCRR/ dir there can make sense).

rahulreddy · 2018-05-09T17:24:00Z

extensions/replication/utils/getRedisClient.js

@@ -9,10 +9,7 @@ werelogs.configure({
 });

 function getRedisClient() {
-    const redisConfig = Object.assign({}, config.redis, {
-        enableOfflineQueue: false,


It may actually be better if this offlinequeue is disabled. We already have a Kafka Queue for the failed list. If Redis is unavailable, the operation should be retried at a later time.

Perhaps we should have some kind of exponential backoff when performing Redis operations? I guess in that case, we would just push the entry back into the queue if it failed permanently (beyond the backoff limit).

I will look into potential solutions for this.

This seems like a good option as we can reuse the retry method from the BackbeatTask class.

Okay, just pushed a new commit with that functionality—it follows the same pattern for retrying replication tasks. When the retry gives up, then we push it back to the kafka topic dedicated to tracking failures.

bennettbuchanan · 2018-05-14T18:33:59Z

extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js

+            const versionId = queueEntry.getEncodedVersionId();
+            const { site } = backend;
+            const message = {
+                field: `${bucket}:${key}:${versionId}:${site}`,


Since moving away from Redis hash, field should be updated to key.

Will update in a subsequent PR so that we can maintain functionality when merging this one.

bennettbuchanan · 2018-05-14T18:37:54Z

extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js

+            const { site } = backend;
+            const message = {
+                field: `${bucket}:${key}:${versionId}:${site}`,
+                value: Buffer.from(kafkaEntry.value).toString(),


Since moving away from storing object metadata in Redis, we can eliminate this.

Will update in a subsequent PR so that we can maintain functionality when merging this one.

jonathan-gramain · 2018-05-15T23:02:31Z

extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js

@@ -126,38 +125,32 @@ class ReplicationStatusProcessor {
    }

    /**
-     * Set the Redis hash key for each failed backend.
+     * Push any failed entry to the "retry" topic.


should be to the "failed" topic

ironman-machine · 2018-05-16T22:39:59Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-16T22:50:02Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-17T17:22:00Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-17T17:22:00Z

CONFLICT (add/add): Merge conflict in lib/queuePopulator/QueuePopulator.js
CONFLICT (add/add): Merge conflict in extensions/replication/utils/getRedisClient.js
CONFLICT (add/add): Merge conflict in extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js
CONFLICT (add/add): Merge conflict in extensions/replication/ReplicationConfigValidator.js
CONFLICT (add/add): Merge conflict in conf/config.json

ironman-machine · 2018-05-17T17:26:33Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-17T17:30:37Z

PR has been updated. Reviewers, please be cautious.

jonathan-gramain · 2018-05-17T18:20:09Z

extensions/replication/failedCRR/FailedCRRConsumer.js

+            log,
+        }, err => {
+            if (err && err.retryable === true) {
+                return this._failedCRRProducer.setupProducer(err => {


Not sure it's the right place to initialize the producer, IMO it should be done once for all at init stage.

I had reservations as well, but didn't want to use a callback in the constructor. I'll move it to a separate method and call it when creating the instance.

You can have a setup or init async method called separately from the constructor.

Updated to set up when starting the consumer.

ironman-machine · 2018-05-17T21:06:11Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-17T21:06:12Z

CONFLICT (add/add): Merge conflict in lib/queuePopulator/QueuePopulator.js
CONFLICT (add/add): Merge conflict in extensions/replication/utils/getRedisClient.js
CONFLICT (add/add): Merge conflict in extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js
CONFLICT (add/add): Merge conflict in extensions/replication/ReplicationConfigValidator.js
CONFLICT (add/add): Merge conflict in conf/config.json

Do it again human slave!:point_right: :runner: (Oh and the pull request has been updated, by the way.)

ironman-machine · 2018-05-17T21:19:19Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-17T21:23:58Z

PR has been updated. Reviewers, please be cautious.

Do it again human slave!:point_right: :runner: (Oh and the pull request has been updated, by the way.)

ironman-machine · 2018-05-25T23:40:43Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-26T00:21:54Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-26T00:23:55Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-26T00:31:44Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-26T00:36:10Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-29T19:22:43Z

PR has been updated. Reviewers, please be cautious.

ironman-machine · 2018-05-29T19:22:44Z

CONFLICT (add/add): Merge conflict in lib/queuePopulator/QueuePopulator.js
CONFLICT (add/add): Merge conflict in extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js
CONFLICT (add/add): Merge conflict in extensions/replication/ReplicationConfigValidator.js
CONFLICT (add/add): Merge conflict in conf/config.json

ironman-machine · 2018-05-29T19:37:08Z

PR has been updated. Reviewers, please be cautious.

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 2bdb24c to 5c7a0cd Compare May 2, 2018 22:18

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 5c7a0cd to 4dc5ad9 Compare May 2, 2018 22:45

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 4dc5ad9 to 067ffde Compare May 2, 2018 23:10

jonathan-gramain reviewed May 3, 2018

View reviewed changes

jonathan-gramain reviewed May 9, 2018

View reviewed changes

rahulreddy reviewed May 9, 2018

View reviewed changes

bennettbuchanan commented May 14, 2018

View reviewed changes

jonathan-gramain reviewed May 15, 2018

View reviewed changes

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 8efa2ce to 20b43ed Compare May 16, 2018 22:39

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 20b43ed to 7507ec5 Compare May 16, 2018 22:50

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from b94731f to 1fab148 Compare May 17, 2018 17:26

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 1fab148 to 99a4969 Compare May 17, 2018 17:30

jonathan-gramain reviewed May 17, 2018

View reviewed changes

jonathan-gramain previously approved these changes May 17, 2018

View reviewed changes

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from a8dfe9b to 9fd04d3 Compare May 17, 2018 21:19

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 9fd04d3 to 6890aa5 Compare May 17, 2018 21:23

jonathan-gramain previously approved these changes May 17, 2018

View reviewed changes

philipyoo previously approved these changes May 17, 2018

View reviewed changes

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 6890aa5 to 67fe856 Compare May 25, 2018 23:40

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 67fe856 to 7dfdd46 Compare May 26, 2018 00:21

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 7dfdd46 to 82beeef Compare May 26, 2018 00:23

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 82beeef to 6e13111 Compare May 26, 2018 00:31

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from 6e13111 to 9dac495 Compare May 26, 2018 00:36

ft: S3C-1398 Use retry CRR topic

7d06d0e

bennettbuchanan force-pushed the ft/S3C-1398/useFailedCRRTopic branch from b985d3c to 7d06d0e Compare May 29, 2018 19:37

philipyoo approved these changes May 29, 2018

View reviewed changes

jonathan-gramain approved these changes May 29, 2018

View reviewed changes

jonathan-gramain merged commit bf2745b into z/1.0 May 29, 2018

jonathan-gramain deleted the ft/S3C-1398/useFailedCRRTopic branch May 29, 2018 20:36

ft: S3C-1398 Use retry CRR topic #278

ft: S3C-1398 Use retry CRR topic #278

Conversation

bennettbuchanan commented May 2, 2018 • edited Loading

ironman-machine commented May 2, 2018

ironman-machine commented May 2, 2018

ironman-machine commented May 2, 2018

bennettbuchanan commented May 2, 2018

jonathan-gramain commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironman-machine commented May 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bennettbuchanan May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironman-machine commented May 16, 2018

ironman-machine commented May 16, 2018

ironman-machine commented May 17, 2018

ironman-machine commented May 17, 2018

ironman-machine commented May 17, 2018

ironman-machine commented May 17, 2018

Choose a reason for hiding this comment

bennettbuchanan May 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ironman-machine commented May 17, 2018

ironman-machine commented May 17, 2018

ironman-machine commented May 17, 2018

ironman-machine commented May 17, 2018

ironman-machine commented May 25, 2018

ironman-machine commented May 26, 2018

ironman-machine commented May 26, 2018

ironman-machine commented May 26, 2018

ironman-machine commented May 26, 2018

ironman-machine commented May 29, 2018

ironman-machine commented May 29, 2018

ironman-machine commented May 29, 2018

bennettbuchanan commented May 2, 2018 •

edited

Loading

bennettbuchanan May 16, 2018 •

edited

Loading

bennettbuchanan May 17, 2018 •

edited

Loading