Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ft: S3C-1398 Use retry CRR topic #278

Merged
merged 1 commit into from
May 29, 2018

Conversation

bennettbuchanan
Copy link

@bennettbuchanan bennettbuchanan commented May 2, 2018

Decouple logic around making calls to Redis and updating object metadata. Allows for usage of ioredis's offline queue during cases where the connection to Redis is faulty. To that end, this PR alters the retry feature to do the following:

  • Push failed entries to a kafka "retry" topic.
  • Set Redis keys during processing of "retry" kafka entry.

Depends on https://github.com/scality/Federation/pull/1514 for creating the topic.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@bennettbuchanan
Copy link
Author

Noticing this CI error intermittently:

  1) BackbeatTestConsumer "before all" hook:
     Uncaught Error: KafkaConsumer is disconnected
      at Error (native)
      at KafkaConsumer.commit (node_modules/node-rdkafka/lib/kafka-consumer.js:482:32)
      at Object._processingQueue.drain (lib/BackbeatConsumer.js:202:32)
      at node_modules/async/dist/async.js:2269:19
      at node_modules/async/dist/async.js:958:16
      at node_modules/async/dist/async.js:3874:9
      at node_modules/async/dist/async.js:473:16
      at iteratorCallback (node_modules/async/dist/async.js:1050:13)
      at node_modules/async/dist/async.js:958:16
      at node_modules/async/dist/async.js:3871:13
      at Function._producer.send.err (lib/RetryProducer.js:56:20)
      at wrapper (node_modules/arsenal/lib/jsutil.js:25:30)
      at BackbeatProducer._onDeliveryReport (lib/BackbeatProducer.js:130:13)
      at onDeliveryReport (node_modules/node-rdkafka/lib/producer.js:91:12)

I'm wondering if this is a known timing issue in the CI. In particular, due to this comment. @jonathan-gramain

@jonathan-gramain
Copy link
Contributor

@bennettbuchanan thanks for reporting this, looking at it, it could be a real bug, since a kafka consumer could throw when trying to commit its offset if it's not connected to kafka, we should guard this in a try/catch block IMO.

On the tests side, not sure why the chain of callbacks ends up in the consumer side directly from the producer side, but it may be possible to fix by serializing the creation of the BackbeatProducer after the BackbeatConsumer class has emitted the 'ready' event (so that it will be connected at this stage).

conf/config.json Outdated
@@ -78,6 +78,7 @@
},
"topic": "backbeat-replication",
"replicationStatusTopic": "backbeat-replication-status",
"replicationRetryTopic": "backbeat-replication-retry",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this topic to be called backbeat-replication-failures (+ associated config items and classes), since it only catalogs entries that failed to be replicated, the retry API will be a service consuming this topic for items that may (or not) be retried.

backends.forEach(backend => {
const { status, site } = backend;
let message;
for (let i = 0; i < backends.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of looping and using break, may I suggest a clearer alternative:

const backend = backends.find(b => b.status === 'FAILED' && b.site === queueEntry.getSite());
if (backend) {
    const { status, site } = backend;
    //...
    return this._retryProducer.publishRetryEntry(...);
}

this._retryConsumer = new RetryConsumer(this.kafkaConfig);
this._retryConsumer.start();

this._retryProducer = new RetryProducer(this.kafkaConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this producer used at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no it's not—I was following the metrics API design a bit too closely here. 😅

* @return {undefined}
*/
_setupRetryClients(cb) {
this._retryConsumer = new RetryConsumer(this.kafkaConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try not to stretch the role of queue populator beyond its metadata ingestion scope.

The producer is fine in the replication status processor IMO because that's where the info is readily available to be published.

The consumer should be part of its own component updating redis for separation of concerns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what separating the start of the consumer it into its own component would look like. Should it be a backbeat "extension" with its own task? Maybe a good place is https://github.com/scality/backbeat/blob/master/extensions/replication/queueProcessor/task.js#L49. In any case, I noticed the metrics consumer was being started when opening the queue populator so added it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to follow a philosophy where each backbeat internal component should be runnable as an independent entity. All entities that belong to the same service (e.g. replication) will go under a single kub service. The general idea as I understand it is to make it more manageable, pluggable on-demand and scalable independently from each other.

In this case, I think it makes sense to create a new component for consuming the retry queue and updating redis. In that case, indeed there should be an entrypoint that starts the process (task.js or service.js), and a set of classes doing the job (e.g. in tasks/ for the processing triggered by consumption of a single kafka entry).

I think it was an overlook (or a shortcut) to put the metrics consumer in the queue populator, it should only contain the producer, and the consumer should be a separate process.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

* @param {Function} cb - The callback function
* @return {undefined}
*/
publishRetryEntry(message, cb) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be renamed as well to publishFailedCRREntry

@@ -0,0 +1,61 @@
'use strict'; // eslint-disable-line strict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the producer and consumer classes are specific to replication service, I think they'd be better located somewhere in extensions/replication/ (maybe creating an extra failedCRR/ dir there can make sense).

@@ -9,10 +9,7 @@ werelogs.configure({
});

function getRedisClient() {
const redisConfig = Object.assign({}, config.redis, {
enableOfflineQueue: false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually be better if this offlinequeue is disabled. We already have a Kafka Queue for the failed list. If Redis is unavailable, the operation should be retried at a later time.

Copy link
Author

@bennettbuchanan bennettbuchanan May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should have some kind of exponential backoff when performing Redis operations? I guess in that case, we would just push the entry back into the queue if it failed permanently (beyond the backoff limit).

I will look into potential solutions for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good option as we can reuse the retry method from the BackbeatTask class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, just pushed a new commit with that functionality—it follows the same pattern for retrying replication tasks. When the retry gives up, then we push it back to the kafka topic dedicated to tracking failures.

const versionId = queueEntry.getEncodedVersionId();
const { site } = backend;
const message = {
field: `${bucket}:${key}:${versionId}:${site}`,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since moving away from Redis hash, field should be updated to key.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in a subsequent PR so that we can maintain functionality when merging this one.

const { site } = backend;
const message = {
field: `${bucket}:${key}:${versionId}:${site}`,
value: Buffer.from(kafkaEntry.value).toString(),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since moving away from storing object metadata in Redis, we can eliminate this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update in a subsequent PR so that we can maintain functionality when merging this one.

@@ -126,38 +125,32 @@ class ReplicationStatusProcessor {
}

/**
* Set the Redis hash key for each failed backend.
* Push any failed entry to the "retry" topic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be to the "failed" topic

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

1 similar comment
@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

CONFLICT (add/add): Merge conflict in lib/queuePopulator/QueuePopulator.js
CONFLICT (add/add): Merge conflict in extensions/replication/utils/getRedisClient.js
CONFLICT (add/add): Merge conflict in extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js
CONFLICT (add/add): Merge conflict in extensions/replication/ReplicationConfigValidator.js
CONFLICT (add/add): Merge conflict in conf/config.json

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

log,
}, err => {
if (err && err.retryable === true) {
return this._failedCRRProducer.setupProducer(err => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it's the right place to initialize the producer, IMO it should be done once for all at init stage.

Copy link
Author

@bennettbuchanan bennettbuchanan May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had reservations as well, but didn't want to use a callback in the constructor. I'll move it to a separate method and call it when creating the instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have a setup or init async method called separately from the constructor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to set up when starting the consumer.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

CONFLICT (add/add): Merge conflict in lib/queuePopulator/QueuePopulator.js
CONFLICT (add/add): Merge conflict in extensions/replication/utils/getRedisClient.js
CONFLICT (add/add): Merge conflict in extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js
CONFLICT (add/add): Merge conflict in extensions/replication/ReplicationConfigValidator.js
CONFLICT (add/add): Merge conflict in conf/config.json

@ironman-machine ironman-machine dismissed jonathan-gramain’s stale review May 17, 2018 21:19

Do it again human slave!:point_right: :runner: (Oh and the pull request has been updated, by the way.)

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

philipyoo
philipyoo previously approved these changes May 17, 2018
@ironman-machine ironman-machine dismissed stale reviews from jonathan-gramain and philipyoo May 25, 2018 23:40

Do it again human slave!:point_right: :runner: (Oh and the pull request has been updated, by the way.)

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

1 similar comment
@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@ironman-machine
Copy link
Contributor

CONFLICT (add/add): Merge conflict in lib/queuePopulator/QueuePopulator.js
CONFLICT (add/add): Merge conflict in extensions/replication/replicationStatusProcessor/ReplicationStatusProcessor.js
CONFLICT (add/add): Merge conflict in extensions/replication/ReplicationConfigValidator.js
CONFLICT (add/add): Merge conflict in conf/config.json

@ironman-machine
Copy link
Contributor

PR has been updated. Reviewers, please be cautious.

@jonathan-gramain jonathan-gramain merged commit bf2745b into z/1.0 May 29, 2018
@jonathan-gramain jonathan-gramain deleted the ft/S3C-1398/useFailedCRRTopic branch May 29, 2018 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants