-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ft: S3C-1398 Use retry CRR topic #278
Conversation
2bdb24c
to
5c7a0cd
Compare
PR has been updated. Reviewers, please be cautious. |
5c7a0cd
to
4dc5ad9
Compare
PR has been updated. Reviewers, please be cautious. |
4dc5ad9
to
067ffde
Compare
PR has been updated. Reviewers, please be cautious. |
Noticing this CI error intermittently:
I'm wondering if this is a known timing issue in the CI. In particular, due to this comment. @jonathan-gramain |
@bennettbuchanan thanks for reporting this, looking at it, it could be a real bug, since a kafka consumer could throw when trying to commit its offset if it's not connected to kafka, we should guard this in a try/catch block IMO. On the tests side, not sure why the chain of callbacks ends up in the consumer side directly from the producer side, but it may be possible to fix by serializing the creation of the BackbeatProducer after the BackbeatConsumer class has emitted the 'ready' event (so that it will be connected at this stage). |
conf/config.json
Outdated
@@ -78,6 +78,7 @@ | |||
}, | |||
"topic": "backbeat-replication", | |||
"replicationStatusTopic": "backbeat-replication-status", | |||
"replicationRetryTopic": "backbeat-replication-retry", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this topic to be called backbeat-replication-failures
(+ associated config items and classes), since it only catalogs entries that failed to be replicated, the retry API will be a service consuming this topic for items that may (or not) be retried.
backends.forEach(backend => { | ||
const { status, site } = backend; | ||
let message; | ||
for (let i = 0; i < backends.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of looping and using break
, may I suggest a clearer alternative:
const backend = backends.find(b => b.status === 'FAILED' && b.site === queueEntry.getSite());
if (backend) {
const { status, site } = backend;
//...
return this._retryProducer.publishRetryEntry(...);
}
lib/queuePopulator/QueuePopulator.js
Outdated
this._retryConsumer = new RetryConsumer(this.kafkaConfig); | ||
this._retryConsumer.start(); | ||
|
||
this._retryProducer = new RetryProducer(this.kafkaConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this producer used at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no it's not—I was following the metrics API design a bit too closely here. 😅
lib/queuePopulator/QueuePopulator.js
Outdated
* @return {undefined} | ||
*/ | ||
_setupRetryClients(cb) { | ||
this._retryConsumer = new RetryConsumer(this.kafkaConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try not to stretch the role of queue populator beyond its metadata ingestion scope.
The producer is fine in the replication status processor IMO because that's where the info is readily available to be published.
The consumer should be part of its own component updating redis for separation of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what separating the start of the consumer it into its own component would look like. Should it be a backbeat "extension" with its own task? Maybe a good place is https://github.com/scality/backbeat/blob/master/extensions/replication/queueProcessor/task.js#L49. In any case, I noticed the metrics consumer was being started when opening the queue populator so added it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to follow a philosophy where each backbeat internal component should be runnable as an independent entity. All entities that belong to the same service (e.g. replication) will go under a single kub service. The general idea as I understand it is to make it more manageable, pluggable on-demand and scalable independently from each other.
In this case, I think it makes sense to create a new component for consuming the retry queue and updating redis. In that case, indeed there should be an entrypoint that starts the process (task.js or service.js), and a set of classes doing the job (e.g. in tasks/ for the processing triggered by consumption of a single kafka entry).
I think it was an overlook (or a shortcut) to put the metrics consumer in the queue populator, it should only contain the producer, and the consumer should be a separate process.
PR has been updated. Reviewers, please be cautious. |
lib/FailedCRRProducer.js
Outdated
* @param {Function} cb - The callback function | ||
* @return {undefined} | ||
*/ | ||
publishRetryEntry(message, cb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be renamed as well to publishFailedCRREntry
lib/FailedCRRProducer.js
Outdated
@@ -0,0 +1,61 @@ | |||
'use strict'; // eslint-disable-line strict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the producer and consumer classes are specific to replication service, I think they'd be better located somewhere in extensions/replication/ (maybe creating an extra failedCRR/ dir there can make sense).
@@ -9,10 +9,7 @@ werelogs.configure({ | |||
}); | |||
|
|||
function getRedisClient() { | |||
const redisConfig = Object.assign({}, config.redis, { | |||
enableOfflineQueue: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually be better if this offlinequeue is disabled. We already have a Kafka Queue for the failed list. If Redis is unavailable, the operation should be retried at a later time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should have some kind of exponential backoff when performing Redis operations? I guess in that case, we would just push the entry back into the queue if it failed permanently (beyond the backoff limit).
I will look into potential solutions for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good option as we can reuse the retry
method from the BackbeatTask
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, just pushed a new commit with that functionality—it follows the same pattern for retrying replication tasks. When the retry gives up, then we push it back to the kafka topic dedicated to tracking failures.
const versionId = queueEntry.getEncodedVersionId(); | ||
const { site } = backend; | ||
const message = { | ||
field: `${bucket}:${key}:${versionId}:${site}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since moving away from Redis hash, field
should be updated to key
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update in a subsequent PR so that we can maintain functionality when merging this one.
const { site } = backend; | ||
const message = { | ||
field: `${bucket}:${key}:${versionId}:${site}`, | ||
value: Buffer.from(kafkaEntry.value).toString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since moving away from storing object metadata in Redis, we can eliminate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update in a subsequent PR so that we can maintain functionality when merging this one.
@@ -126,38 +125,32 @@ class ReplicationStatusProcessor { | |||
} | |||
|
|||
/** | |||
* Set the Redis hash key for each failed backend. | |||
* Push any failed entry to the "retry" topic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be to the "failed" topic
8efa2ce
to
20b43ed
Compare
PR has been updated. Reviewers, please be cautious. |
20b43ed
to
7507ec5
Compare
PR has been updated. Reviewers, please be cautious. |
1 similar comment
PR has been updated. Reviewers, please be cautious. |
|
b94731f
to
1fab148
Compare
PR has been updated. Reviewers, please be cautious. |
1fab148
to
99a4969
Compare
PR has been updated. Reviewers, please be cautious. |
log, | ||
}, err => { | ||
if (err && err.retryable === true) { | ||
return this._failedCRRProducer.setupProducer(err => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's the right place to initialize the producer, IMO it should be done once for all at init stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had reservations as well, but didn't want to use a callback in the constructor. I'll move it to a separate method and call it when creating the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have a setup
or init
async method called separately from the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to set up when starting the consumer.
PR has been updated. Reviewers, please be cautious. |
|
a8dfe9b
to
9fd04d3
Compare
Do it again human slave!:point_right: :runner: (Oh and the pull request has been updated, by the way.)
PR has been updated. Reviewers, please be cautious. |
9fd04d3
to
6890aa5
Compare
PR has been updated. Reviewers, please be cautious. |
6890aa5
to
67fe856
Compare
Do it again human slave!:point_right: :runner: (Oh and the pull request has been updated, by the way.)
PR has been updated. Reviewers, please be cautious. |
67fe856
to
7dfdd46
Compare
PR has been updated. Reviewers, please be cautious. |
7dfdd46
to
82beeef
Compare
PR has been updated. Reviewers, please be cautious. |
82beeef
to
6e13111
Compare
PR has been updated. Reviewers, please be cautious. |
6e13111
to
9dac495
Compare
PR has been updated. Reviewers, please be cautious. |
1 similar comment
PR has been updated. Reviewers, please be cautious. |
|
b985d3c
to
7d06d0e
Compare
PR has been updated. Reviewers, please be cautious. |
Decouple logic around making calls to Redis and updating object metadata. Allows for usage of
ioredis
's offline queue during cases where the connection to Redis is faulty. To that end, this PR alters the retry feature to do the following:Depends on https://github.com/scality/Federation/pull/1514 for creating the topic.