You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The indexer currently has logic that prevents running when the Kafka offset is rolled back. While this seemed like a good idea at the time, in the course of operating Kafka administrators can take actions that cause the offset to be reset. When this happens today, it results in an indexer becoming completely stuck, and can require deleting all of the data for that indexer (recovery tasks, and snapshots).
We should rework this so rollbacks are supported. Additionally we may consider moving from calculating the offset from previous snapshots + pending recovery tasks to a separate ZK node that is only responsible for storing the offset for the indexer. This would allow easier resetting, and wouldn't require the indexer to load all of the snapshots into memory on boot (allowing quicker boot times).
The text was updated successfully, but these errors were encountered:
Closing this as resolved in #1064. That PR added new manager functionality to enable operators to manually reset the stored offset, which is likely the best approach here. Automatically resetting the offset could have unintended consequences with dual indexing, so keeping this an operator-controlled action is likely the best option currently.
Describe the bug
The indexer currently has logic that prevents running when the Kafka offset is rolled back. While this seemed like a good idea at the time, in the course of operating Kafka administrators can take actions that cause the offset to be reset. When this happens today, it results in an indexer becoming completely stuck, and can require deleting all of the data for that indexer (recovery tasks, and snapshots).
astra/astra/src/main/java/com/slack/astra/server/RecoveryTaskCreator.java
Lines 244 to 251 in 9838d38
We should rework this so rollbacks are supported. Additionally we may consider moving from calculating the offset from previous snapshots + pending recovery tasks to a separate ZK node that is only responsible for storing the offset for the indexer. This would allow easier resetting, and wouldn't require the indexer to load all of the snapshots into memory on boot (allowing quicker boot times).
The text was updated successfully, but these errors were encountered: