-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PXP-7805 Audit Service SQS #1603
Conversation
38e454c
to
39ba834
Compare
I'm sorry to add this comment late in the game, but this PR makes me think about our coupling with AWS. I know we're already pretty tightly coupled, but I was wondering if there are abstract open source messaging platforms we considered before SQS? Realistically I think we're pretty far into AWS world so maybe it's not even worth considering. |
gen3/bin/kube-setup-audit-service.sh
Outdated
local saName | ||
local roleName | ||
|
||
# fence can push messages to the queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(thinking out loud)
I wonder if this should be in kube-setup-fence-service.sh
. I could see an argument for having it here, but I also think that the audit-service
only cares about the receiving policy, and the senders should be responsible for their ability to put to the queue. But then again the queue is created here so 🤷
Maybe it makes sense here since the audit-service "owns" this queue, but I guess I also wonder about the implementation and the order we roll services. Do we have a dependency sequence now that's not obvious? Do we have a way of making sure the apps wait to init until after the queue is created and they're able to put/get from it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I could move it to kube-setup-fence-service.sh
. The audit SQS and policies would get created even if the audit-service is not in use though, but that's maybe not a big deal.
the order we roll services
right now fence rolls after audit-service
Do we have a way of making sure the apps wait to init until after the queue is created and they're able to put/get from it?
the SQS and SQS policies are set up during kube-setup
and the service container doesn't start until kube-setup
completes, so i think we should be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. I didn't think about the fact that setup would always run first anyway. And good point about fence potentially having unused policies!
# 5 min visilibity timeout; avoid consuming the same message twice | ||
visibility_timeout_seconds = 300 | ||
# 1209600s = 14 days (max value); time AWS will keep unread messages in the queue | ||
message_retention_seconds = 1209600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any alerting/monitoring on the queues? Even with 14 days retention we could potentially lose messages in the queue. If that happened on a Commons running audit-service and we didn't detect it we could lose audit history. If there was an AWS outage/processing backup we could also potentially have a huge backlog for the audit-service to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into that. We might be able to set up CloudWatch monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i created ticket DEVOPS-52 to look into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this makes sense. I do have some broad/vague questions about how we handle queues and similar services in a way that isn't AWS-specific, but I don't know enough to speak to our general direction on that.
Adding SQS also gives us one more service to monitor/watch. Not saying that's necessarily bad, but more operational overhead for instance admins to consider.
No, we already use SQS for multiple use cases (ssjdispatcher, DCF replication). I actually chose to use SQS because we already use it and I thought we would already have code in cloud-automation ready for me to use. Turns out that wasn't the case since the existing code is not generic, but I tried to make the new To counter that, I am writing the new fence code and audit-service code in a way that'll make it easy to add support for other types of queue. But i think it's fair to assume cloud-automation can deploy using AWS. |
Thanks for all your replies @paulineribeyre! That makes a lot of sense. I completely forgot SQS is already used in a place I'd seen with batch jobs 🙄 |
5f46cf2
to
9f61bee
Compare
f72194f
to
499ac79
Compare
849335a
to
1b24398
Compare
Jira Ticket: PXP-7805
goes with uc-cdis/audit-service#2 and uc-cdis/fence#923
Breaking change for environments that already have the audit service deployed: the new version of the audit service will require running
gen3 kube-setup-audit-service
again and adding the following to the audit service configuration file:New Features
Deployment changes
gen3 kube-setup-audit-service
andgen3 kube-setup-fence
again, and updating the audit-service and fence configuration files