Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PXP-7805 Audit Service SQS #1603

Merged
merged 9 commits into from
Jun 11, 2021
Merged

PXP-7805 Audit Service SQS #1603

merged 9 commits into from
Jun 11, 2021

Conversation

paulineribeyre
Copy link
Contributor

@paulineribeyre paulineribeyre commented May 11, 2021

Jira Ticket: PXP-7805

goes with uc-cdis/audit-service#2 and uc-cdis/fence#923

Breaking change for environments that already have the audit service deployed: the new version of the audit service will require running gen3 kube-setup-audit-service again and adding the following to the audit service configuration file:

PULL_FROM_QUEUE: true
QUEUE_CONFIG:
  type: aws_sqs
  sqs_url: <sqs url>
  region: <sqs region>

New Features

  • New CLI module "gen3 sqs" to manage AWS SQS queues
  • Setting up the audit service now involves the creation of an AWS SQS
  • Fence now uses service account "fence-sa" which has access to push messages to the audit SQS
  • The audit service now uses service account "audit-service-sa" which has access to read messages in the audit SQS

Deployment changes

  • The new version of the audit service will require running gen3 kube-setup-audit-service and gen3 kube-setup-fence again, and updating the audit-service and fence configuration files

@williamhaley
Copy link
Contributor

I'm sorry to add this comment late in the game, but this PR makes me think about our coupling with AWS. I know we're already pretty tightly coupled, but I was wondering if there are abstract open source messaging platforms we considered before SQS? Realistically I think we're pretty far into AWS world so maybe it's not even worth considering.

local saName
local roleName

# fence can push messages to the queue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thinking out loud)

I wonder if this should be in kube-setup-fence-service.sh. I could see an argument for having it here, but I also think that the audit-service only cares about the receiving policy, and the senders should be responsible for their ability to put to the queue. But then again the queue is created here so 🤷

Maybe it makes sense here since the audit-service "owns" this queue, but I guess I also wonder about the implementation and the order we roll services. Do we have a dependency sequence now that's not obvious? Do we have a way of making sure the apps wait to init until after the queue is created and they're able to put/get from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could move it to kube-setup-fence-service.sh. The audit SQS and policies would get created even if the audit-service is not in use though, but that's maybe not a big deal.

the order we roll services

right now fence rolls after audit-service

Do we have a way of making sure the apps wait to init until after the queue is created and they're able to put/get from it?

the SQS and SQS policies are set up during kube-setup and the service container doesn't start until kube-setup completes, so i think we should be good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I didn't think about the fact that setup would always run first anyway. And good point about fence potentially having unused policies!

# 5 min visilibity timeout; avoid consuming the same message twice
visibility_timeout_seconds = 300
# 1209600s = 14 days (max value); time AWS will keep unread messages in the queue
message_retention_seconds = 1209600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any alerting/monitoring on the queues? Even with 14 days retention we could potentially lose messages in the queue. If that happened on a Commons running audit-service and we didn't detect it we could lose audit history. If there was an AWS outage/processing backup we could also potentially have a huge backlog for the audit-service to process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into that. We might be able to set up CloudWatch monitoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i created ticket DEVOPS-52 to look into this

williamhaley
williamhaley previously approved these changes May 13, 2021
Copy link
Contributor

@williamhaley williamhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this makes sense. I do have some broad/vague questions about how we handle queues and similar services in a way that isn't AWS-specific, but I don't know enough to speak to our general direction on that.

Adding SQS also gives us one more service to monitor/watch. Not saying that's necessarily bad, but more operational overhead for instance admins to consider.

@paulineribeyre
Copy link
Contributor Author

@williamhaley

our coupling with AWS. I know we're already pretty tightly coupled, but I was wondering if there are abstract open source messaging platforms we considered before SQS? Realistically I think we're pretty far into AWS world so maybe it's not even worth considering.

Adding SQS also gives us one more service to monitor/watch. Not saying that's necessarily bad, but more operational overhead for instance admins to consider.

No, we already use SQS for multiple use cases (ssjdispatcher, DCF replication). I actually chose to use SQS because we already use it and I thought we would already have code in cloud-automation ready for me to use. Turns out that wasn't the case since the existing code is not generic, but I tried to make the new gen3 sqs module in this PR generic. I do agree with your point about AWS coupling, but we already heavily use AWS and SQS. Introducing a new messaging platform would mean more operational overhead for instance admins.

To counter that, I am writing the new fence code and audit-service code in a way that'll make it easy to add support for other types of queue. But i think it's fair to assume cloud-automation can deploy using AWS.

@williamhaley
Copy link
Contributor

Thanks for all your replies @paulineribeyre! That makes a lot of sense. I completely forgot SQS is already used in a place I'd seen with batch jobs 🙄

williamhaley
williamhaley previously approved these changes May 18, 2021
williamhaley
williamhaley previously approved these changes Jun 9, 2021
williamhaley
williamhaley previously approved these changes Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants