Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deploy task for post move survey email #2655

Merged
merged 23 commits into from Sep 16, 2019

Conversation

@mkrump
Copy link
Contributor

commented Sep 5, 2019

Description

Supports transcom/ppp-infra#604

Add deploy task for post move survey email

References

@mkrump mkrump requested review from chrisgilmerproj, ralren, mr337 and rdhariwal Sep 5, 2019

@mkrump mkrump added the roci label Sep 5, 2019

@chrisgilmerproj
Copy link
Contributor

left a comment

Looks good, let's do it on experimental.

# TODO remove --offset-days from args before merge to default back to 15 days.
- deploy:
name: Deploy post move email survey task service
command: scripts/do-exclusively --job-name ${CIRCLE_JOB} bin/ecs-deploy-task-container --aws-account-id ${AWS_ACCOUNT_ID} --aws-region ${AWS_DEFAULT_REGION} --service app --environment ${APP_ENVIRONMENT} --repository-name app-tasks --image-tag git-${CIRCLE_SHA1} --command send-post-move-survey-email --command-args "--db-iam --db-iam-role arn:aws:iam::923914045601:role/ecs-rds-role-app-${APP_ENVIRONMENT} --db-region us-west-2 --offset-days 1"

This comment has been minimized.

Copy link
@mr337

mr337 Sep 5, 2019

Contributor

More curious than anything, where is the --offset-days coming from? Can't find it in the do-exclusively or ecs-deploy-task-container program.

This comment has been minimized.

Copy link
@mkrump

mkrump Sep 5, 2019

Author Contributor

possible that might not work as structured, but was attempting to pass a command line flag to binary.

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 5, 2019

Contributor

It's correct, basically you're passing through an argument to the Task Definition which is running the Docker Container which itself is running the binary. It's that binary that has --offset-days.

@codecov

This comment has been minimized.

Copy link

commented Sep 5, 2019

Codecov Report

Merging #2655 into master will increase coverage by 0.1%.
The diff coverage is 23.6%.

@@           Coverage Diff            @@
##           master   #2655     +/-   ##
========================================
+ Coverage    56.1%   56.2%   +0.1%     
========================================
  Files         271     272      +1     
  Lines       12427   12350     -77     
========================================
- Hits         6971    6934     -37     
+ Misses       4760    4724     -36     
+ Partials      696     692      -4
Impacted Files Coverage Δ
pkg/notifications/notification_sender.go 0% <0%> (ø) ⬆️
pkg/notifications/move_reviewed.go 72% <66.7%> (+0.4%) ⬆️
pkg/paperwork/forms.go 48.4% <0%> (-2%) ⬇️
pkg/services/accesscode/claim_access_code.go 100% <0%> (ø)
mkrump added 2 commits Sep 6, 2019
@chrisgilmerproj
Copy link
Contributor

left a comment

The command is missing CheckEmailFlags(v) which we get from https://github.com/transcom/mymove/blob/master/pkg/cli/email.go#L29 in this section of code: https://github.com/transcom/mymove/blob/master/cmd/send_post_move_survey_email/main.go#L28-L38.

The rule is that every time we use Init<something>Flags we also need to include Check<something>Flags.

But its also unclear why the client didn't throw an error when we configured it without a domain or region. We should figure that out too.

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2019

There's no validation done here when setting up the notification sender to see if it can send email: https://github.com/transcom/mymove/blob/master/pkg/notifications/notification_sender.go#L65-L76

CheckEmailFlags is supposed to do at least a minimum of configuration checking. But would be nice if we had something similar to the DB Ping() that we use to make sure we can talk to Postgres before connecting. A good example might be aws ses get-account-sending-enabled ....

@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

But its also unclear why the client didn't throw an error when we configured it without a domain or region. We should figure that out too.

Added CheckEmailFlags(v). I think that there was no error b/c it's getting from the local environment (at least for me I see).

(base) Projects/mymove - [mk-168035083-deploy-ecs-email-task●] » env | grep -i aws_ses
AWS_SES_REGION=us-west-2
AWS_SES_DOMAIN=devlocal.dp3.us
Add startup checks
- check GetAccountSendingEnabled = true when interacting with ses
- add cli.CheckEmail to cmd/send_post_move_survey_email
@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

A good example might be aws ses get-account-sending-enabled

In ff2a310 added a call to get-account-sending-enabled. Right now an error results in a log.Fatal just wanted to confirm that is the behavior that you are expecting.

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2019

In ff2a310 added a call to get-account-sending-enabled. Right now an error results in a log.Fatal just wanted to confirm that is the behavior that you are expecting.

Yes! This is exactly what we need. If we expect to use email we need to make sure its enabled for use. Thanks for adding this.

@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

@chrisgilmerproj linda got an email over the weekend. if everything else looks ok, should be good to go then (after updating the offset). Any idea why the deploy to experimental failed though? Should I just re-run to make sure that it deploys successfully?

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2019

@chrisgilmerproj linda got an email over the weekend. if everything else looks ok, should be good to go then (after updating the offset). Any idea why the deploy to experimental failed though? Should I just re-run to make sure that it deploys successfully?

Link me to the failure and I'll look. I'm thinking we still have a little more work to do on this utility before we want to send it out to Production. In particular we ought to make sure that it has exponential backoff in case we get rate limited. I'm also worried about the tooling we have around sending/resending if there are problems.

@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

Wanted to clarify any additional changes needed before merge.

I'm thinking we still have a little more work to do on this utility before we want to send it out to Production. In particular we ought to make sure that it has exponential backoff in case we get rate limited. I'm also worried about the tooling we have around sending/resending if there are problems.

For the rate limiting, I thought that we'd agreed that a future story was sufficient. We probably don't even need to get that complicated. A small delay would probably suffice since the limit is 80 emails/second and it looks like it needs to be sustained, not just quickly hit https://docs.aws.amazon.com/ses/latest/DeveloperGuide/manage-sending-limits.html?icmpid=docs_ses_console.

I'm also worried about the tooling we have around sending/resending if there are problems.

I wasn't sure if looking for a change here. But I think this probably applies to all emails that are sent at the moment though, not just these.

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2019

For the rate limiting, I thought that we'd agreed that a future story was sufficient. We probably don't even need to get that complicated. A small delay would probably suffice since the limit is 80 emails/second and it looks like it needs to be sustained, not just quickly hit https://docs.aws.amazon.com/ses/latest/DeveloperGuide/manage-sending-limits.html?icmpid=docs_ses_console.

I see the confusion. I expected a second story would be fine if we finished it before going to production. I think you heard it as saying we could push to production and then revisit. I'm nervous that it will get dropped if we do that and I'd rather have it in hand ahead of time. I'm open to opinions on that though (whether it will get done in a reasonable time frame).

I wasn't sure if looking for a change here. But I think this probably applies to all emails that are sent at the moment though, not just these.

There's no real admin for managing emails. So if a bad email goes out we don't really have re-call. We don't have the ability to re-send. There's pretty much no dials or switches that infra can use or app devs can use to manage the emails. It feels a bit risky to launch without something that would let us manage the emails that get picked up by the scheduled task. If you disagree that's fine, and you're right that its a larger problem, but we are putting ourselves in a tougher situation without having something in hand.

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 9, 2019

Oh, the other thing is I was hoping to combine the fuel task and the email task into a single binary with sub-commands. Would that be in this PR?

@jim

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

Oh, the other thing is I was hoping to combine the fuel task and the email task into a single binary with sub-commands. Would that be in this PR?

I think that needs to be in another story.

There's no real admin for managing emails. So if a bad email goes out we don't really have re-call. We don't have the ability to re-send. There's pretty much no dials or switches that infra can use or app devs can use to manage the emails. It feels a bit risky to launch without something that would let us manage the emails that get picked up by the scheduled task. If you disagree that's fine, and you're right that its a larger problem, but we are putting ourselves in a tougher situation without having something in hand.

This is true, but we're already sending emails in other parts of the app that this is also true for. It sounds like we need to plan out some additional work to have better email infrastructure, but I don't think we should block this story on that.

The emails in question here are survey emails that are sent after the move is completed, so a delivery failure is an acceptable risk in this case.

@jim

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

In particular we ought to make sure that it has exponential backoff in case we get rate limited.

This is something we can iterate on going forward. We have a very small number of users in the system right now and it's unlikely we're going to hit such a rate limit in the near future.

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

@jim @mkrump - we can move forward on this but I don't consider this to be "done". There's clearly work that needs to be done on the email system that we're discovering now that we're modifying it and I fear it won't actually be addressed. Also the scheduled tasks (fuel, email) really need to be combined into a binary so we can start looking at doing acceptance testing before deploys as currently there is no test that validates the binaries won't fail in production.

I'll give my approval I just don't want us to never come back to these issues as they will absolutely bite us in the future.

@chrisgilmerproj
Copy link
Contributor

left a comment

My approval is conditional on the understanding that I don't consider this to be the last story to be done to finish the task work.

@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

I can do this.

Also the scheduled tasks (fuel, email) really need to be combined into a binary so we can start looking at doing acceptance testing before deploys as currently there is no test that validates the binaries won't fail in production.

Is there a middle ground on the other stuff? Like the concern about rate limiting? SES says our burst rate limit is 80 per 1 sec (or an email every 12.5ms). Would adding a sleep to the loop for 20 ms help alleviate that concern?

The admin for managing emails really does feel like something separate that we need to probably talk to product about (unless this is more straightforward than i'm thinking).

@chrisgilmerproj

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2019

Is there a middle ground on the other stuff? Like the concern about rate limiting? SES says our burst rate limit is 80 per 1 sec (or an email every 12.5ms). Would adding a sleep to the loop for 20 ms help alleviate that concern?

Yeah, that would be fine. Anything that keeps us out of the danger zone for rate limiting will help us. My real concern there is that the error we hit when we do get rate limited will be inscrutable so we might waste time trying to figure out what's happening.

The admin for managing emails really does feel like something separate that we need to probably talk to product about (unless this is more straightforward than i'm thinking).

I'm ok having the admin in another ticket. I just don't know what we'll do about resending because honestly I'm not even sure how we capture information on send failures to know something went wrong. It just feels like we're building a system where we have no real visibility into our success/failure just anecdotal evidence that it worked while we were looking at it. Kind of like a Schrodinger's email box, works when we watch, no idea what it does when we're not looking.

This makes me think that we need to be capturing email send success/failure stats somewhere with some kind of actionable alert. And if we have something like that we need to write it up so the on-call person knows what to do if we start failing a lot.

@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

@chrisgilmerproj and @jim.

I added a sleep 9db08a3 to address the concerns about possible rate limiting.

Also, I took a stab at combining under single binary c52b61c. This feels like a pretty big change to make at the end of the pr, so could I get some careful eyes on that? I especially don't want to mess up anything with existing save-fuel-price data.

Experimental was occupied at the moment, but was hoping when that frees up, that we could do a final round of manual testing and then get this merged in tomorrow (if possible).

.circleci/config.yml Outdated Show resolved Hide resolved
.circleci/config.yml Outdated Show resolved Hide resolved
@chrisgilmerproj
Copy link
Contributor

left a comment

I added a few comments which we probably would catch in deploying to experimental.

Also I'm thinking we should rename this from ecs-tasks to milmove-tasks to make it more clear that this is related to the MilMove app and not related to ECS management.

@mkrump
Copy link
Contributor Author

left a comment

I added a few comments which we probably would catch in deploying to experimental.

Also I'm thinking we should rename this from ecs-tasks to milmove-tasks to make it more clear that this is related to the MilMove app and not related to ECS management.

renamed to milmove-tasks in 1b210ab

mkrump added 3 commits Sep 11, 2019
@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 12, 2019

@jim @chrisgilmerproj - i think i covered all the requested changes. after we manually test this on experimental is this good to go?

var commandName = s[0]
if len(s) > 1 {
commandName = s[1]
}

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 12, 2019

Contributor

nice fix!

@chrisgilmerproj
Copy link
Contributor

left a comment

Code changes look good to go for me! Deploy to experimental (when its ready) and check that both your command and the fuel price command are running. I can modify the cron job for the fuel task as well. After that I think you have one TODO item to remove (about offset) and I'm happy to approve.

@mkrump

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2019

reverted the experimental deploy and changed offset back to 15 days, so i think on monday this is ready to merge.

completionCommand := &cobra.Command{
Use: "completion",
Short: "Generates bash completion scripts",
Long: "To install completion scripts run:\n\nmilmove completion > /usr/local/etc/bash_completion.d/milmove",

This comment has been minimized.

Copy link
@chrisgilmerproj

chrisgilmerproj Sep 13, 2019

Contributor
Suggested change
Long: "To install completion scripts run:\n\nmilmove completion > /usr/local/etc/bash_completion.d/milmove",
Long: "To install completion scripts run:\n\nmilmove-tasks completion > /usr/local/etc/bash_completion.d/milmove-tasks",

This comment has been minimized.

Copy link
@mkrump

mkrump Sep 16, 2019

Author Contributor

👍 updated.

@chrisgilmerproj
Copy link
Contributor

left a comment

🚀 - This is great. Thanks for doing all this hard work I think it really paid off.

I left one comment for you but I'm not going to hang up on it.

@mkrump mkrump merged commit 6ccd934 into master Sep 16, 2019

15 of 16 checks passed

codecov/patch/go 23.6% of diff hit (target 55.9%)
Details
ci/circleci: acceptance_tests_experimental Your tests passed on CircleCI!
Details
ci/circleci: acceptance_tests_local Your tests passed on CircleCI!
Details
ci/circleci: acceptance_tests_staging Your tests passed on CircleCI!
Details
ci/circleci: build_app Your tests passed on CircleCI!
Details
ci/circleci: build_migrations Your tests passed on CircleCI!
Details
ci/circleci: build_tasks Your tests passed on CircleCI!
Details
ci/circleci: build_tools Your tests passed on CircleCI!
Details
ci/circleci: client_test Your tests passed on CircleCI!
Details
ci/circleci: integration_tests Your tests passed on CircleCI!
Details
ci/circleci: pre_deps_golang Your tests passed on CircleCI!
Details
ci/circleci: pre_deps_yarn Your tests passed on CircleCI!
Details
ci/circleci: pre_test Your tests passed on CircleCI!
Details
ci/circleci: server_test Your tests passed on CircleCI!
Details
ci/circleci: server_test_coverage Your tests passed on CircleCI!
Details
codecov/project/go 55.9% (+0.1%) compared to dd5d25f
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.