New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 backup for Waiter Service logs on K8s #547

Merged
merged 1 commit into from Jan 15, 2019

Conversation

2 participants
@DaoWen
Copy link
Contributor

DaoWen commented Jan 10, 2019

Changes proposed in this PR

  • Add a dockerized S3-compatible server to our k8s (heavy) tests
  • Modify waiter-init to optionally copy log files (stdout, stderr) to S3 at pod termination

Why are we making these changes?

The files in a pod disappear when it exits. Persisting some logs (stdout & stderr) in S3 improves our debugging story.

@DaoWen DaoWen requested a review from shamsimam Jan 10, 2019

@DaoWen DaoWen added this to To do in Waiter on Kubernetes via automation Jan 10, 2019

@DaoWen DaoWen force-pushed the DaoWen:feature/logs-s3 branch from 4e4ed97 to 7323a0d Jan 10, 2019

@DaoWen

This comment has been minimized.

Copy link
Contributor Author

DaoWen commented Jan 10, 2019

@shamsimam - FYI, I don't think you started reviewing yet, but I just rebased this onto master and squashed the big stack of commits that I had before.

# we upload the stdout and stderr, and build an index.json file
# that is also uploaded to the target directory in the S3 bucket.
cd "$waiter_sandbox_base_dir"
for i in $(seq 0 $waiter_restart_count); do

This comment has been minimized.

@shamsimam

shamsimam Jan 13, 2019

Contributor

Should we loop in the reverse direction if we are worried the uploads may not complete all the way?

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

Sure, we can loop backwards. I'll change that.

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

Done.

# Install AWS CLI (for s3api commands) via pip
# We use this to handle S3 authentication for bucket creation
# https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
pip install awscli --upgrade --user

This comment has been minimized.

@shamsimam

shamsimam Jan 13, 2019

Contributor

Should we do this pip install before looping to wait for the S3 server to start up?

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

Yes, having this fail after the S3 server startup would be unfortunate. I'll move this.

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

Done.

;; We include the default log-bucket-sync-secs value in the total-sigkill-delay-secs
;; delay iff the log-bucket-url setting was given the scheduler config.
log-bucket-sync-secs (if log-bucket-url (:log-bucket-sync-secs context) 0)
total-sigkill-delay-secs (+ pod-sigkill-delay-secs log-bucket-sync-secs)

This comment has been minimized.

@shamsimam

shamsimam Jan 13, 2019

Contributor

Do we need to validate if total-sigkill-delay-secs isn't too large?

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

Both pod-sigkill-delay-secs and log-bucket-sync-secs have reasonable limits, so the sum of the two should also be OK.

@@ -873,6 +907,8 @@
(re-matches #"https?" fileserver-scheme)
(utils/pos-int? (:socket-timeout http-options))
(utils/pos-int? (:conn-timeout http-options))
(and (number? log-bucket-sync-secs) (<= 0 log-bucket-sync-secs 300))

This comment has been minimized.

@shamsimam

shamsimam Jan 13, 2019

Contributor

Is 300 seconds too large an amount of time to wait before killing a pod?

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

I think it's too long, but some other user might not. I don't plan to set this property to 300 seconds, but I don't think we need to make the max value lower here.

logfile="r$i/$f"
# Using the -T option with curl PUTs the target file to the given URL,
# and avoids loading the full file into memory when sending the payload.
curl -s -T "$logfile" "$base_url/$logfile"

This comment has been minimized.

@shamsimam

shamsimam Jan 13, 2019

Contributor

How long do this uploads usually take for our integration tests?

This comment has been minimized.

@DaoWen

DaoWen Jan 14, 2019

Author Contributor

My experience (running locally) was that it was sub-second. I don't think it would be much slower than that in Travis.

Show resolved Hide resolved waiter/integration/waiter/kubernetes_scheduler_integration_test.clj

@DaoWen DaoWen force-pushed the DaoWen:feature/logs-s3 branch 3 times, most recently from 5c294f4 to 3d55ac1 Jan 14, 2019

@DaoWen DaoWen force-pushed the DaoWen:feature/logs-s3 branch 3 times, most recently from 3c5a898 to a3a7635 Jan 15, 2019

@DaoWen

This comment has been minimized.

Copy link
Contributor Author

DaoWen commented Jan 15, 2019

@shamsimam - When I was fixing the s3 log server integration test, I ended up re-writing the directory listing logic to generate the listing directly from the S3 server's XML response. We no longer manually generate an index.json file. This feels much cleaner and more reliable.

service-id->service-description-fn
service-description->namespace)
pod (get-in @watch-state [:service-id->pod-id->pod service-id pod-name])]
(ss/try+

This comment has been minimized.

@shamsimam

shamsimam Jan 15, 2019

Contributor

Nitpick: I would extract function for the two cases of getting the info from the sidecar and from S3.

(log/info "waiting s3 logs to appear")
(is (wait-for
#(let [{:keys [body] :as logs-response} (make-request-fn log-url)]
(string/includes? body log-bucket-url))

This comment has been minimized.

@shamsimam

shamsimam Jan 15, 2019

Contributor

Since this is the S3 logs test, should we instead wait for the prefix to start with the S3 log url?

@shamsimam shamsimam merged commit e156a93 into twosigma:master Jan 15, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

Waiter on Kubernetes automation moved this from To do to Done Jan 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment