Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunk objects when uploading via PutObject #30

Merged
merged 4 commits into from Mar 17, 2020

Conversation

skriss
Copy link
Member

@skriss skriss commented Mar 4, 2020

closes #26

I've pushed a dev image containing this change to steveheptio/velero-plugin-for-microsoft-azure:chunk-objects.

@skriss
Copy link
Member Author

skriss commented Mar 4, 2020

@dharmab I'm not sure if you have an environment where you're comfortable testing out a patch, but I've pushed a dev image containing this change (#30 (comment)). It's seemed to work well in my testing but would be great to get additional eyes.

If you do have an environment where you can test this, you can either kubectl -n velero edit deploy/velero and change the image for the plugin init container, or you can do:

velero plugin remove velero/velero-plugin-for-microsoft-azure:v1.0.0 # change the version tag if you've already upgraded to v1.0.1
velero plugin add steveheptio/velero-plugin-for-microsoft-azure:chunk-objects

Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
@dharmab
Copy link

dharmab commented Mar 4, 2020

@skriss I'll be able to test this within 24 hours. Actually, I just realized that the impacted cluster is still running Velero 1.1.0- we'll have it upgraded to 1.2.0 soon, though.

@dharmab
Copy link

dharmab commented Mar 5, 2020

I've upgraded my impacted cluster to Velero 1.2 and will test this PR there.

@dharmab
Copy link

dharmab commented Mar 5, 2020

I deployed the plugin and ran a test backup, which failed without logging any errors:

[kube]$ velero backup get dharmab-test-2
NAME             STATUS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
dharmab-test-2   Failed   2020-03-05 14:03:48 -0700 MST   6d        default            <none>
[kube]$ velero backup logs dharmab-test-2 | grep -v "level=info"
[kube]$

The log content is somewhat sensitive so I cannot post it publicly, but it just contained the standard info logs about the items it was backing up.

@skriss
Copy link
Member Author

skriss commented Mar 5, 2020

Hmm. Are there any errors in the server log (e.g. kubectl -n velero logs deploy/velero | grep error)?

@dharmab
Copy link

dharmab commented Mar 5, 2020

I checked the server logs we shipped to our log collector and I only see "level=info" logs from the velero server.

@skriss
Copy link
Member Author

skriss commented Mar 5, 2020

@dharmab can you find in the server logs from around where the backup finished running? Any info from that would be helpful for tracking this down.

I'll try some more tests and see if I can figure out what's happening too

@skriss
Copy link
Member Author

skriss commented Mar 5, 2020

I wonder if you're running into pod resource issues - do you have CPU/mem requests/limits set for the velero pod?

It's possible the chunks need to be smaller, since this is reading 100MB at a time which could be pushing past memory limits.

@dharmab
Copy link

dharmab commented Mar 6, 2020

Pod resource limits aren't the issue- we had to raise our limits very high due to vmware-tanzu/velero#2069, which was only fixed in 1.3.0 a few days ago.

I'll see if I can create a reproducible case with a script to create a few thousand configmaps or something.

@skriss
Copy link
Member Author

skriss commented Mar 6, 2020

OK; I did further testing and was unable to reproduce so any additional info you can provide from the logs, etc. will be very helpful for figuring this out.

@dharmab
Copy link

dharmab commented Mar 6, 2020

I was able to reproduce and test this successfully with Velero 1.3. I think the mentioned memory leak in Velero 1.2 tainted the result.

Steps to reproduce:

dd if=/dev/urandom bs=1024 count=256 of=256k-padding.txt
kubectl create namespace velero-azure-issue-26
for i in {0..2048}; do kubectl -n velero-azure-issue-26 create configmap padding-$i --from-file 256k-padding.txt; done

This creates ~512MB of configmaps on a cluster (takes a while to run if the apiserver isn't on localhost). Note that excess random data is needed to prevent Velero from shrinking the upload below 256MiB through compression.

With the above configmaps present, plugin 1.0.0 fails to create backups while the patch works. Unfortunately there aren't any reproducible error logs in Velero when the backup fails- not sure why.

@skriss skriss changed the title [WIP] chunk objects when uploading via PutObject chunk objects when uploading via PutObject Mar 6, 2020
@skriss
Copy link
Member Author

skriss commented Mar 6, 2020

alright, this is ready for review from other maintainers.

For anyone following, I pushed a new image, steveheptio/velero-plugin-for-microsoft-azure:chunk-objects-final, containing the current version of the code that allows the block/chunk size to be optionally configured via the BackupStorageLocation's config.blockSizeInBytes field. It still defaults to 100MB.

@skriss skriss marked this pull request as ready for review March 6, 2020 22:35
@skriss skriss requested a review from carlisia March 10, 2020 21:53
@skriss
Copy link
Member Author

skriss commented Mar 13, 2020

@carlisia @nrb @ashish-amarnath gentle reminder that this needs some review, 🙏

Copy link
Contributor

@ashish-amarnath ashish-amarnath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, otherwise lgtm.

# for more information on block blobs.
#
# Optional (defaults to 104857600, i.e. 100MB).
blockSizeInBytes: "10485760"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the value here be "104857600"? I think 10485760 corresponds to 10MB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I put a non-default value in here since the field is not required if you just want the default. I could put something totally different (e.g. 1024) so it's not confusing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhhhhh. Could it be left blank?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, I'll just put the default value in. I'd like to have a sample value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's good too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Signed-off-by: Steve Kriss <krisss@vmware.com>
Signed-off-by: Steve Kriss <krisss@vmware.com>
Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@carlisia carlisia merged commit f524207 into vmware-tanzu:master Mar 17, 2020
@skriss skriss deleted the chunk-objects branch May 7, 2020 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants