Join GitHub today
Retrieval Metering to Avoid Massive Charges #47
Please add a Max-Retrieval-Rate option to the restore command so we can protect ourselves from accidentally incurring massive charges for retrieval. For example, if I store 100GB at Glacier in 10GB files and I run the restore command with max-number-of-files=10 (which is a low, reasonable number), I will request the retrieval of my entire 100GB vault. While it only costs me $1.00 a month to store my 100GB, running the above command will cost me $179.70 for the retrieval even if I don't end up downloading the files to my computer. The bandwidth to download the 100GB will only cost me $12 at $0.12/GB, while the retrieval will probably make me cry. This will obviously negate the cost savings over using S3 for example.
Amazon charges us at the maximum retrieval rate we incur during any month (during a 4 hour window) as if we retrieved at that rate for the ENTIRE month (720 hours). Until Amazon adds a retrieval cap on their end (which is probably unlikely), every client of Glacier capable of retrieval should implement this.
Since the journal keeps track of what files we have retrieved and when, it shouldn't be too hard for the script to do a simple calculation and see how much data has been "restored"/"retrieved" in the previous 4 hours and to compare that to our Max Retrieval Rate option. If executing the restore command would violate the Max Retrieval Rate, it should fail completely. We could then continue to try the command until doing so will not violate the rate (if we were scripting it, for example).
For example, let's say I set my Max Retrieval Rate at 3GB/Hr. Not factoring in the free retrieval allowance, this means if I hit this rate, I will be charged 3GB x 0.01 x 720 hours = $21.6 for the month, no matter how much I download for the remainder of the month as long as it's lower than this rate. I'd be much more comfortable knowing my maximum retrieval cost will be $21.6 instead of my entire archive x 0.01 x 720 (less free allowance), which in the case of 100GB, it's $179.70 as I mentioned.
It would also be good to incorporate the free allowance calculation. Per the Amazon FAQ: "You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. For example, if on a given day you have 75 TB of data stored in Amazon Glacier, you can retrieve up to 128 GB of data for free that day (75 terabytes x 5% / 30 days = 128 GB, assuming it is a 30 day month). In this example, 128 GB is your daily free retrieval allowance. Each month, you are only charged a Retrieval Fee if you exceed your daily retrieval allowance.". The journal should know how much data we have in the vault before we "restore", and since this looks like its pro-rated daily, it shouldn't be hard to factor this into a metering system. Perhaps the option Max-Chargable-Retreival-Rate could factor this in (ie, the maximum retrieval rate I'm willing to pay for above and beyond my free allowance rate). Also note the download bandwidth cost is in addition to the "retrieval" costs, which is currently $0.12GB up to 10TB.
If you want to check the math yourself,
Here is an example of a client that does this:
Thanks you for feature request.
Yes, I am going to implement this feature.
yep, that exactly how I would implement it.
btw their faq contradics their own answer in forum
Maybe, but need to know total data size for current region, not vault, so journal would not help.
So I think I end up implementing something like
I will keep this ticket open and post updates about this feature here, when it's done. I cannot promise deadline, through.
Meanwhile you might find File selection options https://github.com/vsespb/mt-aws-glacier#file-selection-options (together with --max-number-of-files) usefull for
Another workaround is to split your data into archives of equal size, before backing up to Amazon Glacier.
You are right, I was not aware that the free allowance was by Region. This would probably overly complicate things as you point out, so just a maximum rate (not inclusive of free amount) would probably be the best and cleanest approach.
However, if we use a different journal for each vault as you suggest, you would have to know the contents of every journal in order to properly enforce a metering system. This could complicate things as well and we would have to pass in every journal into restore when we are retrieving from just one vault-- if we forget to pass in a journal, we could go over our threshold-- this is messy. Luckily, data we have in Glacier should rarely be restored anyways, so perhaps the solution is not to restore from different vaults at the same time-- and if you do, use at your own risk.
Hopefully I'll never have to restore from Glacier (because that means my raid box at home failed and/or my house burned down), but if I do, this would come in very handy.
Keep up the great work!
I actually meant "real" rate, not Amazon billing "rate". I.e. you specify 100Mb/hour - that means
And if you need to control you retrieval price, you'll have to do some (rought) calculations manually before restore
Well, maybe later I'll add something more user friendly (again, that would be a bit of mess, you'll have to
Well, I suggest this because it's how Amazon works. A file belongs to one vault. And if you mix files
Is this still on the table? Without this I'll have to implement this with boto, if I ever need to retrieve my files. Implementation suggestion: