Retrieval Metering to Avoid Massive Charges #47

Open
ghost opened this Issue Oct 23, 2013 · 8 comments

Projects

None yet

5 participants

@ghost
ghost commented Oct 23, 2013

Please add a Max-Retrieval-Rate option to the restore command so we can protect ourselves from accidentally incurring massive charges for retrieval. For example, if I store 100GB at Glacier in 10GB files and I run the restore command with max-number-of-files=10 (which is a low, reasonable number), I will request the retrieval of my entire 100GB vault. While it only costs me $1.00 a month to store my 100GB, running the above command will cost me $179.70 for the retrieval even if I don't end up downloading the files to my computer. The bandwidth to download the 100GB will only cost me $12 at $0.12/GB, while the retrieval will probably make me cry. This will obviously negate the cost savings over using S3 for example.

Amazon charges us at the maximum retrieval rate we incur during any month (during a 4 hour window) as if we retrieved at that rate for the ENTIRE month (720 hours). Until Amazon adds a retrieval cap on their end (which is probably unlikely), every client of Glacier capable of retrieval should implement this.

Since the journal keeps track of what files we have retrieved and when, it shouldn't be too hard for the script to do a simple calculation and see how much data has been "restored"/"retrieved" in the previous 4 hours and to compare that to our Max Retrieval Rate option. If executing the restore command would violate the Max Retrieval Rate, it should fail completely. We could then continue to try the command until doing so will not violate the rate (if we were scripting it, for example).

For example, let's say I set my Max Retrieval Rate at 3GB/Hr. Not factoring in the free retrieval allowance, this means if I hit this rate, I will be charged 3GB x 0.01 x 720 hours = $21.6 for the month, no matter how much I download for the remainder of the month as long as it's lower than this rate. I'd be much more comfortable knowing my maximum retrieval cost will be $21.6 instead of my entire archive x 0.01 x 720 (less free allowance), which in the case of 100GB, it's $179.70 as I mentioned.

It would also be good to incorporate the free allowance calculation. Per the Amazon FAQ: "You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month. For example, if on a given day you have 75 TB of data stored in Amazon Glacier, you can retrieve up to 128 GB of data for free that day (75 terabytes x 5% / 30 days = 128 GB, assuming it is a 30 day month). In this example, 128 GB is your daily free retrieval allowance. Each month, you are only charged a Retrieval Fee if you exceed your daily retrieval allowance.". The journal should know how much data we have in the vault before we "restore", and since this looks like its pro-rated daily, it shouldn't be hard to factor this into a metering system. Perhaps the option Max-Chargable-Retreival-Rate could factor this in (ie, the maximum retrieval rate I'm willing to pay for above and beyond my free allowance rate). Also note the download bandwidth cost is in addition to the "retrieval" costs, which is currently $0.12GB up to 10TB.

If you want to check the math yourself,
http://liangzan.net/aws-glacier-calculator/
http://timewasteblog.com/2012/08/30/amazon-glacier-date-retrieval-charge-explained-with-example/

Here is an example of a client that does this:
http://fastglacier.com/amazon-glacier-smart-retrieval.aspx

@vsespb
Owner
vsespb commented Oct 24, 2013

Hello.

Thanks you for feature request.

Yes, I am going to implement this feature.

Since the journal keeps track of what files we have retrieved and when, it shouldn't be too hard for the script to do a simple calculation and see how much data has been "restored"/"retrieved" in the previous 4 hours and to compare that to our Max Retrieval Rate option.

yep, that exactly how I would implement it.

Per the Amazon FAQ: "You can retrieve up to 5% of your average monthly storage (pro-rated daily) for free each month.

btw their faq contradics their own answer in forum
The 5% monthly free retrieval allowance is calculated independently for each region.
https://forums.aws.amazon.com/message.jspa?messageID=481895#481895

It would also be good to incorporate the free allowance calculation.
The journal should know how much data we have in the vault before we "restore", and since this looks like its pro-rated daily

Maybe, but need to know total data size for current region, not vault, so journal would not help.
Solution is to retrieve size of data in each vault before operation, well it does not look like Unix way to me, and also size can be outdated (24h).

So I think I end up implementing something like --max-hour-rate=100M. (at least, as first milestone)

I will keep this ticket open and post updates about this feature here, when it's done. I cannot promise deadline, through.

Meanwhile you might find File selection options https://github.com/vsespb/mt-aws-glacier#file-selection-options (together with --max-number-of-files) usefull for
contolling which files to restore.

Another workaround is to split your data into archives of equal size, before backing up to Amazon Glacier.

@ghost
ghost commented Oct 24, 2013

You are right, I was not aware that the free allowance was by Region. This would probably overly complicate things as you point out, so just a maximum rate (not inclusive of free amount) would probably be the best and cleanest approach.

However, if we use a different journal for each vault as you suggest, you would have to know the contents of every journal in order to properly enforce a metering system. This could complicate things as well and we would have to pass in every journal into restore when we are retrieving from just one vault-- if we forget to pass in a journal, we could go over our threshold-- this is messy. Luckily, data we have in Glacier should rarely be restored anyways, so perhaps the solution is not to restore from different vaults at the same time-- and if you do, use at your own risk.

Hopefully I'll never have to restore from Glacier (because that means my raid box at home failed and/or my house burned down), but if I do, this would come in very handy.

Keep up the great work!

@vsespb
Owner
vsespb commented Oct 24, 2013

so just a maximum rate (not inclusive of free amount)

I actually meant "real" rate, not Amazon billing "rate". I.e. you specify 100Mb/hour - that means
mtglacier will retrieve max 100Mb/hour, no matter how much data you store in your vauls.

And if you need to control you retrieval price, you'll have to do some (rought) calculations manually before restore
(you can login to amazon console and see the total size of all vaults for region).

Well, maybe later I'll add something more user friendly (again, that would be a bit of mess, you'll have to
set up read permissions for all your vaults + vault size is outdated 24h, so you should be sure you did not
delete anything from Amazon last 24h, if you rely on client calculations, and, if I decide to calculate rate in $$,
I'll have to hardcode Amazon prices for all regions and keep it up-to-date)

However, if we use a different journal for each vault as you suggest

Well, I suggest this because it's how Amazon works. A file belongs to one vault. And if you mix files
from two vaults in one journal, mtglacier will be unable to tell which file corresponds to which vault.
It will be impossible to delete/retrieve file without specifying vault.

Keep up the great work!

Thanks!

@donnie-darko

This would be an awesome feature, so +1

@ihartley-zz

+1. Not only an awesome feature but a great implementation. The user should be able to calculate the cost depending on max-hour-rate specified.

@tedder
Contributor
tedder commented Apr 26, 2014

calculating cost is fine, but it'd be simpler to limit bytes/time like wget does.

@m3nu
m3nu commented May 9, 2016

Is this still on the table? Without this I'll have to implement this with boto, if I ever need to retrieve my files. Implementation suggestion:

  • set max rate to X mb (user needs to calculate himself. E.g. total storage 1TB. So max free rate =

1024GB * 5% / 30 days / 24 hours = 0.0711GB hourly max rate, or 72MB/hour

i.e.

total storage (in MB) * 0.05 / 720 = max hourly retrieval rate in MB.

@m3nu
m3nu commented May 24, 2016

I'm just retrieving about 15% of my files and I'm using --max-number-of-files, which is great, as long as the files are roughly the same size. How about a --max-number-of-bytes option? That would solve OP's issue and is probably easy to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment