
> Roman, want to field the back-of-the-envelope calculations?
> 
A single tumor/normal sample is around 400GB of raw storage
We expect an average of 5 samples / week for the next 12 months (but it's easy to scale up the calculations if needed)
Data flow would be from the Cache -> S3 (~3-6 months storage period) and in parallel a copy to Glacier.
> 
What is the storage cost per sample for the lifecycle?
What is the cost to move a sample to NCI (data out) for processing?
How long does it take to restore a sample from Glacier if we want to keep cost < 500 AUD?
Storage costs on NCI are:
> 
156\$ / TB / year on active storage (S3 equivalent) or ~\$30 per sample if kept for 6 months
73\$ / TB on (dual) tape archive or ~\$30 per sample and year
> 
Ignore compute for now, I don't have good numbers.
[8:49] 
If you absolutely want to, we need machines with 4GB memory / core. A sample takes ~48h on 128 cores or ~6400 CPU hours.
> 
[8:49] 
At \$0.0260/CPU hour that's ~$165 AUD for the processing.
> 
[8:50] 
But those numbers would only translate 1:1 if we could run bcbio on AWS which we can't. Any other runner will have different and likely better runtimes on AWS.

TB [of storage] at NCI is 150 a year. Or 75 per sample. Or 10 bucks a patient a month.

Regarding compute at NCI and its cost:


 ```Total Grant: 300.00 KSU
Total Used:  200.00 KSU
Total Avail: 100.00 KSU
Bonus Used:  41.86 KSU```


Going through ~100k Units a month right now or about 2600$ AUD.

### Many alternatives to calculate this

* https://www.cloudberrylab.com/backup/calc.aspx
* https://calculator.s3.amazonaws.com/index.html
* https://www.cloudberrylab.com/amazon-s3-pricing-explained.aspx

### Fetch S3 and EC2 pricing data from AWS

In [365]:
# Adapted to Python 3 from: https://blog.rackspace.com/experimenting-aws-price-list-api
import json, boto3, time, requests
from collections import defaultdict

AWS_SERVICES_IDX = 'https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/index.json'

AWS_EC2_URL = 'https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/ap-southeast-2/index.json'
AWS_S3_URL = 'https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonS3/current/ap-southeast-2/index.json'
AWS_GLACIER_URL = 'https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonGlacier/current/ap-southeast-2/index.json'

def read_aws_prices(service, **kwargs):
  event = defaultdict()

  if service == 's3':
    offer = json.load(open('s3.json', 'r'))
    prices = extract_s3_prices(offer, **kwargs)
  elif service == 'glacier':
    offer = json.load(open('glacier.json', 'r'))
    prices = extract_s3_prices(offer, **kwargs)
  elif service == 'ec2':
    offer = json.load(open('ec2.json', 'r'))
    prices = extract_ec2_prices(offer, **kwargs)

  return prices

def get_aws_prices(service):
  event = defaultdict()

  if service == 's3':
    event['offerCode'] = 'AmazonS3'
    offer = download_offer(event)
    prices = extract_s3_prices(offer)
  elif service == 'glacier':
    event['offerCode'] = 'AmazonGlacier'
    offer = download_offer(event)
    prices = extract_s3_prices(offer)
  elif service == 'ec2':
    event['offerCode'] = 'AmazonEC2'
    offer = download_offer(event)
    prices = extract_ec2_prices(offer)

  #upload_prices(prices)
  return prices

def download_offer(event):
  if event['offerCode'] == 'AmazonS3':
    URL = AWS_S3_URL
  elif event['offerCode'] == 'AmazonEC2':
    URL = AWS_EC2_URL
  elif event['offerCode'] == 'AmazonGlacier':
    URL = AWS_GLACIER_URL

  response = requests.get(URL)
  return json.loads(response.text)

def filter_ec2_products(products):
  filtered = []

  # Only interested in shared tenancy, linux instances
  for sku, product in products:
    a = product['attributes']
    if not ('locationType' in a and
            'location' in a and
            'tenancy' in a and
            a['tenancy'] == "Shared" and
            a['locationType'] == 'AWS Region' and
            a['operatingSystem'] == 'Linux'):
      continue

    a['sku'] = sku
    filtered.append(a)

  return filtered

def filter_s3_products(products, **kwargs):
  filtered = []

  for sku, product in products:
    a = product['attributes']
    if kwargs['usagetype'] == 'APS2-TimedStorage-ByteHrs':
        a['sku'] = sku
        filtered.append(a)
    elif not ('usagetype' in a and
            'fromLocation' in a and
            'toLocation' in a and
            #a['usagetype'] == kwargs['usagetype'] and
            a['fromLocation'] == kwargs['src'] and
            a['toLocation'] == kwargs['dst']):
            #a['fromLocationType'] == kwargs['src']):
            #a['fromLocation'] == 'Asia Pacific (Sydney)'):
            #a['toLocation'] == 'Asia Pacific (Sydney)'):
      continue

    a['sku'] = sku
    filtered.append(a)

  return filtered


def extract_ec2_prices(offer):
  terms = offer['terms']
  products = offer['products'].items()

  instances = {}
  for a in filter_ec2_products(products):
    term = list(terms['OnDemand'][a['sku']].items())[0][1]
    cost = list(term['priceDimensions'].items())[0][1]
    cost = cost['pricePerUnit']['USD']


    info = {"type" : a['instanceType'], "vcpu" : a['vcpu'], 
            "memory" : a['memory'].split(" ")[0], "cost" : cost}

    if not a['location'] in instances:
      instances[a['location']] = []

    instances[a['location']].append(info)

  return {'created': time.strftime("%c"), 'published': offer['publicationDate'], 
          'instances': instances}

def extract_s3_prices(offer, **kwargs):
  terms = offer['terms']
  products = offer['products'].items()

  info = {}
  transfers = {}
  transfers['APS2-TimedStorage-ByteHrs'] = []
  for a in filter_s3_products(products, **kwargs):
    term = list(terms['OnDemand'][a['sku']].items())[0][1]
    cost = list(term['priceDimensions'].items())[0][1]
    cost = cost['pricePerUnit']['USD']
    
    if a['usagetype'] == 'APS2-TimedStorage-ByteHrs':
        info = {"type": a["usagetype"], "cost": cost, 'from': None, 'to': None}
        transfers['APS2-TimedStorage-ByteHrs'].append(info)
    elif 'fromLocation' in a:
        info = {"type": a["usagetype"], "from": a["fromLocation"], "to": a["toLocation"], "cost": cost}
        print(transfers)
        transfers[a['fromLocation']].append(info)

    
  return transfers


# Premises

* All in USD
* Hour-level granularity for time model
* GiB-level granularity for space model

## Storage policies

In [366]:
sample_rate = 5/(24*7)Ω
storage_retention_policy = 6*30*24 # 6 months in hours

sample_size = 400

In [367]:
# S3 and Glacier cost in dollars per GB per month
#s3_egress_unit_cost = read_aws_prices("s3", usagetype='APS2-DataTransfer-Out-Bytes', src='Asia Pacific (Sydney)', dst='External')
#s3_ingress_unit_cost = read_aws_prices("s3", usagetype='APS2-DataTransfer-In-Bytes', src='External', dst='Asia Pacific (Sydney)')
s3_storage_unit_cost = read_aws_prices("s3", usagetype='APS2-TimedStorage-ByteHrs')

#glacier_unit_cost = read_aws_prices("glacier", usagetype='APS2-DataTransfer-Out-Bytes', src='Asia Pacific (Sydney)', dst='External')

#s3_egress_unit_cost
#s3_ingress_unit_cost
#glacier_unit_cost

{'APS2-TimedStorage-ByteHrs': []}


KeyError: 'Asia Pacific (Sydney)'

In [299]:
s3_storage_unit_cost

{'External': [{'cost': '0.0000000000',
   'from': 'External',
   'to': 'Asia Pacific (Sydney)',
   'type': 'APS2-DataTransfer-In-Bytes'},
  {'cost': '0.0400000000',
   'from': 'External',
   'to': 'Asia Pacific (Sydney)',
   'type': 'APS2-DataTransfer-In-ABytes-T1'},
  {'cost': '0.0800000000',
   'from': 'External',
   'to': 'Asia Pacific (Sydney)',
   'type': 'APS2-DataTransfer-In-ABytes-T2'},
  {'cost': '0.0000000000',
   'from': 'External',
   'to': 'Asia Pacific (Sydney)',
   'type': 'APS2-DataTransfer-In-ABytes'}]}

## S3 Storage

In [288]:
# Since we most probably will not get more than 128 patients during first year, we'll not exceed the first 50TB tier:
0.025*400

10.0

## S3 ingress

In [284]:
s3_ingress_unit_cost['External']

[{'cost': '0.0000000000',
  'from': 'External',
  'to': 'Asia Pacific (Sydney)',
  'type': 'APS2-DataTransfer-In-Bytes'},
 {'cost': '0.0400000000',
  'from': 'External',
  'to': 'Asia Pacific (Sydney)',
  'type': 'APS2-DataTransfer-In-ABytes-T1'},
 {'cost': '0.0800000000',
  'from': 'External',
  'to': 'Asia Pacific (Sydney)',
  'type': 'APS2-DataTransfer-In-ABytes-T2'},
 {'cost': '0.0000000000',
  'from': 'External',
  'to': 'Asia Pacific (Sydney)',
  'type': 'APS2-DataTransfer-In-ABytes'}]

In [285]:
# Cost of sending a single sample_size sample to S3
s3_ingress_unit_cost = s3_ingress_unit_cost['External'][2] # assume worst rate for S3
float(s3_ingress_unit_cost['cost']) * sample_size

32.0

In [286]:
# Weekly sample cost (times 5 samples per week) times 4 weeks a month
samples_weekly_cost = float(s3_ingress_unit_cost['cost']) * sample_size * 4 * 5
samples_weekly_cost

640.0

In [287]:
# Yearly
float(s3_ingress_unit_cost['cost']) * sample_size * 4 * 5 * 12

7680.0

## S3 egress

#### "What is the cost to move a sample to NCI (data out) for processing?... Ideally cost to transfer one patient sample out of S3. All we need for now."

In [289]:
s3_egress_unit_cost['Asia Pacific (Sydney)']

[{'cost': '0.0400000000',
  'from': 'Asia Pacific (Sydney)',
  'to': 'External',
  'type': 'APS2-DataTransfer-Out-ABytes'},
 {'cost': '0.0400000000',
  'from': 'Asia Pacific (Sydney)',
  'to': 'External',
  'type': 'APS2-DataTransfer-Out-ABytes-T1'},
 {'cost': '0.0400000000',
  'from': 'Asia Pacific (Sydney)',
  'to': 'External',
  'type': 'APS2-DataTransfer-Out-ABytes-T2'},
 {'cost': '0.1400000000',
  'from': 'Asia Pacific (Sydney)',
  'to': 'External',
  'type': 'APS2-DataTransfer-Out-Bytes'}]

In [290]:
s3_egress_unit_cost = s3_egress_unit_cost['Asia Pacific (Sydney)'][3] # Again, assume worst case scenario
float(s3_egress_unit_cost['cost']) * sample_size

56.00000000000001

In [291]:
# Weekly sample cost (times 5 samples per week) times 4 weeks a month
samples_weekly_cost = float(s3_egress_unit_cost['cost']) * sample_size * 4 * 5
samples_weekly_cost

1120.0000000000002

In [292]:
# Yearly
float(s3_egress_unit_cost['cost']) * sample_size * 4 * 5 * 12

13440.000000000004

#### "What is the storage cost per sample for the lifecycle?"

It depends. Let's first assume that we are not in a rush to retrieve the data so that we don't go Glacier "expedited mode" (expensive, urgent retrieval).

In [293]:
glacier_unit_cost['Asia Pacific (Sydney)']

[{'cost': '0.1400000000',
  'from': 'Asia Pacific (Sydney)',
  'to': 'External',
  'type': 'APS2-DataTransfer-Out-Bytes'}]

In [294]:
float(glacier_unit_cost['Asia Pacific (Sydney)'][0]['cost']) * sample_size

56.00000000000001