# Draft: Roman Science Platform Data Access in the Cloud 

***

## Table of Contents:
* [Imports](#Imports)
* [Introduction](#Introduction)
* [Enabling Cloud Access](#Enabling-Cloud-Access)
* [Accessing MAST Data](#Accessing-MAST-Data)
* 

---

## Imports
Here we import the required packages for our data access examples. Some unique packages include:
- *astropy.io fits* for accessing FITS files
- *astropy.mast Observations* for accessing, searching, and selecting data from other missions
- *s3fs* for streaming in data directly from the cloud
- *matplotlib.pyplot* for plotting data
- *numpy* for easily setting the plotting bounds

In [4]:
%matplotlib inline
from astropy.io import fits
from astroquery.mast import Observations
import matplotlib.pyplot as plt
import numpy as np
import s3fs

***

## Introduction
The first step to any project is to acquire the necessary data to analyse. This notebook is designed to provide examples for how to access data from the science platform. In particular, it demonstrates how to stream data from the cloud directly into memory, circumventing the need to download the data locally and use excess storage. This method of accessing the data on the cloud is HIGHLY recommended, however we understand that some use-cases will require the data to be downloaded locally so we show an example of how to do that at the end of the notebook.

Here-in we examine how to download data from two classes of sources:
- the STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST
- simulated Roman Space Telescope data from the Troxel suite of simulations (see [Troxel 2023](https://ui.adsabs.harvard.edu/abs/2023MNRAS.522.2801T/abstract) for more details on the simulation methods used)


### Defining terms
- Cloud computing: the practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than a local server or a personal computer
- AWS: Amazon Web Services (AWS) is the cloud computing landscape that is provided by Amazon
- URI: a Universal Resource Identifier (URI) is a sequence of characters that identify a name or a unique resource on the Internet. Website URLs are a subclass of URIs.
- AWS S3: Amazon Simple Storage Service (S3) is a very scalable and inexpensive object storage service on the AWS cloud platform. The storage containers are referred to as "buckets" so we will often refer to these storage devices as "S3 buckets" or "S3 servers".

***

## Enabling Cloud Access
The most important step for accessing data from the cloud is to enable *astroquery* to retreive URIs and other relevant cloud information. Even if we are working locally and plan to download the data files (not recommended for Roman data), we need to use this command to copy the file locations.

In [5]:
Observations.enable_cloud_dataset()

INFO: Using the S3 STScI public dataset [astroquery.mast.cloud]


***

## Accessing MAST Data
In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive, stream the files directly from the cloud, as well as download them locally.

### Querying MAST
Now we are ready to begin our query. This example is rather simple, but is quick and easy to reproduce. We will be querying Hubble ACS/WFC data in this example.

In [6]:
obs = Observations.query_criteria(obs_collection='HST',
                                        filters='F606W',
                                        instrument_name='ACS/WFC',
                                        proposal_id=['12062'],
                                        dataRights='PUBLIC')
products = Observations.get_product_list(obs)
filtered = Observations.filter_products(products,
                                        productSubGroupDescription='DRZ')
filtered

obsID,obs_collection,dataproduct_type,obs_id,description,type,dataURI,productType,productGroupDescription,productSubGroupDescription,productDocumentationURL,project,prvversion,proposal_id,productFilename,size,parent_obsid,dataRights,calib_level,filters
str8,str3,str5,str35,str64,str1,str67,str9,str28,str11,str1,str7,str20,str5,str50,int64,str8,str6,int64,str9
24832664,HST,image,jbeveo010,DADS DRZ file - Calibrated combined image ACS/WFC3/WFPC2/STIS,D,mast:HST/product/jbeveo010_drz.fits,SCIENCE,Minimum Recommended Products,DRZ,--,CALACS,DrizzlePac 3.6.2,12062,jbeveo010_drz.fits,219608640,26423318,PUBLIC,3,F606W
24832664,HST,image,jbeveo010,DADS DRZ file - Calibrated combined image ACS/WFC3/WFPC2/STIS,D,mast:HST/product/jbeveo010_drz.fits,SCIENCE,Minimum Recommended Products,DRZ,--,CALACS,DrizzlePac 3.6.2,12062,jbeveo010_drz.fits,219608640,24832664,PUBLIC,3,F606W
24832668,HST,image,jbevet010,DADS DRZ file - Calibrated combined image ACS/WFC3/WFPC2/STIS,D,mast:HST/product/jbevet010_drz.fits,SCIENCE,Minimum Recommended Products,DRZ,--,CALACS,DrizzlePac 3.6.2,12062,jbevet010_drz.fits,219608640,24832668,PUBLIC,3,F606W
24832668,HST,image,jbevet010,DADS DRZ file - Calibrated combined image ACS/WFC3/WFPC2/STIS,D,mast:HST/product/jbevet010_drz.fits,SCIENCE,Minimum Recommended Products,DRZ,--,CALACS,DrizzlePac 3.6.2,12062,jbevet010_drz.fits,219608640,26421364,PUBLIC,3,F606W


In our query, we specify that we want to look at HST data using the F606W filter and ACS/WFC. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the obervations. We then filter the products to gather all the drizzled data products (as they have higher resolution and will look better with simple plotting) which leaves us with four filtered products.

Now that we have our desired products, we can gather the URIs for each of the files which indicate the files' locations in the MAST AWS S3 servers.

In [7]:
uris = Observations.get_cloud_uris(filtered)
uris

INFO: 2 of 4 products were duplicates. Only downloading 2 unique product(s). [astroquery.mast.observations]


['s3://stpubdata/hst/public/jbev/jbeveo010/jbeveo010_drz.fits',
 's3://stpubdata/hst/public/jbev/jbevet010/jbevet010_drz.fits']

Note that `get_cloud_uris` checks for duplicates in the provided products to reduce the data access volume. It is also important to not that `get_cloud_uris` will always return a list. Thus we will need to gather the indivual URI string to access the files. Let's choose the first URI for the remainder of this notebook.

In [8]:
uri = uris[0]

### Streaming files directly into memory
Here, we will use `s3fs` to directly access the data stored in the AWS S3 servers. Note that we must set `anon=True` to acces the files.

In [9]:
fs = s3fs.S3FileSystem(anon=True)

Because we can see that the URI points to a FITS file, we can use `astropy` to access the information in the file.

In [7]:
# Open the file in AWS: 'F' is the S3 file
with fs.open(uri, 'rb') as f:
    # Now actually read in the FITS file 
    with fits.open(f, 'readonly') as HDUlist:
        HDUlist.info()
        sci = HDUlist[1].data
type(sci)

Filename: <class 's3fs.core.S3File'>
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     846   ()      
  1  SCI           1 ImageHDU        81   (4240, 4313)   float32   
  2  WHT           1 ImageHDU        44   (4240, 4313)   float32   
  3  CTX           1 ImageHDU        37   (4240, 4313)   int32   
  4  HDRTAB        1 BinTableHDU    593   10R x 292C   [9A, 3A, K, D, D, D, D, D, D, D, D, D, D, D, D, D, K, 3A, 9A, 7A, 18A, 4A, D, D, D, D, 3A, D, D, D, D, D, D, D, D, D, D, D, D, K, 8A, 23A, D, D, D, D, K, K, K, 8A, K, 23A, 9A, 20A, K, 4A, K, D, K, K, K, K, 23A, D, D, D, D, K, K, 3A, 3A, 4A, 4A, L, D, D, D, 3A, 1A, K, D, D, D, D, D, 4A, 4A, 12A, 12A, 23A, 8A, 23A, 10A, 10A, D, D, 3A, 3A, 23A, 4A, 8A, 7A, 23A, D, K, D, 6A, 9A, 8A, D, D, L, 9A, 18A, 3A, K, 5A, 7A, 3A, D, 13A, 8A, 4A, 3A, L, K, L, K, L, K, K, D, D, D, D, D, D, 3A, 1A, D, 23A, D, D, D, 3A, 23A, L, 1A, 3A, 6A, D, 3A, 6A, K, D, D, D, D, D, D, D, D, D, D, 23A, D, D, D, D, 3A, D

numpy.ndarray

## Streaming ASDF Files

Though Roman data will eventually be available through MAST but in this testing phase, some simulated data have been placed in a separate S3 bucket. These files can be streamed in exactly the same way as the Hubble FITS file above. Additionally we can also peruse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#).

In [None]:
#fs = s3fs.S3FileSystem(anon=False, key=AWS_ACCESS_KEY_ID, secret=AWS_SECRET_ACCESS_KEY, token=AWS_SESSION_TOKEN)
roman_asdf_dir_uri = 's3://rdmt-sandbox-roman-wfi-l2/romanisim/'
fs.ls(roman_asdf_dir_uri)

Here we can see all the files available from romanisim simulations. Below we import `roman_datamodels` to read in the asdf file as an example. Please see the asdf data format notebook (____) for more information about accessing data within asdf files

In [None]:
import roman_datamodels as rdm
asdf_file_uri = roman_asdf_dir_uri + 'test.asdf'

with fs.open(asdf_file_uri, 'rb') as f:
    dm = rdm.open(f)
    
print(type(dm))
print(dm.meta)

## Accessing Simulated Roman Data (to be implemented when data is available)

Eventually, the Troxel 2023 data will be available in S3 buckets for use with the Roman Science Platform. When they are available, this section will discuss how to sort through and access those simulation files.

### Downloading Files Locally (not recommended)

Though it is **not recommended**, there may be instances where data files must be downloaded locally for certain workflows. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get)).

In [11]:
# commented out as this use case is not recommended and should only be needed in rare circumstances
#from pathlib import Path
#local_file_path = Path('data/')
#local_file_path.mkdir(parents=True, exist_ok=True)
#fs.get(uri, local_file_path)

## Aditional Resources
Some additional information that may be helpful can be found at the following links:

- [`s3fs` Documentation](https://s3fs.readthedocs.io/en/latest/api.html#)
- [ASDF Data Format Notebook]
- [Troxel 2023 Paper](https://academic.oup.com/mnras/article/522/2/2801/7076879?login=true)

## About this notebook
The data streaming information from this notebook is built largely off of the TIKE data-acces notebook by Thomas Dutkiewicz.

**Author:** Will Schultz  
**Updated On:** 2024-05-03

***

[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/spacetelescope/notebooks/master/assets/stsci_pri_combo_mark_horizonal_white_bkgd.png" alt="Space Telescope Logo" width="200px"/> 