# Roman Science Platform Data Discovery and Access in the Cloud 


***

## Kernel Information

To run this notebook, please select the "Roman Calibration" kernel at the top right of your window.

## Imports
Here we import the required packages for our data access examples including:
- *astropy.io fits* for accessing FITS files
- *astropy.mast Observations* for accessing, searching, and selecting data from other missions
- *s3fs* for streaming in data directly from the cloud
- *roman_datamodels* for opening Roman ASDF files. You can find additional information on how to work with ASDF files in the Working with ASDF notebook tutorial.

In [None]:
from astropy.io import fits
from astroquery.mast import Observations
import s3fs
import roman_datamodels as rdm

***

## Introduction
This notebook is designed to provide examples of accessing data from the science platform. It demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is *HIGHLY* recommended. However, we understand that some use-cases will require downloading the data locally, so we also provide an example of how to do this at the end of the notebook.

Here-in we examine how to download data from two types of sources:
- The STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST
- Simulated Roman Space Telescope data hosted in storage containers on the AWS cloud


### Defining terms
- *Cloud computing*: the practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than using a local server or a personal computer.
- *AWS*: Amazon Web Services (AWS) is the cloud computing platform provided by Amazon.
- *URI*: a Universal Resource Identifier (URI) is a sequence of characters that identifies a name or a unique resource on the Internet. URLs for websites are a subclass of URIs.
- *AWS S3*: Amazon Simple Storage Service (S3) is a scalable and cost-effective object storage service on the AWS cloud platform. Storage containers within S3 are knwon as "buckets," so we often refer to these storage devices as "S3 buckets" or "S3 servers".

***

## Accessing MAST Data
In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive, stream the files directly from the cloud, as well as download them locally.

### Enabling Cloud Access
The most important step for accessing data from the cloud is to enable *astroquery* to retreive URIs and other relevant cloud information. Even if we are working locally and plan to download the data files (not recommended for Roman data), we need to use this command to copy the file locations.

In [None]:
Observations.enable_cloud_dataset()

### Querying MAST
Now we are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying JWST NIRCAM data of M83. In our query, we specify that we want to look at JWST data using the F444W filter and NIRCAM. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the rate image data products which still leaves us with 144 filtered products. To reduce the number of URIs we filter through, we choose a single observation to continue with in this notebook.

In [None]:
obs = Observations.query_criteria(obs_collection='JWST',
                                  filters='F444W',
                                  instrument_name='NIRCAM/IMAGE',
                                  proposal_id=['1783'],
                                  dataRights='PUBLIC')
products = Observations.get_product_list(obs)
filtered = Observations.filter_products(products,
                                        productSubGroupDescription='RATE')
print('Filtered data products:\n', filtered, '\n')
single =  Observations.filter_products(filtered,
                                       obsID='87766440')
print('Single data product:\n', single, '\n')

Now that we have our desired products, we can gather the URIs for each of the files which indicate their locations in the MAST AWS S3 servers.

In [None]:
uris = Observations.get_cloud_uris(single)
uris

The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract the individual URI strings to access the files.

In [None]:
uri = uris[0]

### Streaming files directly into memory
Here, we will use `s3fs` to directly access the data stored in the AWS S3 servers. Note that we must set `anon=True` to acces the files.

In [None]:
fs = s3fs.S3FileSystem(anon=True)

Because the URI points to a FITS file, we can use `astropy` to access the information in the file.

In [None]:
# Open the file in AWS: 'F' is the S3 file
with fs.open(uri, 'rb') as f:
    # Now actually read in the FITS file 
    with fits.open(f, 'readonly') as HDUlist:
        HDUlist.info()
        sci = HDUlist[1].data
type(sci)

***

## Streaming from the Roman Science Platform S3 Bucket

Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available in a separate S3 bucket. These files can be streamed in exactly the same way as the JWST FITS file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#).

In [None]:
asdf_dir_uri = 's3://roman-sci-test-data-prod-summer-beta-test/'
fs = s3fs.S3FileSystem()

fs.ls(asdf_dir_uri)

The `fs.ls()` command allows us to list the contents of the URI. In the above example, the `roman-sci-test-data-prod-summer-beta-test` s3 bucket contains two directories:
- `ROMANISIM` contains the simulated WFI-imaging mode Roman Space Telescope data used in this suite of notebooks
- `STIPS` contains data for the Space Telescope Image Product Simulator (STIPS) notebook (Notebook link: [stips.ipynb](../stips/stips.ipynb))

Diving into the `ROMANISIM` directory, we find three folders:
- `CATALOGS_SCRIPTS`: contains stellar and galactic catalogs used to create the simulated data stored in the other directories
- `DENSE_REGION`: contains calibrated and uncalibrated simulated data of dense stellar fields obtained with different filters for all the eighteen WFI detectors. The data are separarted into two directories, each with a different pointings. Filenames in these directories use the prefixes `r0000101001001001001*` and `r0000101001001001002*`, which correspond to the use of the F158 and F129 optical elements respectively.
- `GALAXIES`: contains one calibrated, simulated image of a galaxy field obtained using the F158 optical element.

Below, we use `roman_datamodels` to read the ASDF file corresponding to the dense region as an example.

In [None]:
asdf_file_uri = asdf_dir_uri + 'ROMANISIM/DENSE_REGION/R0.5_DP0.5_PA0/r0000101001001001001_01101_0001_WFI01_cal.asdf'

with fs.open(asdf_file_uri, 'rb') as f:
    dm = rdm.open(f)
    
print(type(dm))
print(dm.meta)

***

## Downloading Files Locally (not recommended)

Though it is **not recommended**, there may be instances where data files must be downloaded locally for certain specific science cases. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get)).

In [None]:
# commented out as this use case is not recommended and should only be needed in rare circumstances
# from pathlib import Path
# local_file_path = Path('data/')
# local_file_path.mkdir(parents=True, exist_ok=True)
# fs.get(uri, local_file_path)

***

## Aditional Resources
Additional information can be found at the following links:

- [`s3fs` Documentation](https://s3fs.readthedocs.io/en/latest/api.html#)
- [Working with ASDF Notebook](../working_with_asdf/working_with_asdf.ipynb)

## About this notebook
The data streaming information from this notebook largely builds off of the TIKE data-acces notebook by Thomas Dutkiewicz.

**Author:** Will C. Schultz  
**Updated On:** 2024-05-14

***

[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/spacetelescope/notebooks/master/assets/stsci_pri_combo_mark_horizonal_white_bkgd.png" alt="Space Telescope Logo" width="200px"/> 