# Roman Science Platform Data Discovery and Access in the Cloud 
***

## Learning Goals

This notebook is designed to walk you through accessing data from the [Mikulski Archive for Space Telescopes (MAST)](https://archive.stsci.edu/) and retriving simulated [Roman Wide Field Instrument (WFI)](https://roman-docs.stsci.edu/roman-instruments-home) images.

## Table of Contents

\- [0. Introduction](#intro)

\- [1. Accessing MAST Data](#mast_data)

\- [2. Streaming from the Roman Science Platform S3 Bucket](#s3_bucket)

\- [3. Downloading Files (**not recommended**)](#download)

***

<a id = intro></a>
# 0. Introduction
This notebook is designed to provide examples of accessing data from the Roman Science Platform (RSP). Due to the survey nature of the Roman Space Telescope, it will produce large data volumes of data that will need to be easily and quickly accessed to perform scientific tasks like creating catalogs, difference imaging, generating light curves, etc. Downloading all the required data would burden most users by requiring excessive data storage solutions (likely `>10TB`).

This notebook demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is **HIGHLY** recommended. However, we understand that some use-cases will require downloading the data locally, so we also provide an example of how to do this at the end of the notebook.

During operations, each Roman data file will be given a Unique Resource Identifier (URI), an analog to an online filepath that is similar to a URL, which points to where the data is hosted on the Amazon Web Services (AWS) cloud. Users will retrieve these URIs from one of several sources including MAST (see [Accessing WFI Data](https://roman-docs.stsci.edu/data-handbook-home/accessing-wfi-data) for more information) and will be able to use the URI to access the desired data from the cloud. 

Herein we examine how to download data from two types of sources:

- The [STScI MAST server](https://archive.stsci.edu/) which hosts data for in-flight telescopes including Hubble, TESS, and JWST and will host Roman data in the future

- Simulated Roman Space Telescope data hosted in storage containers on the AWS cloud

***

### Kernel Information

To run this notebook on the Roman Science Platform, please select the "Roman Calibration" kernel at the top right of your window.

### Imports
Here we import the required packages for our data access examples including:

- `asdf` for accessing `ASDF` files

- `astropy.io fits` for accessing `FITS` files

- `astropy.mast Observations` for accessing, searching, and selecting data from other missions

- `s3fs` for streaming in data directly from the cloud

- `roman_datamodels` for opening Roman `ASDF` files. You can find additional information on how to work with `ASDF` files in the [Working with ASDF](https://github.com/spacetelescope/roman_notebooks/tree/main/content/notebooks/working_with_asdf) notebook tutorial.


In [None]:
import asdf
from astropy.io import fits
from astroquery.mast import Observations
import s3fs
import roman_datamodels as rdm

***

<a id = mast_data></a>
# 1. Accessing MAST Data
In this section, we will go through the steps to retrieve archived MAST data from the cloud, including how to query the archive and stream the files directly from the cloud.

## Enabling Cloud Access
The most important step for accessing data from the cloud is to enable `astroquery` to retrieve URIs and other relevant cloud information. Even if we are working locally and plan to download the data files (**not recommended for Roman data**), we need to use this command to copy the file locations.

In [None]:
Observations.enable_cloud_dataset()

## Querying MAST
We are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying HST WFC3/IR data of [M85](https://science.nasa.gov/mission/hubble/science/explore-the-night-sky/hubble-messier-catalog/messier-85/). In practice, the science platform should primarily be used for analyzing and exploring Roman data products. However, due to the smaller file sizes, HST WFC3/IR data provides a nice example. The process is identical regardless of which space telescope is used.

In our query, we specify that we want to look at HST data using the `F160W` filter and WFC3/IR. We also specify the `proposal_id` to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the Level 3 science data products associated with a specific project which still leaves us with sixty data products.

In [None]:
# Query MAST for matching observations
obs = Observations.query_criteria(obs_collection='HST',
                                  filters='F160W',
                                  instrument_name='WFC3/IR',
                                  proposal_id=['11360'],
                                  dataRights='PUBLIC')

# Get the list of products (files)
products = Observations.get_product_list(obs)

# Filter the products
filtered = Observations.filter_products(products,
                                        calib_level=[3], 
                                        productType=['SCIENCE'], 
                                        dataproduct_type=['image'], 
                                        project=['CALWF3'])

print(f'Filtered data products:\n{filtered}\n')

# Filter for just one product
single =  Observations.filter_products(filtered,
                                       obsID='24797441')

print(f'Single data product:\n{single}\n')

Now that we have our desired products, we can gather the URIs for each of the files which indicate their locations in the MAST AWS Simple Storage Service servers.

Amazon Simple Storage Service (S3) is a scalable and cost-effective object storage service on the AWS cloud platform. Storage containers within S3 are known as "buckets," so we often refer to these storage devices as "S3 buckets" or "S3 servers".

In [None]:
uris = Observations.get_cloud_uris(filtered)
uris

The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract an individual URI string to access the file. Here we choose the first URI, but in practice, you would select the URI associated with the desired file.

In [None]:
uri = uris[0]

## Streaming files directly into memory
Here, we will use `fsspec` to directly access the data stored in the AWS S3 servers. Because the URI points to a `FITS` file, we can use `fits.open` to access the information in the file.

In [None]:
with fits.open(uri, 'readonly', fsspec_kwargs={"anon":True}) as HDUlist:
    HDUlist.info()
    sci = HDUlist[1].data
    
type(sci)

***
<a id = s3_bucket></a>
# 2. Streaming from the Roman Science Platform S3 Bucket

Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available in a separate S3 bucket. These files can be streamed in the exact same way as the HST `FITS` file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the [`s3fs` documentation](https://s3fs.readthedocs.io/en/latest/api.html#).

The S3 bucket containing the data is currently only open to the public on the science platform where we have managed the permissions so none need to be specified explicitly. Because of the required permissions, many of the below cells will not work on a private computer.

There are currently three different data sources within the Roman science platform. We can view them by performing a list command (`ls`) on the the main science platform directory.

In [None]:
fs = s3fs.S3FileSystem()

asdf_dir_uri = 's3://roman-sci-test-data-prod-summer-beta-test/'
fs.ls(asdf_dir_uri)

The `fs.ls()` command allows us to list the contents of the URI. In the above example, the `roman-sci-test-data-prod-summer-beta-test` S3 bucket contains three directories:

- `ROMANISIM` contains the simulated WFI-imaging mode Roman Space Telescope data used in this suite of notebooks. See the [`ROMANISM` notebook tutorial](https://github.com/spacetelescope/roman_notebooks/tree/main/content/notebooks/romanisim) in this repo for more information.

- `STIPS` contains data for the [Space Telescope Image Product Simulator (STIPS) notebook](https://github.com/spacetelescope/roman_notebooks/tree/main/content/notebooks/stips).

- `OPEN_UNIVERSE` contains data from the OpenUniverse 2024 Matched Rubin and Roman Simulation preview provided by NASA/IPAC Infrared Science Archive (IRSA) at Caltech. 

In the next subsection we will explore opening data files made using Roman I-Sim, which are stored in the `ROMANISIM` S3 directory. These simulations are saved in the same file formats as observed Roman data will be and thus are useful to help develop file ingestion pipelines. Unfortunately, Roman I-Sim has not been used to extensively simulate survey data. 

In the final subsection, we will explore how to open the OpenUniverse preview data (in the `OPEN_UNIVERSE` S3 directory). The OpenUniverse collaboration has simulated extensive datasets from two core community surveys: the High Latitude Time Domain and Wide Area Surveys (HLTDS and HLWAS). Though they have only provided a preview of the full simulation suite, the quantity of data is still sufficient to start creating data pipelines to analyze Roman data.

A full description of the provided data products and simulation methodologies can be found in the two linked Monthly Notices of the Royal Astronomical Society (MRNAS) papers in [Additional Resources](#additional_res) below, and an overview is provided in [Simulated Data Products](https://github.com/spacetelescope/roman_notebooks/blob/main/markdown/simulated-data.md).

## Opening Roman I-Sim Models

Diving into the `ROMANISIM` directory, we find three folders:

- `CATALOGS_SCRIPTS`: contains stellar and galactic catalogs used to create the simulated data stored in the other directories
  
- `DENSE_REGION`: contains calibrated and uncalibrated simulated data of dense stellar fields obtained with different filters for all the eighteen WFI detectors. The data are separated into two directories, each with a different pointing. Filenames in these directories use the prefixes `r0000101001001001001*` and `r0000101001001001002*`, which correspond to the use of the `F158` and `F129` optical elements respectively.
  
- `GALAXIES`: contains one calibrated, simulated image of a galaxy field obtained using the `F158` optical element.

Below, we use `roman_datamodels` to read the `ASDF` file corresponding to the dense region as an example. To simplify the workflow we are providing a URI to the sample Roman data. During operations, the data would be referenced using the URI when performing queries through MAST or other data access methods that are currently under development.

The file naming convention for Roman is quite elaborate as each includes all the relevant information about the observation. Please see the [Data Levels and Products](https://roman-docs.stsci.edu/data-handbook-home/wfi-data-format/data-levels-and-products) Roman documentation page for more information on the file naming conventions.

In [None]:
asdf_file_uri =  f'{asdf_dir_uri}ROMANISIM/DENSE_REGION/R0.5_DP0.5_PA0/r0000101001001001001_01101_0001_WFI01_cal.asdf'

with fs.open(asdf_file_uri, 'rb') as f:
    dm = rdm.open(f)
    
print(dm.info())

## Opening OpenUniverse Models

The subset of data that IPAC has shared is hosted in their own S3 bucket, detailed on the [OpenUniverse AWS Open Data](https://registry.opendata.aws/openuniverse2024/) website. Additionally, IPAC has created two [OpenUniverse notebooks](https://irsa.ipac.caltech.edu/docs/notebooks/) that highlight how you can interact with their image data and catalog files. In this notebook, we focus on how to access the files and leave the linked notebooks as resources for the user to explore.

The simulations are natively saved as `FITS` files and are divided by survey, optical element, and HEALPix cell. [HEALPix](https://healpix.sourceforge.io) is a commonly used way to discretize the area of a sphere uniformly. Please see [Simulated Data Products](https://github.com/spacetelescope/roman_notebooks/blob/main/markdown/simulated-data.md) for more information about the specific products provided in the Open Universe data.

Below we provide an example of streaming a simulated "calibrated" image `FITS` file from their S3 bucket using an alternate way of streaming a `FITS` file. Instead of initializing our own `S3FileSystem`, we pass the credentials (anonymous credentials in this case as the data is public) to `fits.open` and allow it to create the file system. This shorthand is convenient when the URI is specifically provided, but it is impossible to explore the S3 directory structure without initializing the `S3FileSystem`.

In [None]:
s3bucket = 's3://nasa-irsa-simulations/openuniverse2024/roman/preview/RomanWAS/images/simple_model'
band = 'F184'
hpix = '15297'
sensor = 11
s3fpath = f'{s3bucket}/{band}/{hpix}/Roman_WAS_simple_model_{band}_{hpix}_{sensor}.fits.gz'

fits_file = fits.open(s3fpath, fsspec_kwargs={'anon':True})
print(fits_file.info())

For convenience, we have converted all the simulated "calibrated" images from `FITS` to `ASDF` files and are hosting them on the science platform's S3 bucket. In addition to the original files' data, we have also included two new features to the `ASDF` file:

1. We unpacked the WCS information from the `FITS` metadata and created a `gwcs.WCS` object that is saved in `asdf_file['roman']['wcs']`.

2. We queried the provided source catalogs and included all point sources, galaxies, and transients within the detector's field of view, storing them in `astropy.table.Table` objects directly within the `ASDF` files.

Below is an example of accessing the same file that we opened with the `FITS` file:

In [None]:
s3bucket = 's3://roman-sci-test-data-prod-summer-beta-test/OPEN_UNIVERSE/WAS/simple_model'
band = 'F184'
hpix = '15297'
sensor = 11
s3fpath = s3bucket+f'/{band}/{hpix}/roman_was_{band}_{hpix}_wfi{sensor:02d}_simple.asdf'

fs = s3fs.S3FileSystem()

with fs.open(s3fpath, 'rb') as file_path:
    asdf_file = asdf.open(file_path)

print(asdf_file.info())

Notice the difference when printing the file information between `FITS` and `ASDF`. `ASDF` provides more detail about the contents in a hierarchical structure to `FITS` native printing. Additionally we can index the `asdf_file` object similarly to a Python dictionary to access the contents.

Below we print the pre-prepared source catalog of galaxies:

In [None]:
print(asdf_file['roman']['catalogs']['galaxies'])

Now that we have loaded Roman data into a datamodel, please review the [Working with ASDF Notebook](https://github.com/spacetelescope/roman_notebooks/tree/main/content/notebooks/working_with_asdf) notebook to explore how to use them.
***
<a id = download></a>
# 3. Downloading Files (**not recommended**)

It is **not recommended** for users to download Roman data products due to the large file size and the number of files that are expected from the survey nature of the mission. Instead, users are encouraged to construct and adopt workflows that utilize the file streaming services described above for the best experience.

However, there may be instances where data files must be downloaded for certain specific science cases. To do that, we can use the URIs and the `S3FileSystem.get` function (documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get)). Running the below cell will download the data to your personal instance of the science platform. However, the preliminary, simulated sample of Roman data on the science platform are currently not accessible outside of the science platform.

**NOTE**: MAST data can be downloaded on your private computer using `anon=True` in the `S3FileSystem` initialization. However, the preliminary, simulated sample of Roman data on the science platform are currently not accessible outside of the science platform.

In [None]:
# Commented out as this use case is not recommended and should only be needed in rare circumstances
# from pathlib import Path
# URI =  ## Set this to the URI string you want to download.
# local_file_path = Path('data/')
# local_file_path.mkdir(parents=True, exist_ok=True)
# fs = s3fs.S3FileSystem()
# fs.get(URI, local_file_path)

***
<a id = "additional_res"></a>
## Additional Resources
Additional information can be found at the following links:

- [`s3fs` Documentation](https://s3fs.readthedocs.io/en/latest/api.html#)

- [OpenUniverse AWS Open Data](https://registry.opendata.aws/openuniverse2024/)

- [OpenUniverse notebooks](https://irsa.ipac.caltech.edu/docs/notebooks/)

- [Simulated Data Products Document](../../../markdown/simulated-data.md)

- [MNRAS paper detailing Open Universe data simulation methods (Troxel et al 2021)](https://ui.adsabs.harvard.edu/abs/2021MNRAS.501.2044T/abstract)

- [MNRAS paper detailing the previewed Open Universe data (Troxel et al 2023)](https://ui.adsabs.harvard.edu/abs/2023MNRAS.522.2801T/abstract)


**For additional support, please contact the [Roman Help Desk at STScI](https://stsci.service-now.com/roman).**

## About this notebook
The data streaming information from this notebook largely builds off of the TIKE data-access notebook by Thomas Dutkiewicz.

**Author:** Will C. Schultz

**Updated On:** 2024-11-06

***

[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/spacetelescope/notebooks/master/assets/stsci_pri_combo_mark_horizonal_white_bkgd.png" alt="Space Telescope Logo" width="200px"/> 