<a id="top"></a>
# How do I work with data in the cloud?
***
This Notebook will answer some "first-time" questions about working with cloud data. We'll then cover a basic example of cloud access syntax that you can copy for your own use.

By the end of this tutorial, you will be able to:

- Describe the basic workflow for accessing data in the cloud
- Apply this cloud workflow to your own data queries

## Notebook Table of Contents
- [Introduction](#Introduction)
- [Imports and setup](#Imports-and-Setup)
- [A Quick Query](#A-Quick-Query)
- [Loading Files Directly into Memory](#Loading-Files-Directly-into-Memory)
- [Bonus: Other Methods, Notes, and Caveats](#Bonus:-Other-methods,-Notes,-and-Caveats)
    - [AWS Command Line Interface](#AWS-command-line-interface)
    - [Integrated Methods](#Integrated-Methods)

## Introduction

### What is "the cloud"? 

In this case, "the cloud" is the AWS East Datacenters in northern Virginia. By storing a cloud copy of MAST data here, we're able to offer our data in a new, highly accessible, highly available format. Cloud hosted data also permits users to interact with our data in new ways, as we'll see in the example below.

### What datasets are available?

The [MAST Archive](https://archive.stsci.edu/) offers a cloud copy of several mission datasets, including data from TESS, HST, GALEX, and more. They are generally cataloged in full on the [MAST Public Datasets](https://registry.opendata.aws/collab/stsci/) page, with a more condensed listing available on the [Public AWS Data](https://outerspace.stsci.edu/display/MASTDOCS/Public+AWS+Data) page.

### How can I access cloud-hosted data?

There are two approaches to accessing cloud-hosted data:
1. While on TIKE, loading files directly into memory (recommended)
2. A traditional download to your local machine from the cloud-hosted copy of MAST

Whenever possible, it's best to use the first method. The vast majority of users, with small tweaks to existing code, should be able to access data this way.

## Imports and Setup

We'll use the standard tools to open and plot a fits file:
- `astropy.io fits` to read in the fits file
- `matplotlib` to create the plot
- `numpy` to automatically set brightness limits in the plot

To access the cloud data, we need
- `astroquery.mast` to search for and select data

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from astropy.io import fits
from astroquery.mast import Observations

The most important step in this process is to enable cloud data access. Once we do, we'll be able to get cloud URIs and access files directly. If you're working locally, you can use this command to download data from the cloud copy of MAST data.

In [None]:
Observations.enable_cloud_dataset()

## A Quick Query

Now we can begin our query. This is not a particularly interesting query, but makes for a quick, easily reproducable example. We'll look for a particular HST Observation, then keep only the minimum recommended science files from that observation.

In [None]:
# You likely wouldn't search on obs_id, but it makes this example reproducable
obs = Observations.query_criteria(obs_id="ibxl50020")

# Get the products, then filter to keep science and MRP (minimum recommended products)
prod = Observations.get_product_list(obs)
filtered = Observations.filter_products(prod, mrp_only=True, productType='SCIENCE')

filtered

This particular product is a combined, ["drizzled" FITS file](https://hst-docs.stsci.edu/drizzpac/chapter-3-description-of-the-drizzle-algorithm/3-2-drizzle-concept). Without bogging down this notebook in technical details, drizzled images are generally better-resolved. This specific drizzled image combines data from four Hubble instruments, and thus will make a particularly nice looking plot.

Now that we've identified a file of interest, we need to locate its cloud URI: where the file is located on the S3 server. This is straightforward with the `get_cloud_uris` function, which allows us to pass a table of products.

In [None]:
c_uri = Observations.get_cloud_uris(filtered)
c_uri

It's important to note that our cloud URI is in a **list**. When we access the file, we'll need to pass the location as a **string**. Let's fix that now and avoid errors later.

In [None]:
c_uri = c_uri[0]

## Loading files directly into memory
Whether opening a file on the cloud, or on your local machine, it's best practice to close the file once you're done using it. This is most easily done using Python's `with/open` syntax, as we'll do below.

**Note:** Code in the `with` statement should only be used to extract data from the file. Relatively slow computations and plot generation should go outside of this statement so that the file can close in a timely manner.

In [None]:
# instead of passing a local file path,
# pass the cloud URI into fits.open()
with fits.open(c_uri, fsspec_kwargs={"anon": True}) as hdulist:
    hdulist.info()
    sci = hdulist[1].data

We've printed out some information about the file, and read the data from `HDU1` into a variable called `sci`. We can now use `sci` to create plots or images.

In [None]:
# Adjust limits on image: stars are bright, empty space is very dark
low = np.nanpercentile(sci, 1)
high = np.nanpercentile(sci,99)

# Plot sci in greyscale
plt.imshow(sci, cmap='gray', vmin=low, vmax=high)

This is a neat view of (a part of) the [Horsehead Nebula](https://hubblesite.org/contents/media/images/2013/12/3165-Image.html).

Although this is an artificial example, it still demonstrates that reading data from the cloud is straightforward. With an extra line or two of code, users have access to terabytes of data, no downloads required. This is particularly useful for users with slow internet connections, limited storage, or without a personal computer.

## Bonus: Other methods, Notes, and Caveats

This section is of limited value to the average user looking to analyze astronomical data. Here, we document additional ways to access cloud data that may be of interest to developers or those curious about our cloud infrastructure.

### AWS command-line interface

Since all of the cloud files are available to you as if they were local, it is possible to browse through them. This is not a very efficient way to find observations, but the AWS command-line interface does permit exploration within Jupyter Notebooks (by prefixing the command with `!`) and from a Terminal window (file › new › terminal).

For example, we can list the contents of the TESS data held in the public MAST S3 bucket:

In [None]:
!aws s3 ls s3://stpubdata/tess/public/ --no-sign-request

This is not a particularly insightful list, so if you wanted to explore in more detail:
* `ffi/` holds the TESS full-frame images
* `mast/` contains the TESS "cubes": [stacked FFI images](https://spacetelescope.github.io/mast_notebooks/notebooks/astrocut/making_tess_cubes_and_cutouts/making_tess_cubes_and_cutouts.html) that allow you to easily generate cutouts
* `pixel_list` contains an incomplete list of targets and their associated pixels
* `staged_cutouts/` is particularly unhelpful; it contains TESS cutouts requested by users. Filenames are randomly generated, so you will not find anything interesting by browsing.
* `tid/` contains all of SPOC-derived TPFs and LCs, sorted by [TESS filename convention](https://archive.stsci.edu/missions-and-data/tess/data-products)

***

## About this Notebook

If you have comments or questions on this notebook, please contact us through the Archive Help Desk e-mail at archive@stsci.edu. If you spot any errors, open an issue on the [tike_content repository](https://github.com/spacetelescope/tike_content).

**Author:** Thomas Dutkiewicz <br>
**Keywords:** TIKE, AWS, Cloud <br>

[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/spacetelescope/notebooks/master/assets/stsci_pri_combo_mark_horizonal_white_bkgd.png" alt="Space Telescope Logo" width="200px"/> 