<a id="top"></a>
# Generating TESS Cutouts Quickly with TIKE
***
The Timeseries Integrated Knowledge Engine (TIKE) is a cloud computing environment offered by the MAST Archive. You should already be inside this environment if you're running this notebook!

One advantage of working in the cloud is rapid access to data. There's no need to submit a query to the MAST servers in Baltimore and wait for a download to your local machine to complete. Instead, you can work with the data directly in the cloud.

To that end, this tutorial will demonstrate:

- Filtering the TESS Threshold Crossing Events (TCE) catalog for transits of interest.
- Using TESSCut to get data from a target of interest.
- Accessing cutouts more quickly and efficiently using cutout cubes.
- Optimizing our cloud query for better performance.


## Introduction
The [TESS Mission](https://archive.stsci.edu/missions-and-data/tess) was launched in 2018 to find nearby transiting exoplanets. Its large, $24^\circ$ x $96^\circ$ field-of-view allows for simultaneous observations of a large part of the sky. These Full Field Images (FFIs) are tremendously useful for science. Although local [Target Pixel Files](https://heasarc.gsfc.nasa.gov/docs/tess/data-products.html) (AKA "postage stamps") are produced for many stars TESS observes, some potential targets are only available in the FFIs.

[Starting in September 2023](https://tess.mit.edu/news/tess-begins-its-second-extended-mission/), TESS will take a FFI every 200 seconds. This is a significant change from the 30 minute and 10 minute intervals in the prime mission and first extended mission. One challenge with this upgraded FFI frequency is the increased data volume. For example, the calibrated Sector 56 FFIs are **6.7 TB**. This exceeds the storage capabilities of most computers, in addition to being a drain on network and computing resources. 

So how can you do science in the era of big data? Rather than requesting all of this data from the [MAST Archive](https://archive.stsci.edu) servers in Baltimore, you can work within a cloud environment. In a nutshell, rather than downloading the data, you can upload your code. This is where TIKE comes in. Making [TESS data available on AWS](https://registry.opendata.aws/tess/) allows for _very_ rapid access to data, regardless of your own download speeds or computer performance.

The workflow for this notebook consists of:
* [Selecting Targets of Interest](#Targets-of-Interest)
* [Cutout Method 1: TESSCut in the Cloud](#Method1)
* [Cutout Method 2: Cloud Cutout Cubes](#Method2)
* [Cutout Method 3: Cutout Cubes with Multiple Cores](#Method3)

## Imports
We'll need a few unusual imports for this notebook to function properly. They should all be pre-installed on the TESS Environment kernel in TIKE.
- `Astrocut`, which will let us work with pre-generated cutout cubes
- `nest_asyncio` to use AWS S3 within the notebook environment
- We'll import the `multiprocessing` module to run our program on multiple cores. To run this within a Jupyter Notebok, we need to import `multi` file from within this folder; `multi` contains some convenience functions we'll write in the following cells.

In [None]:
import asyncio
import multi as m
import multiprocessing
import nest_asyncio
import pandas as pd
import requests
import time
import warnings

from astrocut import CutoutFactory
from astropy.coordinates import SkyCoord
from astroquery.mast import Tesscut 

***

# Targets of Interest

We'll find our targets using the [TESS TCE bulk download](https://archive.stsci.edu/tess/bulk_downloads/bulk_downloads_tce.html) bulk download list. We'll limit ourselves to Sector 55 to keep the number of results low, and our runtimes short. However, this technique will work on any sector!

In [None]:
# Read in the CSV as a pandas array. Ignore the first five rows of the CSV, which are comments
tess = pd.read_csv('tess55.csv', header=6)

# Display the first five rows of the array, to make sure the file was read correctly
tess.head()

## Define a Subset

Let's only look for transits that display a large dip in brightness. `tce_depth` is given in in parts per million. Here, we're selecting truly enormous dips, at a million parts per million. 

"Now wait a minute", you might be saying, "wouldn't that mean a star is repeatedly dimming by 100% of its brighness?". And indeed, that is exactly the criterion we are using to filter. A star which actually turned on and off like a lightbulb would be the astrophysical discovery of the century; these high values are likely due to errors in data processing. However, we might be able to "clean" some of the data and recover a more sensible result. Stay tuned for a follow-up notebook.

Note that we're going to drop duplicates from our list. Multiple entries might indicate, for example, a star system with multiple planets. For our purpose of selecting target stars, this is not relevant, so we will ignore this information (for now).

In [None]:
# Select only depths > 1 million ppm
large_depth = tess[tess['tce_depth']>1000000]

# Drop duplicate stars
large_depth = large_depth.drop_duplicates(subset = "ticid")

len(large_depth)

In [None]:
# Preview some of the results, if you like
large_depth[['ticid', 'tce_depth', 'tce_period']].head()

# Convenience Functions
As we proceed through the notebook, it will be helpful to have coordinates associated with our targets. We'll also need an easy way to determine the camera and CCD the targets were observed on. For this, we'll create a dictionary where each target name is associated with the values we want.

Let's create a helper function to generate the dictionary.

In [None]:
def makeDict(target_table):
    '''
    Create a dictionary where the keys are TESS IDs, and the values are
    the coordinates, camera, and CCD.
    Input: Table of targets with a column named 'ticid' containing the TESS IDs
    Output: Dictionary of those TESS IDs and the associated coordinates, camera, and CCD
    '''
    # Create a "results" dictionary that stores the target name, coordinates, and camera/CCD
    nameCoordCam = dict()
    
    # Loop through the TESS IDs in the target table
    for tid in target_table['ticid']:
        # Add 'TIC' to the target name
        target_name = f"TIC{tid}"
        
        # Get the coordinates for the target
        target_crd = SkyCoord.from_name(target_name)
        
        # Get the camera/CCD used to observe the target. We still need to write this function.
        cam_ccd = getCamCCD(target_crd)
        
        # Add an entry to the dictionary to store the above information
        nameCoordCam[target_name] = [target_crd, cam_ccd]
        
    return nameCoordCam

The output of this function is a pretty handy dictionary. For any target on our list, we have now saved its coordinates and the camera/CCD used by TESS during the observation. We'll need this information when we request cutout cubes in methods 2 and 3.

The `makeDict()` function is only six lines of code because we've hidden a relatively complex operation under the guise of the `getCamCCD()` function. This is actually non-trivial to do. This function needs to find the the camera and CCD used for each observation, which we'll do by querying the TESSCut API. 

In [None]:
def getCamCCD(coord):
    '''
    For a given set of coordinates, return the camera/CCD used for observation.
    Input: astropy.coordinate object
    Output: string formatted like "[camera]-[CCD]"
    '''
    
    # Split the coordinate object into Ra/Dec
    Ra = coord.ra.degree
    Dec = coord.dec.degree
    
    # Our parameters are the Ra/Dec, which we want to match exactly (zero-radius)
    myparams = {"ra":Ra, "dec":Dec, "radius":"0m"}
    
    # This could be one line of code, but it's good to remember you can query the API for cutouts directly
    urlroot = "https://mast.stsci.edu/tesscut/api/v0.1"
    url = urlroot + "/sector"
    
    # Run the request, get results in JSON format
    requestData = requests.get(url = url, params = myparams)
    results = requestData.json()['results']
    
    # Our results have multiple sectors, loop through them and keep only sector 55.
    for result in results:
        if result['sector'] == '0055':
            # Get the camera/CCD, then format them "nicely"
            camnum = result['camera']
            ccdnum = result['ccd']
            combined = f"{camnum}-{ccdnum}"
    
    return combined

Great! We've written all the helper functions we need to create the dictionary. Let's run our function and take a look at the first entry.

**A note on optimization**: Although this solution is "quick enough" for this example, each call to the API will take roughly 0.2 seconds to complete. If you try to run this on a million objects, you'll be sitting around for 55 hours!

In [None]:
# Generate the "large depth" transits dictionary
ld_dict = makeDict(large_depth)

In [None]:
# Print only the first entry from our dictionary
for k, v in ld_dict.items():
    print(f"Key: {k}\nValue: {v}")
    break

Printing out the first key/value pair makes it clear exactly why this is useful to us. We now have a list of TESS targets associated with their coordinates and the camera/CCD on which they were observed.

Some of our queries, particularly methods 2 and 3, will be much easier to run using this dictionary.

<a id="Method1"></a>
# Method 1: TESSCut in the Cloud
We'll start by testing out the "conventional" method, which makes use of the [TESSCut](https://mast.stsci.edu/tesscut/) function. This is a convenient way to generate cutouts, and can be run on your local machine.

Let's generate cutouts for all of our targets and time how long it takes to run.

In [None]:
# Record the start time so we can see how long this takes
t0 = time.time()

# Go through our dictionary values, which contain the coordinates and cam/ccd
for v in ld_dict.values():
    # Get the coordinates, then pass them to get_cutouts()
    coord = v[0]
    hdulist = Tesscut.get_cutouts(coordinates=coord, size=10, sector=55)

# Print the time spent executing this cell
print(f"Took {time.time()-t0:.1f} seconds.")

In the TIKE environment, this takes about 40 seconds. If you were to run this request on your local machine, it would take around 380 seconds; nearly 10 times longer!

This performance improvement is extremely impressive when you consider that all we've done is run our code in TIKE. For many users, this convenient factor of ten speedup may be more than fast enough. However, for large enough queries, it might be worthwhile to continue optimizing. Let's explore some options for further improvements.

<a id="Method2"></a>
# Method 2: Cloud Cutout Cubes
A "[TESS Cutout Cube](https://astrocut.readthedocs.io/en/latest/astrocut/index.html#tess-full-frame-image-cutouts)" is an image cube which MAST has generated from the TESS FFIs for one particular sector, camera, and CCD. These files are not meant to be downloaded, as they often exceed 100GB in size. However, these cubes are a _very efficient_ way to generate cutouts; indeed, this is what TESSCut is doing behind the scenes. Let's write our own `get_cutouts` function and see if we can get the data faster.

Since the cutout cubes are specific to a camera, sector, and CCD, we need this information for each target. This is where the dictionary we built comes in handy, as it will help us match our targets to their corresponding cube files. Let's build another helper function to generate these cube file names.

In [None]:
def get_cube_fn(camera_ccd, sector=55):  # Note that the default sector is 55 to match our example.
    # In the filename, the sector has two leading zeros, i.e. 0055
    cube_fn = f"s3://stpubdata/tess/public/mast/tess-s{sector:04d}-{camera_ccd}-cube.fits"
    return cube_fn

We're almost done! Let's create one more function takes in our target dictionary and gets all of the cutouts.

In [None]:
def get_cutouts(target_dict):
    # We need nest_asyncio for AWS access within Jupyter
    nest_asyncio.apply()
    # Create a "cutout factory" to generate cutouts
    factory = CutoutFactory()
    for k, v in target_dict.items():
        # Get the coordinates and camera/ccd info from the dictionary
        coord = v[0]
        cam_ccd = v[1]
        print(f"Starting {k}")
        factory.cube_cut(cube_file = get_cube_fn(cam_ccd) # Get the cube filename for this camera/ccd
                        ,coordinates=coord                # Use the coordinates for the target
                        ,cutout_size=10
                        ,target_pixel_file=f"{k}.fits");  # Name the output file the target name
        print(f"Finished {k}\n")

We're now ready to run our function and generate the cutouts.

In [None]:
# We're ignoring some warnings about the size of the comment card
warnings.filterwarnings('ignore')

# Start time for this cell
t0 = time.time()

# Get the cutouts for the large-depth dictionary
get_cutouts(ld_dict)

# Print how long this cell took to run
print(f"Took {time.time()-t0:.1f} seconds.")

This usually runs in under 20 seconds; half the time of Method 1!

However, there's a way to squeeze out even more performance from this query. Note the order in which this ran; we start the request for a cutout, then we wait for the data to be returned. Wouldn't it be nice if we could request multiple cutouts at the same time?

As a matter of fact, we can do exactly that with multiprocessing.

# Method 3: Cutout Cubes with Multiple Cores

Python has a feature unique among programming languages called the Global Interpreter Lock, or GIL. You can read [this interesting blog post](https://realpython.com/python-gil/) for more details, but in essence: the GIL prevents more than one Python thread from running at the same time. This matters in this example, because we can only request one cutout at a time. To be most efficient, every CPU core should have its own thread, so they can all work on getting cutouts.

We can figure out how many processors are available with a call to the `cpu_count` function.

In [None]:
n_cores = multiprocessing.cpu_count()
n_cores

At the time of writing, TIKE offers us 4 processors. Thus we could, in theory, cut down on the run time by a factor of 4. In practice, the speedup will be slightly less than this since multiprocessing is not perfectly efficient.

We'll need to make some minor changes to our functions in order for them to run on multiple processors. To start, we need to break down our single list of targets into four lists, one for each processor. We want these lists close in length to each other, so that each processor handles roughly the same number of cutouts.

In [None]:
# Pull the list of targets out of our dictionary
targets = list(ld_dict.keys())

# Assign targets so that each CPU has a list of roughly equal length
target_lists = [targets[i::n_cores] for i in range(n_cores)]
target_lists

The lists aren't perfectly equal lengths, but they're close enough to maximize multiprocessing efficiency.

Our next big change will be how we call the functions. Unfortunately, the multiprocessing module was not designed to be run in a Jupyter Notebook. As a workaround, we can copy the code we wrote before to a separate python file and import it. 

We are doing this because `multiprocessing` expects to see `if __name__=='main':`, but this syntax does not function properly in a notebook environment. Ideally, multiprocessing code would be contained to its own Python script; see examples of proper use in the [Python multiprocessing docs](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing).

We're now ready to run our query with multiple processors. Watch the output of the below cell and note that the queries do not execute linearly.

In [None]:
# Import our functions, and ignore warnings
import multi as m
warnings.simplefilter('ignore')

# Time this method
t0 = time.time()

# Create a list to hold CPU processes
processes = []

# Go through the lists of targets, and start one process for each
for target_list in target_lists:
    # The "target" of multiprocessing is our cutout function.
    # The arguments are our target list and target dictionary
    a_process = multiprocessing.Process(target = m.get_cutouts
                                        ,args = [target_list, ld_dict])
    a_process.start()
    processes.append(a_process)

# Join the processes, which ends them and stops them from becoming "zombies"
for a_process in processes:
      a_process.join()
           
# Print the time taken to run this cell            
print(f"Took {time.time()-t0:.1f}s")

With multiple processors, it takes under 7 seconds to get the cutouts. This is about three times faster than the previous method! 

**A note on runtime, CPUs, and processors:** We might have expected a time closer to 5 seconds: that's the Method #2 result of 20 seconds divided by 4. However, our "4 cores" actually consist of 2 physical cores and 2 virtual cores. Since we need physical silicon to perform calculations, we cannot run four processes simultaneously; there are only two physical cores available. With that said, it's still worthwhile to use the virtual cores: the CPU will use [multi-threading](https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)) to save an additional second or two.

# Summary and Key Takeaways
We've gone over several methods for generating cutouts from FFIs. Let's compare the performance for each method:

Method | Time to Run (s) | Time per Cutout (s)
:--- | :---: | :---:
Local Machine, TESSCut | 380 | 29.2
TIKE, TESSCut | 40 | 3.1
TIKE, Cutout Cubes | < 20 | < 1.5
TIKE, Multiprocessing & Cubes | < 7 | < 0.5

It is worth pointing out that our best time is nearly 60x faster than doing a "standard" TESSCut. The multiprocessing method is the trickiest to implement, but can be worthwhile if you're generating a large number of cutouts.

Even with no changes to the code, using the TIKE platform results in a significant speed improvement. This is true for many packages, including `astroquery`, `lightkurve`, `astrocut`, and others.

## Additional Resources

- [TESS Archive Manual](https://outerspace.stsci.edu/display/TESS/TESS+Archive+Manual) for more information about the TESS data products stored in MAST
- [TESSCut Home](https://mast.stsci.edu/tesscut/), for additional reading about the TESSCut software

## About this Notebook
For support, please contact the Archive HelpDesk at archive@stsci.edu, or through the [MAST HelpDesk Portal](https://stsci.service-now.com/mast).

**Author:** Thomas Dutkiewicz <br>
**Keywords:** TIKE, AWS Cloud, Muliprocessing, Optimization <br>
**Last Updated:** Nov 2022 <br>
**Next Review:** Apr 2023
***
[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/spacetelescope/notebooks/master/assets/stsci_pri_combo_mark_horizonal_white_bkgd.png" alt="Space Telescope Logo" width="200px"/> 