# Notebook X: performing large searches for exoplanet (candidates) with TIKE

## Learning objectives

[td: be more specific with our goals, now that we have the end written. important to note cloud access (rather than download)]
- rapidly download large amounts of TESS data
- work with the large TESS data collections

## Introduction
Studying exoplanet demographics (the frequency of different types of exoplanets) lets us tackle some of the biggest questions in exoplanet science: How do planets form? How do they evolve? Are there any exoplanets that are like the Earth? 

TIKE is an excellent resource for computing occurrence rates. There are other resources that explain TIKE elsewhere; we summarize some of the main TIKE benefits in this case below:

- your computations are performed "close" to MAST data, meaning that it's quicker to access lightcurves.
- you don't have to download data to your own computer, saving *storage* too.
- you don't have to use your own computational resources.
- in most cases, the scientific computing environments are already set up. you don't have to spend time wrangling with environments.


[todo: increase or decrease the amount that TIKE is talked up?]

[td: this is an ok amount of talking up]


In this notebook, we'll take one of the first major steps to calculating occurrence rates: searching a large selection of lightcurves for exoplanets. Note: our abridged version may produce spurious planetary candidates! While we favor a concise narrative and rapid notebook execution time here, we'll note when applicable where a more robust approach should be taken for, e.g., a scientific publication.

# 1. Import packages, set constants

Before diving into planet-finding, we will import some necessary packages.

- `astropy` contains a number of utility functions for working with astronomical data.
- `astroquery` lets us easily query the astronomical databases that contain the TESS lightcurves.
- `numpy` is used for array manipulation.
- `matplotlib.pyplot` is used to display images and plot datasets.
- `tqdm` is a lightweight progress bar that we can use to track how long our calculations will take.
- `pandas` let us interact easily with CSV files that we'll be downliading.
- `os` and `concurrent` let us make informed choiced when multithreading.

In [None]:
from astropy.io import fits
import astropy.units as u
import astropy.constants as const
from astropy.timeseries import BoxLeastSquares


from astroquery.mast import Observations
from astroquery.mast import Catalogs
from astroquery.exceptions import InvalidQueryError


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm


import os
from concurrent.futures import ThreadPoolExecutor, as_completed

We will also *install* one package into our environment: the `batman` package. This tool will let us easily model exoplanet transits during our planet searches.

**Note**: this installation will not persist after restarting TIKE. See the [installation instructions](~/references/tike_content-ref/markdown/software-installed.md) to do a persistent installation.

In [None]:
!pip install batman-package

With our package installed, we can safely import it.

In [None]:
import batman

We'll now set a few physical constants and conversion factors so that we can use them later on.

In [None]:
G = const.G.si.value
days_to_seconds = (u.day).to(u.s)
r_earth_to_meters = (u.R_earth).to(u.m)
m_sun_kg = (u.M_sun).to(u.kg)
r_sun_m = (u.R_sun).to(u.m)

We'll also make sure that we retrieve cloud files from AWS, making our data access stage much faster.

In [None]:
# Important: ensure files are retrieved from AWS 
Observations.enable_cloud_dataset(provider='AWS')

# 2. Find stars

Generally, the first step to finding planets is figuring out which host stars you want to search for orbiting planets. In this study, we want to find the occurrence of hot Jupiters around Sun-like stars. We'll make some cuts so that we're looking for bright, high-gravity, solar-mass stars:

- *bright*: TESS was designed to look for transits around bright, nearby stars. These stars' TESS data will have high signal to noise ratios, so we have better chances of finding planetary signals in them.
- *high-gravity*: particularly low-gravity stars tend to be evolved off the stellar main-sequence, and therefore not very Sun-like.
- *solar-mass*: solar-mass stars on the stellar main sequence tend to be quite solar.

To find these stars, we'll query the TESS Candidate Target List (CTL) — a list of stars pre-selected by the TESS team to be likely good targets for transit detection (e.g., bright, low flux contamination).

In [None]:
# this cell will take ~30 seconds to run
catalog_data = Catalogs.query_criteria(catalog="Ctl",
                                       Tmag=[0, 10.5], # remember, lower magnitudes are brighter than higher ones!
                                      logg=[4.1,400], # solar logg ~ 4.4
                                      mass=[0.8, 1.05]) # these masses are in solar radii

Let's take a look at this catalog data.

In [None]:
catalog_data

This is a pretty long table, totaling 53056 stars, with many columns of information about each star.

To get a better sense of the data products we'll be working with, let's access a single lightcurve associated with one of these stars.

In [None]:
# download a single star from this catalog to inspect the data type we're working with.
TESS_table = Observations.get_cloud_uris(target_name=catalog_data['ID'][2]
                                         , obs_collection="TESS"
                                         , dataproduct_type='timeseries'
                                         ) 
# get data products
data_products = Observations.get_product_list(TESS_table) 

# filter for light curve data only
filtered = Observations.filter_products(data_products, productType="SCIENCE"
                                        , productSubGroupDesciption = "LC")

# Get the cloud uri for these files
lc_uri = Observations.get_cloud_uris(lc_prod)[0]

#open the lc file
lc_fits = fits.open(lc_uri, use_fsspec=True, fsspec_kwargs={"anon": True})

Great! This file is now loaded as `lc_fits`. Now, let's examine the columns available to us in this lightcurve.

In [None]:
lc = lc_fits[1].data
lc.columns

There are two flux columns: SAP flux and PDCSAP flux. The SAP flux is closer to raw TESS data. The PDCSAP_FLUX is derived from SAP flux, but it's been cleaned of longer-term trends.

Let's plot both types of flux to get a sense of their differences.

In [None]:
sapflux = lc['SAP_FLUX']
pdcflux = lc['PDCSAP_FLUX']
time_lc = lc['TIME']

fig = plt.figure(figsize = (11,4))

fig.add_subplot(211)
plt.plot(time_lc, sapflux,'.', label = 'SAP', color = "gold")
plt.legend(loc = 'lower left')
plt.ylabel("FLUX (e-/s)")

fig.add_subplot(212)
plt.plot(time_lc, pdcflux,'.', label = 'PDC', color = "red")
plt.legend(loc = 'lower left')
plt.ylabel("FLUX (e-/s)")
plt.xlabel('TIME  (BJD-2457000)')

This definitely looks like a TESS lightcurve! The PDC data, as expected, are smoother and seem more evenly distributed about a median value.

# Next: aggregate all the lightcurves!

Now we know how to access a single lightcurve. To calculate our occurrence rates, we'll need to download lightcurves from many stars. 

The neatest way to do this is to write a function that accesses TESS lightcurves, then loop over that function for all the stars we want to query. Along the way, we'll incorporate a few tricks to speed things up.

Instead of, e.g., saving the data as files, it's more efficient to keep all of this data in RAM as we go. A neat way to do this is to define a class for our data, fetch the data with a method of that class, then assign the data to an attribute of that class. This object-oriented approach avoids extra input/output (I/O) time and maintains clean code design.

In [None]:
class LightcurveData:
    def __init__(self, catalog_id):
        """
        Initializes the instance.
        """
        self.catalog_id = catalog_id # assign catalog ID to the object

    def fetch_and_save_data(self):
        """
        Accesses and saves data from a TESS target.
    
        Inputs
        ------
            :catalog_id: (int) TESS Input Catalog (TIC) ID.
        """
        TESS_table = Observations.query_criteria(target_name=self.catalog_id
                                                 , obs_collection="TESS"
                                                 , dataproduct_type='timeseries'
                                                 ) 
        try:
            data_products = Observations.get_product_list(TESS_table) 
        except InvalidQueryError:
            self.pdcflux = np.nan
            self.pdcflux_err = np.nan
            self.time_lc = np.nan
            return
            
            
        # Keep only the science products
        filtered = Observations.filter_products(data_products
                                                , productType="SCIENCE")
        
        # filter
        lc_prod = Observations.filter_products(data_products  
                                              , productSubGroupDescription = "LC")
        
        # Get the cloud uris for these files
        lc_uri = Observations.get_cloud_uris(lc_prod[0])
    
        # open the lc file
        lc_fits = fits.open(lc_uri, use_fsspec=True, fsspec_kwargs={"anon": True})
        
        # assign the data to this instance 
        lc = lc_fits[1].data
        sapflux = lc['SAP_FLUX'] #SAP flux column
        self.pdcflux = lc['PDCSAP_FLUX'] #PDCSAP flux column
        self.pdcflux_err = lc['PDCSAP_FLUX_ERR'] # flux error column
        self.time_lc = lc['TIME'] #time column

Let's test this method out on a single star.

In [None]:
lightcurvedata = LightcurveData(catalog_data['ID'][2])
lightcurvedata.fetch_and_save_data()

Let's check that this function worked correctly by checking that the method performed as expected.

In [None]:
lightcurvedata.time_lc, lightcurvedata.pdcflux_err

Great! Our data has been downloaded and assigned as an attribute to an object.

Now, we want to *parallelize* our computation. What this means is we want to access multiple lightcurve files simultaneously.

We'll use this with the Python multithreading package. Multithreading is a very rich and complex subject (link something here), but for our purposes we can just think of it as splitting up our machine into different sections that each perform a separate task. Because our option is largely limited by 

[td: link something here, talk about the deets a little bit more. maybe mention that you typically get 4 cores in TIKE?]

In [None]:
# explain how to set the max number of workers
os.cpu_count() 


Now for a short function that will create an object for a catalog ID, download the data, and return the *object*.

We'll have this function return both the TIC ID and the object itself so that it can be sorted even if we work through our catalog asynchronously.

In [None]:
def return_object(catalog_id):
    lightcurvedata = LightcurveData(catalog_id)
    lightcurvedata.fetch_and_save_data()
    return catalog_id, lightcurvedata

In [None]:
# Use ThreadPoolExecutor to download files in parallel
total_tasks = len(catalog_data['ID'][::100])
progress_bar = tqdm(total=total_tasks, position=0, leave=True)

max_workers = 5  # Adjust this based on your system's capabilities

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(return_object, catalog_id) for catalog_id in catalog_data['ID'][::100]]
    for future in as_completed(futures):
        progress_bar.update(1)
    

# now store all the data in a list
lightcurve_objects = {}
for future in futures:
    catalog_id, lightcurvedata = future.result()
    lightcurve_objects[catalog_id] = lightcurvedata

We can access each object now from the `lightcurve_objects` dictionary. Let's check!

In [None]:
lightcurve_objects[catalog_data['ID'][0]].time_lc

Nice! This our list of of observation times for this target star.

# Next: search the lightcurves for planets.
We now have our processed data products. The next step is to search them for planets!

One of the standard ways to do this is with the [Box-Least Squares algorithm](https://arxiv.org/abs/astro-ph/0206099). This approach takes advantage of the facts that 1) planetary transits are expected to be periodic, and 2) transits are more or less shaped like boxes (excluding effects like limb-darkening). Loosely, the algorithm works similarly to the Lomb-Scargle periodogram, with the key difference that the signal being tested is a box car, instead of a sine wave.

Let's load in a light curve and test out the algorithm.

In [None]:
time, flux, flux_err  = lightcurve_objects[catalog_data['ID'][0]].time_lc,  \
                        lightcurve_objects[catalog_data['ID'][0]].pdcflux,  \
                        lightcurve_objects[catalog_data['ID'][0]].pdcflux_err

time = time[~np.isnan(flux)]
flux_err = flux_err[~np.isnan(flux)]
flux = flux[~np.isnan(flux)]
model = BoxLeastSquares(time, flux, flux_err)

For this example, we'll search for planets with periods between 0.5 and 10 days. The lower limit is physical: planets can only have periods that are so short before their orbits are unstable. The upper limit is statistical: [Nyquist sampling](https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem) dictates that you can only detect periodic signals with frequencies at most *half* your observational baseline. That is, you need to be able to detect your signal twice in a dataset in order to measure its period. Because TESS sectors are generally 27 days long, we'll set our upper limit to 13 days. We'll test 1000 trial periods for the purposes of this exploratory notebook.

In [None]:
lower_limit = 0.5 #days
upper_limit = 13 # days
n_periods_tested = 1000
periods = np.linspace(lower_limit, upper_limit, n_periods_tested)  # in units of days.

To search for the planets, we will need to specify a transit duration. At first, this may seem like a daunting task — will we have to provide another axis to our search?

To keep things computationally feasible for this notebook, let's appeal to the original goal: finding hot Jupiters orbiting Sun-like stars. Because we're searching for a specific planet type (with a certain orbital period range) orbiting a certain type of star, there should be an expected transit duration for our search. Let's calculate what that duration is.

Assuming a circular orbit, the transit duration is (Seager & Mallén-Ornelas 2003):

$$T_{\rm dur} =\frac{P}{\pi} \arcsin\bigg{(}\frac{\sqrt{1 - a^2/R_{\star}^2 \cos^2(i)}}{a^2/R_{\star}^2\sin(i)}\bigg{)},$$

where $P$ is the orbital period, $a$ is the semimajor axis, $R_{\star}$ is the orbital inclination, and $i$ is the orbital inclination.

What do we know? We have the mass of the planet and the star, and the orbital period is about a day. From Kepler's third law we can then calculate the semimajor axis. 

Let's code this up!

In [None]:
def calc_a_from_period(period, planet_mass, stellar_mass):
    """

    Given the period, uses Kepler's Third Law to calculate the semimajor axis in meters.

    Inputs
    -----
        :period: orbital period of planet (s)
        :planet_mass: mass of planet (kg)
        :stellar_mass: mass of star (kg)

    Outputs
    ------
        :a: semimajor axis(meters)
    """
    total_mass = planet_mass + stellar_mass
    a_cubed = (const.G.si.value * total_mass * period**2) / (4 * np.pi**2)
    a = a_cubed**(1/3)
    return a

In [None]:
def calc_t_dur(P, a, rstar, i):
    """
    Calculates the transit duration, in seconds, of a planet.

    Inputs
    ------
        :P: the orbital period of the planet, in seconds.
        :a: the semimajor axis of the planet, in meters.
        :rstar: the radius of the star, in meters.
        :i: the inclincation of the planet, in radians. an inclination of 0 is face-on (non-transiting); an inclination of pi/2 is perfectly transiting.
    """
    a_r_2 = (a/rstar)**2
    return (P / np.pi) * np.arcsin(np.sqrt(1 - a_r_2 * np.square(np.cos(i))) / (a_r_2 * np.sin(i)))
    

[td: there's a lot of code cells here with no narrative. I know you just explained this above; can you find a way to better intersperse the details between code cells?]

In [None]:
period = (3 * u.day).si.value # 1 day orbital period, expressed in seconds
planet_mass = (1 * u.M_jup).si.value # jupiter mass
stellar_mass = (1 * u.M_sun).si.value # solar mass
stellar_radius = (1 * u.R_sun).si.value # solar radius
inclination  = np.pi/2 # transiting across the stellar equator

a = calc_a_from_period(period, planet_mass, stellar_mass)
t_dur = calc_t_dur(period, a, stellar_radius, inclination)

(t_dur * u.s).to(u.hour), (t_dur * u.s).to(u.day)

In [None]:
(t_dur * u.s).to(u.day).value

Great! Now let's express this transit duration in days and find our planets.

In [None]:
t_dur = (t_dur * u.s).to(u.day).value
t_dur

In [None]:
results = model.power(periods, t_dur)  # The second argument is the duration of the transit (in days)

We can access the best-fitting period from our planet search with the below cell.

In [None]:
best_period = results.period[np.argmax(results.power)]
print("Best-fit period:", best_period)

Now let's plot up the signal-to-noise of the tested planetary signals at all tested periods, paying special attention to the best-fitting period.

In [None]:
plt.plot(results.period, results.depth_snr)
plt.axvline(best_period, color='black', linestyle='--', label='Best-fitting period')
plt.xlabel('Period (days)')
plt.ylabel('Signal S/N')
plt.legend(frameon=False)

[td: comment about this graph: what's awesome? is that what we expected to see?]
Awesome! Now let's wrap this code in a function and apply it to all the selected planets.

In [None]:
def do_bls(time, flux, flux_err):
    model = BoxLeastSquares(time, flux, flux_err)
    results = model.power(periods, 0.1)  # The second argument is the duration of the transit (in days)
    max_power = np.argmax(results.power)
    stats = model.compute_stats(results.period[max_power],
                                results.duration[max_power],
                                results.transit_time[max_power])

    # todo: check even and odd
    return results, stats

[td: again, lots of code without much narrative in these cells and the following ones. a little bit of commentary goes a long way for newbies]

In [None]:
def do_inversion_test(time, flux, flux_err):
    inverted_flux = -1 * flux + 2* np.mean(flux)
    results, stats = do_bls(time, inverted_flux, flux_err)
    return results, stats

In [None]:
def find_planet(ticid):
    """
    This is our main function to find a planet. We basically perform the box-least squares algorithm, 
    then do a simple inversion test.

    todo: make this searchable by ticid.
    """
    lightcurve_object = lightcurve_objects[ticid]
    time, flux, flux_err  = lightcurve_object.time_lc, lightcurve_object.pdcflux, lightcurve_object.pdcflux_err

    # time is just a nan if the download did not proceed correctly.
    if isinstance(time, float):
        return False


    time = time[~np.isnan(flux)]
    flux_err = flux_err[~np.isnan(flux)]
    flux = flux[~np.isnan(flux)]
    model = BoxLeastSquares(time, flux, flux_err)
    results = model.power(periods, t_dur)  # The second argument is the duration of the transit (in days)
    max_power = np.argmax(results.power)
    stats = model.compute_stats(results.period[max_power],
                                results.duration[max_power],
                                results.transit_time[max_power])

    
    # do the inversion test here? if we invert it and find the transit, not good.
    if len(stats['transit_times']) >= 2 and results.depth_snr[max_power] >= 10:

        # test for variability
        inverted_results, inverted_stats = do_inversion_test(time, flux, flux_err)
        if len(inverted_stats['transit_times']) < 2 or inverted_results.depth_snr[max_power] < 10:
            return True

    # todo: The harmonic_delta_log_likelihood is the difference in log likelihood between a sinusoidal model and the transit model. If harmonic_delta_log_likelihood is greater than zero, the sinusoidal model is preferred.

            
    return False

        
        # we found a planet then!

In [None]:
found_planets = {}
for catalog_id in tqdm(catalog_data['ID'][::100], position=0):
    found_planets[catalog_id] = find_planet(catalog_id)

Great! Let's quickly see how many planets our simple pipeline found.

In [None]:
n_found_planets = np.sum(list(found_planets.values()))
n_found_planets

The simplest way to calculate occurrence rates is to divide the number of planets found by the number of stars searched. Let's go ahead and try that out.

In [None]:
n_found_planets/len(found_planets)

This number the same order of magnitude as the expected hot Jupiter occurrence rate (about 0.6% around G stars; [Beleznay & Kunimoto 2022](https://academic.oup.com/mnras/article/516/1/75/6654884)). Nice!

But this nearly correct answer is a lucky coincidence: we must have a lot of false positives! More advanced planet-finding and occurrence rate pipelines take multiple additional factors into consideration:

- some of the planet signals are astrophysical (e.g., eclipsing binary) or instrumental (e.g., the Kepler "rolling bands") false positives.
- we are not sensitive to planets orbiting in all configurations via the transit method (i.e., most planets do not transit).
- our pipeline is equally sensitive to all signal sizes, e.g. smaller planets around bigger stars.

[aaand end the notebook! couple ways to extend: add error bars, one line of code, a little statistics? add more testing to the planet detection pipeline?]

[td: let's plot some of the light curves and discuss what we see. that will help ground the statement "we must have a lot of false positives]

## About this notebook


**Author:** Arjun Savel  
**Last updated:** Oct 2024

***
[Top of Page](#top)
<img style="float: right;" src="https://raw.githubusercontent.com/spacetelescope/notebooks/master/assets/stsci_pri_combo_mark_horizonal_white_bkgd.png" alt="Space Telescope Logo" width="200px"/> 