# Read the DESC truth tables in parquet format

**A large container must be used for this notebook or the kernel will crash.**

**Contact authors:** Jeff Carlin and Melissa Graham <br>
**Container size:** large <br>
**Last verified to run:** 2023-02-15 <br>
**Version:** w_2023_07

## 1.0. Introduction

Jim Chiang has put additional <a href="https://parquet.apache.org/">parquet</a>-format truth tables in `/project` for DP0 delegates:
 - `/project/jchiang/Run2.2i/truth/` contains
   - `SNe/truth_sn_summary_v1-0-0.parquet` (46M)
   - `SNe/truth_sn_variability_v1-0-0.parquet` (247M)
   - `stars/truth_star_summary_v1-0-0.parquet` (211M)
   - `stars/truth_star_variability_v1-0-0.parquet` (5.3G)

These truth tables have been publicly released by the DESC, but they are not part of DP0.2. They contain more detailed information about the simulated supernovae (SNe) and stars that were injected into the DP0.2 dataset. 

> **Warning: these truth tables have not been, and will not be, cross-matched to the DP0.2 DiaObject table nor available via the TAP service like other DP0.2 catalogs.** 

Attempting to use TAP or SQL on these parquet files will fail.
In particular, note that the tutorial notebook "08_Truth_Tables.ipynb" uses the `id_truth_type` column to match between DP0.2 truth catalogs, and that this column is *unique* to the DP0.2 truth catalogs.
It is *not available* in these auxiliary truth data files that come directly from DESC.

This notebook demonstrates the other methods available for users to cross-match their DP0.2 objects of interest with these truth tables.

The **schema** for these tables can be found in the DESC's DC2 Data Release note (<a href="https://arxiv.org/pdf/2101.04855.pdf">arXiv:2101.04855</a>),
in tables B.3 (truth star summary), B.4 (truth SN summary), B.6 (truth star variability), and B.7 (truth SN variability).

> **Warning: the truth star variability file, at 5.3 G, is too large to be read in full and will crash the kernel.**

For the large file of truth star variability, users have the option of the `pyarrow` or `dask` packages for retrieving variability data for a single star, and then converting it into a `pandas` dataframe.
As demonstrated in Section 2, it takes about 15-20 seconds to retrieve the full simulated light curve for a single star. 
Use the truth star variability file with care to avoid crashing the kernel.

The three files that are <250 MB are small enough to be read in full.
The `pandas` package can be used to read the entire table into a dataframe, as demonstrated below.

Rerunning cells multiple times might cause a kernel error in this notebook, given the potentially large data volumes.
If this happens, go to the "Kernel" menu item and choose "Restart kernel and clear all outputs" and try again.
If issues persist, exit the RSP and log back in with a large container.

### 1.1. Import packages.

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import time, gc
import pandas as pd
import pyarrow.parquet as pq
import dask.dataframe as dd

## 2.0. Stars

Define the file names of the star truth data.

In [6]:
pfnm_star_sum = '/project/jchiang/Run2.2i/truth/stars/truth_star_summary_v1-0-0.parquet'
pfnm_star_var = '/project/jchiang/Run2.2i/truth/stars/truth_star_variability_v1-0-0.parquet'

### 2.1. Read the full summary table with `pandas`

The star summary file, at 211 M, is small enough to be read in entirely with `pandas`, as done below.

But, if you attempt to use `pd.read_parquet` with the star variability file (5.3 G), the kernel will crash.

In [7]:
result_star_sum = pd.read_parquet(pfnm_star_sum)

**Option** to view the star summary table.

In [8]:
# result_star_sum

### 2.2. Identify a single true star of interest

Use `numpy.unique` to figure out how many different variable star types there are.

In [9]:
unique_models, counts_models = np.unique(result_star_sum['model'], return_counts=True)
for u in range(len(unique_models)):
    print(u, unique_models[u], counts_models[u])

0 MLT 2361402
1 None 126709
2 applyRRly 211
3 kplr 1744538


As described in the DESC DC2 Data Release Note, these types include:
1. `applyRRly`: periodic variables (RR Lyrae and Cepheids)
2. `MLT`: non-periodic transients/variables such as microlensing events, flaring M-dwarfs, cataclysmic variables, etc.
3. `kplr`: stars with no definitive variability class, whose variability is modeled after Kepler lightcurves
4. `None`: non-variable stars

Use `tx` to index all 211 of the RR Lyrae stars. Print the `id`, `ra`, and `dec` of a random RR Lyrae.

In [10]:
tx = np.where(result_star_sum['model'][:] == 'applyRRly')[0]
ri = np.random.choice(tx, size=1)
print(result_star_sum['id'][ri[0]], \
      result_star_sum['ra'][ri[0]], \
      result_star_sum['dec'][ri[0]])
del tx, ri

696465 54.4463796 -39.1503306


If all we had was an RA and Dec that we got by, for example, identifying a `DiaObject` that we thought might be an RR Lyrae, we could find the `id` in the star summary table.

For this example, use RA = 72.5850633 and Dec = -44.6386746.

In [11]:
my_star_ra = 72.5850633
my_star_dec = -44.6386746

In [12]:
tx = np.where((np.abs(result_star_sum['ra'] - my_star_ra) < 2.0/3600.)
              & (np.abs(result_star_sum['dec'] - my_star_dec) < 2.0/3600.))[0]
if(len(tx) == 1):
    print('Unique match identified within 2 arcseconds.')
    print(result_star_sum['id'][tx[0]], \
          result_star_sum['ra'][tx[0]], \
          result_star_sum['dec'][tx[0]])
else:
    print('Number of matches: ', len(tx))
del tx

Unique match identified within 2 arcseconds.
836896 72.5850633 -44.6386746


Clean up.

In [13]:
del result_star_sum, my_star_ra, my_star_dec
gc.collect()

996

### 2.3. Use `pyarrow` to retrieve the true light curve

Use `pyarrow` to retrieve the  true variability (true light curve) for this one RR Lyrae of interest, with `id` = 836896.

Read the parquet table and only retrieve rows where `id` = 836896. This takes about 10 seconds.

In [None]:
result = pq.read_table(pfnm_star_var, use_threads=False,
                         filters=[('id', '==', 836896)])

Convert the result ot a pandas dataframe, `df`. This takes <1 second.

In [3]:
df = result.to_pandas()

NameError: name 'result' is not defined

**Option** to show the dataframe.

In [2]:
df

NameError: name 'df' is not defined

Plot the g, r, and i-band `delta_flux` values.

In [None]:
gx = df['bandpass'][:] == 'g'
rx = df['bandpass'][:] == 'r'
ix = df['bandpass'][:] == 'i'
plt.plot(df['MJD'][gx], df['delta_flux'][gx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkgreen')
plt.plot(df['MJD'][rx], df['delta_flux'][rx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkorange')
plt.plot(df['MJD'][ix], df['delta_flux'][ix], \
         'o', ms=3, mew=0, alpha=0.3, color='black')
plt.xlabel('MJD (days)')
plt.ylabel('delta$\_$flux (nJy)')
plt.show()
del gx, rx, ix

In [None]:
del result, df
gc.collect()

### 2.4. Use `dask` to retrieve the true light curve

Use `dask` to retrieve the  true variability (true light curve) for this one RR Lyrae of interest, with `id` = 836896.

**Option** to read the parquet file and view what `dd.read_parquet` returns. Note that it is the structure of the dataframe, NOT a dataframe filled with values.

In [None]:
# result = dd.read_parquet(pfnm_star_var)

In [None]:
# result

In [None]:
# del result

Read the parquet table and only retrieve rows where `id` = 836896. This takes <1 second.

In [None]:
result = dd.read_parquet(pfnm_star_var, filters = [('id', '==', 836896)])

Convert the result into a `pandas` dataframe. This takes up to 15 seconds.

In [None]:
df = result.compute()

**Option** to show the `pandas` dataframe.

In [None]:
# df

Plot the g, r, and i-band `delta_flux` values.

In [None]:
gx = df['bandpass'][:] == 'g'
rx = df['bandpass'][:] == 'r'
ix = df['bandpass'][:] == 'i'
plt.plot(df['MJD'][gx], df['delta_flux'][gx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkgreen')
plt.plot(df['MJD'][rx], df['delta_flux'][rx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkorange')
plt.plot(df['MJD'][ix], df['delta_flux'][ix], \
         'o', ms=3, mew=0, alpha=0.3, color='black')
plt.xlabel('MJD (days)')
plt.ylabel('delta$\_$flux (nJy)')
plt.show()
del gx, rx, ix

In [None]:
del result, df
gc.collect()

## 3.0. Supernovae

Define the file names of the SN truth data.

In [None]:
pfnm_sn_sum = '/project/jchiang/Run2.2i/truth/SNe/truth_sn_summary_v1-0-0.parquet'
pfnm_sn_var = '/project/jchiang/Run2.2i/truth/SNe/truth_sn_variability_v1-0-0.parquet'

### 3.1. Read the full parquet files

Read the full SN summary table.

In [None]:
result_sn_sum = pd.read_parquet(pfnm_sn_sum)

**Option** to show the SN summary table contents.

In [None]:
# result_sn_sum

Read the full SN variability table.

In [None]:
result_sn_var = pd.read_parquet(pfnm_sn_var)

**Option** to show the SN variability table contents.

In [None]:
# result_sn_var

Plot the g, r, and i-band `delta_flux` values (the light curve) for true SN with `id` 10816000752662.

In [None]:
%%time
gx = (result_sn_var['id'][:] == 10816000752662) & (result_sn_var['bandpass'][:] == 'g')
rx = (result_sn_var['id'][:] == 10816000752662) & (result_sn_var['bandpass'][:] == 'r')
ix = (result_sn_var['id'][:] == 10816000752662) & (result_sn_var['bandpass'][:] == 'i')
plt.plot(result_sn_var['MJD'][gx], result_sn_var['delta_flux'][gx], \
         'o', ms=13, mew=0, alpha=0.3, color='darkgreen')
plt.plot(result_sn_var['MJD'][rx], result_sn_var['delta_flux'][rx], \
         'o', ms=13, mew=0, alpha=0.3, color='darkorange')
plt.plot(result_sn_var['MJD'][ix], result_sn_var['delta_flux'][ix], \
         'o', ms=13, mew=0, alpha=0.3, color='black')
plt.xlabel('MJD (days)')
plt.ylabel('delta$\_$flux (nJy)')
plt.show()
del gx, rx, ix

Clean up.

In [None]:
del result_sn_sum, result_sn_var
gc.collect()

### 3.2. Use `dask` to retrieve a true SN light curve

Just because we _can_ read in the full SN variability parquet file doesn't mean we should, if all we want is the light curve for one SN of interest.

In this case, the "SN of interest" is chosen to be the SN with `id` = 10816000752662. 

See Section 2.2 for a demonstration of how to do a simple spatial cross match if the RA and Dec of the object of interest is known, instead of the `id`.

The following uses `dask`, as done in Section 2.4 for a variable star light curve, but users could use `pyarrow` in the same way as Section 2.3 if preferred.

In [None]:
result = dd.read_parquet(pfnm_sn_var, filters = [('id', '==', 10816000752662)])

In [None]:
df = result.compute()

**Option** to view the dataframe.

In [None]:
# df

Plot the g, r, and i-band `delta_flux` values (the light curve) for true SN with `id` 10816000752662.

In [None]:
gx = df['bandpass'][:] == 'g'
rx = df['bandpass'][:] == 'r'
ix = df['bandpass'][:] == 'i'
plt.plot(df['MJD'][gx], df['delta_flux'][gx], \
         'o', ms=13, mew=0, alpha=0.3, color='darkgreen')
plt.plot(df['MJD'][rx], df['delta_flux'][rx], \
         'o', ms=13, mew=0, alpha=0.3, color='darkorange')
plt.plot(df['MJD'][ix], df['delta_flux'][ix], \
         'o', ms=13, mew=0, alpha=0.3, color='black')
plt.xlabel('MJD (days)')
plt.ylabel('delta$\_$flux (nJy)')
plt.show()
del gx, rx, ix