# Extended DESC truth tables in parquet format

Contact authors: Jeff Carlin and Melissa Graham

Last verified to run on 2022-10-27 with Weekly 40.

Container size: large

## 1.0. Introduction

Jim Chiang has put additional truth tables in `/project` for DP0 delegates:
 - `/project/jchiang/Run2.2i/truth/` contains
   - `SNe/truth_sn_summary_v1-0-0.parquet` (46M)
   - `SNe/truth_sn_variability_v1-0-0.parquet` (247M)
   - `stars/truth_star_summary_v1-0-0.parquet` (211M)
   - `stars/truth_star_variability_v1-0-0.parquet` (5.3G)

This notebook demonstrates how to retreive data from these files.

> **Warning:** the truth star variability file, at 5.3 G, is too large to be read in full. As demonstrated in Section 2, it takes about 15-20 seconds to retrieve the full simulated light curve for a single star. 
Use the `truth_star_variability` file with care to **avoid crashing the kernel**.

For files small enough to be read in full, use the `pandas` package and read it all into a dataframe.

For the large file of truth star variability, users have the option of using `pyarrow` (Section 2.2) or `dask` (Section 2.3) for retrieving variability data for a single star, and converting it into a `pandas` dataframe.

> **Notice:** these truth tables have not been (and will not be) cross-matched to the DP0.2 DiaObject table, but this notebook demonstrates how users can cross-match their objects of interest with these truth tables.

### 1.1. Import packages.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import time, gc
import pandas as pd
import pyarrow.parquet as pq
import dask.dataframe as dd

## 2.0. Stars

In [None]:
pfnm_star_sum = '/project/jchiang/Run2.2i/truth/stars/truth_star_summary_v1-0-0.parquet'
pfnm_star_var = '/project/jchiang/Run2.2i/truth/stars/truth_star_variability_v1-0-0.parquet'

The star summary file, at 211 M, is small enough to be read in entirely with `pandas`.

Attempted to use `pd.read_parquet` with the star variability file (5.3 G) will crash the kernel.

In [None]:
%%time
result_star_sum = pd.read_parquet(pfnm_star_sum)

**Option** to view the star summary table.

In [None]:
# result_star_sum

### 2.1. Identify a single true star of interest

Use `numpy.unique` to figure out how many differet variable star types there are.

In [None]:
unique_models, counts_models = np.unique(result_star_sum['model'], return_counts=True)
for u in range(len(unique_models)):
    print(u, unique_models[u], counts_models[u])

Use `tx` to index all 211 of the RR Lyrae stars. Print the `id`, `ra`, and `dec` of a random RR Lyrae.

In [None]:
tx = np.where(result_star_sum['model'][:] == 'applyRRly')[0]
ri = np.random.choice(tx, size=1)
print(result_star_sum['id'][ri[0]], \
      result_star_sum['ra'][ri[0]], \
      result_star_sum['dec'][ri[0]])
del tx, ri

If all we had was an RA and Dec that we got by, for example, identifying a `DiaObject` that we thought might be an RR Lyrae, we could find the `id` in the star summary table.

For this example, use RA = 72.5850633 and Dec = -44.6386746.

In [None]:
my_star_ra = 72.5850633
my_star_dec = -44.6386746

In [None]:
tx = np.where((np.abs(result_star_sum['ra'] - my_star_ra) < 2.0/3600.)
              & (np.abs(result_star_sum['dec'] - my_star_dec) < 2.0/3600.))[0]
if(len(tx) == 1):
    print('Unique match identified within 2 arcseconds.')
    print(result_star_sum['id'][tx[0]], \
          result_star_sum['ra'][tx[0]], \
          result_star_sum['dec'][tx[0]])
else:
    print('Number of matches: ', len(tx))
del tx

Clean up.

In [None]:
del result_star_sum, my_star_ra, my_star_dec
gc.collect()

### 2.2. Use `pyarrow` to retrieve the true light curve

Use `pyarrow` to retrieve the  true variability (true light curve) for this one RR Lyrae of interest, with `id` = 836896.

Read the parquet table and only retrieve rows where `id` = 836896. This takes about 10 seconds.

In [None]:
%%time
result = pq.read_table(pfnm_star_var, use_threads=True,
                         filters=[('id', '==', 836896)])

Convert the result ot a pandas dataframe, `df`. This takes <1 second.

In [None]:
%%time
df = result.to_pandas()

**Option** to show the dataframe.

In [None]:
# df

Plot the g, r, and i-band `delta_flux` values.

In [None]:
gx = df['bandpass'][:] == 'g'
rx = df['bandpass'][:] == 'r'
ix = df['bandpass'][:] == 'i'
plt.plot(df['MJD'][gx], df['delta_flux'][gx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkgreen')
plt.plot(df['MJD'][rx], df['delta_flux'][rx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkorange')
plt.plot(df['MJD'][ix], df['delta_flux'][ix], \
         'o', ms=3, mew=0, alpha=0.3, color='black')
plt.show()
del gx, rx, ix

In [None]:
del result, df
gc.collect()

### 2.3. Use `dask` to retrieve the true light curve

Use `dask` to retrieve the  true variability (true light curve) for this one RR Lyrae of interest, with `id` = 836896.

**Option** to read the parquet file and view what `dd.read_parquet` returns. Note that it is the structure of the dataframe, NOT a dataframe filled with values.

In [None]:
# %%time
# result = dd.read_parquet(pfnm_star_var)
# result
# del result

Read the parquet table and only retrieve rows where `id` = 836896. This takes <1 second.

In [None]:
%%time
result = dd.read_parquet(pfnm_star_var, filters = [('id', '==', 836896)])

Convert the result into a `pandas` dataframe. This takes up to 15 seconds.

In [None]:
%%time
df = result.compute()

**Option** to show the `pandas` dataframe.

In [None]:
# df

Plot the g, r, and i-band `delta_flux` values.

In [None]:
gx = df['bandpass'][:] == 'g'
rx = df['bandpass'][:] == 'r'
ix = df['bandpass'][:] == 'i'
plt.plot(df['MJD'][gx], df['delta_flux'][gx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkgreen')
plt.plot(df['MJD'][rx], df['delta_flux'][rx], \
         'o', ms=3, mew=0, alpha=0.3, color='darkorange')
plt.plot(df['MJD'][ix], df['delta_flux'][ix], \
         'o', ms=3, mew=0, alpha=0.3, color='black')
plt.show()
del gx, rx, ix

In [None]:
del result, df
gc.collect()

## 3.0. Supernovae

In [3]:
pfnm_sn_sum = '/project/jchiang/Run2.2i/truth/SNe/truth_sn_summary_v1-0-0.parquet'
pfnm_sn_var = '/project/jchiang/Run2.2i/truth/SNe/truth_sn_variability_v1-0-0.parquet'

In [4]:
%%time
result_sn_sum = pd.read_parquet(pfnm_sn_sum)

CPU times: user 254 ms, sys: 218 ms, total: 472 ms
Wall time: 475 ms


In [5]:
result_sn_sum

Unnamed: 0,id_string,host_galaxy,ra,dec,redshift,c,mB,t0,x0,x1,id,av,rv,max_flux_u,max_flux_g,max_flux_r,max_flux_i,max_flux_z,max_flux_y
0,MS_10199_0,10562502111,66.115587,-40.866055,0.077278,0.035201,18.516489,60772.368515,6.173744e-04,1.477374,10816002161686,0.096224,3.1,,,,,6543.662109,70800.187500
1,MS_10199_1,10562500822,66.662435,-42.042877,0.073256,-0.086665,17.966773,63060.297448,1.027672e-03,0.576378,10816000841750,0.081817,3.1,,17351.597656,,70927.117188,55145.105469,94845.875000
2,MS_10199_2,10562500735,65.212146,-41.416473,0.068469,0.037553,19.078334,62832.166508,3.703208e-04,-0.177457,10816000752662,0.071639,3.1,2589.770752,73588.171875,39960.710938,49958.871094,35685.750000,47779.726562
3,MS_10199_3,10562502246,65.469824,-41.109646,0.080189,0.076715,19.192019,61400.008116,3.348096e-04,-1.344205,10816002299926,0.108552,3.1,10282.806641,25551.062500,72298.148438,55859.316406,12957.909180,42383.917969
4,MS_10199_5,10562504918,65.986211,-42.086708,0.115499,-0.012653,19.388848,61342.657248,2.775546e-04,0.474204,10816005036054,0.084987,3.1,1207.371338,30067.052734,25256.710938,46585.750000,31519.847656,32113.683594
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
405288,MS_9047_4204,2814669465,65.563507,-28.369243,0.970664,0.012134,24.870431,60775.352207,1.777808e-06,0.970683,2882221532182,0.102293,3.1,,,1.423070,22.788200,737.987366,657.338623
405289,MS_9047_4205,2814681850,64.084658,-27.958133,0.977860,-0.148739,24.273878,62470.479889,3.075572e-06,1.038007,2882234214422,0.129386,3.1,53.433254,41.071884,658.171875,1246.436157,1186.183594,985.751160
405290,MS_9047_4206,2814685005,65.058137,-28.552109,0.962130,0.117223,25.409439,61506.364087,1.081487e-06,0.965478,2882237445142,0.165537,3.1,0.000757,,94.811592,63.658932,40.061329,409.666016
405291,MS_9047_4207,2814668118,64.484798,-27.860924,0.974147,0.147051,25.588282,62863.183383,9.187455e-07,0.442576,2882220152854,0.136652,3.1,0.008117,2.380490,81.333008,261.303101,345.730713,320.325409


In [6]:
%%time
result_sn_var = pd.read_parquet(pfnm_sn_var)

CPU times: user 3.12 s, sys: 2.31 s, total: 5.43 s
Wall time: 4.27 s


In [7]:
result_sn_var

Unnamed: 0,id_string,obsHistID,MJD,bandpass,delta_flux,id
0,MS_10199_0,796369,60753.011084,z,1761.107666,10816002161686
1,MS_10199_0,796420,60753.039172,z,1808.408447,10816002161686
2,MS_10199_0,798153,60754.996432,z,6543.662109,10816002161686
3,MS_10199_0,799057,60755.999051,y,3875.850098,10816002161686
4,MS_10199_0,809099,60767.990715,y,70788.859375,10816002161686
...,...,...,...,...,...,...
27998609,MS_9047_4208,913914,60910.406923,r,4.555153,2882219213846
27998610,MS_9047_4208,914716,60911.313434,u,0.000000,2882219213846
27998611,MS_9047_4208,914720,60911.315222,u,0.000000,2882219213846
27998612,MS_9047_4208,914721,60911.315668,u,0.000000,2882219213846


In [8]:
del result_sn_sum, result_sn_var
gc.collect()

0

## GCRCatalogs -- couldn't get it set up

Also not sure the db files are GCRCatalog-accessible...

In [None]:
import GCRCatalogs
from GCRCatalogs.helpers.tract_catalogs import tract_filter, sample_filter
from GCRCatalogs import GCRQuery

In [None]:
GCRCatalogs.get_root_dir()

In [None]:
GCRCatalogs.get_public_catalog_names()

In [None]:
# obj_cat = GCRCatalogs.load_catalog("truth_sn_summary_v1-0-0.db")

## Spark -- couldn't get it set up

`Spark`, and in particular `pyspark`, can be used to apply SQL queries directly to parquet tables.
 - https://spark.apache.org/docs/latest/sql-getting-started.html
 - https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
 
However, while `pyspark` is easy enough to `pip install`, it requires java and some other stuff in order to run in a notebook. E.g.,
 - https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook/
 - https://opensource.com/article/18/11/pyspark-jupyter-notebook
 
And it was the java installation that seemed too much to expect of users.

JAVA_HOME being undefined was the error.

In [None]:
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName('Ops').getOrCreate()

## Try with pandas -- varstar parquet too big

SN table small enough to reqd in whole.

Jeff showed in his NB that reading even one column of the variable star table taks 15 min.

https://github.com/rubin-dp0/cet-dev/blob/main/JLC_slagheap/dc2_truth_parquet_exploration.ipynb

In [None]:
import pandas as pd

In [None]:
# pfnm = '/project/jchiang/Run2.2i/truth/SNe/truth_sn_summary_v1-0-0.parquet'
pfnm = '/project/jchiang/Run2.2i/truth/stars/truth_star_summary_v1-0-0.parquet'

In [None]:
result = pd.read_parquet(pfnm)

In [None]:
result

In [None]:
tx = np.where((result['t0'] > 60770) & (result['t0'] < 60780))[0]

In [None]:
print(len(tx))

In [None]:
del result, tx

## Try with pyarrow -- varstar parquet too big

In [None]:
import pyarrow.parquet as pq

In [None]:
result = pq.read_table(pfnm, columns=['ra', 'dec']).to_pandas()

In [None]:
result

In [None]:
del result

The above works fine, because the SN table is short. The following is not even possible.

In [None]:
pfnm = '/project/jchiang/Run2.2i/truth/stars/truth_star_variability_v1-0-0.parquet'

In [None]:
%%time
result = pq.read_table(pfnm, columns=['id']).to_pandas()