# Explore how to read the new DESC truth tables in parquet format

Jim Chiang put the enhanced truth tables in `/project` for us:
 - `/project/jchiang/Run2.2i/truth/` contains
   - `SNe/truth_sn_summary_v1-0-0.parquet`
   - `SNe/truth_sn_variability_v1-0-0.parquet`
   - `stars/truth_star_summary_v1-0-0.parquet`
   - `stars/truth_star_variability_v1-0-0.parquet`



Set up.

In [1]:
import numpy as np
import time

## GCRCatalogs -- couldn't get it set up

Also not sure the db files are GCRCatalog-accessible...

In [2]:
import GCRCatalogs
from GCRCatalogs.helpers.tract_catalogs import tract_filter, sample_filter
from GCRCatalogs import GCRQuery

In [4]:
GCRCatalogs.get_root_dir()

'/project/jchiang/Run2.2i/truth/'

In [5]:
GCRCatalogs.get_public_catalog_names()

['desc_cosmodc2',
 'desc_dc2_run2.2i_dr6_object',
 'desc_dc2_run2.2i_dr6_object_with_truth_match',
 'desc_dc2_run2.2i_dr6_truth',
 'desc_dc2_run2.2i_truth_galaxy_summary',
 'desc_dc2_run2.2i_truth_sn_summary',
 'desc_dc2_run2.2i_truth_sn_variability',
 'desc_dc2_run2.2i_truth_star_summary',
 'desc_dc2_run2.2i_truth_star_variability']

In [None]:
# obj_cat = GCRCatalogs.load_catalog("truth_sn_summary_v1-0-0.db")

## Spark -- couldn't get it set up

`Spark`, and in particular `pyspark`, can be used to apply SQL queries directly to parquet tables.
 - https://spark.apache.org/docs/latest/sql-getting-started.html
 - https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
 
However, while `pyspark` is easy enough to `pip install`, it requires java and some other stuff in order to run in a notebook. E.g.,
 - https://sparkbyexamples.com/pyspark/install-pyspark-in-anaconda-jupyter-notebook/
 - https://opensource.com/article/18/11/pyspark-jupyter-notebook
 
And it was the java installation that seemed too much to expect of users.

JAVA_HOME being undefined was the error.

In [None]:
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName('Ops').getOrCreate()

## Try with pandas

SN table small enough to reqd in whole.

Jeff showed in his NB that reading even one column of the variable star table taks 15 min.

In [None]:
import pandas as pd

In [None]:
pfnm = '/project/jchiang/Run2.2i/truth/SNe/truth_sn_summary_v1-0-0.parquet'

In [None]:
result = pd.read_parquet(pfnm)

In [None]:
result

In [None]:
tx = np.where((result['t0'] > 60770) & (result['t0'] < 60780))[0]

In [None]:
print(len(tx))

In [None]:
del result, tx

## Try with pyarrow

In [None]:
import pyarrow.parquet as pq

In [None]:
result = pq.read_table(pfnm, columns=['ra', 'dec']).to_pandas()

In [None]:
result

In [None]:
del result

The above works fine, because the SN table is short. The following is not even possible.

In [None]:
pfnm = '/project/jchiang/Run2.2i/truth/stars/truth_star_variability_v1-0-0.parquet'

In [None]:
%%time
result = pq.read_table(pfnm, columns=['id']).to_pandas()