# Classification for Stars vs. Non-Stellar Objects 

#### In this notebook, we will explore the RefCat data for analyzing and extracting train data sets for **labels of stars**.

#### Some Issues:
- Imbalanced data due to their HealPix-indexed astropy table. We may need to save them again as a new single PySpark Dataframe. (this job will be done on this notebook)
- Huge Size of Data. Many in-memory executions also can show some troubles. Hence, only operations from disk to disk can work well. 

## Import Basic Packages 

In [1]:
import numpy as np
import pandas as pd
import glob
import sys
import h5py
#from netCDF4 import Dataset
from datetime import datetime
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree

import pyarrow as pa
import pyarrow.parquet as pq

from functools import reduce
import operator
import gc

# Increase display width to 200 characters
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_colwidth', 200)

In [2]:
import os

from astropy.table import Table
from matplotlib.ticker import MultipleLocator

from astropy.utils.exceptions import AstropyWarning
import warnings
warnings.simplefilter('ignore', category=AstropyWarning)

In [3]:
# plot settings
#plt.rc('font', family='serif') 
#plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

## PySpark Session

In [4]:
%%time
# PySpark packages
from pyspark import SparkContext   
from pyspark.sql import SparkSession

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W


spark = SparkSession.builder \
    .master("yarn") \
    .appName("spark-shell") \
    .config("spark.driver.maxResultSize", "32g") \
    .config("spark.driver.memory", "32g") \
    .config("spark.executor.memory", "14g") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.instances", "100") \
    .config("spark.sql.hive.filesourcePartitionFileCacheSize", "2097152000") \
    .getOrCreate()


sc = spark.sparkContext
sc.setCheckpointDir("hdfs://spark00:54310/tmp/checkpoints")

spark.conf.set("spark.sql.debug.maxToStringFields", 500)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

CPU times: user 13.5 ms, sys: 6.17 ms, total: 19.7 ms
Wall time: 27.9 s


- This takes time to get resources from the Yarn Cluster
- **High Memory Configurations** for Handling RefCat

## Reading Reference Catalog Files

- Read converted spark dataframe in the **Parquet Format**

In [5]:
hdfsheader = 'hdfs://spark00:54310'
workpath = '/user/shong/work/sedfit/spherex/data/temp/'
datapath = '/user/shong/data/spherex/star-classification/input-ref-cat/'

#### Check the catalog files

In [6]:
outlist = !hadoop fs -ls {hdfsheader+datapath}

In [7]:
len(outlist)

12289

In [8]:
outlist[0]

'Found 12288 items'

In [9]:
outlist[12288]

'drwxr-xr-x   - shong supergroup          0 2024-05-11 14:45 hdfs://spark00:54310/user/shong/data/spherex/star-classification/input-ref-cat/Gaia_DR3.LS.PS1DR1.CatWISE.AllWISE.2MASS_NSIDE32_012287.parquet.snappy'

#### Read *only* one catalog file

> The Full catalog itself is too big to fit within in-memory calculations. This is quite rare but indeed it is. Hence, for checking out the schema and displaying some rows, we need this one-file sample to avoid loading all data in memory. 

In [10]:
onedf = spark.read.option("header","true") \
.parquet(hdfsheader+datapath+'Gaia_DR3.LS.PS1DR1.CatWISE.AllWISE.2MASS_NSIDE32_012287.parquet.snappy')

In [11]:
onedf.printSchema()

root
 |-- SPHERExRefID: long (nullable = true)
 |-- Gaia_DR3_source_id: long (nullable = true)
 |-- LegacySurvey_uid: long (nullable = true)
 |-- PS1_DR1_StackObject_objID: long (nullable = true)
 |-- CatWISE_source_id: string (nullable = true)
 |-- AllWISE_designation: string (nullable = true)
 |-- 2MASS_designation: string (nullable = true)
 |-- ra: double (nullable = true)
 |-- dec: double (nullable = true)
 |-- ra_error: double (nullable = true)
 |-- dec_error: double (nullable = true)
 |-- coord_src: long (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: double (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: double (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: double (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- astrometric_params_solved: short (nullable = true)
 |-- CatWISE_PMRA: double (nullable = true)
 |-- CatWISE_PMDec: double (nullable = true)
 |-- CatWISE_sigPMRA: double (nullab

In [12]:
%%time
onedf.limit(4).toPandas().T

CPU times: user 10.8 ms, sys: 6.62 ms, total: 17.4 ms
Wall time: 3.39 s


Unnamed: 0,0,1,2,3
SPHERExRefID,1789529881403457536,1789529881403457538,1789529881403457539,1789529881403457540
Gaia_DR3_source_id,-9999,-9999,-9999,-9999
LegacySurvey_uid,1374849096224393,1374849096224300,1374849096224207,1374849096224425
PS1_DR1_StackObject_objID,-9999,-9999,-9999,-9999
CatWISE_source_id,,,,
AllWISE_designation,,,,
2MASS_designation,,,,
ra,314.849397,314.846476,314.844225,314.850308
dec,-1.993202,-1.99189,-1.991788,-1.990858
ra_error,0.000026,0.000034,0.000013,0.00001


In [13]:
len(onedf.columns)

101

> The total number of data columns is **101**. 

#### Now read all reference catalog files to a single dataframe

In [14]:
%%time
rawdf = spark.read.option("header","true").option("recursiveFileLookup","true").parquet(hdfsheader+datapath)

CPU times: user 2.74 ms, sys: 0 ns, total: 2.74 ms
Wall time: 7.47 s


In [15]:
rawdf = rawdf.repartition(2000)

> our current parquet files were direct convertions from HealPix astro tables. this caused imbalanced funny partitions. we need to repartition and save them. 

In [16]:
rawdf.cache()

DataFrame[SPHERExRefID: bigint, Gaia_DR3_source_id: bigint, LegacySurvey_uid: bigint, PS1_DR1_StackObject_objID: bigint, CatWISE_source_id: string, AllWISE_designation: string, 2MASS_designation: string, ra: double, dec: double, ra_error: double, dec_error: double, coord_src: bigint, pmra: double, pmra_error: double, pmdec: double, pmdec_error: double, parallax: double, parallax_error: double, ref_epoch: double, astrometric_params_solved: smallint, CatWISE_PMRA: double, CatWISE_PMDec: double, CatWISE_sigPMRA: double, CatWISE_sigPMDec: double, Gaia_G: double, Gaia_BP: double, Gaia_RP: double, Gaia_G_error: double, Gaia_BP_error: double, Gaia_RP_error: double, LS_g: double, LS_r: double, LS_z: double, LS_g_error: double, LS_r_error: double, LS_z_error: double, PS1_g: double, PS1_r: double, PS1_i: double, PS1_z: double, PS1_y: double, PS1_g_error: double, PS1_r_error: double, PS1_i_error: double, PS1_z_error: double, PS1_y_error: double, 2MASS_J: double, 2MASS_H: double, 2MASS_Ks: double,

### Save the trimmed dataframe 

In [17]:
outdatapath = '/user/shong/data/spherex/star-classification/reduced-data/'

In [18]:
hdfsheader+outdatapath

'hdfs://spark00:54310/user/shong/data/spherex/star-classification/reduced-data/'

In [19]:
hdfsheader+outdatapath+'RefCat-Repartitioned.parquet.snappy'

'hdfs://spark00:54310/user/shong/data/spherex/star-classification/reduced-data/RefCat-Repartitioned.parquet.snappy'

In [20]:
%%time
rawdf.write.option("compression", "snappy").mode("overwrite") \
    .save(hdfsheader+outdatapath+'RefCat-Repartitioned.parquet.snappy')    

CPU times: user 739 ms, sys: 573 ms, total: 1.31 s
Wall time: 5h 37min 59s
