# Classification for Stars vs. Non-Stellar Objects 


### In this notebook, we will compare `prob_star` distributions between Full Gaia Cat vs. Train Sample

- gaia `source_id`
- gaia `ra` and `dec`
- three gaia classification columns: <br> 
`classprob_dsc_combmod_star` `classprob_dsc_combmod_galaxy` `classprob_dsc_combmod_quasar`


## Import Basic Packages 

In [1]:
import numpy as np
import pandas as pd
import glob
import sys
import h5py
#from netCDF4 import Dataset
from datetime import datetime
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree

import pyarrow as pa
import pyarrow.parquet as pq

from functools import reduce
import operator
import gc

# Increase display width to 200 characters
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_colwidth', 200)

In [2]:
import os

from astropy.table import Table
from matplotlib.ticker import MultipleLocator

from astropy.utils.exceptions import AstropyWarning
import warnings
warnings.simplefilter('ignore', category=AstropyWarning)

In [3]:
# plot settings
#plt.rc('font', family='serif') 
#plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

## PySpark Session

In [4]:
%%time
# PySpark packages
from pyspark import SparkContext   
from pyspark.sql import SparkSession

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W


spark = SparkSession.builder \
    .master("yarn") \
    .appName("spark-shell") \
    .config("spark.driver.maxResultSize", "32g") \
    .config("spark.driver.memory", "32g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.instances", "200") \
    .config("spark.sql.hive.filesourcePartitionFileCacheSize", "2097152000") \
    .getOrCreate()



sc = spark.sparkContext
sc.setCheckpointDir("hdfs://spark00:54310/tmp/checkpoints")

spark.conf.set("spark.sql.debug.maxToStringFields", 500)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

CPU times: user 15.2 ms, sys: 6.38 ms, total: 21.6 ms
Wall time: 32.6 s


> This takes time to get resources from the Yarn Cluster

## Reading Train-Labeled Sample

In [5]:
!pwd

/home/shong/work/deeplearnings/star-classification/notebook


#### Train Sample

In [6]:
datapath = '/user/shong/data/spherex/star-classification/reduced-data/'
hdfsheader = 'hdfs://spark00:54310'
localdatapath= '/home/shong/work/deeplearnings/star-classification/data/'

In [7]:
hdfsheader+datapath+'RefCat-Train-Label.parquet.snappy'

'hdfs://spark00:54310/user/shong/data/spherex/star-classification/reduced-data/RefCat-Train-Label.parquet.snappy'

In [8]:
%%time
tdf = spark.read.option("header","true").parquet(hdfsheader+datapath+'RefCat-Train-Label.parquet.snappy')

CPU times: user 1.42 ms, sys: 763 µs, total: 2.19 ms
Wall time: 1.86 s


In [9]:
tdf.printSchema()

root
 |-- Gaia_DR3_source_id: long (nullable = true)
 |-- SPHERExRefID: long (nullable = true)
 |-- LegacySurvey_uid: long (nullable = true)
 |-- PS1_DR1_StackObject_objID: long (nullable = true)
 |-- CatWISE_source_id: string (nullable = true)
 |-- AllWISE_designation: string (nullable = true)
 |-- 2MASS_designation: string (nullable = true)
 |-- ra: double (nullable = true)
 |-- dec: double (nullable = true)
 |-- ra_error: double (nullable = true)
 |-- dec_error: double (nullable = true)
 |-- coord_src: long (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: double (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: double (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: double (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- astrometric_params_solved: short (nullable = true)
 |-- CatWISE_PMRA: double (nullable = true)
 |-- CatWISE_PMDec: double (nullable = true)
 |-- CatWISE_sigPMRA: double (nullab

#### Still, there are some Null labels for this catalog

In [10]:
traindf = tdf.dropna()

In [11]:
traindf.cache()

DataFrame[Gaia_DR3_source_id: bigint, SPHERExRefID: bigint, LegacySurvey_uid: bigint, PS1_DR1_StackObject_objID: bigint, CatWISE_source_id: string, AllWISE_designation: string, 2MASS_designation: string, ra: double, dec: double, ra_error: double, dec_error: double, coord_src: bigint, pmra: double, pmra_error: double, pmdec: double, pmdec_error: double, parallax: double, parallax_error: double, ref_epoch: double, astrometric_params_solved: smallint, CatWISE_PMRA: double, CatWISE_PMDec: double, CatWISE_sigPMRA: double, CatWISE_sigPMDec: double, Gaia_G: double, Gaia_BP: double, Gaia_RP: double, Gaia_G_error: double, Gaia_BP_error: double, Gaia_RP_error: double, LS_g: double, LS_r: double, LS_z: double, LS_g_error: double, LS_r_error: double, LS_z_error: double, PS1_g: double, PS1_r: double, PS1_i: double, PS1_z: double, PS1_y: double, PS1_g_error: double, PS1_r_error: double, PS1_i_error: double, PS1_z_error: double, PS1_y_error: double, 2MASS_J: double, 2MASS_H: double, 2MASS_Ks: double,

In [12]:
%%time
traindf.select(['ra','dec','gaia_classprob_dsc_combmod_star', \
                'gaia_classprob_dsc_combmod_galaxy','gaia_classprob_dsc_combmod_quasar']).describe().toPandas().T

CPU times: user 10.9 ms, sys: 3.03 ms, total: 13.9 ms
Wall time: 20.3 s


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
ra,97499546,197.88838525600755,91.75315953189548,1.6327128351173464E-6,359.9999806544865
dec,97499546,15.051231911555828,27.717222646361073,-31.382798082634444,84.77074678823259
gaia_classprob_dsc_combmod_star,97499546,0.9731148012396852,0.15775595388122993,0.0,1.0
gaia_classprob_dsc_combmod_galaxy,97499546,0.013732832528726079,0.11413580298246234,0.0,1.0
gaia_classprob_dsc_combmod_quasar,97499546,0.011232709380161953,0.10144233487227104,0.0,1.0


#### Sanity Check of ra and dec between RefCat and GaiaDR3

In [21]:
traindf.select(['Gaia_DR3_source_id','ra','gaia_ra','dec','gaia_dec']).show(20,truncate=False)

+------------------+------------------+------------------+-------------------+-------------------+
|Gaia_DR3_source_id|ra                |gaia_ra           |dec                |gaia_dec           |
+------------------+------------------+------------------+-------------------+-------------------+
|7284264691456     |45.06227669197522 |45.06227669197522 |0.21600441945474147|0.21600441945474147|
|14263587225600    |45.13475778741175 |45.13475778741175 |0.3215546124504204 |0.3215546124504204 |
|72778221634816    |44.751786988531386|44.751786988531386|0.41324883067097495|0.41324883067097495|
|74594992236672    |44.72045881857565 |44.72045881857565 |0.4959731380559752 |0.4959731380559752 |
|127788162720128   |44.75591061613937 |44.75591061613937 |0.8696145483248345 |0.8696145483248345 |
|139878495490560   |44.97738333517409 |44.97738333517409 |1.133967304337373  |1.133967304337373  |
|158638912323712   |45.51570962353016 |45.51570962353016 |0.7889201098230643 |0.7889201098230643 |
|169462229

#### GaiaDR3

In [13]:
filepath = "hdfs://spark00:54310/common/data/external-catalogs/parquet/gaia-dr3/original/"

In [14]:
%%time
gaiadf = spark.read.option("header","true").option("recursiveFileLookup","true").parquet(filepath)

CPU times: user 1.3 ms, sys: 1.04 ms, total: 2.34 ms
Wall time: 1.42 s


In [15]:
gaiadf.printSchema()

root
 |-- solution_id: long (nullable = true)
 |-- designation: string (nullable = true)
 |-- source_id: long (nullable = true)
 |-- random_index: long (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- ra: double (nullable = true)
 |-- ra_error: float (nullable = true)
 |-- dec: double (nullable = true)
 |-- dec_error: float (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: float (nullable = true)
 |-- parallax_over_error: float (nullable = true)
 |-- pm: float (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: float (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: float (nullable = true)
 |-- ra_dec_corr: float (nullable = true)
 |-- ra_parallax_corr: float (nullable = true)
 |-- ra_pmra_corr: float (nullable = true)
 |-- ra_pmdec_corr: float (nullable = true)
 |-- dec_parallax_corr: float (nullable = true)
 |-- dec_pmra_corr: float (nullable = true)
 |-- dec_pmdec_corr: float (nullable = true)
 |--

In [16]:
# We only need these columns
mycols = ['source_id','ra','dec','classprob_dsc_combmod_star',
          'classprob_dsc_combmod_galaxy','classprob_dsc_combmod_quasar']

In [17]:
gdf = gaiadf.select(mycols)

In [18]:
gdf.cache()

DataFrame[source_id: bigint, ra: double, dec: double, classprob_dsc_combmod_star: float, classprob_dsc_combmod_galaxy: float, classprob_dsc_combmod_quasar: float]

In [19]:
%%time
gdf.describe().toPandas().transpose()

CPU times: user 11 ms, sys: 0 ns, total: 11 ms
Wall time: 53 s


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
source_id,1811193722,4.3522287618984335E18,1.6398716237281344E18,4295806720,6917528997577384320
ra,1811193722,229.0929743315936,77.77397548484502,3.4096239126626443E-7,359.999999939548
dec,1811193722,-18.396119551038435,36.52167578231099,-89.99287859590359,89.99005196682685
classprob_dsc_combmod_star,1590266307,0.993084235549476,0.07693965045668184,0.0,1.0
classprob_dsc_combmod_galaxy,1590266307,0.0023364383268872618,0.046092171151267174,0.0,1.0
classprob_dsc_combmod_quasar,1590266307,0.0036410749663895025,0.05416124999863435,0.0,1.0


> Cleaned! and we got new column names for the `join` operation.

## Histograms for Gaia Labels

In [22]:
#tdf.printSchema()

In [23]:
minimalcols = ['Gaia_DR3_source_id','ra','dec']

In [24]:
%%time
tdf.select(minimalcols).describe().toPandas().transpose()

CPU times: user 1.62 ms, sys: 5.98 ms, total: 7.61 ms
Wall time: 14.3 s


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
Gaia_DR3_source_id,97529175,3.0082135512171448E18,1.78561572298214912E18,4295806720,6917528997577384320
ra,97529175,197.88749584731966,91.75502774654322,1.6327128351173464E-6,359.9999806544865
dec,97529175,15.04879472835068,27.716062779013008,-31.382798082634444,84.77074678823259


In [25]:
#tdf.columns

#### Join two dataframes 

In [26]:
newdf = tdf.join(df, on="Gaia_DR3_source_id", how="left" )

In [27]:
newdf.printSchema()

root
 |-- Gaia_DR3_source_id: long (nullable = true)
 |-- SPHERExRefID: long (nullable = true)
 |-- LegacySurvey_uid: long (nullable = true)
 |-- PS1_DR1_StackObject_objID: long (nullable = true)
 |-- CatWISE_source_id: string (nullable = true)
 |-- AllWISE_designation: string (nullable = true)
 |-- 2MASS_designation: string (nullable = true)
 |-- ra: double (nullable = true)
 |-- dec: double (nullable = true)
 |-- ra_error: double (nullable = true)
 |-- dec_error: double (nullable = true)
 |-- coord_src: long (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: double (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: double (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: double (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- astrometric_params_solved: short (nullable = true)
 |-- CatWISE_PMRA: double (nullable = true)
 |-- CatWISE_PMDec: double (nullable = true)
 |-- CatWISE_sigPMRA: double (nullab

In [28]:
somecols = ['Gaia_DR3_source_id','ra','dec','gaia_ra','gaia_dec','gaia_classprob_dsc_combmod_star']

In [29]:
%%time
newdf.select(somecols).show(3,truncate=True)

+-------------------+------------------+------------------+------------------+------------------+-------------------------------+
| Gaia_DR3_source_id|                ra|               dec|           gaia_ra|          gaia_dec|gaia_classprob_dsc_combmod_star|
+-------------------+------------------+------------------+------------------+------------------+-------------------------------+
| 867821339277466112|116.07343452366969|24.563358610389322|116.07343452366969|24.563358610389322|                      0.9998576|
| 954701273473568896| 103.0496785696089| 47.28799557615813| 103.0496785696089| 47.28799557615813|                       0.999983|
|1014611222529846912| 131.0006819972484| 48.07785687605532| 131.0006819972484| 48.07785687605532|                      0.9999857|
+-------------------+------------------+------------------+------------------+------------------+-------------------------------+
only showing top 3 rows

CPU times: user 3.97 ms, sys: 0 ns, total: 3.97 ms
Wall time: 20.

In [31]:
%%time
newdf.select(['gaia_classprob_dsc_combmod_star','gaia_classprob_dsc_combmod_galaxy', \
              'gaia_classprob_dsc_combmod_quasar']).describe().toPandas()

CPU times: user 16.6 ms, sys: 7.83 ms, total: 24.4 ms
Wall time: 4min 40s


Unnamed: 0,summary,gaia_classprob_dsc_combmod_star,gaia_classprob_dsc_combmod_galaxy,gaia_classprob_dsc_combmod_quasar
0,count,97499546.0,97499546.0,97499546.0
1,mean,0.9731148012396756,0.0137328325287686,0.0112327093801519
2,stddev,0.1577559538812301,0.1141358029824622,0.101442334872271
3,min,0.0,0.0,0.0
4,max,1.0,1.0,1.0


> `tdf` 97529175 vs. `newdf` 97499546. There are some missing (Null) Labels

### Save this new train sample with gaia labels 

In [33]:
hdfsheader+datapath+'RefCat-Train-Label.parquet.snappy'

'hdfs://spark00:54310/user/shong/data/spherex/star-classification/reduced-data/RefCat-Train-Label.parquet.snappy'

In [34]:
%%time
newdf.write.option("compression", "snappy").mode("overwrite") \
    .save(hdfsheader+datapath+'RefCat-Train-Label.parquet.snappy')   

CPU times: user 51.4 ms, sys: 29 ms, total: 80.4 ms
Wall time: 20min 4s
