# Classification for Stars vs. Non-Stellar Objects 


### In this notebook, we will extract GaiaDR3 labels to add these to our TrainSample. 

- gaia `source_id`
- gaia `ra` and `dec`
- three gaia classification columns: <br> 
`classprob_dsc_combmod_star` `classprob_dsc_combmod_galaxy` `classprob_dsc_combmod_quasar`


## Import Basic Packages 

In [1]:
import numpy as np
import pandas as pd
import glob
import sys
import h5py
#from netCDF4 import Dataset
from datetime import datetime
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree

import pyarrow as pa
import pyarrow.parquet as pq

from functools import reduce
import operator
import gc

# Increase display width to 200 characters
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_colwidth', 200)

In [2]:
import os

from astropy.table import Table
from matplotlib.ticker import MultipleLocator

from astropy.utils.exceptions import AstropyWarning
import warnings
warnings.simplefilter('ignore', category=AstropyWarning)

In [3]:
# plot settings
#plt.rc('font', family='serif') 
#plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

## PySpark Session

In [4]:
%%time
# PySpark packages
from pyspark import SparkContext   
from pyspark.sql import SparkSession

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W


spark = SparkSession.builder \
    .master("yarn") \
    .appName("spark-shell") \
    .config("spark.driver.maxResultSize", "32g") \
    .config("spark.driver.memory", "32g") \
    .config("spark.executor.memory", "7g") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.instances", "200") \
    .config("spark.sql.hive.filesourcePartitionFileCacheSize", "2097152000") \
    .getOrCreate()



sc = spark.sparkContext
sc.setCheckpointDir("hdfs://spark00:54310/tmp/checkpoints")

spark.conf.set("spark.sql.debug.maxToStringFields", 500)
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

CPU times: user 3.95 ms, sys: 17.4 ms, total: 21.4 ms
Wall time: 32.5 s


> This takes time to get resources from the Yarn Cluster

## Reading Train Sample and GaiaDR3 Labels

In [5]:
!pwd

/home/shong/work/deeplearnings/star-classification/notebook


#### Train Sample

In [6]:
datapath = '/user/shong/data/spherex/star-classification/reduced-data/'
hdfsheader = 'hdfs://spark00:54310'
localdatapath= '/home/shong/work/deeplearnings/star-classification/data/'

In [7]:
hdfsheader+datapath+'RefCat-Train.parquet.snappy'

'hdfs://spark00:54310/user/shong/data/spherex/star-classification/reduced-data/RefCat-Train.parquet.snappy'

In [8]:
%%time
tdf = spark.read.option("header","true").parquet(hdfsheader+datapath+'RefCat-Train.parquet.snappy')

CPU times: user 772 µs, sys: 989 µs, total: 1.76 ms
Wall time: 2.28 s


In [9]:
tdf.printSchema()

root
 |-- SPHERExRefID: long (nullable = true)
 |-- Gaia_DR3_source_id: long (nullable = true)
 |-- LegacySurvey_uid: long (nullable = true)
 |-- PS1_DR1_StackObject_objID: long (nullable = true)
 |-- CatWISE_source_id: string (nullable = true)
 |-- AllWISE_designation: string (nullable = true)
 |-- 2MASS_designation: string (nullable = true)
 |-- ra: double (nullable = true)
 |-- dec: double (nullable = true)
 |-- ra_error: double (nullable = true)
 |-- dec_error: double (nullable = true)
 |-- coord_src: long (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: double (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: double (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: double (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- astrometric_params_solved: short (nullable = true)
 |-- CatWISE_PMRA: double (nullable = true)
 |-- CatWISE_PMDec: double (nullable = true)
 |-- CatWISE_sigPMRA: double (nullab

#### GaiaDR3

In [10]:
filepath = "hdfs://spark00:54310/common/data/external-catalogs/parquet/gaia-dr3/original/"

In [11]:
%%time
gaiadf = spark.read.option("header","true").option("recursiveFileLookup","true").parquet(filepath)

CPU times: user 517 µs, sys: 1.06 ms, total: 1.57 ms
Wall time: 3.73 s


In [12]:
gaiadf.printSchema()

root
 |-- solution_id: long (nullable = true)
 |-- designation: string (nullable = true)
 |-- source_id: long (nullable = true)
 |-- random_index: long (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- ra: double (nullable = true)
 |-- ra_error: float (nullable = true)
 |-- dec: double (nullable = true)
 |-- dec_error: float (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: float (nullable = true)
 |-- parallax_over_error: float (nullable = true)
 |-- pm: float (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: float (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: float (nullable = true)
 |-- ra_dec_corr: float (nullable = true)
 |-- ra_parallax_corr: float (nullable = true)
 |-- ra_pmra_corr: float (nullable = true)
 |-- ra_pmdec_corr: float (nullable = true)
 |-- dec_parallax_corr: float (nullable = true)
 |-- dec_pmra_corr: float (nullable = true)
 |-- dec_pmdec_corr: float (nullable = true)
 |--

In [13]:
# We only need these columns
mycols = ['source_id','ra','dec','classprob_dsc_combmod_star',
          'classprob_dsc_combmod_galaxy','classprob_dsc_combmod_quasar']

In [14]:
gdf = gaiadf.select(mycols)

In [15]:
%%time
gdf.describe().toPandas().transpose()

CPU times: user 4.51 ms, sys: 9.75 ms, total: 14.3 ms
Wall time: 47.3 s


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
source_id,1811193722,4.3522287618984305E18,1.63987162372813824E18,4295806720,6917528997577384320
ra,1811193722,229.09297433159446,77.773975484845,3.4096239126626443E-7,359.999999939548
dec,1811193722,-18.396119551038502,36.521675782310965,-89.99287859590359,89.99005196682685
classprob_dsc_combmod_star,1590266307,0.9930842355494748,0.07693965045668165,0.0,1.0
classprob_dsc_combmod_galaxy,1590266307,0.0023364383268872635,0.04609217115126699,0.0,1.0
classprob_dsc_combmod_quasar,1590266307,0.0036410749663895047,0.0541612499986343,0.0,1.0


> **class_less** objects : $1811193722 - 1590266307$

In [16]:
df = gdf.dropna()

In [17]:
df.printSchema()

root
 |-- source_id: long (nullable = true)
 |-- ra: double (nullable = true)
 |-- dec: double (nullable = true)
 |-- classprob_dsc_combmod_star: float (nullable = true)
 |-- classprob_dsc_combmod_galaxy: float (nullable = true)
 |-- classprob_dsc_combmod_quasar: float (nullable = true)



In [18]:
# Dictionary mapping old column names to new column names
rename_mappings = {"source_id": "Gaia_DR3_source_id", "ra": "gaia_ra", \
                  "dec": "gaia_dec","classprob_dsc_combmod_star": "gaia_classprob_dsc_combmod_star", \
                  "classprob_dsc_combmod_galaxy": "gaia_classprob_dsc_combmod_galaxy", \
                  "classprob_dsc_combmod_quasar": "gaia_classprob_dsc_combmod_quasar"}

In [19]:
# Apply renaming for each column
for old_name, new_name in rename_mappings.items():
    #print(old_name)
    #print(new_name)
    df = df.withColumnRenamed(old_name, new_name)

In [20]:
df.cache()

DataFrame[Gaia_DR3_source_id: bigint, gaia_ra: double, gaia_dec: double, gaia_classprob_dsc_combmod_star: float, gaia_classprob_dsc_combmod_galaxy: float, gaia_classprob_dsc_combmod_quasar: float]

In [21]:
%%time
df.describe().toPandas().transpose()

CPU times: user 6.18 ms, sys: 5.73 ms, total: 11.9 ms
Wall time: 42.3 s


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
Gaia_DR3_source_id,1590266307,4.323867992575383E18,1.69846808056080717E18,4295806720,6917528997577384320
gaia_ra,1590266307,225.7475618571117,80.47346490956434,3.4096239126626443E-7,359.999999939548
gaia_dec,1590266307,-17.09411541236342,37.68630391712936,-89.99287859590359,89.99005196682685
gaia_classprob_dsc_combmod_star,1590266307,0.9930842355494731,0.07693965045668162,0.0,1.0
gaia_classprob_dsc_combmod_galaxy,1590266307,0.00233643832688727,0.046092171151267215,0.0,1.0
gaia_classprob_dsc_combmod_quasar,1590266307,0.0036410749663895116,0.0541612499986343,0.0,1.0


> Cleaned! and we got new column names for the `join` operation.

## CrossMatch `tdf` and `df`  

In [22]:
#tdf.printSchema()

In [23]:
minimalcols = ['Gaia_DR3_source_id','ra','dec']

In [24]:
%%time
tdf.select(minimalcols).describe().toPandas().transpose()

CPU times: user 1.62 ms, sys: 5.98 ms, total: 7.61 ms
Wall time: 14.3 s


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
Gaia_DR3_source_id,97529175,3.0082135512171448E18,1.78561572298214912E18,4295806720,6917528997577384320
ra,97529175,197.88749584731966,91.75502774654322,1.6327128351173464E-6,359.9999806544865
dec,97529175,15.04879472835068,27.716062779013008,-31.382798082634444,84.77074678823259


In [25]:
#tdf.columns

#### Join two dataframes 

In [26]:
newdf = tdf.join(df, on="Gaia_DR3_source_id", how="left" )

In [27]:
newdf.printSchema()

root
 |-- Gaia_DR3_source_id: long (nullable = true)
 |-- SPHERExRefID: long (nullable = true)
 |-- LegacySurvey_uid: long (nullable = true)
 |-- PS1_DR1_StackObject_objID: long (nullable = true)
 |-- CatWISE_source_id: string (nullable = true)
 |-- AllWISE_designation: string (nullable = true)
 |-- 2MASS_designation: string (nullable = true)
 |-- ra: double (nullable = true)
 |-- dec: double (nullable = true)
 |-- ra_error: double (nullable = true)
 |-- dec_error: double (nullable = true)
 |-- coord_src: long (nullable = true)
 |-- pmra: double (nullable = true)
 |-- pmra_error: double (nullable = true)
 |-- pmdec: double (nullable = true)
 |-- pmdec_error: double (nullable = true)
 |-- parallax: double (nullable = true)
 |-- parallax_error: double (nullable = true)
 |-- ref_epoch: double (nullable = true)
 |-- astrometric_params_solved: short (nullable = true)
 |-- CatWISE_PMRA: double (nullable = true)
 |-- CatWISE_PMDec: double (nullable = true)
 |-- CatWISE_sigPMRA: double (nullab

In [28]:
somecols = ['Gaia_DR3_source_id','ra','dec','gaia_ra','gaia_dec','gaia_classprob_dsc_combmod_star']

In [29]:
%%time
newdf.select(somecols).show(3,truncate=True)

+-------------------+------------------+------------------+------------------+------------------+-------------------------------+
| Gaia_DR3_source_id|                ra|               dec|           gaia_ra|          gaia_dec|gaia_classprob_dsc_combmod_star|
+-------------------+------------------+------------------+------------------+------------------+-------------------------------+
| 867821339277466112|116.07343452366969|24.563358610389322|116.07343452366969|24.563358610389322|                      0.9998576|
| 954701273473568896| 103.0496785696089| 47.28799557615813| 103.0496785696089| 47.28799557615813|                       0.999983|
|1014611222529846912| 131.0006819972484| 48.07785687605532| 131.0006819972484| 48.07785687605532|                      0.9999857|
+-------------------+------------------+------------------+------------------+------------------+-------------------------------+
only showing top 3 rows

CPU times: user 3.97 ms, sys: 0 ns, total: 3.97 ms
Wall time: 20.

In [31]:
%%time
newdf.select(['gaia_classprob_dsc_combmod_star','gaia_classprob_dsc_combmod_galaxy', \
              'gaia_classprob_dsc_combmod_quasar']).describe().toPandas()

CPU times: user 16.6 ms, sys: 7.83 ms, total: 24.4 ms
Wall time: 4min 40s


Unnamed: 0,summary,gaia_classprob_dsc_combmod_star,gaia_classprob_dsc_combmod_galaxy,gaia_classprob_dsc_combmod_quasar
0,count,97499546.0,97499546.0,97499546.0
1,mean,0.9731148012396756,0.0137328325287686,0.0112327093801519
2,stddev,0.1577559538812301,0.1141358029824622,0.101442334872271
3,min,0.0,0.0,0.0
4,max,1.0,1.0,1.0


> `tdf` 97529175 vs. `newdf` 97499546. There are some missing (Null) Labels

### Save this new train sample with gaia labels 

In [33]:
hdfsheader+datapath+'RefCat-Train-Label.parquet.snappy'

'hdfs://spark00:54310/user/shong/data/spherex/star-classification/reduced-data/RefCat-Train-Label.parquet.snappy'

In [34]:
%%time
newdf.write.option("compression", "snappy").mode("overwrite") \
    .save(hdfsheader+datapath+'RefCat-Train-Label.parquet.snappy')   

CPU times: user 51.4 ms, sys: 29 ms, total: 80.4 ms
Wall time: 20min 4s
