# Cross-Referencing Datasets

Using the HSC SSP catalogue of objects with spectroscopic redshift estimates with a catalogue of objects that have been classified as stars, QSOs, galaxies, or unknown.

We will use these cross-referenced datasets as the basis for validating our Masked Image Modelling approach to developing meaningful embeddings of HSC images.

By creating `.csv` files with the RA, Dec, and redshift measurements of each object, we can then use this information to index into the HSV image data to create datasets of 64$\times$ 64 cutouts around each object.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
import h5py
import os
import sys
from scipy.spatial import cKDTree
import time

sys.path.append('../utils/')
from analysis_fns import normalize_images, display_images

### Load the classification data.

`cspec: {0:unknown, 1:star, 2:galaxy, 3:qso}`

In [2]:
class_labels = pd.read_parquet('/arc/projects/unions/catalogues/redshifts/redshifts-2024-05-07.parquet')
class_indices = {'unkown':np.nan, 'star':1, 'galaxy':2, 'qso':3}
class_labels

Unnamed: 0,ra,dec,cspec,zspec,zspec_err
0,24.837864,31.987288,2.0,0.373586,0.0
1,24.834875,32.031507,2.0,0.783066,0.0
2,24.291261,31.900061,1.0,0.000045,0.0
3,24.347372,31.805734,2.0,1.071963,0.0
4,24.270095,31.874742,2.0,0.812852,0.0
...,...,...,...,...,...
55225810,320.101000,-63.718700,,0.021837,0.0
55225811,13.091700,31.442220,,0.015593,0.0
55225812,118.893000,29.173060,,0.021234,0.0
55225813,133.058000,1.460280,,0.204486,0.0


### Remove duplicates in the catalogue.

In [3]:
def deg_to_cartesian(ra, dec):
    # Convert RA and DEC to radians for spatial indexing
    ra = np.radians(ra)
    dec = np.radians(dec)
    # Convert to Cartesian coordinates
    return np.cos(ra) * np.cos(dec), np.sin(ra) * np.cos(dec), np.sin(dec)

def create_kdtree(ra, dec):
    '''Function to create a KDTree for efficient spatial searches.'''
    # Convert to Cartesian coordinates for KDTree
    x, y, z = deg_to_cartesian(ra, dec)
    coords = np.vstack((x, y, z)).T
    return cKDTree(coords)

tolerance = 1/3600  # Tolerance in degrees
tolerance_rad = np.radians(tolerance)  # Convert tolerance to radians

In [4]:
# Create HSC KDTree to remove duplicates
hsc_kdtree = create_kdtree(class_labels['ra'].values, 
                           class_labels['dec'].values)

# Collect RA and Dec of HSC SSP data and 
# convert to Cartesian for search
X, Y, Z = deg_to_cartesian(class_labels['ra'].values, class_labels['dec'].values)

# Remove duplicates
good_indices = []
for i, (x,y,z) in enumerate(zip(X,Y,Z)):
    matches = hsc_kdtree.query_ball_point([x, y, z], r=tolerance_rad)
    if len(matches)<2:
        good_indices.append(i)

print(f'Removed {(len(class_labels)-len(good_indices))} duplicates.')
class_labels = class_labels.iloc[good_indices]

Removed 2624141 duplicates.


### Create class .csv files

In [5]:
# Select only a given class of objects and look for matching RA and Decs
for class_name in ['unkown','star','galaxy','qso']:
    class_index = class_indices[class_name]
    
    matching_indices = np.where(class_labels['cspec']==class_index)[0]

    print(f'Found {len(matching_indices)} objects with the {class_name} class')
    # Write the DataFrame to a CSV file, including only the specified columns
    class_labels.iloc[matching_indices].to_csv(f'../data/HSC_{class_name}.csv', 
                                             columns=['ra','dec','zspec','zspec_err'], index=False)

Found 0 objects with the unkown class
Found 24184966 objects with the star class
Found 24541761 objects with the galaxy class
Found 3421700 objects with the qso class
