# Merge All Cluster Catalogs for K-DRIFT Survey

Describe Basic Tasks what you will do on this jupyter notebook, before going on for coding. 

- We will make a new **cluster catalog** by merging all availabe open cluster catalogs
- `pandas` and `parquet` will be used


## Basic Packages

In [1]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pyarrow as pa
import pyarrow.parquet as pq

import gc

# plot settings
plt.rc('font', family='serif') 
plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

## Explore the open catalogs 

### Looking around directories and files 

In [2]:
%pwd

'/home/shong/work/kdrift/survey-design/notebook'

In [3]:
%ls ../data/

[0m[01;34mcluster-catalog[0m/  [01;34mvoid-catalog[0m/


In [4]:
%ls ../data/cluster-catalog/

Abell_all.dat      SPLUS.dat              eRASS_cluster.dat    upcluster_sz.dat
CLUMPR_DESI.dat    cluster_DESunWISE.dat  eRASS_cosmology.dat
DESI_clusters.dat  column_info.txt        eRASS_primary.dat
DESI_xray_sz.dat   eFEDS.dat              redmapper.dat


In [5]:
%ls -alFh ../data/cluster-catalog/

total 342M
drwxrwxr-x 3 shong shong 4.0K  4월  9 15:28 [0m[01;34m.[0m/
drwxrwxr-x 4 shong shong 4.0K  4월  9 14:44 [01;34m..[0m/
drwxrwxr-x 2 shong shong 4.0K  4월  9 15:28 [01;34m.ipynb_checkpoints[0m/
-rw-rw-r-- 1 shong shong 318K  4월  9 14:43 Abell_all.dat
-rw-rw-r-- 1 shong shong 140M  4월  9 14:44 CLUMPR_DESI.dat
-rw-rw-r-- 1 shong shong 167M  4월  9 14:44 DESI_clusters.dat
-rw-rw-r-- 1 shong shong 245K  4월  9 14:43 DESI_xray_sz.dat
-rw-rw-r-- 1 shong shong 595K  4월  9 14:43 SPLUS.dat
-rw-rw-r-- 1 shong shong  17M  4월  9 14:43 cluster_DESunWISE.dat
-rw-rw-r-- 1 shong shong  44K  4월  9 14:43 column_info.txt
-rw-rw-r-- 1 shong shong 208K  4월  9 14:43 eFEDS.dat
-rw-rw-r-- 1 shong shong 4.7M  4월  9 14:43 eRASS_cluster.dat
-rw-rw-r-- 1 shong shong 1.3M  4월  9 14:43 eRASS_cosmology.dat
-rw-rw-r-- 1 shong shong  11M  4월  9 14:43 eRASS_primary.dat
-rw-rw-r-- 1 shong shong 503K  4월  9 14:43 redmapper.dat
-rw-rw-r-- 1 shong shong 162K  4월  9 14:43 upcluster_sz.dat


In [6]:
datadir = '/home/shong/work/kdrift/survey-design/data/cluster-catalog/'

In [7]:
catlist = !ls ../data/cluster-catalog/

In [8]:
catlist

['Abell_all.dat',
 'CLUMPR_DESI.dat',
 'DESI_clusters.dat',
 'DESI_xray_sz.dat',
 'SPLUS.dat',
 'cluster_DESunWISE.dat',
 'column_info.txt',
 'eFEDS.dat',
 'eRASS_cluster.dat',
 'eRASS_cosmology.dat',
 'eRASS_primary.dat',
 'redmapper.dat',
 'upcluster_sz.dat']

In [9]:
numcatlist = len(catlist)
print(numcatlist)

13


### Explore `upcluster_sz.dat`

In [10]:
catlist[-1]

'upcluster_sz.dat'

In [11]:
!head {datadir+catlist[-1]}

	id	name	SNR	RAdeg	DEdeg	posErr	z	f_z	NSpec	YR500	vali	vali_flag	Msz	catname
0	1	PSZ2G000.04+45.13	6.75319	229.1905	-1.0172	4.1073	0.1198	spec	-	5.481591	20	V	3.96	upcluster
1	2	PSZ2G000.13+78.04	9.25669	203.5587	20.256	2.0562	0.171	spec	-	4.360847	20	V	5.12	upcluster
2	3	PSZ2G000.40-41.86	9.70428	316.0845	-41.3542	2.4274	0.1651	spec	-	4.507689	21	V	5.3	upcluster
3	4	PSZ2G000.77-35.69	6.58179	307.9728	-40.5987	2.3434	0.3416	spec	-	1.606287	21	V	6.33	upcluster
4	5	PSZ2G002.04-22.15	5.12563	291.3596	-36.5179	5.0208	-1.0	-	-	1.927779	-1	C	0.0	upcluster
5	6	PSZ2G002.08-68.28	4.7504	349.6324	-36.3326	5.4273	0.14	spec	-	3.12731	20	V	2.84	upcluster
6	7	PSZ2G002.42+69.64	4.62169	210.9933	15.6884	2.4286	0.1802	spec	3	2.68569	60	V	3.51	upcluster
7	8	PSZ2G002.77-56.16	9.19606	334.6595	-38.8794	2.2796	0.1411	spec	-	4.047786	21	V	4.41	upcluster
8	9	PSZ2G002.82+39.23	8.07308	235.0152	-3.2851	2.4288	0.1533	spec	-	9.700967	21	V	5.74	upcluster


- If the file was a normal `csv`, we can simply read the catalog file by using `pd.read_csv()`. 
- But, for this case, we need to extract schema (column info) and parse row-by-row for data contents. 
- So.. know *data science* and **be cool**! 

### Now make a new pandas dataframe from `upcluster_sz.dat`

#### Read the first line for defining `schema` 

In [12]:
filename = datadir+catlist[-1]
print(filename)

/home/shong/work/kdrift/survey-design/data/cluster-catalog/upcluster_sz.dat


In [13]:
with open(filename, "r") as file:
    schema_line = file.readline().strip()
    data_line = file.readline().strip()

In [14]:
schema = schema_line.split()

In [15]:
schema

['id',
 'name',
 'SNR',
 'RAdeg',
 'DEdeg',
 'posErr',
 'z',
 'f_z',
 'NSpec',
 'YR500',
 'vali',
 'vali_flag',
 'Msz',
 'catname']

In [16]:
firstrow = data_line.split()

In [17]:
firstrow

['0',
 '1',
 'PSZ2G000.04+45.13',
 '6.75319',
 '229.1905',
 '-1.0172',
 '4.1073',
 '0.1198',
 'spec',
 '-',
 '5.481591',
 '20',
 'V',
 '3.96',
 'upcluster']

In [18]:
[len(schema), len(firstrow)]

[14, 15]

> Okey, let's ignore the first index item

#### Make a pandas dataframe 

In [19]:
# Initialize an empty list to store data
data = []

In [20]:
# Read the rest of the lines (excluding schema)
with open(filename, "r") as file:
    # Skip the first line (schema)
    next(file)
    
    for line in file:
        line_data = line.strip().split()
        # Append the processed data to the list
        data.append(line_data[1:])

In [21]:
# Create the DataFrame from the list
df = pd.DataFrame(data, columns=schema)  # Use schema for column names

In [22]:
df.head()

Unnamed: 0,id,name,SNR,RAdeg,DEdeg,posErr,z,f_z,NSpec,YR500,vali,vali_flag,Msz,catname
0,1,PSZ2G000.04+45.13,6.75319,229.1905,-1.0172,4.1073,0.1198,spec,-,5.481591,20,V,3.96,upcluster
1,2,PSZ2G000.13+78.04,9.25669,203.5587,20.256,2.0562,0.171,spec,-,4.360847,20,V,5.12,upcluster
2,3,PSZ2G000.40-41.86,9.70428,316.0845,-41.3542,2.4274,0.1651,spec,-,4.507689,21,V,5.3,upcluster
3,4,PSZ2G000.77-35.69,6.58179,307.9728,-40.5987,2.3434,0.3416,spec,-,1.606287,21,V,6.33,upcluster
4,5,PSZ2G002.04-22.15,5.12563,291.3596,-36.5179,5.0208,-1.0,-,-,1.927779,-1,C,0.0,upcluster


In [23]:
df.tail()

Unnamed: 0,id,name,SNR,RAdeg,DEdeg,posErr,z,f_z,NSpec,YR500,vali,vali_flag,Msz,catname
1648,1649,PSZ2G358.94-70.57,6.0534,352.7803,-36.5498,3.1576,0.0957,spec,-,4.050002,21,V,2.52,upcluster
1649,1650,PSZ2G358.98-67.26,9.88427,348.9266,-37.7791,1.7243,0.1786,spec,-,4.346875,21,V,5.22,upcluster
1650,1651,PSZ2G359.07-32.12,7.64275,303.0189,-41.4869,2.4288,0.1496,spec,-,3.354362,21,V,4.72,upcluster
1651,1652,PSZ2G359.60-08.72,4.74027,275.0918,-33.5317,2.428,-1.0,-,-,2.50723,-1,C,0.0,upcluster
1652,1653,PSZ2G359.67-07.23,6.71094,273.5496,-32.791,1.2551,-1.0,-,-,1.404587,-1,C,0.0,upcluster


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1653 entries, 0 to 1652
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         1653 non-null   object
 1   name       1653 non-null   object
 2   SNR        1653 non-null   object
 3   RAdeg      1653 non-null   object
 4   DEdeg      1653 non-null   object
 5   posErr     1653 non-null   object
 6   z          1653 non-null   object
 7   f_z        1653 non-null   object
 8   NSpec      1653 non-null   object
 9   YR500      1653 non-null   object
 10  vali       1653 non-null   object
 11  vali_flag  1653 non-null   object
 12  Msz        1653 non-null   object
 13  catname    1653 non-null   object
dtypes: object(14)
memory usage: 180.9+ KB


- You may realize that all columns' types are `object`. Some of them should be `int`, `float`, or `double`

#### Explore the pandas dataframe

In [25]:
df['NSpec'].unique()

array(['-', '3', '2', '34', '6', '9', '1', '25', '10', '15', '17', '38',
       '61', '8', '31', '54', '33', '22', '21', '37', '4', '30', '23',
       '20', '46', '13', '12', '16', '5', '52', '14', '44', '47', '53',
       '11', '32', '39', '27', '18', '29', '41', '7', '19', '28', '45',
       '26'], dtype=object)

In [26]:
df['vali_flag'].unique()

array(['V', 'C', 'S', 'N'], dtype=object)

In [27]:
df['vali'].unique()

array(['20', '21', '-1', '60', '64', '13', '16', '50', '62', '30', '-50',
       '54', '25', '53', '11', '10', '51', '52', '14', '15', '24', '63',
       '12', '22', '61', '23', '55'], dtype=object)

#### Set numeric types for certain columns 

In [28]:
numeric_cols = ['id','SNR','RAdeg','DEdeg','posErr','z','YR500','vali','Msz']

In [29]:
# Convert columns to numeric using apply and pd.to_numeric
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, errors='coerce')

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1653 entries, 0 to 1652
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         1653 non-null   int64  
 1   name       1653 non-null   object 
 2   SNR        1653 non-null   float64
 3   RAdeg      1653 non-null   float64
 4   DEdeg      1653 non-null   float64
 5   posErr     1653 non-null   float64
 6   z          1653 non-null   float64
 7   f_z        1653 non-null   object 
 8   NSpec      1653 non-null   object 
 9   YR500      1653 non-null   float64
 10  vali       1653 non-null   int64  
 11  vali_flag  1653 non-null   object 
 12  Msz        1653 non-null   float64
 13  catname    1653 non-null   object 
dtypes: float64(7), int64(2), object(5)
memory usage: 180.9+ KB


In [31]:
df.head()

Unnamed: 0,id,name,SNR,RAdeg,DEdeg,posErr,z,f_z,NSpec,YR500,vali,vali_flag,Msz,catname
0,1,PSZ2G000.04+45.13,6.75319,229.1905,-1.0172,4.1073,0.1198,spec,-,5.481591,20,V,3.96,upcluster
1,2,PSZ2G000.13+78.04,9.25669,203.5587,20.256,2.0562,0.171,spec,-,4.360847,20,V,5.12,upcluster
2,3,PSZ2G000.40-41.86,9.70428,316.0845,-41.3542,2.4274,0.1651,spec,-,4.507689,21,V,5.3,upcluster
3,4,PSZ2G000.77-35.69,6.58179,307.9728,-40.5987,2.3434,0.3416,spec,-,1.606287,21,V,6.33,upcluster
4,5,PSZ2G002.04-22.15,5.12563,291.3596,-36.5179,5.0208,-1.0,-,-,1.927779,-1,C,0.0,upcluster


In [32]:
df.tail()

Unnamed: 0,id,name,SNR,RAdeg,DEdeg,posErr,z,f_z,NSpec,YR500,vali,vali_flag,Msz,catname
1648,1649,PSZ2G358.94-70.57,6.0534,352.7803,-36.5498,3.1576,0.0957,spec,-,4.050002,21,V,2.52,upcluster
1649,1650,PSZ2G358.98-67.26,9.88427,348.9266,-37.7791,1.7243,0.1786,spec,-,4.346875,21,V,5.22,upcluster
1650,1651,PSZ2G359.07-32.12,7.64275,303.0189,-41.4869,2.4288,0.1496,spec,-,3.354362,21,V,4.72,upcluster
1651,1652,PSZ2G359.60-08.72,4.74027,275.0918,-33.5317,2.428,-1.0,-,-,2.50723,-1,C,0.0,upcluster
1652,1653,PSZ2G359.67-07.23,6.71094,273.5496,-32.791,1.2551,-1.0,-,-,1.404587,-1,C,0.0,upcluster


#### Set object columns as str columns (optional)

In [33]:
# Select object columns
object_cols = df.select_dtypes(include=['object']).columns

In [34]:
# Convert object columns to string type
df[object_cols] = df[object_cols].astype("string")

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1653 entries, 0 to 1652
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         1653 non-null   int64  
 1   name       1653 non-null   string 
 2   SNR        1653 non-null   float64
 3   RAdeg      1653 non-null   float64
 4   DEdeg      1653 non-null   float64
 5   posErr     1653 non-null   float64
 6   z          1653 non-null   float64
 7   f_z        1653 non-null   string 
 8   NSpec      1653 non-null   string 
 9   YR500      1653 non-null   float64
 10  vali       1653 non-null   int64  
 11  vali_flag  1653 non-null   string 
 12  Msz        1653 non-null   float64
 13  catname    1653 non-null   string 
dtypes: float64(7), int64(2), string(5)
memory usage: 180.9 KB


> That's it of my feedbacks! Learn more about the powers of `pandas`, `python` and `jupyter notebook`. Good Luck!

## New Home Work 

- Can we automatically determine the type of data by the first data_line?!? 
- After some discussions with Gemini and ChatGPT, I can do the job as follows

In [36]:
def infer_dtype(data):
    """
    Infers the data type of each element in a list.
    
    Args:
        data: A list of elements.
        
    Returns:
        A list of data types corresponding to each element in the input list.
    """
    data_types = []
    for item in data:
        try:
            # Try converting to float
            float_val = int(item)
            data_types.append(int)
        except ValueError:
            try:
                # Try converting to int
                int_val = float(item)
                data_types.append(float)
            except ValueError:
                # Otherwise, it's a string
                data_types.append(str)
    return data_types

In [37]:
# Example usage
data = ['0', '1', 'PSZ2G000.04+45.13', '6.75319', '229.1905', '-1.0172', '4.1073', '0.1198', 'spec', '-', '5.481591', '20', 'V', '3.96', 'upcluster']
data_types = infer_dtype(data)

In [38]:
# Print the data types
print(data_types)

[<class 'int'>, <class 'int'>, <class 'str'>, <class 'float'>, <class 'float'>, <class 'float'>, <class 'float'>, <class 'float'>, <class 'str'>, <class 'str'>, <class 'float'>, <class 'int'>, <class 'str'>, <class 'float'>, <class 'str'>]


In [39]:
data_types

[int,
 int,
 str,
 float,
 float,
 float,
 float,
 float,
 str,
 str,
 float,
 int,
 str,
 float,
 str]