# A1 Train the generative synthetic cluster model

Owner: **Tamas Norbert Varga** @vargatn


This notebook will describe how to set up and train the cluster model


## Objectives:

    1 Load and collate galaxy catalogs 
    
    2 Set up cluster line-of-sight emulation script (build KDE models)
    
    3 INTENSIVE: Draw samples from the proposal galaxy catalog and calculate survival scores for rejection sampling.

## Setup

This notebook relies on the:


    * synthetic package & dependencies
    
    * DC2 data files hosted separately
    

## Output

    * samples drawn from the KDE models
    * scores of the samples drawn from the KDE models, the score is the P(sample)

The output files are placed in the `./data/` folder


**the calculations in this notebook are time consuming, and are intended for an OpenMP HPC environment** (not NERSC. These cells are currently commented out.),
 

In [34]:
import fitsio as fio
import numpy as np
import pandas as pd

import copy
import sys
import glob
import os
import matplotlib.pyplot as plt
%matplotlib inline

import healpy as hp
import matplotlib as mpl
import subprocess as sp

import scipy.interpolate as interpolate
import pickle as pickle

import multiprocessing as mp


# this package is installed from https://github.com/vargatn/skysampler/tree/lsst-dev  
# import skysampler_lsst.emulator as emulator
# import skysampler_lsst.utils as utils
# from skysampler_lsst.reader import result_reader

import synthetic.tools as tools
import synthetic.emulator.emulator as emulator
import synthetic.emulator.indexer as indexer
import synthetic.emulator.reader as reader

*** Important *** set the `nprocess` value to the amount of CPU cores you are willing or allowed to use for this notebook

please consider your local server, e.g. don't run heavy calculations on public login nodes

***Note, this notebook is computationally heavey to execute, consider restricitng the number of galaxies you want to metacalibrate to reduce runtime when testing***


In [36]:
nprocess = 4
# nprocess = 160

nsamples = 400 # number of propsal samples to draw
# nsamples = 1600000 # number of propsal samples to draw
nrepeats = 1 # number of times the full run is repeated
# nrepeats = 4 # number of times the full run is repeated

## Setting up the file structure

The data files for this example calculation are pre packaged, and should be downloaded from a link provided in the [data access](DATA.md) instructions

    1 dc2-alpha_concentric_sample-v01_test-03.tar.gz
    2 dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_refpixels.h5
    3 dc2_cluster_sim_cutouts/clust_dc2-sim-LOS_v1.h5
    
These should be downloaded and placed in a file structure such that

    /root/
    |----/resamples/ 
    |----/dc2-alpha_concentric_sample-v01_test-03.tar.gz
    |----/dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_refpixels.h5
    |----/dc2_cluster_sim_cutouts/clust_dc2-sim-LOS_v1.h5
    
from within the root folder, extract the .tar.gz file using the command

    tar xzf dc2-alpha_concentric_sample-v01_test-03.tar.gz -C  resamples --strip-components 1    
    
This should yield a file structure as below
 
     /root/
    |----/resamples/ 
    |--------------/dc2-alpha_concentric_sample-v01_test-03_run0_1846435878_rbin0.p
    |--------------/dc2-alpha_concentric_sample-v01_test-03_run0_1846435878_rbin0_samples.fits
    |--------------/dc2-alpha_concentric_sample-v01_test-03_run0_1846435878_rbin0_scores.fits
            .
            .
            .
    |--------------/dc2-alpha_concentric_sample-v01_test-03_run3_664487101_rbin3.p
    |--------------/dc2-alpha_concentric_sample-v01_test-03_run3_664487101_rbin3_samples.fits
    |--------------/dc2-alpha_concentric_sample-v01_test-03_run3_664487101_rbin3_scores.fits            
    |----/dc2-alpha_concentric_sample-v01_test-03.tar.gz
    |----/dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_refpixels.h5
    |----/dc2_cluster_sim_cutouts/clust_dc2-sim-LOS_v1.h5

 Paths for data preparations

In [37]:
# # cutout paths string
# cutout_fname_base = "/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2_cluster_sim_cutouts/clust-{}_dc2-sim-cutout.h5"
# # cutout output name
# cutout_oname = "/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2_cluster_sim_cutouts/clust_dc2-sim-LOS_v1.h5"

# # reference pixel filename input
# fname_refpixel = "/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_refpixel-{}.h5"
# # reference pixel filename output
# oname_refpixel = "/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_refpixels.h5"



Path for resampling calculations

In [96]:
# paths for calculations
root_path = "/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/"
deep_data_path = root_path + "dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_refpixels.h5"
wide_data_path = root_path + "dc2_cluster_sim_cutouts/clust_dc2-sim-LOS_v1.h5"

In [98]:
tag_root = "dc2-example" # this is what the current output files will be saved as

 # Data Preparation

## Concatenate cluster catalog cutouts

We are concatenating the galaxy catalog cutouts around clusters. These are initially saved in a separate file for each cluster. For further processing these are saved again in a single file

In [99]:
# redmapper catalog paths
# redmapper_path = "/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2_cluster_sim_cutouts/cosmoDC2_v1.1.4_redmapper_v0.7.5_clust.h5"
# clusters = pd.read_hdf(redmapper_path, key="data")

In [100]:
# ii = (clusters["richness"] > 30) & (clusters["richness"] < 60) & (clusters["redshift"] > 0.3) & (clusters["redshift"] < 0.35)
# print(ii.sum())
# clusters[ii].to_hdf("./data/refdata/dc2_redmapper_example_sel.h5", key="data")

Note that above we restricts the redshift range to 0.3 - 0.35. This is done as the emulated clusters will be representative of the ensemble properties of the selection. A narrow redshift range minimizes the intrinsic spread in apparent photometry properties.

There are 41 clusters in this selection


In [101]:
clusters = pd.read_hdf("./data/refdata/dc2_redmapper_example_sel.h5", key="data")

In [102]:
# table = []
# for i, cid in enumerate(clusters["cluster_id"]):
#     print(i, cid)
#     fname = cutout_fname_base.format(cid)
#     tab = pd.read_hdf(fname, key="data")
#     tab["cluster_id"] = cid
#     tab = tab[tab["R"] < 16 ]
#     table.append(tab)
# table = pd.concat(table)
# table.to_hdf(cutout_oname, key="data")

## reference field data

Since we saved all galaxies from three randomly selected healpix pixels, we are now concatenating them for further processing

In [103]:
# pixels = [8786, 8791, 9937]
# refpixel = []
# for pix in pixels:
#     print(pix)
#     tmp = pd.read_hdf(fname_refpixel.format(pix), key="data")
#     refpixel.append(tmp)
# refpixel = pd.concat(refpixel)

In [104]:
# refpixel["R"] = np.sqrt(np.random.uniform(0, 16**2., size=len(refpixel)))

We assign a mock uniform radial profile to reference field galaxies. In case of availability, this can be replaced by the radial profile around random points (such as redmapper randoms). That approach is bit more advanced, and captures the un-evennes and edges in the survey footprint

In [105]:
# refpixel.to_hdf(oname_refpixel, key="data")

# Train KDE model and calculate survival scores

In [106]:
NREPEATS = nrepeats # number of propsal samples to draw
NSAMPLES = nsamples # number of propsal samples to draw
NCHUNKS = nprocess # number of CPU cores to use
bandwidth=0.1 # Gaussian KDE bandwidth in the eigen-feature space after applying PCA

# data paths, the code will create a subfolder within root_path based on tag_root

# The number of galaxies goes with surface area element, to avoid modeling very un-balanced PDFs 
# a series of nested concentric segments are modeled consecutively and later stiched together
LOGR_DRAW_RMINS = np.array([-3, -0.5, 0., 0.5])
LOGR_DRAW_RMAXS = np.array([-0.5, 0., 0.5, 1.2])
LOGR_CAT_RMAXS = [0., 0.5, 1.1, 1.2]

# feature aliases and definitions from the deep / reference dataset for comparison with the wide dataset
deep_c_settings = {
    "columns": [
        ("MAG_I", "mag_i"),
        ("COLOR_G_R", ("mag_g", "mag_r", "-")),
        ("COLOR_R_I", ("mag_r", "mag_i", "-")),
    ],
    "logs": [False, False, False, False],
    "limits": [(17, 22.5), (-1, 3), (-1, 3), (-1, 3)],
}

# feature aliases and definitions for all features we want to model and inherit from the deep / reference fields
deep_smc_settings = {
    "columns": [
        ("GABS", ("ellipticity_1_true", "ellipticity_2_true", "SQSUM")),
        ("SIZE", "size_true"),
        ("MAG_I", "mag_i"),
        ("COLOR_G_R", ("mag_g", "mag_r", "-")),
        ("COLOR_R_I", ("mag_r", "mag_i", "-")),
        ("COLOR_I_Z", ("mag_i", "mag_z", "-")),
        ("STELLAR_MASS", "stellar_mass"),
        ("HALO_MASS", "halo_mass")
    ],
    "logs": [False, True, False, False, False, False, True, True],
    "limits": [(0., 1.), (-1, 5), (17, 25), (-1, 3), (-1, 3), (-1, 3), (10**3, 10**13), (10**9, 10**16)],
}

# feature aliases and definitions from wide dataset
wide_cr_settings = {
    "columns": [
        ("MAG_I", "mag_i"),
        ("COLOR_G_R", ("mag_g", "mag_r", "-")),
        ("COLOR_R_I", ("mag_r", "mag_i", "-")),
        ("LOGR", "R"),
    ],
    "logs": [False, False, False, True],
    "limits": [(17, 22.5), (-1, 3), (-1, 3), (1e-3, 16.), ],
}

# the radial profile around clusters from the wide dataset
wide_r_settings = {
    "columns": [
        ("MAG_I", "mag_i"),
        ("LOGR", "R"),
    ],
    "logs": [False, True,],
    "limits": [(17, 22.5), (1e-3, 16.),],
}
# features to use for rejection sampling
columns = {
    "cols_dc": ["COLOR_G_R", "COLOR_R_I",],
    "cols_wr": ["LOGR",],
    "cols_wcr": ["COLOR_G_R", "COLOR_R_I", "LOGR",],
}

The below script carries out most heavy lifting calculation

1) loading the data

2) constructs the features from the above dictionaries

3) transforms features into their eigien-space and builds a KDE

4) draws NSAMPLES proposal points from the features in  deep_smc_settings 

5) scores each proposal point based on the KDE models of the other features. (scores are transformed according to PCA jacobian for each feature space)

6) saves samples, scores, and jacobian for each draw

This section is commented out as it takes a ~few hunderd CPU hours to run and it's not optimized for NERSC job managers. Currently it was ran in a local computing resource at LMU Munich.


In [None]:
print("started reading")
refpixel = pd.read_hdf(deep_data_path, key="data")
table = pd.read_hdf(wide_data_path, key="data")

print("creating output folder")
root_path = root_path + tag_root + "/"
print(root_path)
if not os.path.isdir(root_path):
    os.mkdir(root_path)

nrbins = len(LOGR_DRAW_RMINS)
print("NRBINS:", nrbins)

for nrep in np.arange(NREPEATS):
    tag = tag_root + "_run" + str(nrep)
    print("running repeat", nrep, "out of", NREPEATS)
    print(tag)

    master_seed = np.random.randint(0, np.iinfo(np.int32).max, 1)[0]
    rng = np.random.RandomState(seed=master_seed)
    seeds = rng.randint(0, np.iinfo(np.int32).max, nrbins * 5)

    i = 0
    print("starting concentric shell resampling")
    for i in np.arange(nrbins):
        print("rbin", i)
        outname = root_path + "/" + tag + "_{:1d}".format(master_seed) + "_rbin{:d}".format(i)
        print(outname)

        # loading random data
        tmp_wide_r_settings = wide_r_settings.copy()
        tmp_wide_r_settings["limits"][-1] = (10**-3, 10**LOGR_CAT_RMAXS[i])
        _wide_r_settings_rands = emulator.construct_deep_container(refpixel, tmp_wide_r_settings, seed=seeds[nrbins * i + 0], drop="MAG_I")

        tmp_wide_cr_settings = wide_cr_settings.copy()
        tmp_wide_cr_settings["limits"][-1] = (10**-3, 10**LOGR_CAT_RMAXS[i])
        _wide_cr_settings_rands = emulator.construct_deep_container(refpixel, tmp_wide_cr_settings, seed=seeds[nrbins * i + 1], drop="MAG_I")

        # loading deep catalogs
        _deep_c_settings = emulator.construct_deep_container(refpixel, deep_c_settings, seed=seeds[nrbins * i + 2], drop="MAG_I")
        _deep_smc_settings = emulator.construct_deep_container(refpixel, deep_smc_settings, seed=seeds[nrbins * i + 3])

        # loading cluster data
        tmp_wide_cr_settings = wide_cr_settings.copy()
        tmp_wide_cr_settings["limits"][-1] = (10**-3, 10**LOGR_CAT_RMAXS[i])
        _wide_cr_settings_clust = emulator.construct_deep_container(table, tmp_wide_cr_settings, seed=seeds[nrbins * i + 4], drop="MAG_I")

        infodicts, samples = emulator.make_classifier_infodicts(_wide_cr_settings_clust, _wide_r_settings_rands,
                                                                _wide_cr_settings_rands,
                                                                _deep_c_settings, _deep_smc_settings,
                                                                columns, nsamples=NSAMPLES, nchunks=NCHUNKS,
                                                                bandwidth=bandwidth,
                                                                rmin=LOGR_DRAW_RMINS[i],
                                                                rmax=LOGR_DRAW_RMAXS[i])

        fname = outname + "_samples.fits"
        print(fname)
        fio.write(fname, samples.to_records(), clobber=True)
        master_dict = {
            "columns": infodicts[0]["columns"],
            "bandwidth": infodicts[0]["bandwidth"],
            "deep_c_settings": deep_c_settings,
            "deep_smc_settings": deep_smc_settings,
            "wide_r_settings": tmp_wide_r_settings,
            "wide_cr_settings": tmp_wide_cr_settings,
            "rmin": infodicts[0]["rmin"],
            "rmax": infodicts[0]["rmin"],
        }
        pickle.dump(master_dict, open(outname + ".p", "wb"))
        print("calculating scores")
        result = emulator.run_scores2(infodicts)
        print("finished calculating scores")
        fname = outname + "_scores.fits"
        print(fname)
#         fio.write(fname, result.to_records(), clobber=True)


started reading
creating output folder
/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2-example/
NRBINS: 4
running repeat 0 out of 1
dc2-example_run0
starting concentric shell resampling
rbin 0
/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2-example//dc2-example_run0_1909520610_rbin0
(865, 1)
(865,)
(865, 3)
(865,)
(220873, 2)
(220873,)
(1513572, 8)
(1513572,)
(1481, 3)
(1481,)
/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2-example//dc2-example_run0_1909520610_rbin0_samples.fits
calculating scores
finished calculating scores
/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2-example//dc2-example_run0_1909520610_rbin0_scores.fits
rbin 1
/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2-example//dc2-example_run0_1909520610_rbin1
(8597, 1)
(8597,)
(8597, 3)
(8597,)
(220873, 2)
(220873,)
(1513572, 8)
(1513572,)
(9421, 3)
(9421,)
/e/ocean1/users/vargatn/LSST/DC2_1.1.4/clusters_v01/dc2-example//dc2-example_run0_1909520610_rbin1_samples.fits
calculating scor