# RAIL SOMPZ Informer

**Author:** Sam Schmidt, Justin Myles

**Last Successfully Run:** April 30, 2024

This notebook will demonstrate the training of the "deep" and "wide" Self-Organized Maps (SOMs) used by `rail_sompz`.  `rail_sompz` is a ported version of the Dark Energy Survey (DES) SOM-based tomographic redshift bin software.

Like other RAIL estimators and summarizers, `rail_sompz` consists of an informer stage and an estimator stage, in this case `SOMPZInformer` and `SOMPZEstimator`.  `SOMPZInformer` takes in both the "deep" data (usually taken over a smaller area than our "wide" data, and usually including additional photometric bands) and "wide" data and trains a pair of SOMs that will be used by the estimator stage.  

There are a number of **configuration parameters** that the stage uses to control aspects of the SOM training:
- redshift_col: the name of the redshift column
- deep_groupname: the hdf5_groupname for deep data
- wide_groupname: the hdf5_groupname for wide data
- inputs_deep: the list of the names of columns to be used as inputs for deep data
- input_errs_deep: the list of the names of columns containing errors on inputs for deep data
- inputs_wide: the list of the names of columns to be used as inputs for wide data
- input_errs_wide: the list of the names of columns containing errors on inputs for wide data
- zero_points_deep: the list of zero points for converting mags to fluxes for deep data, if needed
- zero_points_wide: the list of zero points for converting mags to fluxes for wide data, if needed
- som_shape_deep: a tuple defining the shape for the deep som, must be a 2-element tuple, e.g. `(32, 32)`
- som_shape_wide: a tuple defining the shape for the wide som, must be a 2-element tuple, e.g. `(25, 25)`
- som_minerror_deep: the floor value placed on observational error on each feature in deep som
- som_minerror_wide: the floor value placed on observational error on each feature in wide som
- som_wrap_deep: boolean flag to set whether the deep SOM has periodic boundary conditions
- som_wrap_wide: boolean flag to set whether the wide SOM has periodic boundary conditions
- som_take_log_deep: boolean flag to set whether to take log of inputs (i.e. for fluxes) for deep som
- som_take_log_wide: boolean flag to set whether to take log of inputs (i.e. for fluxes) for wide som
- convert_to_flux_deep: boolean flag for whether to convert input columns to fluxes for deep data, set to true if inputs are mags and to False if inputs are already fluxes
- convert_to_flux_wide: boolean flag for whether to convert input columns to fluxes for wide data
- set_threshold_deep: boolean flag for whether to replace values below a threshold with a set number
- thresh_val_deep: threshold value for set_threshold for deep data
- set_threshold_wide: boolean flag for whether to replace values below a threshold with a set number
- thresh_val_wide: threshold value for set_threshold for wide data

We will set several of these values in our example, any values not explicitly set will revert to their defaults.  

Let's start by importing a few packages, including `SOMPZInformer` and setting up the RAIL DataStore:

References

A. Campos et al. (DES Collaboration) - Enhancing weak lensing redshift distribution characterization by optimizing the Dark Energy Survey Self-Organizing Map Photo-z method (in preparation)

C. Sánchez, M. Raveri, A. Alarcon, G. Bernstein - Propagating sample variance uncertainties in redshift calibration: simulations, theory, and application to the COSMOS2015 data

R. Buchs, et al. - Phenotypic redshifts with self-organizing maps: A novel method to characterize redshift distributions of source galaxies for weak lensing

J. Myles, A. Alarcon, et al. (DES Collaboration) - Dark Energy Survey Year 3 results: redshift calibration of the weak lensing source galaxies

In [None]:
# usual imports
import datetime
import os
import numpy as np
#from rail.core.utils import RAILDIR
import matplotlib.pyplot as plt

In [None]:
import h5py

In [None]:
import tables_io
import rail

In [None]:
rail.__path__

In [None]:
from rail.estimation.algos.sompz import SOMPZInformer

In [None]:
from rail.core.data import TableHandle
from rail.core.stage import RailStage

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

DS.read_fileout of date: Next, let's read in some test data. We'll use some small datasets drawn from the [Cardinal simulations](https://chunhaoto.com/cardinalsim/), and where we have incorporated expected 10-year depth photometric uncertainties into the data via the photerr-based error models in RAIL.

For the "deep" SOM, the data file is named `balrog_data_subcatalog.hdf5` and includes the LSST `ugrizy` bands as well as the VISTA `YJHK` bands, for **nine** total bands.  The extra near-infrared information in the `YJHK` bands will be crucial in mapping out the color to redshift relation for our deep sample.   There are TODO galaxies in this file, cut to include only galaxies with TODO.


For the "wide" som, the data file is named `wide_data_subsample.hdf5` and we will only use the `griz` bands in the analysis.  There are TODO galaxies in this file.

The data is included in a subdirectory of this directory, `examples/datafiles/` (TODO), and we can read them directly into the Data Store:

In [None]:
#from rail.core.utils import find_rail_file
#datadir = '/global/cfs/projectdirs/des/jmyles/sompz_desc/'
datestr = '2024-06-24'
datadir = f'/pscratch/sd/j/jmyles/sompz_buzzard/{datestr}/'
outdir = os.path.join(datadir, 'run-2024-06-28')
os.system(f'mkdir -p {outdir}')
trainFileDeep = os.path.join(datadir, 'balrog_data_subcatalog.h5') #'./datafiles/romandesc_deep_data_3700.hdf5'
trainFileWide = os.path.join(datadir, 'wide_data_subsample.hdf5') # wide_data.h5 #'./datafiles/romandesc_wide_data_5000.hdf5'
model_file = os.path.join(outdir, f"DEMO_CARDINAL_model_{datestr}.pkl") # model storing SOMs, to be created further down in the notebook

In [None]:
deep_data = DS.read_file("input_deep_data", TableHandle, trainFileDeep)
wide_data = DS.read_file("input_wide_data", TableHandle, trainFileWide)
#wide_data = TableHandle("input_wide_data", path=trainFileWide)
#wide_data_chunk = next(wide_data_handle.iterator(groupname='catalog', chunk_size=10_000_000))

Let's take a look at what the names of the columns are in the deep file:

In [None]:
#print('\n'.join(sorted(deep_data()['key'].keys())))
deep_data.data['key'].head()

We have the Rubin `ugrizy` bands and their errors with names like `TRUEMAG_lsst_u` and `TRUEMAG_ERR_lsst_u`, the VISTA NIR bands `YJHF` and their errors.  We will use just the magnitude (or flux) quantities and not the colors when constructing our example SOM.  For our "deep" SOM we will use all of `ugrizyJHK`, while for the "wide" SOM we will use only `griz`.  Let's set up some lists with our magnitudes that will be used in our configs.  The SOM also requires a zero point if we are going to convert to flux (which we are), so we will supply default zero points of 30.0 for all bands in this demo:

In [None]:
bands_deep = ['lsst_u', 'lsst_g', 'lsst_r', 'lsst_i', 'lsst_z', 
              'VISTA_Filters_at80K_forETC_Y', 'VISTA_Filters_at80K_forETC_J', 'VISTA_Filters_at80K_forETC_H', 'VISTA_Filters_at80K_forETC_Ks',]
bands_wide = ['G','R','I','Z',] # 'U', 'Y','J','H','K'

deepbands = []
deeperrs = []
zeropts = []
for band in bands_deep:
    deepbands.append(f'TRUEMAG_{band}')
    deeperrs.append(f'TRUEMAG_ERR_{band}')
    zeropts.append(30.)

widebands = []
wideerrs = []  
for band in bands_wide: #[:6]:
    widebands.append(f'FLUX_{band}')
    wideerrs.append(f'FLUX_ERR_{band}')
    
refband_deep=deepbands[3]
refband_wide=widebands[2]

In [None]:
print(deepbands)

In [None]:
print(widebands)

Next, let's make a dictionary of the parameters that we'll feed into the informer for the deep SOM and wide SOM, including the non-default names of the input columns (`inputs_deep`) and the errors (`input_errs_deep`) and their wide counterparts.  We'll feed in a list for the zero points (`zero_points`) as well.  We want to convert to flux so we set `convert_to_flux_deep` to `True` (`convert_to_flux_wide` is `False` since flux information is already stored in the wide-field catalog).  We will also apply a threshold cut to the deep SOM by setting `set_threshold_deep` to `True` and set the threshold value with `thresh_val_deep` = 1.e-5.  We can set the shape of the SOMs or let them take their default values.  Let's leave the "deep" SOM with its default size of `(32, 32)` by not supplying a value, and set the "wide" SOM size with `som_shape_wide=(32,32)`.  If your input data is flux-like (which ours is only for the wide-field flux information) and want it to look more magnitude-like, you can set  `som_take_log_wide` to `True` if you want to take the log of the data before creating the SOM.  We will set this to `False`, as we want to work in converted flux space.  And, finally `som_wrap_wide` sets whether or not to use periodic boundaries in the SOM, we will set this to `False` for the wide SOM.

In [None]:
som_params = dict(inputs_deep=deepbands, input_errs_deep=deeperrs,
                  zero_points_deep=zeropts, 
                  inputs_wide=widebands, input_errs_wide=wideerrs,
                  convert_to_flux_deep=True, convert_to_flux_wide=False, 
                  set_threshold_deep=True, thresh_val_deep=1.e-5, 
                  som_shape_wide=(32,32), som_minerror_wide=0.005,
                  som_take_log_wide=False, som_wrap_wide=False)

If you have used other RAIL packages you may have seen `hdf5_groupname` as a parameter where you may specify an HDF5 group where your input data may live.  SOMPZ has eqivalent `deep_groupname` and `wide_groupname` config parameters.  Four our example data we will set `deep_groupname` to `key` and `wide_groupname` to `""` to reflect the HDF structure of these catalogs.

We will also supply the `model` config parameter, which will set the name for the pickle file that will hold our output model consisting of the two SOMs and a set of configuration parameters.  This model will be used by the Estimation stage later in the demo:

Now, run the informer:

In [None]:
som_inform = SOMPZInformer.make_stage(name="cardinal_som_informer", 
                                      deep_groupname="key", 
                                      wide_groupname="",#, False, #"catalog",
                                      model=model_file, 
                                      **som_params)

In [None]:
#%%time
print(f'{datetime.datetime.now()} begin informing')
som_inform.inform(deep_data, wide_data)
print(f'{datetime.datetime.now()} done informing')

For large samples, this can take a while to train, and should create a file `DEMO_CARDINAL_model*.pkl`.  Let's look at the results by reading the model we just wrote out, and see what it contains.  It should be a dictionary that contains two soms: `deep_som` and `wide_som`, along with the `deep_columns`, `wide_columns`, `deep_err_columns`, and `wide_err_columns` lists.  We store these column names as they basically define the ordering of the columns and errors, and we'll want that the same for data that we pass in for the estimation stage.

In [None]:
import pickle

In [None]:
with open(model_file, "rb") as f:
    model = pickle.load(f)

In [None]:
model

There are some handy plotting functions available in the `rail.estimation.algos.som.py` file that enable us to visualize our SOM for some basic visual checks.  Let's first plot the occupation of cells broken up by colors, i.e. the mean values of g-i, i-y, u-g, and the i-band magnitude for each cell.  First, the deep SOM (using the `somDomainColorsnok` to plot quantities using only have ugrizy):

In [None]:
import rail.estimation.algos.som as SOMFUNCS

In [None]:
outfile = os.path.join(outdir, 'som_colors.png')
SOMFUNCS.somDomainColorsnok(model['deep_som'])
fig = plt.gcf()
fig.savefig(outfile,)

And now, for the deep SOM using `somDomainColors` which shows i-K in the upper right(in actuality i-F, the names are currently hardcoded):

In [None]:
outfile = os.path.join(outdir, 'som_colors_2.png')
SOMFUNCS.somDomainColors(model['deep_som'])
fig = plt.gcf()
fig.savefig(outfile,)

For comparison, here is the `somDomainColorsnok` plots for the wide SOM:

In [None]:
#SOMFUNCS.somDomainColorsnok(model['wide_som'])

And, as a final visualization, here are the locations of the mean colors of each SOM cell in i-K vs g-i color space, color-coded by the mean i-band magnitude of the SOM cell:

In [None]:
outfile = os.path.join(outdir, 'deep_som_color_color.png')
SOMFUNCS.somPlot2d(model['deep_som'])
fig = plt.gcf()
fig.savefig(outfile,)

This looks promising, our SOMs both show coherent patterns in color and magnitude, as they should, and will enable us to map color-space to redshift via the occupation of training galaxies in the SOMs.  In a separate notebook, `rail_sompz_estimation_demo.ipynb`, we will run the estimator stage and produce tomographic bin estimates for a test set of objects.