# RAIL SOMPZ Estimator Demo

Authors: Sam Schmidt, Justin Myles

Last successfully run: April 17th, 2025

This demo notebook follows the informer demo for the `rail_sompz` method, `rail_sompz_inform_demo.ipynb`, and uses the model files `DEMO_romandesc_model_deep.pkl` and `DEMO_romandesc_model_wide.pkl` that are created in that notebook.  So, you will need to run that notebook and train those two model SOMs before you run this demo, which shows how to run the estimate stage and produce tomographic bin estimates.

The algorithm works by determining weights for a spectroscopic dataset based on a wider "deep" dataset relative to a (usually larger) wide dataset.  See [Buchs, Davis et al. 2019](https://arxiv.org/abs/1901.05005), [Myles, Alarcon et al. 2021](https://arxiv.org/pdf/2012.08566) and references in [Campos et al. 2023](https://github.com/AndresaCampos/sompz_y6) for more details on the method.

The full process entails multiple steps, a common one is identifying the "best" cell in a SOM that data from the spectroscopic, deep/balrog, and wide data belong to, but also computing weights for the mappings, before finally assembling the tomographic bin estimates.  This notebook will go through the multiple stages necessary to construct the final N(z) estimates.


We'll start with our usual imports:

In [None]:
import os
import sys
import numpy as np
from rail.core.utils import RAILDIR
from rail.core import common_params
import tables_io
import matplotlib.pyplot as plt

import pandas as pd
import astropy.io.fits as fits

In [None]:
tables_io.__file__

In [None]:
#from rail.estimation.algos.sompz import SOMPZInformer
from rail.estimation.algos.sompz import SOMPZEstimatorWide, SOMPZEstimatorDeep
from rail.estimation.algos.sompz import SOMPZPzc, SOMPZPzchat, SOMPZPc_chat
from rail.estimation.algos.sompz import SOMPZTomobin, SOMPZnz

The SOMPZ method usually leverages a "deep" dataset with extra bands (often in the near-infrared), where the extra photometric information in the extended wavelength coverage enables a magnitudes/colors -> redshift mapping with less degeneracies than when using optical colors along.  For this demo, we will use data from the Rubin-Roman simulation [Citation needed!], which does contain simluated photometry for both the Rubin optical `ugrizy` bands as well as the Roman `JHFK` bands.  We have included a command-line tool in RAIL that will grab several data files that we will use in this demo.  If you ran the informer demo they are already in place and you can ignore the following cell, if you moved/deleted files, or just copied the model from the informer stage and still need the data, then uncomment the lines in the cell below to grab the data files, move, and untar them in the appropriate location.

In [None]:
# !curl -O https://portal.nersc.gov/cfs/lsst/PZ/roman_desc_demo_data.tar.gz
# !mkdir DEMODATA
# !tar -xzvf roman_desc_demo_data.tar.gz
# !mv romandesc*.hdf5 DEMODATA/

Now, let's load the three files that we will use into memory.  The "spec" file contains the galaxies with spectroscopic redshifts, these are usually a subset of the "deep" data (and that is the case here).  The "deep" data contains both optical and NIR bands, in this case `ugrizyJHF`.  And the "wide" data contains only `ugrizy` photometry.  The code will determine the cell occupation of the spec sample, determine weights via the deep sample, and attempt to create tomographic bin estimates for the sample based on SOM cell occupation.


There are two sets of files included in the Rubin-Roman download, one set that is a factor of 20 larger than the other.  For a quick demo, use the file names for `specfile`, `deepfile`, and `widefile` as-is below, for a more robust estimate with more training and estimation data, switch to the larger files by uncommenting and commenting the file names below:

In [None]:
from rail.core.data import TableHandle
from rail.core.stage import RailStage

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

In [None]:
#
## Larger files to use if you want slightly more robust demo (will take longer to run)
#specfile = "./DEMODATA/romandesc_spec_data_37k_noinf.hdf5"
#deepfile = "./DEMODATA/romandesc_deep_data_75k_noinf.hdf5"
#widefile = "./DEMODATA/romandesc_wide_data_100k_noinf.hdf5"
## smaller files for a quick demo, swap which lines are commented if you don't mind some extra run time
specfile = "./DEMODATA/romandesc_spec_data_18c_noinf.hdf5"
deepfile = "./DEMODATA/romandesc_deep_data_37c_noinf.hdf5"
widefile = "./DEMODATA/romandesc_wide_data_50c_noinf.hdf5"

spec_data = DS.read_file("spec_data", TableHandle, specfile)
balrog_data = DS.read_file("deep_data", TableHandle, deepfile)
wide_data = DS.read_file("wide_data", TableHandle, widefile)

We need to set up several parameters used by the estimate stages, namely the names of the inputs (for both deep and wide), the names of the input errors (again for both deep and wide), the zero points.  In our dataset, the bands are simply called e.g. `u`, and `J`, and the errors `u_err` and `J_err`.  The "deep" SOM we will use both optical and NIR bands, for the wide data we will only use ugrizy: 

In [None]:
bands = ['u','g','r','i','z','y','J','H', 'F']
#bands = ['u','g','r','i','z','y']

deepbands = []
deeperrs = []
zeropts = []
widezeropts = []
for band in bands:
    deepbands.append(f'{band}')
    deeperrs.append(f'{band}_err')
    zeropts.append(30.)

widebands = []
wideerrs = []  
for band in bands[:6]:
    widebands.append(f'{band}')
    wideerrs.append(f'{band}_err')
    widezeropts.append(30.)

In [None]:
print(widebands)

The full SOMPZ process involves multiple stages: in order to construct an N(z) estimate we must:
1) Find the best cell mapping for all of the deep/balrog galaxies into the deep SOM using stage `SOMPZEstimatorDeep`
2) Find the best cell mapping for all of the deep/balrog galaxies into the wide SOM using stage `SOMPZEstimatorWide`
3) Find the best cell mapping for all of the spectrscopic galaxies into the deep SOM using stage `SOMPZEstimatorDeep`

4) Use these cell assignments to compute the pz_c redshift histograms in deep SOM cells using stage `SOMPZPzc`. These distributions are redshift pdfs for individual deep SOM cells. 
5) Compute the 'transfer function' using stage `SOMPZPc_chat`. The 'transfer function' weights relating deep to wide photometry. These weights set the relative importance of p(z) from deep SOM cells for each corresponding wide SOM cell. These are traditionally made by injecting galaxies into images with Balrog.
6) Find the best cell mapping for all of the wide-field galaxies into the wide SOM using stage `SOMPZEstimatorWide`
7) Compute more weights using stage `SOMPZPzchat`. These weights represent the normalized occupation fraction of each wide SOM cell relative to the full sample.
8) Find the best cell mapping for all of the spectroscopic galaxies into the wide SOM using stage `SOMPZEstimatorWide`
9) Define a tomographic bin mapping using stage `SOMPZTomobin`
10) Assemble the final tomographic bin estimates with stage `SOMPZnz`

Note the repeated use of `SOMPZEstimatorDeep` and `SOMPZEstimatorWide` on multiple datasets.  We will have to be careful to define aliases so that `ceci` knows which datasets to use as inputs for these stages.

We will begin with step 1) above, setting up a stage to compute the cell assignments for the deep/balrog data using the deep SOM, first, let's set up some common parameters that will be used for each of the deep and wide SOMs in dictionaries:

There are many configuration parameters that we can access to control the behavior of the estimate stage, these are described below.  Any values not specified will take on their default values as set in the parameter config that is located in the class:

`hdf5_groupname`: hdf5_groupname for data<br>
`redshift_col`: column name for true redshift in specz sample<br>
`inputs`: list of the names of columns to be used as inputs for the data<br>
`input_errs`: list of the names of columns containing errors on inputs for the data<br>
`zero_points`: zero points for converting mags to fluxes for the data, if needed<br>
`som_shape`: shape for the som, must be a 2-element list<br>
`som_minerror`: floor placed on observational error on each feature in the som<br>
`som_wrap`: flag to set whether the SOM has periodic boundary conditions<br>
`som_take_log`: flag to set whether to take log of inputs (i.e. for fluxes) for the som<br>
`convert_to_flux`: flag for whether to convert input columns to fluxes for the input data, set to true if inputs are mags and to False if inputs are already fluxes<br>
`set_threshold`: flag for whether to replace values below a threshold with a set number<br>
`thresh_val`: threshold value for set_threshold for the input data<br>


In [None]:
deep_som_params = dict(inputs=deepbands, 
                       input_errs=deeperrs,
                       hdf5_groupname="",
                       zero_points=zeropts,
                       som_shape=[32,32], # now a list instead of a tuple!
                       som_minerror=0.01,
                       som_take_log=False,
                       convert_to_flux=True,
                       set_threshold=True,
                       thresh_val=1.e-5,
                       thresh_val_err=1.e-5)

wide_som_params = dict(inputs=widebands, 
                       input_errs=wideerrs,
                       hdf5_groupname="",
                       zero_points=widezeropts,
                       som_shape=[25,25], # now a list instead of a tuple!
                       som_minerror=0.005,
                       som_take_log=False,
                       convert_to_flux=True,
                       set_threshold=True,
                       thresh_val=1.e-5,
                       thresh_val_err=1.e-5)

Now, let's set up the first stage and run it:

In [None]:
som_estimate_deepdeep = SOMPZEstimatorDeep.make_stage(name="som_deepdeep_estimator", 
                                                      model="DEMO_romandesc_model_deep.pkl", 
                                                      assignment = "TESTDEMO_deepdata_deep_assign.hdf5",
                                                      aliases=dict(data="input_deep_data"),
                                                      data=deepfile,
                                                      **deep_som_params)

In [None]:
%%time
#som_est_deep.estimate(deep_data)
som_estimate_deepdeep.run()
som_estimate_deepdeep.finalize()

This should create a file `TESTDEMO_deepdata_deep_assign.hdf5`, which will contain the cell assignments (and the som_shape will be carried in the file as well)

Now, we can proceed to stages 2) and 3) to make cell assignments for the deep/balrog data to the wide SOM, and the spec data to the deep SOM:

In [None]:
%%time
som_estimate_deepwide = SOMPZEstimatorWide.make_stage(name="som_deepwide_estimator", 
                                           model="DEMO_romandesc_model_wide.pkl", 
                                           assignment = "TESTDEMO_deepdata_wide_assign.hdf5",
                                           aliases=dict(data="input_deep_data"),
                                           data=deepfile,
                                           **wide_som_params)

#som_estimate_deepwide.estimate(deep_data)
som_estimate_deepwide.run()
som_estimate_deepwide.finalize()

In [None]:
som_estimate_deepspec = SOMPZEstimatorDeep.make_stage(name="som_deepspec_estimator", 
                                           model="DEMO_romandesc_model_deep.pkl", 
                                           aliases=dict(assignment="cell_deep_spec_data", data="input_spec_data"),
                                           data=specfile,
                                           **deep_som_params)

#som_estimate_deepspec.estimate(spec_data)
som_estimate_deepspec.run()
som_estimate_deepspec.finalize()

Next, we will set up the `SOMPZPzc` stage to compute the pz_c weights.  This stage takes several input parameters:<br>
`inputs`: the list of the names of columns to be used as inputs<br>
`bin_edges`: the list of edges of tomo bins<br>
`zbins_min`: minimum redshift for output grid<br>
`zbins_max`: maximum redshift for output grid<br>
`zbins_dz`: delta z for defining output grid<br>
`deep_groupname`: the hdf5_groupname for the deep data<br>
`redshift_col`: column name for true redshift in specz sample<br>

Also, as we have multiple cell assignment files and data files in the DataStore, note that we are setting up the aliases for the expected inputs for `cell_deep_spec_data` and `spec_data` so that the stage uses the appropriate inputs.  As for where these names come from, we set up `som_estimate_deepspec` as an instance of `SOMPZEstimatorDeep` and assigned it a name with `name="som_deepspec_estimator"`.  The output of `SOMPZEstimatorDeep` and `SOMPZEstimatorWide` are both given the name `assignment` (see the definition of the output in the parent class `SOMPZEstimatorBase` here: https://github.com/LSSTDESC/rail_sompz/blob/3e3a73a4579ef2fd0282087e6cd6d73827f5be35/src/rail/estimation/algos/sompz.py#L1333), which is prepended to the name of the stage, and thus the output in the DataStore is stored as "assignment_som_deepspec_estimator".  Similar patterns are used to determine the names of other inputs and outputs.

In [None]:
pzcstage = SOMPZPzc.make_stage(name="som_pzc_stage", 
                               redshift_col="redshift",
                               bin_edges=[0.0,0.5,1.0,2.0,3.0],
                               zbins_min=0.0,
                               zbins_max=3.2,
                               zbins_dz=0.02,
                               deep_groupname="",
                               pz_c="TESTDEMO_pz_c.hdf5",
                               aliases=dict(cell_deep_spec_data="assignment_som_deepspec_estimator", spec_data="input_spec_data"),
                               )

In [None]:
%%time
#pzcstage.estimate(spec_data, cell_deep_spec_data)
pzcstage.run()
pzcstage.finalize()

Next, we can estimate the Pc_chat weights.  The only inputs that this stage takes are the deep/balrog assignments for the deep SOM, and the deep/balrog data assignments to the wide SOM.  We will specify these as aliases again to ensure that the code is grabbing the correct data:

In [None]:
pcchatstage = SOMPZPc_chat.make_stage(name="pcchat_stage",
                                      aliases=dict(cell_deep_balrog_data="assignment_som_deepdeep_estimator",
                                                   cell_wide_balrog_data="assignment_som_deepwide_estimator"),
                                     )

In [None]:
%%time
#pcchatstage.estimate(cell_deep_deep_data, cell_deep_wide_data)
pcchatstage.run()
pcchatstage.finalize()

Next, we need to compute more cell assignments, this time for the wide data.  We will start with the wide cell assignments using the wide SOM:

In [None]:
%%time
som_estimate_widewide = SOMPZEstimatorWide.make_stage(name="som_widewide_estimator", 
                                           model="DEMO_romandesc_model_wide.pkl", 
                                           assignment = "TESTDEMO_widedata_wide_assign.hdf5",
                                           aliases=dict(assignment="cell_wide_wide_data", data="input_wide_data"),
                                           data=widefile,
                                           **wide_som_params)

#som_estimate_widewide.estimate(wide_data)
som_estimate_widewide.run()
som_estimate_widewide.finalize()

With these cell assignments, we can now compute the pz_chat weights with `SOMPZPzchat`.  This stage takes as inputs some of the same parameters as previous stages, namely, `inputs`, `bin_edges`, `zbins_min`, `zbins_max`, `zbins_dz`, and`redshift_col`.   

It also must read in multiple input data files from previous stages, which we can specify in the aliases dictionary.  It requires the spectroscopic data file, the spec data deep SOM cell assignments, the wide data wide SOM cell assignments, the pz_c weights, and the pc_chat weights.

In [None]:
%%time
estimate_pzchat = SOMPZPzchat.make_stage(name="sompz_pzchat", 
                                         bin_edges=[0.2, 0.6, 1.2, 1.8, 2.5],
                                         zbins_min=0.0,
                                         zbins_max=3.0,
                                         zbins_dz=0.025,
                                         redshift_col="redshift",
                                         aliases=dict(spec_data='input_spec_data',
                                                      cell_deep_spec_data='assignment_som_deepspec_estimator',
                                                      cell_wide_wide_data='assignment_som_widewide_estimator',
                                                      pz_c='pz_c_som_pzc_stage',
                                                      pc_chat='pc_chat_pcchat_stage',
                                                     ),
                                         )
#estimate_pzchat.estimate(spec_data, cell_deep_spec_data, cell_wide_wide_data, pzchat_data, pcchat_data)
estimate_pzchat.run()
estimate_pzchat.finalize()

One last set of cell assignments is needed in order to create tomographic bins, the spectroscopic data assigned to the wide SOM:

In [None]:
%%time
som_estimate_widespec = SOMPZEstimatorWide.make_stage(name="som_widespec_estimator", 
                                           model="DEMO_romandesc_model_wide.pkl", 
                                           assignment = "TESTDEMO_specdata_wide_assign.hdf5",
                                           aliases=dict(assignment="cell_wide_spec_data", data="input_spec_data"),
                                           **wide_som_params)

#som_estimate_widespec.estimate(spec_data)
som_estimate_widespec.run()
som_estimate_widespec.finalize()

Now, the penultimate stage, `SOMPZTomobin`, requires the `inputs`, `bin_edges`, `zbins_min`, `zbins_max`, `zbins_dz`, and`redshift_col`, and requires the spectroscopic data and the cell assignment data of the spec data to both the wide and deep SOMs: 

In [None]:
estimate_tomobin = SOMPZTomobin.make_stage(name="sompz_tomobin",
                                           bin_edges=[0.2, 0.6, 1.2, 1.8, 2.5],
                                           zbins_min=0.0,
                                           zbins_max=3.0,
                                           zbins_dz=0.025,
                                           wide_som_size=625,
                                           deep_som_size=1024,
                                           redshift_col="redshift",
                                           tomo_bins_wide="TESTDEMO_tomo_bins_wide.hdf5",
                                           aliases=dict(spec_data='input_spec_data',
                                                        cell_deep_spec_data='assignment_som_deepspec_estimator',
                                                        cell_wide_spec_data='assignment_som_widespec_estimator'),
                                          )

In [None]:
%%time
#estimate_tomobin.estimate(spec_data, cell_deep_spec_data, cell_wide_spec_data)
estimate_tomobin.run()
estimate_tomobin.finalize()

Our final stage, `SOMPZnz` actually outputs the tomographic bin estimates.  It takes the same `inputs`, `bin_edges`, `zbins_min`, `zbins_max`, `zbins_dz`, and`redshift_col` configuration parameters, and requires as inputs the spectroscopic data, the cell assignments of the spec data to the deep SOM, the cell assignments of the wide data to the wide SOM, the tomographic bin assignments from `SOMPZTomobin`, and the pc_chat weights from `SOMPZPc_chat`.  The final output is called `nz` in the stage, we will set that to write out to the file `TESTDEMO_FINAL_NZ.hdf5`.

In [None]:
sompznz_estimate = SOMPZnz.make_stage(name="sompz_nz",
                                      bin_edges=[0.2, 0.6, 1.2, 1.8, 2.5],
                                      zbins_min=0.0,
                                      zbins_max=3.0,
                                      zbins_dz=0.025,
                                      redshift_col="redshift",
                                      aliases=dict(spec_data='input_spec_data',
                                                   cell_deep_spec_data='assignment_som_deepspec_estimator',
                                                   cell_wide_wide_data='assignment_som_widewide_estimator',
                                                   tomo_bins_wide='tomo_bins_wide_sompz_tomobin',
                                                   pc_chat='pc_chat_pcchat_stage'),
                                      nz="TESTDEMO_FINAL_NZ.hdf5")

In [None]:
%%time
#sompznz_estimate.estimate(spec_data, cell_deep_spec_data, cell_wide_wide_data, tomo_bins_wide, pcchat_data)
sompznz_estimate.run()
sompznz_estimate.finalize()

In this example we specified five tomographic bin edges, [0.2, 0.6, 1.2, 1.8, 2.5], so we should have four tomographic bins. 
These four tomographic bin estimates are stored in an output file with the name that we assigned to the `SOMPZnz` stage, "TESTDEMO_FINAL_NZ.hdf5", let's read in that file and display our tomographic bin estimates, along with the bin edges that we set:

In [None]:
import qp

In [None]:
ens = qp.read("TESTDEMO_FINAL_NZ.hdf5")

In [None]:
binedges = [0.2, 0.6, 1.2, 1.8, 2.5]
fig, axs = plt.subplots(1,1, figsize=(10,6))
cols=['r','purple','b','orange']
for i, col in enumerate(cols):
    ens[i].plot_native(axes=axs, color=col)
    axs.axvline(binedges[i+1], color=col, ls='--', lw=0.9)
axs.set_xlabel("redshift", fontsize=14)
axs.set_ylabel("N(z)", fontsize=14)
axs.set_xlim(0,3.25)

Looks very good!  Particularly for such small datasets as were used in the example, results should look better with larger files that enable a more well-defined SOM mapping from color to redshift.  There is a nice separation in our tomographic bins, without many bumps outside of the bin due to degeneracies.  The addition of the near-infrared bands can break many of the degeneracies where the Lyman and Balmer breaks are confused for each other.  This demonstrates the power of this technique, and how using NIR (or any other additional band information) can help us in determining our redshift distributions.