# Making ceci yaml files for SOMPZ

Author: Sam Schmidt<br>
Last Successfully run: May 6, 2025<br>

This notebook will quickly demonstrate using rail pipelines infrastructure to quickly set up the yaml files needed to run SOMPZ on the command line.  We'll start with some imports:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import qp
import os
import tables_io
from rail.core import common_params
from rail.pipelines.estimation.estimate_sompz import EstimateSomPZPipeline
from rail.pipelines.estimation.inform_sompz import InformSomPZPipeline

Next, we will grab the same data used in the `rail_sompz_inform_demo-romandesc.ipynb` notebook.  If you already ran that notebook, then the data should already exist in the `DEMODATA` subdirectory and you can comment out the cell below or skip it.  If you need to grab the data, then the cell below will grab the data:

In [None]:
#!curl -O https://portal.nersc.gov/cfs/lsst/PZ/roman_desc_demo_data.tar.gz
#!mkdir DEMODATA
#!tar -xzvf roman_desc_demo_data.tar.gz
#!mv romandesc*.hdf5 DEMODATA/

This data uses a different `hdf5_groupname` than the default "photometry" value, so first we can redefine `hdf5_groupname` in the `COMMON_PARAMS` that are shared by all of RAIL:

In [None]:
common_params.set_param_defaults(hdf5_groupname="")

We need to specify the locations of our input data files, we'll use the small files that we just downloaded from NERSC with 1850 spec objects, 3700 deep objects, and 5000 wide objects drawn from the Roman-Rubin sims:

In [None]:
specfile = "DEMODATA/romandesc_spec_data_18c_noinf.hdf5"
deepfile = "DEMODATA/romandesc_deep_data_37c_noinf.hdf5"
widefile = "DEMODATA/romandesc_wide_data_50c_noinf.hdf5"

`InformSomPZPipeline` is initialized with a dictionary where the deep and wide filenames are associated with `input_deep_data` and `input_wide_data`, so we need to create a dictionary and feed it to the pipeline, then initialize an instance of this pipeline:

In [None]:
test_input_dict = dict(input_deep_data=deepfile, input_wide_data=widefile, )

In [None]:
inform_pipe = InformSomPZPipeline()

In [None]:
inform_pipe.initialize(
    test_input_dict, 
    dict(
        output_dir=".",
        log_dir=".",
        resume=True,
    ),
    None,
)

That's all we need in order to create the yaml file! We can write the pair of yaml files to disk with inform_pipe.save.

In [None]:
inform_pipe.save("inform_pipeline.yml")

We can then run this pipeline from the command-line with the cell below:

In [None]:
!ceci inform_pipeline.yml

This should run our two stages that create the wide and deep SOMs and generate two pickle files.<br>

These two pickle files are inputs to the `EstimateSomPZPipeline`, we can set up an instance of that and initialize just like we did the inform pipeline.  In this case, we need to specify the following inputs in a dictionary: `wide_model` and `deep_model` (the wide and deep SOM pickle files created just now), and the filenames for the spec, deep, and wide data, `input_spec_data`, `input_deep_data`, and `input_wide_data`:

In [None]:
estimate_pipeline = EstimateSomPZPipeline()

In [None]:
estimate_dict={
    'wide_model':"model_som_informer_wide.pkl",
    'deep_model':"model_som_informer_deep.pkl",
    'input_spec_data':specfile,
    'input_deep_data':deepfile,
    'input_wide_data':widefile,
}

We initialize the pipeline in the same way as before, and again write out the yaml files with an `estimate_pipeline.save`:

In [None]:
estimate_pipeline.initialize(
    estimate_dict, 
    dict(
        output_dir=".",
        log_dir=".",
        resume=True,
    ),
    None,
)

In [None]:
estimate_pipeline.save("estimate_pipeline.yml")

Let's run this via the command line:

In [None]:
!ceci estimate_pipeline.yml

This pipeline runs multiple stages, and should create multiple intermediate files.  You can read a bit more about them in the `rail_sompz_estimate_demo_romandesc.ipynb` notebook, but we will simply look at the final output, the tomographic bins stored in `nz_som_nz.hdf5`.  This dataset should be a qp ensemble with four tomographic redshift bins.  Let's plot the four bins:

In [None]:
nzfile = "nz_som_nz.hdf5"
ens = qp.read(nzfile)

In [None]:
ens.npdf

In [None]:
fig, axs = plt.subplots(1, 1, figsize=(10,8))
for i in range(4):
    ens.plot_native(key=i, axes=axs)
axs.set_xlabel("redshift", fontsize=14)
axs.set_ylabel("N(z)", fontsize=14);

We see four fairly well defined bins, exactly as expected for the small samples used in this demo.