# TPZ: Trees for Photo-Z's

Author: Sam Schmidt <br>
Last successfully run: March 24, 2025 <br>

TPZ is one of the codes implemented in the MLZ (Machine Learning PhotoZ) package by Matias Carraso-Kind, some documentation for the algorithm is included in Matias' website for the package:
http://matias-ck.com/mlz/
However, the code is no longer actively maintained, and Matias' original code was written for Python 2.  This code is based on a fork by Erfan Nourbakhsh for a DESC project which is itself a fork that updated the code to be python3 compatible):
Erfan's fork: https://github.com/enourbakhsh/MLZ

This RAIL-wrapped version of the code does not include the SOM-based MLZ code, nor the implementation of BPZ, it only includes an implementation of the decision-tree-based code.  

Initially, we have only implemented the regression-tree versionof the code, though the classification-tree method may be re-implemented at a future time.  Furthermore, the original code had options for out-of-bag (oob) error estimates and variational importance sampling, those have not been included initially, but will hopefully be added as options in the near future.

For a quick summary of how the code operates, given a set of galaxy observables (usually magnitudes, and optional uncertainties), TPZ builds a set of decision trees where it splits a training set by some of the included parameters in a way that best differentiates the parameter of interest (in our case redshift).  It then performs repeated splits on parameters in each leaf branch of each tree that best differentiate the remaining data, thus building up a decision tree.  It creates multiple trees in two ways: 1) by creating N bootstrap realizations of the initial training set; 2) if uncertainties are provided (e.g. magnitude uncertainties) it creates M alternative training set realizations by adding Gaussian scatter to the training quantities galaxy-by-galaxy (Note: if an error is not supplied it assumes a very small error of 0.00005 for the Gaussian sigma).  Thus, it trains up a total of N x M total trees for its model, e.g. if you tell TPZ that you want 5 random realizations and 4 trees it will create 5 random datasets and bootstrap those 4 times to train a total of 20 trees.  
To create a photo-z estimate, it then lets each test galaxy plinko down through the decision tree, and adds the redshifts of the training galaxies in the terminal leaf node to a histogram, building up the final PDF by looking at all N x M trees.

## Running TPZ
we'll start with a few basic imports, including the import of `TPZliteInformer` from RAIL, along with the `RAILDIR` path that will help us grab some basic test data:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os
import tables_io

In [None]:
from rail.estimation.algos.tpz_lite import TPZliteInformer

In [None]:
from rail.utils.path_utils import RAILDIR

In [None]:
RAILDIR

A small set of ~10,000 training galaxies is include with the base rail repo, below we will point to that data and read it into the Data Store, where it is stored as an ordered dictionary of magnitudes, magnitude uncertainties, and redeshift (under a hdf5 group called `photometry`):

In [None]:
datafile = os.path.join(RAILDIR,"rail/examples_data/testdata/test_dc2_training_9816.hdf5")

In [None]:
import rail
import qp
from rail.core.data import TableHandle
from rail.core.stage import RailStage

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

In [None]:
training_data = DS.read_file("training_data", TableHandle, datafile)


Next, we will create a dictionary with the configuration parameters that controld TPZ's behavior that we will feed to the `make_stage` method of `TPZliteInformer` to set up our training of the trees.  There are several configuration parameters available for TPZliteInformer.  A number of these are "shared parameters", including:
`zmin`, `zmax`, `nzbins`, `mag_limits`, and `redshift_col`.

Two shared parameter play an imporant role in TPZ:<br>
- `bands` is a list that contains the names of the columns that will be used as features in the tree.  While we use the name `bands` to be consistent with other parts of RAIL, note that other meaningful quantities, e.g. size, shape, concentration, surface brightness, etc..., could also be used.
-  `err_bands` contains a list of the column names that contain the 1-sigma uncertainties for the quantities listed in `bands`.  As TPZ creates mock data by sampling from the uncertainty distributions when making its forest of trees, these quantities are necessary for proper functioning of the code.

Additionally, the configuration parameter `err_dict` must be a dictionary that contains the columns that will be used to predict as the keys and the errors associated with that column as the values, e.g. `errdict["mag_u_lsst"] = "mag_err_u_lsst"`.  This dictionary is used by the code that generates random realizations of the galaxy by adding Gaussian scatter, it tells that bit of code which columns contain errors for each of the attributes that will have Gaussian scatter added. <br>

The other configuration parameters for TPZ are:
- `seed` (int): the random seed used by numpy for this stage <br>
- `nrandom` (int): the number of random training catalogs with Gaussian scatter to create. <br>
- `ntrees` (int): the number of bootstrap samples for a given random catalog to create. <br>
REMINDER: the total number of trees trained will be `nrandom` * `ntrees`, and if `nrandom` is set to 1, then no random catalogs are created, only the original training sample is used.<br>
- `minleaf` (int): the mininum number of galaxies in a terminal leaf. <br>
- `natt` (int): the number of attributes to split. <br>
- `sigmafactor` (float): Gaussian smoothing with kernel Sigma1*Resolution. <br>
- `rmsfactor` (float): MS for zconf calculation. <br>
- `tree_strategy` (string): see paragraph below.<br>

`rail_tpz` uses a parameter, `tree_strategy` that which specific algorithm is used to construct the trees, and this choice can have a few important effects on the results.  The original TPZ code contains bespoke decision tree code custom written for the algorighm to perform the recursive data splits.  While it is functional, it can be somewhat slow.  As an alternative, we have implemented an alternative method that instead uses scikit-learn's `DecisionTreeRegressor`, which can result in training times for the random forest informer of more than 10,000 times faster than native.  The specifics for how the decisionas to where tree splits occur are slightly different between the "native" and "sklearn" methods, though resulting photo-z predictions are qualitatively similar.

There is a notable difference in how the two methods handle the PDF construction that will affect results: both methods look at the input galaxy and use the decision tree to find the galaxies that are most similar to that input galaxy, splitting in the tree until they reach the "terminal leaf" where the last split occurs.  There are a small number of galaxies in this terminal leaf.  the "native" method takes this small number of galaxies and makes a histogram of their redshifts, and combines the MxN tree histograms to construct the final PDF.  The "sklearn" method, on the other hand, takes the mean of the small number of galaxies in the terminal leaf and returns a single float, so the final PDF estimate will be a histogram of single values from each of the MxN trees rather than #in leaf node xN xM entries for the "native" representation.  While results should mostly be qualitatively similar, the fact that TPZ uses bootstrap sampling when constructing the different trees means that some specz values can be repeated in some trees if they are drawn multiple times in the bootstrap.  In areas of photometric space with sparse coverage of spectroscopic galaxies, this can result in discrete values appearing multiple times in the histogram of neighbors in the PDF.  This can manifest as repeated values of the mode, for example, often seen at high redshift.  The "sklearn" strategy of averaging over the terminal leaf can somewhat mitigate this effect, as the discrete values are slightly smoothed by the averaging over the terminal leaf sample.   If you re-run this example notebook and switch the `tree_strategy` from "sklearn" to "native", you will likely see some discrete mode values in either a histogram of the zmode or plot of mode vs true redshift.  One method is not generally better than the other, it is simply a feature that users should be aware of, as it can impact a specific science case, particularly if point estimate are going to be employed.


We need to specify the attributes that TPZ will use to create its trees, we do this via a list passed to the `bands` parameter.  While the default list would work, we'll create and use it explicitly in this example.  Redshift, the parameter that we are trying to predict, should not be included in the attribute list (but needs to be included in the data file so the trees can be trained to split on it).

TPZ generates addtional "random" realizations of a training set by adding Gaussian scatter in attributes with sigma values taken from a different column in the input file.  The corresponding uncertainty columns for each attribute are stored as a dictionary with the name of the attribute column as the key and the name of the uncertainty as the value, this configuration parameter is `err_dict`.  While the default values set by `tpz_lite` would work, we'll create the necessary dictionary explicity and use it for illustration.  As mentioned above, using "sklearn" for the `tree_strategy` is much faster, so we will use that option in this demo.

In [None]:
bands = ["u", "g", "r", "i", "z", "y"]
new_err_dict = {}
attribute_list = []
error_list = []
for band in bands:
    attribute_list.append(f"mag_{band}_lsst")
    error_list.append(f"mag_err_{band}_lsst")
    new_err_dict[f"mag_{band}_lsst"] = f"mag_err_{band}_lsst"
# redshift is also an attribute used in the training, but it does not have an associated
# error its entry in the err_dict should be set to "None"
new_err_dict["redshift"] = None

print(new_err_dict)

In [None]:
tpz_dict = dict(zmin=0.0, 
                zmax=3.0, 
                nzbins=301,
                bands=attribute_list,
                err_bands=error_list,
                hdf5_groupname='photometry',
                err_dict=new_err_dict,
                nrandom=3, 
                ntrees=5,
                #tree_strategy='native')  # uncomment this line and comment out the line below to switch to using "native" trees 
                tree_strategy='sklearn')

Now, lets create our stage and run `inform`.  We specified `nrandom = 3` and `ntrees = 5`, so we will get 15 trained trees that constitute our model.  For our 10k training galaxy sample this takes about 0.5 seconds for "sklearn", or about 90 seconds using "native" on my Mac desktop for a rough guide for how long this should take to train:

In [None]:
pz_train = TPZliteInformer.make_stage(name='inform_TPZ', model='demo_tpz.pkl', **tpz_dict)

In [None]:
%%time
pz_train.inform(training_data)

# Running the Estimate stage

The model was created successfully, we now need to read in our test data, which consists of ~20,000 galaies drawn from the same cosmoDC2 simulated sample that was used to create our training sample:

In [None]:
from rail.estimation.algos.tpz_lite import TPZliteEstimator

In [None]:
testfile = os.path.join(RAILDIR,"rail/examples_data/testdata/test_dc2_validation_9816.hdf5")

In [None]:
test_data = DS.read_file("test_data", TableHandle, testfile)


We can now set up our `TPZliteEstimator` stage to actualy estimate our redshift PDFs.  There is only one configuration parameter for the stage: <br>
    - `test_err_dict` (dict): this is a dictionary just like `err_dict` as described for `TPZliteInformer`, i.e. a dictionary with the attributes for keys and the associated errors as values. <br>

The other parameters from the inform stage are carried within the model so that we do not accidentally use conflicting values for them.  We do need to supply the name of the model file to use, this can either be done directly as the file name, or as we do in the cell below, with the `get_handle` method from our inform stage:

In [None]:
test_dict = dict(hdf5_groupname='photometry')

In [None]:
test_runner = TPZliteEstimator.make_stage(name="test_tpz", output="TPZ_demo_output.hdf5",
                                          model=pz_train.get_handle('model'), **test_dict)

Now let's run the code:

In [None]:
%%time
results = test_runner.estimate(test_data)

This took about 6.5 seconds on my Mac desktop, not the fastest photo-z code, but not unreasonable for 20,000 galaxies.  

# Plotting point estimates and an example PDF

Now let's make a few diagnostic plots.  TPZ does calculate the PDF mode for each galaxy and stores this as ancillary data, so we can plot a point estimate vs the true redshift:

In [None]:
sz = test_data()['photometry']['redshift']
zmode = results().ancil['zmode']

In [None]:
plt.figure(figsize=(8,8))
plt.scatter(sz,zmode, s=2,c='k')
plt.plot([0,3],[0,3],'r--')
plt.xlabel("redshift", fontsize=15)
plt.ylabel("TPZ mode", fontsize=15)

Not bad, a handful of outliers, no obvious biases.  Let's also plot an individual redshift PDF:

In [None]:
which=5355
fig, axs = plt.subplots()
results().plot_native(key=which,axes=axs, label=f"PDF for galaxy {which}")
axs.axvline(sz[which],c='r',ls='--', label="true redshift")
plt.legend(loc='upper right', fontsize=12)
axs.set_xlabel("redshift")

You can experiment by changing the integer value of `which` above and see some of the different PDF shapes, though in general you seill see peaks corresponding to the values in the terminal leaves of the trees with Gaussian scatter added on top.  For well constrained areas of parameter space, all will have similar redshifts and result in a nice unimodal peak, for others there will be multiple redshift bumps.