# The Pipeline 

If you have a galaxy catalog (either of `Parametric` origin or from a simulation), an [`EmissionModel`](../emission_models/emission_models.rst), and a set of [instruments](../instrumentation/instrument_example.ipynb) you want observables for, you can easily write a pipeline to generate the observations you want using the Synthesizer UI. However, lets say you have a new catalog you want to run the same analysis on, or a whole different set of instruments you want to use. You could modify your old pipeline or write a whole new pipeline, but thats a lot of work and boilerplate. 

This is where the `Pipeline` shines. Instead, of having to write a pipeline, the `Pipeline` class is a high-level interface that allows you to easily generate observations for a given catalog, emission model, and set of instruments. All you need to do is define a galaxy loader, setup the ``Pipeline`` object, and run the observable methods you want to include. Possible observables include:

- Spectra.
- Emission Lines.
- Photometry.
- Images (with or without PSF convolution/noise).
- Spectral data cubes (IFUs) [WIP].
- Instrument specific spectroscopy [WIP].

The ``Pipeline`` will generate all the requested observations for all (compatible) instruments and galaxies, before writing them out to a standardised HDF5 format.

As a bonus, the abstraction into the `Pipeline` class allows for easy parallelization of the analysis, not only over local threads but distributed over MPI. 

In the following sections we will show how to instantiate and use a ``Pipeline`` object to generate observations for a given catalog, emission model, and set of instruments.

## Setting up a ``Pipeline`` object

Before we instatiate a survey we need to define its "dependencies". These are an emission model, a set of instruments, and importantly some galaxies to observe.

### Defining an emission model

The ``EmissionModel`` defines the emissions we'll generate, including the origin and any reprocessing the emission undergoes. For more details see the ``EmissionModel`` [docs](../emission_models/emission_models.rst). 

For demonstration, we'll use a simple premade ``IntrinsicEmission`` model which defines the intrinsic stellar emission (i.e. stellar emission without any ISM dust reprocessing).

In [None]:
from synthesizer.emission_models import IntrinsicEmission
from synthesizer.grid import Grid

# Get the grid
grid_dir = "../../../tests/test_grid/"
grid_name = "test_grid"
grid = Grid(grid_name, grid_dir=grid_dir)

model = IntrinsicEmission(grid, fesc=0.1)
model.set_per_particle(True)  # we want per particle emissions

### Defining the instruments

We don't need any instruments if all we want is spectra at the resolution of the ``Grid`` or emission lines. However, to get anything more sophisticated we need ``Instruments`` that define the technical specifications of the observations we want to generate. For a full breakdown see the instrumentation [docs](../instrumentation/instrument_example.ipynb).

Here we'll define a simple set of instruments including a subset of NIRCam filters (capable of imaging with a 0.1 kpc resolution) and a set of UVJ top hat filters (only capable of photometry).

In [None]:
import numpy as np
from unyt import angstrom, kpc

from synthesizer.instruments import UVJ, FilterCollection, Instrument

# Get the filters
lam = np.linspace(10**3, 10**5, 1000) * angstrom
webb_filters = FilterCollection(
    filter_codes=[
        f"JWST/NIRCam.{f}"
        for f in ["F090W", "F150W", "F200W", "F277W", "F356W", "F444W"]
    ],
    new_lam=lam,
)
uvj_filters = UVJ(new_lam=lam)

# Instatiate the instruments
webb_inst = Instrument("JWST", filters=webb_filters, resolution=0.1 * kpc)
uvj_inst = Instrument("UVJ", filters=uvj_filters)
instruments = webb_inst + uvj_inst

print(instruments)

### Loading galaxies

You can load galaxies however you want but for this example we'll load some CAMELS galaxies using the `load_data` module.

In [None]:
import numpy as np

from synthesizer.load_data.load_camels import load_CAMELS_IllustrisTNG

# Create galaxy object
galaxies = load_CAMELS_IllustrisTNG(
    "../../../tests/data/",
    snap_name="camels_snap.hdf5",
    group_name="camels_subhalo.hdf5",
    physical=True,
)

### Instantiating the ``Pipeline`` object

Now we have all the ingredients we need to instantiate a ``Pipeline`` object. All we need to do now is pass them into the ``Pipeline`` object alongside the number of galaxies in the catalog in total and the number of threads we want to use during the analysis (in this notebook we'll only use 1 for such a small handful of galaxies).

In [None]:
from synthesizer.survey import Pipeline

survey = Pipeline(
    emission_model=model,
    instruments=instruments,
    nthreads=1,
    verbose=1,
)

Notice that we got a log out of the ``Pipeline`` object detailing the basic setup. The ``Pipeline`` will automatically output logging information to the console but this can be supressed by passing ``verbose=0`` which limits the outputs to saying hello, goodbye, and any errors that occur.

## Adding analysis functions

We could just run the analysis now and get whatever predefined outputs we want. However, we can also add our own analysis functions to the ``Pipeline`` object. These functions will be run on each galaxy in the catalog and can be used to generate any additional outputs we want. Importantly, these functions will be run **after** all other analysis has finished so they can make use of any outputs generated by the ``Pipeline`` object.

Below we'll define an analysis function to compute the stellar half mass radius of each galaxy. Any extra analysis functions must obey the following rules:

- It must calculate the "result" for a single galaxy at a time.
- The function's first argument must be the galaxy to calculate for.
- It must return an array of values or a scalar, such that ``np.array(<list of results>)`` is a valid operation. In other words, the results once combined for all galaxies should be an array of shape ``(n_galaxies, <result shape>)``.
- It can take any number of additional arguments and keyword arguments, but **beware**, adding large objects to the function signature will slow down the threadpools due to the need to serialise and deserialise these objects.
- It can use results of previously added functions if these attached anything to the galaxy itself (i.e. functions will be run in the order they are added).

In [None]:
def get_stellar_half_mass_radius(gal):
    """
    Compute the stellar half mass radius.

    Args:
        gal (Galaxy):
            The galaxy to compute the half light radius of.
    """
    return gal.stars.get_half_mass_radius()

To add this to the ``Pipeline`` we need to pass it along with a string defining the key under which the results will be stored in the HDF5 file.

In [None]:
survey.add_analysis_func(
    get_stellar_half_mass_radius, result_key="Stars/HalfMassRadius"
)

This can also be done with simple ``lambda`` functions to include galaxy attributes in the output. For instance, the redshift.

In [None]:
survey.add_analysis_func(lambda gal: gal.redshift, result_key="Redshift")

## Running the pipeline

To run the pipeline we just need to attach our galaxies and then call the various observable generation methods. This approach allows you to explicitly control which observables you want to generate with a single line of code for each.

### Loading the galaxies

First we'll attach the galaxies. We don't need it here but you can also pass an array of indexes for labelling the galaxies if you want, otherwise they'll be assigned from 0-N, where N is the number of galaxies.

In [None]:
survey.add_galaxies(galaxies)

### Generating the observables

Now we have the galaxies we can generate their observables. We do this by calling the various observable generation methods on the ``Pipeline`` object. These will automatically use the number of threads we defined when we instantiated the ``Pipeline`` object, if this was 1 then everything will be done in serial.

There is a required order to the calling of the observable methods. For instance, you can't generate photometry with first generating the spectra. Each method knows to check it's dependencies have been satisfied, if they have not an error will be raised.

We'll start with the spectra. If we want fluxes, we'll need to pass an ``astropy.cosmology`` object.

In [None]:
from astropy.cosmology import Planck18 as cosmo

survey.get_spectra(cosmo=cosmo)

Next we'll generate the emission lines. Here we can pass exactly which emission lines we want to generate based on line ID. Here we'll just generate all lines offered by the ``Grid``.

In [None]:
survey.get_lines(line_ids=grid.available_lines)

Next, the photometry. This requires no extra inputs but we have separate methods for luminosities and fluxes (with the latter requiring a ``astropy.cosmology`` object was based when spectra were generated).

In [None]:
survey.get_photometry_luminosities()
survey.get_photometry_fluxes()

Finally, we'll generate the images. Again, these are split into luminosity and flux flavours. Here we define our field of view and pass that into each method. We are also doing "smoothed" imaging where each particle is smoothed over its SPH kernel. For this style of image genration we need to pass the kernel array, which we'll extract here.

Had we defined instruments with PSFs and/or noise these methods would automatically generate images with these effects/contributions included.

In [None]:
from synthesizer.kernel_functions import Kernel

# Get the SPH kernel
sph_kernel = Kernel()
kernel = sph_kernel.get_kernel()

survey.get_images_luminosity(fov=50 * kpc, kernel=kernel)
survey.get_images_flux(fov=50 * kpc, kernel=kernel)

## Writing out the data

Finally, we write out the data to a HDF5 file. This file will contain all the observables we generated, as well as any additional analysis we ran. This file is structure to mirror the structure of Synthesizer objects, with each galaxy being a group, each component being a subgroup, and each observale being a dataset (or set of subgroups with the observables as datasets at their leaves in the case of a dicitonary attribute).

To write out the data we just pass the path to the file we want to write to to the ``write`` method.

Note that we all passing ``verbose=0`` to silence the dataset timings for these docs. Otherwise, we would get timings for the writing of individual datasets. In the wild these timings are useful but here they'd just bloat the demo.

In [None]:
survey.write("output.hdf5", verbose=0)

## Putting it all together

Here is what the pipeline would look like without all the descriptive fluff...

In [None]:
survey = Pipeline(model, instruments)
survey.add_analysis_func(
    get_stellar_half_mass_radius, result_key="Stars/HalfMassRadius"
)
survey.add_analysis_func(lambda gal: gal.redshift, result_key="Redshift")
survey.add_galaxies(galaxies)
survey.get_spectra(cosmo=cosmo)
survey.get_lines(line_ids=grid.available_lines)
survey.get_photometry_luminosities()
survey.get_photometry_fluxes()
survey.get_images_luminosity(fov=50 * kpc, kernel=kernel)
survey.get_images_flux(fov=50 * kpc, kernel=kernel)
survey.write("output.hdf5", verbose=0)

# Hybrid parallelism with MPI

Above we demonstrated how to run a pipeline using only local shared memory parallelism. We can also use `mpi4py` to not only use the shared memory parallelism but also distribute the analysis across multiple nodes (hence "hybrid parallelism"). 

To make use of MPI we only need to make a couple changes to running the pipeline. The first is simply that we need to pass the ``comm`` object to the ``Pipeline`` object when we instantiate it. 

```python
from mpi4py import MPI

survey = Pipeline(
    gal_loader_func=galaxy_loader,  
    emission_model=model, 
    n_galaxies=10, 
    instruments=instruments, 
    nthreads=4, 
    verbose=1,
    comm=MPI.COMM_WORLD,
)
```

Note, that ``verbose=1`` will mean only rank 0 will output logging information. If you want all ranks to output logging information you should set ``verbose=2``.

The only other thing we need to do is partition the galaxies **before** we load them.

Below we will spoof an MPI enabled ``Pipeline`` object to demonstrate this (we can't actually run MPI in a notebook).

In [None]:
# Make a survey to demo partitioning
survey = Pipeline(
    emission_model=model,
    instruments=instruments,
    nthreads=4,
    verbose=1,
)

# Fake the MPI ranks (you can ignore this)
survey.using_mpi = True
survey.rank = 0
survey.size = 4