# Simulate Object Photo-z

Contact: Melissa Graham <br>
Last verified to run 2022-10-22 with LSST Science Pipelines version w40. <br>

## The CMNN Photo-z Estimator

The CMNN PZ Estimator is a toy estimator that is used primarily to evaluate LSST observing strategies. 

A full description of the Color-Matched Nearest-Neighbors (CMNN) Photometric Redshift Estimator can be found in the following journal articles:
 * <a href="https://ui.adsabs.harvard.edu/abs/2018AJ....155....1G/abstract">Photometric Redshifts with the LSST: Evaluating Survey Observing Strategies</a> (Graham et al. 2018) 
 * <a href="https://ui.adsabs.harvard.edu/abs/2020AJ....159..258G/abstract">Photometric Redshifts with the LSST. II. The Impact of Near-infrared and Near-ultraviolet Photometry</a> (Graham et al. 2020)

A full-featured version CMNN PZ Estimator can also be found on GitHub: https://github.com/dirac-institute/CMNN_Photoz_Estimator

## WARNINGS

**This notebook uses a *very simple* version of the CMNN PZ Estimator** with a leave-one-out analysis.

This simplified version of the CMNN PZ Estimator:
 - does not handle sparse regions of color-redshift space as well as the full-featured version
 - does not have the capability to apply priors in magnitude or color
 - will not reproduce the photo-z quality demonstrated in the above papers
 - is not scalable to estimating photo-z for millions of DP0.2 Objects
 - should not be used for any scientific studies

But it does make for a useful learning tool!

## Set up

In [None]:
import time
import numpy as np
import pandas
pandas.set_option('display.max_rows', 200)

import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import chi2

import datetime

In [None]:
import warnings
from astropy.units import UnitsWarning
warnings.simplefilter("ignore", category=UnitsWarning)

In [None]:
from lsst.rsp import get_tap_service, retrieve_query
service = get_tap_service()

## 1.0. Use the TAP service to retrieve true galaxies matched to Objects

Query constraints:
 * `truth_type` = 1 (true galaxies only)
 * 0.05 < true `redshift` < 2.0
 * matched to a detected `Object`
 * measured apparent i-band magnitude < 25.2

Spatial constraints: Through a bit of trial and error I figured out that in order to retrieve about 200000 true galaxies, use a radius of 0.7 degrees.

This query can take ten minutes.

In [None]:
query = "SELECT mt.id_truth_type AS mt_id_truth_type, "\
        "mt.match_objectId AS mt_match_objectId, "\
        "ts.truth_type AS ts_truth_type, "\
        "ts.redshift AS ts_redshift, "\
        "scisql_nanojanskyToAbMag(ts.flux_u) AS ts_mag_u, "\
        "scisql_nanojanskyToAbMag(ts.flux_g) AS ts_mag_g, "\
        "scisql_nanojanskyToAbMag(ts.flux_r) AS ts_mag_r, "\
        "scisql_nanojanskyToAbMag(ts.flux_i) AS ts_mag_i, "\
        "scisql_nanojanskyToAbMag(ts.flux_z) AS ts_mag_z, "\
        "scisql_nanojanskyToAbMag(ts.flux_y) AS ts_mag_y, "\
        "scisql_nanojanskyToAbMag(obj.u_cModelFlux) AS obj_cModelMag_u, "\
        "scisql_nanojanskyToAbMag(obj.g_cModelFlux) AS obj_cModelMag_g, "\
        "scisql_nanojanskyToAbMag(obj.r_cModelFlux) AS obj_cModelMag_r, "\
        "scisql_nanojanskyToAbMag(obj.i_cModelFlux) AS obj_cModelMag_i, "\
        "scisql_nanojanskyToAbMag(obj.z_cModelFlux) AS obj_cModelMag_z, "\
        "scisql_nanojanskyToAbMag(obj.y_cModelFlux) AS obj_cModelMag_y, "\
        "scisql_nanojanskyToAbMagSigma(obj.u_cModelFlux,obj.u_cModelFluxErr) AS obj_cModelMagErr_u, "\
        "scisql_nanojanskyToAbMagSigma(obj.g_cModelFlux,obj.g_cModelFluxErr) AS obj_cModelMagErr_g, "\
        "scisql_nanojanskyToAbMagSigma(obj.r_cModelFlux,obj.r_cModelFluxErr) AS obj_cModelMagErr_r, "\
        "scisql_nanojanskyToAbMagSigma(obj.i_cModelFlux,obj.i_cModelFluxErr) AS obj_cModelMagErr_i, "\
        "scisql_nanojanskyToAbMagSigma(obj.z_cModelFlux,obj.z_cModelFluxErr) AS obj_cModelMagErr_z, "\
        "scisql_nanojanskyToAbMagSigma(obj.y_cModelFlux,obj.y_cModelFluxErr) AS obj_cModelMagErr_y "\
        "FROM dp02_dc2_catalogs.MatchesTruth AS mt "\
        "JOIN dp02_dc2_catalogs.TruthSummary AS ts ON mt.id_truth_type = ts.id_truth_type "\
        "JOIN dp02_dc2_catalogs.Object AS obj ON mt.match_objectId = obj.objectId "\
        "WHERE CONTAINS(POINT('ICRS', ts.ra, ts.dec), CIRCLE('ICRS', 62.0, -37.0, 0.7)) = 1 "\
        "AND ts.truth_type = 1 "\
        "AND ts.redshift > 0.05 "\
        "AND ts.redshift < 2.00 "\
        "AND scisql_nanojanskyToAbMag(obj.i_cModelFlux) < 25.2"
print(query)

In [None]:
%%time
job = service.submit_job(query)
print('Job URL is', job.url)
print('Job phase is', job.phase)
job.run()
job.wait(phases=['COMPLETED', 'ERROR'])
print('Job phase is', job.phase)

In [None]:
# job.raise_if_error()

In [None]:
results = job.fetch_result().to_table()

In [None]:
# results

### 1.1 Use numpy arrays

In the past they've proved quicker than pandas data frames, but, this might depend on architecture and number of objects.

It is unconfirmed whether numpy is optimal for this application, but going with it for this demo.

In [None]:
data_id = np.asarray(results['mt_match_objectId'], dtype='int')

In [None]:
# true ("spec") redshifts
data_tz = np.asarray(results['ts_redshift'], dtype='float')

# true ("spec") magnitudes
data_tm = np.transpose(np.asarray((results['ts_mag_u'],results['ts_mag_g'],\
                                   results['ts_mag_r'],results['ts_mag_i'],\
                                   results['ts_mag_z'],results['ts_mag_y']),\
                                  dtype='float' ) )

# object apparent magnitudes
data_om = np.transpose(np.asarray((results['obj_cModelMag_u'],results['obj_cModelMag_g'],\
                                   results['obj_cModelMag_r'],results['obj_cModelMag_i'],\
                                   results['obj_cModelMag_z'],results['obj_cModelMag_y']),\
                                  dtype='float' ) )

# object apparent magnitude errors
data_ome = np.transpose(np.asarray((results['obj_cModelMagErr_u'],results['obj_cModelMagErr_g'],\
                                    results['obj_cModelMagErr_r'],results['obj_cModelMagErr_i'],\
                                    results['obj_cModelMagErr_z'],results['obj_cModelMagErr_y']),\
                                   dtype='float' ) )

In [None]:
# true ("spec") and object colors and color errors
data_tc = np.zeros( (len(data_om),5), dtype='float' )
data_oc = np.zeros( (len(data_om),5), dtype='float' )
data_oce = np.zeros( (len(data_om),5), dtype='float' )

data_tc[:,0] = data_tm[:,0] - data_tm[:,1]
data_tc[:,1] = data_tm[:,1] - data_tm[:,2]
data_tc[:,2] = data_tm[:,2] - data_tm[:,3]
data_tc[:,3] = data_tm[:,3] - data_tm[:,4]
data_tc[:,4] = data_tm[:,4] - data_tm[:,5]

data_oc[:,0] = data_om[:,0] - data_om[:,1]
data_oc[:,1] = data_om[:,1] - data_om[:,2]
data_oc[:,2] = data_om[:,2] - data_om[:,3]
data_oc[:,3] = data_om[:,3] - data_om[:,4]
data_oc[:,4] = data_om[:,4] - data_om[:,5]

data_oce[:,0] = np.sqrt( data_ome[:,0]**2 + data_ome[:,1]**2 )
data_oce[:,1] = np.sqrt( data_ome[:,1]**2 + data_ome[:,2]**2 )
data_oce[:,2] = np.sqrt( data_ome[:,2]**2 + data_ome[:,3]**2 )
data_oce[:,3] = np.sqrt( data_ome[:,3]**2 + data_ome[:,4]**2 )
data_oce[:,4] = np.sqrt( data_ome[:,4]**2 + data_ome[:,5]**2 )

### 1.2 Plot color vs. redshift

If you want to see what the galaxy properties are like.

Note that the redshift distribution shows structure.
This is cosmic variance, an effect of using a small area.

In [None]:
fig, axs = plt.subplots(2,3, figsize=(10, 5))
fig.suptitle('true galaxy color vs. true redshift')
axs[0,0].plot(data_tz, data_tc[:,0], 'o', ms=2, mew=0, alpha=0.01, color='darkviolet')
axs[0,0].set_ylabel('u-g')
axs[0,0].set_ylim([-1,2])
axs[0,1].plot(data_tz, data_tc[:,1], 'o', ms=2, mew=0, alpha=0.01, color='darkgreen')
axs[0,1].set_ylabel('g-r')
axs[0,1].set_ylim([-1,2])
axs[0,2].plot(data_tz, data_tc[:,2], 'o', ms=2, mew=0, alpha=0.01, color='darkorange')
axs[0,2].set_ylabel('r-i')
axs[0,2].set_ylim([-1,2])
axs[1,0].plot(data_tz, data_tc[:,3], 'o', ms=2, mew=0, alpha=0.01, color='firebrick')
axs[1,0].set_ylabel('i-z')
axs[1,0].set_xlabel('redshift')
axs[1,0].set_ylim([-1,2])
axs[1,1].plot(data_tz, data_tc[:,4], 'o', ms=2, mew=0, alpha=0.01, color='saddlebrown')
axs[1,1].set_ylabel('z-y')
axs[1,1].set_xlabel('redshift')
axs[1,1].set_ylim([-1,2])
axs[1,2].hist(data_tz, color='grey', bins=100)
axs[1,2].set_xlabel('redshift')
fig.show()

<br>

## 2.0 Estimate photo-z

This notebook uses a leave-one-out method: for each galaxy that we retrieved (i.e., each "test" galaxy), we use *all the other galaxies* and their true redshifts as the "training" set.

For each test galaxy, the estimator identifies a color-matched nearest-neighbors (CMNN) subset of training galaxies.

This process starts by calculating the Mahalanobis distance in color-space between the test galaxy and all training galaxies:

$D_M = \sum_{\rm 1}^{N_{\rm colors}} \frac{( c_{\rm test} - c_{\rm train} )^2}{ (\delta c_{\rm test})^2}$

where 
 - $c_{\rm train}$ is the (true/spec) color of the training-set galaxy (`data_tc`),
 - $c_{\rm test}$ is the (Object/observed) color of the test-set galaxy (`data_oc`),
 - $\delta c_{\rm test}$ is the uncertainty in the test galaxy's color (`data_oce`), and
 - $N_{\rm color}$ is the number of colors measured for both the test- and training-set galaxy. 

A threshold value is then applied to all training-set galaxies to identify the CMNN subset (i.e., those which are "well-matched" in color).

This threshold value is defined by the percent point function (PPF).
E.g., if the number of degrees of freedom $N_{\rm color}=5$, PPF$=68\%$ of all training galaxies consistent with the test galaxy will have $D_M < 5.86$.

A training galaxy is then selected randomly from the CMNN subset.
Its redshift is used as the test-set galaxy's photometric redshift.
The standard deviation in redshifts of all CMNN subset training galaxies is used as the uncertainty in the photo-_z_ estimate.

<br>

### 2.1 Set the tunable parameters

This simple version of the CMNN estimator takes just two tunable parameters:

(1) The percent point function (`cmnn_ppf`), as described above. The default is 0.68.

(2) The minimum number of colors (`cmnn_minNclr`) that a training-set galaxy must have in common with the test galaxy. The default is 5 (i.e., all five colors). This parameter could be lowered if magnitude cuts are applied, leaving some galaxies undetected in some bands.

In [None]:
cmnn_ppf = 0.68 
cmnn_minNclr = 5

We make and use a thresholds lookup table because chi2.ppf is slow. As described above, the threshold values are based on the desired percent point function (PPF).

In [None]:
cmnn_thresh_table = np.zeros(6, dtype='float')
for d in range(6):
    cmnn_thresh_table[d] = chi2.ppf(cmnn_ppf,d)
cmnn_thresh_table[0] = float(0.0000)

for d in range(6):
    print('degrees of freedom = %1i, threshold = %5.3f' % (d, np.round(cmnn_thresh_table[d],3)))

<br>

### 2.2 Estimate the photo-z

Make arrays to hold photo-z for all of the galaxies

In [None]:
data_pz = np.zeros(len(data_tz), dtype='float') - 1.0
data_pze = np.zeros(len(data_tz), dtype='float') - 1.0

Set `Ncalc` to be how many test-set galaxies you want photo-z estimates for.

**WARNING:** It takes about 30 min to do 100,000 test galaxies (about 1.5 min to do 5,000 test galaxies).

In [None]:
Ncalc = 5000

Choose a random set of galaxies as the test set.

In [None]:
rx = np.random.choice(len(data_tz), Ncalc, replace=False)

Calculate photo-z and uncertainties using the CMNN method.

In [None]:
%%time

t1 = datetime.datetime.now()

for i,r in enumerate(rx):
    if (i == 100) | (i == 1000) | (i == Ncalc-1000):
        t2 = datetime.datetime.now()
        print(i, t2-t1, ((t2-t1)/float(i))*(float(Ncalc)), ' remaining' )

    # calculate DM and DOF
    DM  = np.nansum((data_oc[r,:] - data_tc[:,:])**2 / data_oce[r,:]**2, axis=1, dtype='float')
    DOF = np.nansum((data_oc[r,:]**2 + data_tc[:,:]**2 + 1.0) / (data_oc[r,:]**2 + data_tc[:,:]**2 + 1.0), axis=1, dtype='int')
    
    # calculate the thresholds
    data_th = np.zeros(len(DOF), dtype='float')
    for d in range(6):
        tx = np.where(DOF == d)[0]
        data_th[tx] = cmnn_thresh_table[d]
        del tx
    
    DM[r] = 99.9

    # identify the CMNN subset of training-set galaxies:
    # those for which the DM is less than the threshold
    ix = np.where((DOF >= cmnn_minNclr) & (data_th > 0.00010) & \
                  (DM > 0.00010) & (DM <= data_th))[0]
    
    if len(ix) > 0:
        # choose a random training-set galaxy from the CMNN subset
        rix = np.random.choice(ix, size=1, replace=False)[0]
        data_pz[r] = data_tz[rix]
        data_pze[r] = np.std(data_tz[ix])
        del rix
    else:
        data_pz[r] = float('nan')
        data_pze[r] = float('nan')
        
    del DM, DOF, data_th, ix

#### 2.2.1 Quick check of success rate

In [None]:
tx = np.where( np.isnan(data_pz) )[0]
print( len(tx), ' galaxies did not get a pz estimate' )
del tx

tx = np.where( data_pz > 0.0 )[0]
print( len(tx), ' galaxies did get a pz estimate' )
del tx

<br>

### 2.3 Plot the photo-z results

#### 2.3.1 Plot the photometric *versus* the true redshifts.

In [None]:
tx = np.where( data_pz > 0.0 )[0]

fig = plt.figure(figsize=(4,4))
plt.plot( [0.0,2.0], [0.0,2.0], ls='solid', lw=1, color='firebrick')
plt.plot( data_tz[tx], data_pz[tx], 'o', ms=5, mew=0, alpha=0.1, color='grey' )
plt.xlabel('true redshift')
plt.ylabel('photometric redshift')
plt.show()

del tx

#### 2.3.2 Plot the photo-z uncertainty *versus* the photo-z accuracy.

The photo-z accuracy is the absolute value of the difference between the true and photometric redshifts.

Recall that the photo-z uncertainty is the standard deviation of the true redshifts of the training-set galaxies in the CMNN subset, as described in Section 2. The fact that a bunch of galaxies have an uncertainty of zero means there are galaxies with only 1 training-set galaxy in their CMNN subset. The full CMNN PZ Estimator treats such galaxies better (see Section 3).

In [None]:
tx = np.where( data_pz > 0.0 )[0]

fig = plt.figure(figsize=(4,4))
plt.plot( [0.0,1.0], [0.0,1.0], ls='solid', lw=1, color='firebrick')
plt.plot( np.abs(data_tz[tx]-data_pz[tx]), data_pze[tx],\
         'o', ms=5, mew=0, alpha=0.1, color='grey' )
plt.xlabel('photo-z accuracy')
plt.ylabel('photo-z uncertainty')
plt.xlim([-0.05,1.0])
plt.ylim([-0.05,1.0])
plt.show()

del tx

<br>

## 3.0 Future Work

(1) Generate a *separate* training set of ~200000 galaxies, and then apply it to a *separate* test set of many more galaxies.

(2) Install the full CMNN PZ Estimator as a package, and demonstrate how to use it. This might take some modification. The full package features more parameters, and modules for statistical analysis and plotting. The additional parameters include magnitude and color priors, alternatives to randomly selecting from the CMNN subset, and more robust treatment of test-set galaxies with few training-set galaxies in the CMNN subset. 

(3) Demonstrate how/whether photo-z are worse from non-cModel fluxes.

(4) Allow the test set to include stars mis-identified as extended objects. What are their photo-z like?