# Simulate Object Photo-z

Contact: Melissa Graham <br>
Last verified to run Fri Mar 11, with LSST Science Pipelines version Weekly 2021_49.

## The CMNN Photo-z Estimator

A full description of the Color-Matched Nearest-Neighbors (CMNN) Photometric Redshift Estimator can be found in the following journal articles:
 * <a href="https://ui.adsabs.harvard.edu/abs/2018AJ....155....1G/abstract">Photometric Redshifts with the LSST: Evaluating Survey Observing Strategies</a> (Graham et al. 2018) 
 * <a href="https://ui.adsabs.harvard.edu/abs/2020AJ....159..258G/abstract">Photometric Redshifts with the LSST. II. The Impact of Near-infrared and Near-ultraviolet Photometry</a> (Graham et al. 2020)

The CMNN PZ Estimator can also be found on GitHub: https://github.com/dirac-institute/CMNN_Photoz_Estimator

**This notebook uses a *very simple* version of the CMNN PZ Estimator**, and this particular demo is not directly scalable to estimating photo-z for millions of objects (left for future work, see Section 3).

This notebook is only useful for estimating rudimentary photo-z for a small number of galaxies.

## Set up

In [None]:
import numpy as np
import pandas
pandas.set_option('display.max_rows', 200)

import matplotlib.pyplot as plt
%matplotlib inline

from scipy.stats import chi2

import datetime

In [None]:
import warnings
from astropy.units import UnitsWarning
warnings.simplefilter("ignore", category=UnitsWarning)

In [None]:
from lsst.rsp import get_tap_service, retrieve_query
service = get_tap_service()

<br>

## 1.0 Retrieve galaxies from the objects/truth_match catalogs

Selecting objects from a small arbitrary region near the center of the DC2 simulation.

**WARNING: In this simplistic demonstration in which ALL of the selected galaxies are used as "training-set" galaxies. Selecting much more than 100000 galaxies is NOT recommended in THIS notebook because it will make the PZ estimation step much slower.** This notebook uses a simple leave-one-out method to estimate photo-z for a single set of galaxies, as described in Section 2. It is entirely possible to use the CMNN PZ Estimator to generate photo-z for more galaxies by generating *separate* test and training sets, but this is left to future work, as described in Section 3.

Only select objects that are clean, extended, and bright (i<25 mag).

Only select objects with a truth_match that is good, and is a galaxy (truth_type=1) with redshift > 0.05.

In [None]:
query = "SELECT obj.objectId, obj.ra, obj.dec, "\
        "obj.mag_u_cModel, obj.mag_g_cModel, obj.mag_r_cModel, "\
        "obj.mag_i_cModel, obj.mag_z_cModel, obj.mag_y_cModel, "\
        "obj.magerr_u_cModel, obj.magerr_g_cModel, obj.magerr_r_cModel, "\
        "obj.magerr_i_cModel, obj.magerr_z_cModel, obj.magerr_y_cModel, "\
        "obj.clean, obj.extendedness, "\
        "truth.id, truth.match_objectId, truth.is_good_match, "\
        "truth.truth_type, truth.redshift "\
        "FROM dp01_dc2_catalogs.object as obj "\
        "JOIN dp01_dc2_catalogs.truth_match as truth "\
        "ON truth.match_objectId = obj.objectId "\
        "WHERE CONTAINS(POINT('ICRS', obj.ra, obj.dec), "\
        "CIRCLE('ICRS', 62.0, -37.0, 0.5)) = 1 "\
        "AND obj.clean = 1 "\
        "AND obj.extendedness > 0 "\
        "AND obj.mag_i_cModel <= 25.0 "\
        "AND truth.match_objectid >= 0 "\
        "AND truth.is_good_match = 1 "\
        "AND truth.truth_type = 1 "\
        "AND truth.redshift > 0.05 "
print(query)

In [None]:
%%time
results = service.search(query)
print('Query returned %s matched objects.' % len(results))

### 1.1 Use numpy arrays

In the past they've proved quicker than pandas data frames, but, this might depend on architecture and number of objects. It is unconfirmed whether numpy is optimal for this application, but going with it for this demo.

In [None]:
# galaxy true redshifts
data_tz = np.asarray( results['redshift'], dtype='float' )

# galaxy apparent magnitudes
data_m = np.transpose( np.asarray( (results['mag_u_cModel'],results['mag_g_cModel'],\
                                    results['mag_r_cModel'],results['mag_i_cModel'],\
                                    results['mag_z_cModel'],results['mag_y_cModel']),\
                                  dtype='float' ) )

# galaxy apparent magnitude errors
data_me = np.transpose( np.asarray( (results['magerr_u_cModel'],results['magerr_g_cModel'],\
                                     results['magerr_r_cModel'],results['magerr_i_cModel'],\
                                     results['magerr_z_cModel'],results['magerr_y_cModel']),\
                                  dtype='float' ) )

In [None]:
# galaxy colors and color errors
data_c = np.zeros( (len(data_m),5), dtype='float' )
data_ce = np.zeros( (len(data_m),5), dtype='float' )

data_c[:,0] = data_m[:,0] - data_m[:,1]
data_c[:,1] = data_m[:,1] - data_m[:,2]
data_c[:,2] = data_m[:,2] - data_m[:,3]
data_c[:,3] = data_m[:,3] - data_m[:,4]
data_c[:,4] = data_m[:,4] - data_m[:,5]

data_ce[:,0] = np.sqrt( data_me[:,0]**2 + data_me[:,1]**2 )
data_ce[:,1] = np.sqrt( data_me[:,1]**2 + data_me[:,2]**2 )
data_ce[:,2] = np.sqrt( data_me[:,2]**2 + data_me[:,3]**2 )
data_ce[:,3] = np.sqrt( data_me[:,3]**2 + data_me[:,4]**2 )
data_ce[:,4] = np.sqrt( data_me[:,4]**2 + data_me[:,5]**2 )

In [None]:
del results

### 1.2 Plot color vs. redshift

If you want to see what the galaxy properties are like.

In [None]:
# fig, axs = plt.subplots(2,3, figsize=(14, 7))
# fig.suptitle('galaxy color vs. redshift')
# axs[0,0].plot(data_tz, data_c[:,0], 'o', ms=2, mew=0, alpha=0.01, color='darkviolet')
# axs[0,0].set_ylabel('u-g')
# axs[0,0].set_ylim([-1,2])
# axs[0,1].plot(data_tz, data_c[:,1], 'o', ms=2, mew=0, alpha=0.01, color='darkgreen')
# axs[0,1].set_ylabel('g-r')
# axs[0,1].set_ylim([-1,2])
# axs[0,2].plot(data_tz, data_c[:,2], 'o', ms=2, mew=0, alpha=0.01, color='darkorange')
# axs[0,2].set_ylabel('r-i')
# axs[0,2].set_ylim([-1,2])
# axs[1,0].plot(data_tz, data_c[:,3], 'o', ms=2, mew=0, alpha=0.01, color='firebrick')
# axs[1,0].set_ylabel('i-z')
# axs[1,0].set_xlabel('redshift')
# axs[1,0].set_ylim([-1,2])
# axs[1,1].plot(data_tz, data_c[:,4], 'o', ms=2, mew=0, alpha=0.01, color='saddlebrown')
# axs[1,1].set_ylabel('z-y')
# axs[1,1].set_xlabel('redshift')
# axs[1,1].set_ylim([-1,2])
# axs[1,2].hist(data_tz, color='grey')
# axs[1,2].set_xlabel('redshift')
# fig.show()

<br>

## 2.0 Estimate photo-z

This notebook uses a leave-one-out method: for each galaxy that we retrieved (i.e., each test galaxy), we use *all the other galaxies* and their true redshifts as the training set.

For each test galaxy, the estimator identifies a color-matched subset of training galaxies by calculating the Mahalanobis distance in color-space between the test galaxy and all training galaxies:

$D_M = \sum_{\rm 1}^{N_{\rm colors}} \frac{( c_{\rm test} - c_{\rm train} )^2}{ (\delta c_{\rm test})^2}$

where $c$ is the color of the test- or training-set galaxy, $\delta c_{\rm test}$ is the uncertainty in the test galaxy's color, and $N_{\rm color}$ is the number of colors measured for both the test- and training-set galaxy. 

A threshold value is then applied to all training-set galaxies to identify those which are well-matched in color: this is called the **CMNN subset** of training galaxies.

This threshold value is defined by the percent point function (PPF): e.g., if the number of degrees of freedom $N_{\rm color}=5$, PPF$=68\%$ of all training galaxies consistent with the test galaxy will have $D_M < 5.86$.

A training galaxy is then selected randomly from this subset of color-matched nearest-neighbors, and its redshift is used as the test-set galaxy's photometric redshift.

The standard deviation in redshifts of this subset of training galaxies is used as the uncertainty in the photo-_z_ estimate.

<br>

### 2.1 Set the tunable parameters

This simple version of the CMNN estimator takes just two tunable parameters:

(1) The percent point function (`cmnn_ppf`), as described above. The default is 0.68.

(2) The minimum number of colors (`cmnn_minNclr`) that a training-set galaxy must have in common with the test galaxy. The default is 5 (i.e., all five colors). This parameter could be lowered if magnitude cuts are applied, leaving some galaxies undetected in some bands.

In [None]:
cmnn_ppf = 0.68 
cmnn_minNclr = 5

We make and use a thresholds lookup table because chi2.ppf is slow. As described above, the threshold values are based on the desired percent point function (PPF).

In [None]:
cmnn_thresh_table = np.zeros( 6, dtype='float' )
for d in range(6):
    cmnn_thresh_table[d] = chi2.ppf(cmnn_ppf,d)
cmnn_thresh_table[0] = float(0.0000)

for d in range(6):
    print('degrees of freedom, threshold = ',d,cmnn_thresh_table[d])

<br>

### 2.2 Estimate the photo-z

Make arrays to hold photo-z for all of the galaxies

In [None]:
data_pz = np.zeros( len(data_m), dtype='float' ) - 1.0
data_pze = np.zeros( len(data_m), dtype='float' ) - 1.0

<br>

**WARNING:** It takes ~30 minutes to do ~100000 galaxies (~1.5 minutes to do ~5000).

As a test, we can use `Ncalc` to only estimate photo-z for a subset of the galaxies.

In [None]:
# Ncalc = len(data_c)
Ncalc = 5000

In [None]:
%%time

t1 = datetime.datetime.now()

for i in range( Ncalc ):
    if (i == 100) | (i == 1000) | (i == Ncalc-1000):
        t2 = datetime.datetime.now()
        print(i, t2-t1, ((t2-t1)/float(i))*(float(Ncalc)), ' remaining' )
        
    DM  = np.nansum( ( data_c[i,:] - data_c[:,:] )**2 / data_ce[i,:]**2, axis=1, dtype='float' )
    DOF = np.nansum( ( data_c[i,:]**2 + data_c[:,:]**2 + 1.0 ) / ( data_c[i,:]**2 + data_c[:,:]**2 + 1.0 ), \
            axis=1, dtype='int' )
    
    data_th = np.zeros( len(data_c), dtype='float' )
    for d in range(6):
        tx = np.where( DOF == d )[0]
        data_th[tx] = cmnn_thresh_table[ d ]
        del tx
    
    # reset the Mahalanobis distance for this 'test' galaxy to be very large
    # this will "leave out" the current 'test' galaxy from the 'training set'
    DM[i] = 99.9
        
    index = np.where( \
    ( DOF >= cmnn_minNclr ) & \
    ( data_th > 0.00010 ) & \
    ( DM > 0.00010 ) & \
    ( DM <= data_th ) )[0]
    
    if len(index) > 0:
        rival = np.random.choice( index, size=1, replace=False )[0]
        data_pz[i] = data_tz[rival]
        data_pze[i] = np.std( data_tz[index] )
        del rival
    else:
        data_pz[i] = float('nan')
        data_pze[i] = float('nan')
        
    del index, data_th, DOF, DM

#### 2.2.1 Quick check of success rate

In [None]:
tx = np.where( np.isnan(data_pz) )[0]
print( len(tx), ' galaxies did not get a pz estimate' )
del tx

tx = np.where( data_pz > 0.0 )[0]
print( len(tx), ' galaxies did get a pz estimate' )
del tx

<br>

### 2.3 Plot the photo-z results

#### 2.3.1 Plot the photometric *versus* the true redshifts.

In [None]:
tx = np.where( data_pz > 0.0 )[0]

fig = plt.figure(figsize=(4,4))
plt.plot( [0.0,3.0], [0.0,3.0], ls='solid', lw=1, color='firebrick')
plt.plot( data_tz[tx], data_pz[tx], 'o', ms=5, mew=0, alpha=0.1, color='grey' )
plt.xlabel('true redshift')
plt.ylabel('photometric redshift')
plt.show()

del tx

#### 2.3.2 Plot the photo-z uncertainty *versus* the photo-z accuracy.

The photo-z accuracy is the absolute value of the difference between the true and photometric redshifts.

Recall that the photo-z uncertainty is the standard deviation of the true redshifts of the training-set galaxies in the CMNN subset, as described in Section 2. The fact that a bunch of galaxies have an uncertainty of zero means there are galaxies with only 1 training-set galaxy in their CMNN subset. The full CMNN PZ Estimator treats such galaxies better (see Section 3).

In [None]:
tx = np.where( data_pz > 0.0 )[0]

fig = plt.figure(figsize=(10,4))
plt.plot( np.abs(data_tz[tx]-data_pz[tx]), data_pze[tx],\
         'o', ms=5, mew=0, alpha=0.1, color='grey' )
plt.xlabel('photo-z accuracy')
plt.ylabel('photo-z uncertainty')
plt.show()

del tx

<br>

## 3.0 Future Work

(1) Generate a *separate* training set of ~200000 galaxies, and then apply it to a *separate* test set of many more galaxies. 

(2) Install the full CMNN PZ Estimator as a package, and demonstrate how to use it. This might take some modification. The full package features more parameters, and modules for statistical analysis and plotting. The additional parameters include magnitude and color priors, alternatives to randomly selecting from the CMNN subset, and more robust treatment of test-set galaxies with few training-set galaxies in the CMNN subset. 