<img align="left" src = figs/logos/logo-IJCLab_v1.png height=40, style="padding: 10px"> 
<b>PhotoZ estimation with scikit learn Machine learning </b> <br>
Last verified to run on 2022-03-15 with LSST Science Pipelines release w_2021_49 <br>
Contact authors: Sylvie Dagoret-Campagne (DP0 Delegate) <br>
Target audience: DP0 delegates member <br>

**Credit:** Originally developed by Sylvie Dagoret-Campagne in the framework provided by Rubin DP0.1 (reference DP0.1 tutorials)

Acknowledgement: Melissa Graham, Leanne Guy, Alex Drlica-Wagner, Keith Bechtol, Grzegorz Madejski, Louise Edwards, and many others ..

## Learning Objectives : Compare PhotoZ estimators performances using simple Machine Learning algorithm from scikit learn.

Three typical regressors in scikit learn are evaluated and compared together. Finaly those are compared to the CMNN Photo-Z estimator (introduced in a previous notebook).
No optimisation is performed in this notebook. This will be the subject of another complementary notebook.

This notebook comes in two parts:

- Part 1 : Selection of the dataset, including discussion on photometric detection
- Part 2 : Comparison of Ridge, RandomForest, Gradient Boosting estimators and CMNN




**Note:** : 
- all plots are made with Holoview.
- **Better select the maximum of CPU.**

### Imports

In [None]:
# Import general python packages
import numpy as np
import re
import pandas as pd
import pickle
from pandas.testing import assert_frame_equal
import os
import errno
import shutil
import getpass
import datetime
# Import the Rubin TAP service utilities
from lsst.rsp import get_tap_service, retrieve_query

# LSST Science Pipelines (Stack) packages
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay
import lsst.geom as geom
import lsst.afw.coord as afwCoord
afwDisplay.setDefaultBackend('matplotlib')

#
from lsst import skymap

# Astropy
from astropy import units as u
from astropy.coordinates import SkyCoord
from astropy.units.quantity import Quantity
from astropy.visualization import (MinMaxInterval, SqrtStretch,ZScaleInterval,PercentileInterval,
                                   ImageNormalize,imshow_norm)
from astropy.visualization.stretch import SinhStretch, LinearStretch,AsinhStretch,LogStretch


# Bokeh for interactive visualization
import bokeh
from bokeh.io import output_file, output_notebook, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, CDSView, GroupFilter, HoverTool
from bokeh.plotting import figure
from bokeh.transform import factor_cmap

import holoviews as hv
from holoviews import streams, opts
from holoviews.operation.datashader import rasterize
from holoviews.operation.datashader import datashade, dynspread
from holoviews.plotting.util import process_cmap

import datashader as dsh


# Set the maximum number of rows to display from pandas
pd.set_option('display.max_rows', 20)


# Set the holoviews plotting library to be bokeh
# You will see the holoviews + bokeh icons displayed when the library is loaded successfully
#hv.extension('bokeh')
hv.extension('bokeh', 'matplotlib')

# Display bokeh plots inline in the notebook
output_notebook()

In [None]:
# What versions of bokeh and holoviews nd datashader are we working with?
# This is important when referring to online documentation as
# APIs can change between versions.
print("Bokeh version: " + bokeh.__version__)
print("Holoviews version: " + hv.__version__)
print("Datashader version: " + dsh.__version__)

In [None]:
#  What version of the Stack are we using?
! echo $IMAGE_DESCRIPTION
! eups list -s | grep lsst_distrib

In [None]:
# allow for matplotlib to create inline plots in our notebook
%matplotlib inline
import matplotlib.pyplot as plt      # imports matplotlib.pyplot as plt
from matplotlib.colors import Normalize

import warnings                      # imports the warnings library
import gc                            # imports python's garbage collector

# Ignore warnings
from astropy.units import UnitsWarning
warnings.simplefilter("ignore", category=UnitsWarning)

In [None]:
# Set up some plotting defaults:

params = {'axes.labelsize': 28,
          'font.size': 24,
          'legend.fontsize': 14,
          'xtick.major.width': 3,
          'xtick.minor.width': 2,
          'xtick.major.size': 12,
          'xtick.minor.size': 6,
          'xtick.direction': 'in',
          'xtick.top': True,
          'lines.linewidth': 3,
          'axes.linewidth': 3,
          'axes.labelweight': 3,
          'axes.titleweight': 3,
          'ytick.major.width': 3,
          'ytick.minor.width': 2,
          'ytick.major.size': 12,
          'ytick.minor.size': 6,
          'ytick.direction': 'in',
          'ytick.right': True,
          'figure.figsize': [18, 10],
          'figure.facecolor': 'White'
          }

plt.rcParams.update(params)

# Utility functions

## Tools to explain the expected detected magnitudes distributions

- taken from from https://github.com/ixkael/Photoz-tools

$$
p(m) = m^\alpha \exp\left( - \left(m/m_{max}\right)^\beta \right)
$$

In [None]:
def p_mag(imag_grid,maglim):
    """   
    
    Model of magnitude distribution in photometric survey
    
    p_mag(imag_grid,maglim)
    from https://github.com/ixkael/Photoz-tools
    
    input args:
     - imag_grid : magnitude
     - maglim limit of magnitude
     
     return the probability of magnitude distribution
     
     THIS IS THE MODEL THAT MUST BE USED
    
    """

    # some parameters for prob(imagnitude)
    alpha = 15.0 
    beta = 2
    off=1

    # prob(imagnitude) distribution
    p_imag = imag_grid**alpha*np.exp(-(imag_grid/(maglim-off))**beta)
    p_imag /= p_imag.sum()
    return p_imag


In [None]:
# imag errir distribution as function of mag limit, as in Rykoff et al
def imag_err(m, mlim):
    """
    from https://github.com/ixkael/Photoz-tools
    """
    
    a, b = 4.56, 1
    k = 1
    sigmadet = 5
    teff = np.exp(a + b * (mlim - 21.))
    F = 10**(-0.4*(m-22.5))
    Flim = 10**(-0.4*(mlim-22.5))
    Fnoise = (Flim/sigmadet)**2 * k * teff - Flim
    return 2.5/np.log(10) * np.sqrt( (1 + Fnoise/F) / (F*k*teff))

In [None]:
def det_prob(imag_grid,maglim):
    """
    det_prob(imag_grid,maglim)
    
    Give detection probability of a magnitude
    from https://github.com/ixkael/Photoz-tools
    
    input arg:
    - imag_grid : magnitude grid
    - maglim limit of magnitude
    
    return histogram of magnitude probability
    
    THIS IS THE MODEL THAT MUST BE USED
    
    """
    pp_mag=p_mag(imag_grid,maglim)
    
    detprob = 1*pp_mag 
    ind = (imag_grid >= maglim - 0.4)
    #detprob[ind] *= ( 1 - scipy.special.erf((imag_grid[ind]-maglim+0.4)/0.4) )
    # detection probability looks like a sigmoid
    detprob[ind] *= np.exp( -0.5*((imag_grid[ind]-maglim+0.4)/0.2)**2)
    detprob /= detprob.sum() * (imag_grid[1]-imag_grid[0])
    return detprob



# Configurations and initialisation

## Holoview Configuration

In [None]:
HV_CURVE_SINGLE_WIDTH  = 400
HV_CURVE_SINGLE_HEIGHT = 350
HV_CURVE_MULTI_WIDTH  = 300
HV_CURVE_MULTI_HEIGHT = 300
HV_CURVE_MULTI_FRAME_WIDTH = 300
HV_CURVE_MULTI_COLS   = 3

In [None]:
NBINS_HISTO = 50

In [None]:
HV_HISTO_SINGLE_WIDTH  = 600
HV_HISTO_SINGLE_HEIGHT = 600
HV_HISTO_MULTI_WIDTH  = 300
HV_HISTO_MULTI_HEIGHT = 300
HV_HISTO_MULTI_FRAME_WIDTH = 300
HV_HISTO_MULTI_COLS   = 3

In [None]:
HV_IMAGE_SINGLE_WIDTH  = 400
HV_IMAGE_SINGLE_HEIGHT = 400
HV_IMAGE_SINGLE_FRAME_WIDTH = 600
HV_IMAGE_MULTI_WIDTH  = 400
HV_IMAGE_MULTI_HEIGHT = 400
HV_IMAGE_MULTI_FRAME_WIDTH = 300
HV_IMAGE_MULTI_COLS   = 3

## Notebook Configuration

#### Setup paths

In [None]:
# username
myusername=getpass.getuser()

In [None]:
# temporary folders if necessary
NBDIR       = 'photoz_part1'                           # relative path for this notebook output
TMPTOPDIR   = "/scratch"                               # always write some output in /scratch, never in user HOME 
TMPUSERDIR  = os.path.join(TMPTOPDIR,myusername)       # defines the path of user outputs in /scratch 
TMPNBDIR    = os.path.join(TMPUSERDIR,NBDIR)           # output path for this particular notebook

In [None]:
# create user temporary directory
if not os.path.isdir(TMPUSERDIR):
    try:
        os.mkdir(TMPUSERDIR)
    except:
        raise OSError(f"Can't create destination directory {TMPUSERDIR}!" ) 

In [None]:
# create this notebook temporary directory
if not os.path.isdir(TMPNBDIR):
    try:
        os.mkdir(TMPNBDIR)
    except:
        raise OSError(f"Can't create destination directory {TMPNBDIR}!" ) 

#### 1.2 Defines steering flags and parameters

The Output queries may be saved in files if requested. 
By defaults all the following flags are set False : no query output is saved in file.
To speed-up the demo, the presenter may keep some of those flags True.


In [None]:
FLAG_WRITE_DATAFRAMEONDISK  = True                     # Select of query output will be saved on disk
FLAG_READ_DATAFRAMEFROMDISK = True                     # Select of the query can be red from disk if it exists
FLAG_CLEAN_DATAONDISK       = False                     # Select of the output queries saved in file will be cleaned at the end of the notebook

###  Retrieve source from catalog

#### Coadds-Magnitude cutoff for 10 years

In [None]:
UMAX,GMAX,RMAX,IMAX,ZMAX,YMAX = 26.1, 27.4, 27.5, 26.8, 26.1,24.9

### Selection function in catalog

In [None]:
# Define a function to build a query passing a coordinate and a search radius
def getQueryCircle(c: SkyCoord, r: Quantity) -> str:
    query = "SELECT obj.ra, obj.dec, obj.objectId, obj.extendedness, "\
            "obj.mag_u_cModel, obj.mag_g_cModel, obj.mag_r_cModel, "\
            "obj.mag_i_cModel, obj.mag_z_cModel, obj.mag_y_cModel, "\
            "obj.magerr_u_cModel, obj.magerr_g_cModel, obj.magerr_r_cModel, "\
            "obj.magerr_i_cModel, obj.magerr_z_cModel, obj.magerr_y_cModel, "\
            "truth.truth_type, truth.redshift, truth.match_objectId " \
            "FROM dp01_dc2_catalogs.object as obj " \
            "JOIN dp01_dc2_catalogs.truth_match as truth " \
            "ON truth.match_objectId = obj.objectId " \
            "WHERE CONTAINS(POINT('ICRS', obj.ra, obj.dec),"\
            "CIRCLE('ICRS', " + str(c.ra.value) + ", " + str(c.dec.value) + ", " \
            + str(r.to(u.deg).value) + " )) = 1 " \
            "AND obj.good = 1 "  \
            "AND truth.match_objectid >= 0 " \
            "AND truth.is_good_match = 1 " \
            "AND truth.truth_type = 1 " \
            "AND obj.mag_u_cModel - 5*obj.magerr_u_cModel < " +str(UMAX) +" "\
            "AND obj.mag_g_cModel - 5*obj.magerr_g_cModel < " +str(GMAX) +" "\
            "AND obj.mag_r_cModel - 5*obj.magerr_r_cModel < " +str(RMAX) +" "\
            "AND obj.mag_i_cModel - 5*obj.magerr_i_cModel < " +str(IMAX) +" "\
            "AND obj.mag_z_cModel - 5*obj.magerr_z_cModel < " +str(ZMAX) +" "\
            "AND obj.mag_y_cModel - 5*obj.magerr_y_cModel < " +str(YMAX)
    return query

## Configuration and initialisation of the Rubin TAP Service client

In [None]:
# Get an instance of the TAP service
service = get_tap_service()
assert service is not None
assert service.baseurl == "https://data.lsst.cloud/api/tap"

In [None]:
# Define a reference position on the sky  for a square seach
c1 = SkyCoord(ra=62.0*u.degree, dec=-40.*u.degree, frame='icrs')
size = 1.0 * u.deg

In [None]:
query = getQueryCircle(c1, size)

In [None]:
query

In [None]:
filename_result=f'cat_photozpart1_result.pkl'
fullfilename_result=os.path.join(TMPNBDIR,filename_result)

# Selection flags

- put boolean flags here to avoid execution of some sections

In [None]:
# Show plots on redshift distribution
FLAG_SHOW_TRUE_REDSHIFT_DISTRIB = False

In [None]:
# Show plots to check the photometry selected for photoz
# For a pure demo on photoZ, this section can be skipped
FLAG_SHOW_PHOTOMETRY_DETECTION = False

# START HERE

- To speed up the database query, especially if you run this notebook in the context of a DP0 demo public session,
you can copy my data file **cat_photozpart1_result.pkl** from **/scratch/sylvielsstfr/photoz_part1/cat_photozpart1_result.pkl** to your path **/scratch/yourusername/photoz_part1**

- check the variable **FLAG_READ_DATAFRAMEFROMDISK=True**
- your username is given by **yourusername=getpass.getuser()**

## Read input data

In [None]:
if FLAG_READ_DATAFRAMEFROMDISK and os.path.exists(fullfilename_result):
    sql_result = pd.read_pickle(fullfilename_result)
else:
    job = service.submit_job(query)
    job.run()
    job.wait(phases=['COMPLETED', 'ERROR'])
    print('Job phase is', job.phase)
    #sql_result = job.fetch_result().to_table().to_pandas()
    
    
    # Create and submit the job. This step does not run the query yet
    job = service.submit_job(query)
    # Get the job URL
    print('Job URL is', job.url)

    # Get the job phase. It will be pending as we have not yet started the job
    print('Job phase is', job.phase)
    
    # Run the job. You will see that the the cell completes executing,
    # even though the query is still running
    job.run()
    
    # Use this to tell python to wait for the job to finish if
    # you don't want to run anything else while waiting
    # The cell will continue executing until the job is finished
    job.wait(phases=['COMPLETED', 'ERROR'])
    print('Job phase is', job.phase)
    
    # A usefull funtion to raise an exception if there was a problem with the query
    job.raise_if_error()
    
    # Once the job completes successfully, you can fetch the results
    async_data = job.fetch_result()
    
    sql_result = async_data.to_table().to_pandas()
    
    
if FLAG_WRITE_DATAFRAMEONDISK:
    sql_result.to_pickle(fullfilename_result)
    

In [None]:
sql_result.head()

In [None]:
data = sql_result

In [None]:
# for shorter names
data.rename(columns={"mag_u_cModel": "mag_u", "mag_g_cModel": "mag_g","mag_r_cModel": "mag_r",
                     "mag_i_cModel": "mag_i", "mag_z_cModel": "mag_z","mag_y_cModel": "mag_y",
                     "magerr_u_cModel": "magerr_u", "magerr_g_cModel": "magerr_g","magerr_r_cModel": "magerr_r",
                     "magerr_i_cModel": "magerr_i", "magerr_z_cModel": "magerr_z","magerr_y_cModel": "magerr_y",
                    },inplace=True)

In [None]:
# output directory where the extracted catalog is temporary saved
! ls -l $TMPNBDIR

In [None]:
# map truth_type
data['truth_type']=data['truth_type'].map({1: 'galaxy', 2: 'star', 3: 'SNe'})

#### drop objects that are not galaxies

In [None]:
#drop objects that are not galaxies
data.drop(data.loc[data['truth_type'] != 'galaxy' ].index, inplace=True)

In [None]:
# drop NA
data = data.dropna()

In [None]:
len(data)

### add color

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'
data["umg"]=data["mag_u"]- data["mag_g"]
data["gmr"]=data["mag_g"]- data["mag_r"]
data["rmi"]=data["mag_r"]- data["mag_i"]
data["imz"]=data["mag_i"]- data["mag_z"]
data["zmy"]=data["mag_z"]- data["mag_y"]

# Ckeck input data

## Redshifts distribution

In [None]:
if FLAG_SHOW_TRUE_REDSHIFT_DISTRIB:
    (z_bin, count) = np.histogram(data.redshift, bins=NBINS_HISTO)
    z_distribution = hv.Histogram(z_bin, count).opts(title=f"redshift distribution",color='darkblue', 
    xlabel='redshift', fontscale=1.2,
    height=HV_HISTO_SINGLE_HEIGHT-100, width=HV_HISTO_SINGLE_WIDTH,tools=['hover'])
    
    z_distribution

## Magnitudes distribution

### Principle of photodetection

Study of a simple model to find the expected magnitude distribution including photodetection bias:

- blue curve : the true magnitude distribution
- green curve : the detected magnitude distribution
- red : detection threshold


In [None]:
mymag_grid = np.linspace(17,30,100)
mymaglim = 27
mymag0 = 17
mymag1 = 30
prob_mag_tab = p_mag(mymag_grid,mymaglim)
dens_mag_tab = prob_mag_tab/np.sum(prob_mag_tab)/(mymag_grid[1]-mymag_grid[0])
det_prob_tab =  det_prob(mymag_grid,mymaglim)

In [None]:
curve_opts = dict(
                xaxis="bottom", 
                padding = 0.01, fontsize={'title': '8pt'},
                height=HV_CURVE_SINGLE_HEIGHT, width=HV_CURVE_SINGLE_WIDTH+100,tools=['hover']
               )  

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    curve_magdist = hv.Curve(zip(mymag_grid,dens_mag_tab),label="true mag").opts(**curve_opts).opts(color="blue") 
    curve_maglim = hv.VLine(mymaglim,label="detection threshold").opts(color="red")
    curve_prob = hv.Curve(zip(mymag_grid,det_prob_tab),label="detected mag").opts(**curve_opts).opts(color="green") 

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout = (curve_magdist * curve_prob * curve_maglim).opts(legend_position='top_left',xlabel="magnitude",ylabel="probability",title="magnitude density distribution")
    layout

### Magnitude distribution before cutoff at detection threshold

- The catalog sources have been extracted $5\sigma$ above the cutoff. This allow the user to refine the expected cutoff inside each band 

In [None]:
# Detection cutiff definition
maglim_u = hv.VLine(UMAX).opts(color="magenta")
maglim_g = hv.VLine(GMAX).opts(color="magenta")
maglim_r = hv.VLine(RMAX).opts(color="magenta")
maglim_i = hv.VLine(IMAX).opts(color="magenta")
maglim_z = hv.VLine(ZMAX).opts(color="magenta")
maglim_y = hv.VLine(YMAX).opts(color="magenta")

In [None]:
histo_opts = dict(
                xaxis="bottom", 
                padding = 0.01, fontsize={'title': '8pt'},
                height=HV_HISTO_MULTI_HEIGHT, width=HV_HISTO_MULTI_WIDTH,tools=['hover']
               )  

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    mag_bin, count = np.histogram(data.mag_u, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magu = hv.Histogram(mag_bin, count).opts(title=f"mag U",xlabel="U",color='blue').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_g, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magg = hv.Histogram(mag_bin, count).opts(title=f"mag G",xlabel="G",color='green').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_r, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magr = hv.Histogram(mag_bin, count).opts(title=f"mag R",xlabel="R",color='red').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_i, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magi = hv.Histogram(mag_bin, count).opts(title=f"mag I",xlabel="I",color='orange').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_z, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magz = hv.Histogram(mag_bin, count).opts(title=f"mag Z",xlabel="Z",color='grey').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_y, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magy = hv.Histogram(mag_bin, count).opts(title=f"mag Y",xlabel="Y",color='black').opts(**histo_opts)

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout1 = h_magu * maglim_u + h_magg * maglim_g + h_magr * maglim_r + h_magi *  maglim_i + h_magz *  maglim_z + h_magy *  maglim_y
    layout1.cols(HV_HISTO_MULTI_COLS)

### Apply Photometric detection

- The magnitudes in DC2 correcpond to average magnitudes and their photometric error. In the telescope, the detection correspond to the realisation of one sample of these distribution.
Then the cutoff will be applied on detected magnitudes. (But the evaluation of photoz algorithm will be applied on the average magnitudes in the catalog)

#### randomisation of magnitude with photometric errors

- gaussian distribution should be applied to flux, not magnitudes. Assume central limit theorem applies on magnitudes as well. This could be checked or corrected.

In [None]:
def photodet_mag(mag,errmag):
    return np.random.normal(loc=mag, scale=errmag)

In [None]:
data['mag_u_det'] = data.apply(lambda x:  photodet_mag(x['mag_u'], x['magerr_u']), axis=1)
data['mag_g_det'] = data.apply(lambda x:  photodet_mag(x['mag_g'], x['magerr_g']), axis=1)
data['mag_r_det'] = data.apply(lambda x:  photodet_mag(x['mag_r'], x['magerr_r']), axis=1)
data['mag_i_det'] = data.apply(lambda x:  photodet_mag(x['mag_i'], x['magerr_i']), axis=1)
data['mag_z_det'] = data.apply(lambda x:  photodet_mag(x['mag_z'], x['magerr_z']), axis=1)
data['mag_y_det'] = data.apply(lambda x:  photodet_mag(x['mag_y'], x['magerr_y']), axis=1)

#### Selection on detected magnitudes

In [None]:
def photodet_select(mu,mg,mr,mi,mz,my):
    return (mu>17) and (mu < UMAX) and (mg < GMAX) and (mr < RMAX) and (mi < IMAX) and (mz < ZMAX) and (my < YMAX)

In [None]:
data['selected'] = data.apply(lambda x:  photodet_select(x['mag_u_det'], x['mag_g_det'], x['mag_r_det'], x['mag_i_det'],x['mag_z_det'], x['mag_y_det'] ), axis=1)

In [None]:
data = data[data["selected"]]

In [None]:
data = data.drop('selected', axis=1)

In [None]:
data = data.dropna()

### Distribution of average magnitudes after photometric detection

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    mag_bin, count = np.histogram(data.mag_u, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magu = hv.Histogram(mag_bin, count).opts(title=f"mag U",xlabel="U",color='blue').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_g, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magg = hv.Histogram(mag_bin, count).opts(title=f"mag G",xlabel="G",color='green').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_r, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magr = hv.Histogram(mag_bin, count).opts(title=f"mag R",xlabel="R",color='red').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_i, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magi = hv.Histogram(mag_bin, count).opts(title=f"mag I",xlabel="I",color='orange').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_z, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magz = hv.Histogram(mag_bin, count).opts(title=f"mag Z",xlabel="Z",color='grey').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_y, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magy = hv.Histogram(mag_bin, count).opts(title=f"mag Y",xlabel="Y",color='black').opts(**histo_opts)

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout2 = h_magu * maglim_u + h_magg * maglim_g + h_magr * maglim_r + h_magi *  maglim_i + h_magz *  maglim_z + h_magy *  maglim_y
    layout2.cols(HV_HISTO_MULTI_COLS)

### Distribution of detected magnitudes after photometric detection

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    mag_bin, count = np.histogram(data.mag_u_det, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magud = hv.Histogram(mag_bin, count).opts(title=f"mag U",xlabel="U",color='blue').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_g_det, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_maggd = hv.Histogram(mag_bin, count).opts(title=f"mag G",xlabel="G",color='green').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_r_det, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magrd = hv.Histogram(mag_bin, count).opts(title=f"mag R",xlabel="R",color='red').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_i_det, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magid = hv.Histogram(mag_bin, count).opts(title=f"mag I",xlabel="I",color='orange').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_z_det, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magzd = hv.Histogram(mag_bin, count).opts(title=f"mag Z",xlabel="Z",color='grey').opts(**histo_opts)

    mag_bin, count = np.histogram(data.mag_y_det, bins=NBINS_HISTO,range=(mymag0,mymag1))
    h_magyd = hv.Histogram(mag_bin, count).opts(title=f"mag Y",xlabel="Y",color='black').opts(**histo_opts)

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout3 = h_magud * maglim_u + h_maggd * maglim_g + h_magrd * maglim_r + h_magid *  maglim_i + h_magzd *  maglim_z + h_magyd *  maglim_y
    layout3.cols(HV_IMAGE_MULTI_COLS)

### Detected magnitude vs true magnitude

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    huudet, xhuz, yhuz=np.histogram2d(data.mag_u,data.mag_u_det,bins=(50, 50),range=[[16,28],[16,28]])
    hggdet, xhgz, yhgz=np.histogram2d(data.mag_g,data.mag_g_det,bins=(50, 50),range=[[16,28],[16,28]])
    hrrdet, xhrz, yhrz=np.histogram2d(data.mag_r,data.mag_r_det,bins=(50, 50),range=[[16,28],[16,28]])
    hiidet, xhiz, yhiz=np.histogram2d(data.mag_i,data.mag_i_det,bins=(50, 50),range=[[16,28],[16,28]])
    hzzdet, xhzz, yhzz=np.histogram2d(data.mag_z,data.mag_z_det,bins=(50, 50),range=[[16,28],[16,28]])
    hyydet, xhyz, yhyz=np.histogram2d(data.mag_y,data.mag_y_det,bins=(50, 50),range=[[16,28],[16,28]])

In [None]:
img_opts = dict(
                #height=600, width=700, 
                xaxis="bottom", 
                padding = 0.01, fontsize={'title': '8pt'},
                #colorbar=True, toolbar='right', show_grid=True,
                #aspect='equal',
                frame_width= HV_HISTO_MULTI_FRAME_WIDTH,
                show_grid=True ,
                #ylabel="mag",
                tools=['hover','undo','redo','zoom_in','zoom_out'],
                #tools=[myhover,'crosshair'],
               )  

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    fhuudet=np.flipud(huudet.T)
    fhggdet=np.flipud(hggdet.T)
    fhrrdet=np.flipud(hrrdet.T)
    fhiidet=np.flipud(hiidet.T)
    fhzzdet=np.flipud(hzzdet.T)
    fhyydet=np.flipud(hyydet.T)
    img_uud=hv.Image(fhuudet, bounds=(16,16,28,28) ).opts(cmap="Blues",title="magU det vs magU true ",xlabel="U (mag)",ylabel="U (mag)").opts(**img_opts)
    img_ggd=hv.Image(fhggdet, bounds=(16,16,28,28) ).opts(cmap="Greens",title="magG det vs magG true",xlabel="G (mag)",ylabel="G (mag)").opts(**img_opts)
    img_rrd=hv.Image(fhrrdet, bounds=(16,16,28,28) ).opts(cmap="Reds",title="magR det vs magR true",xlabel="R (mag)",ylabel="R (mag)").opts(**img_opts)
    img_iid=hv.Image(fhiidet, bounds=(16,16,28,28) ).opts(cmap="Oranges",title="magI det vs magI true",xlabel="I (mag)",ylabel="I (mag)").opts(**img_opts)
    img_zzd=hv.Image(fhzzdet, bounds=(16,16,28,28) ).opts(cmap="Greys",title="magZ det vs magZ true",xlabel="Z (mag)",ylabel="Z (mag)").opts(**img_opts)
    img_yyd=hv.Image(fhyydet, bounds=(16,16,28,28) ).opts(cmap="GnBu",title="magY det vs magY true",xlabel="Y (mag)",ylabel="Y (mag)").opts(**img_opts)

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout = img_uud + img_ggd + img_rrd + img_iid + img_zzd + img_yyd
    layout.cols(HV_IMAGE_MULTI_COLS)

### Average magnitude vs redshift

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    huz, xhuz, yhuz=np.histogram2d(data.redshift,data.mag_u,bins=(50, 50),range=[[0,3],[16,26]])
    hgz, xhgz, yhgz=np.histogram2d(data.redshift,data.mag_g,bins=(50, 50),range=[[0,3],[16,26]])
    hrz, xhrz, yhrz=np.histogram2d(data.redshift,data.mag_r,bins=(50, 50),range=[[0,3],[16,26]])
    hiz, xhiz, yhiz=np.histogram2d(data.redshift,data.mag_i,bins=(50, 50),range=[[0,3],[16,26]])
    hzz, xhzz, yhzz=np.histogram2d(data.redshift,data.mag_z,bins=(50, 50),range=[[0,3],[16,26]])
    hyz, xhyz, yhyz=np.histogram2d(data.redshift,data.mag_y,bins=(50, 50),range=[[0,3],[16,26]])

In [None]:
img_opts = dict(
                #height=600, width=700, 
                xaxis="bottom", 
                padding = 0.01, fontsize={'title': '8pt'},
                #colorbar=True, toolbar='right', show_grid=True,
                #aspect='equal',
                frame_width= HV_HISTO_MULTI_FRAME_WIDTH,
                show_grid=True ,
                xlabel="redshift",
                #ylabel="mag",
                tools=['hover','undo','redo','zoom_in','zoom_out'],
                #tools=[myhover,'crosshair'],
               )  

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    fhuz=np.flipud(huz.T)
    fhgz=np.flipud(hgz.T)
    fhrz=np.flipud(hrz.T)
    fhiz=np.flipud(hiz.T)
    fhzz=np.flipud(hzz.T)
    fhyz=np.flipud(hyz.T)
    img_uz=hv.Image(fhuz, bounds=(0,16,3,26) ).opts(cmap="Blues",title="magU vs reshift",ylabel="U (mag)").opts(**img_opts)
    img_gz=hv.Image(fhgz, bounds=(0,16,3,26) ).opts(cmap="Greens",title="magG vs reshift",ylabel="G (mag)").opts(**img_opts)
    img_rz=hv.Image(fhrz, bounds=(0,16,3,26) ).opts(cmap="Reds",title="magR vs reshift",ylabel="R (mag)").opts(**img_opts)
    img_iz=hv.Image(fhiz, bounds=(0,16,3,26) ).opts(cmap="Oranges",title="magI vs reshift",ylabel="I (mag)").opts(**img_opts)
    img_zz=hv.Image(fhzz, bounds=(0,16,3,26) ).opts(cmap="Greys",title="magZ vs reshift",ylabel="Z (mag)").opts(**img_opts)
    img_yz=hv.Image(fhyz, bounds=(0,16,3,26) ).opts(cmap="GnBu",title="magY vs reshift",ylabel="Y (mag)").opts(**img_opts)

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout = img_uz + img_gz + img_rz + img_iz + img_zz + img_yz
    layout.cols(HV_IMAGE_MULTI_COLS)

### Average Color vs redshift

In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    humgz, xhumgz, yhumgz=np.histogram2d(data.redshift,data.umg,bins=(50, 50),range=[[0,3],[-0.5,2.]])
    hgmrz, xhgmrz, yhgmrz=np.histogram2d(data.redshift,data.gmr,bins=(50, 50),range=[[0,3],[-0.5,2.]])
    hrmiz, xhrmiz, yhrmiz=np.histogram2d(data.redshift,data.rmi,bins=(50, 50),range=[[0,3],[-0.5,2.]])
    himzz, xhimzz, yhimzz=np.histogram2d(data.redshift,data.imz,bins=(50, 50),range=[[0,3],[-0.5,2.]])
    hzmyz, xhzmyz, yhzmyz=np.histogram2d(data.redshift,data.zmy,bins=(50, 50),range=[[0,3],[-0.5,2.]])

    fhumgz=np.flipud(humgz.T)
    fhgmrz=np.flipud(hgmrz.T)
    fhrmiz=np.flipud(hrmiz.T)
    fhimzz=np.flipud(himzz.T)
    fhzmyz=np.flipud(hzmyz.T)

    img_umgz=hv.Image(fhumgz, bounds=(0,-0.5,3,2.) ).opts(cmap="Blues",title="U - G vs reshift",ylabel="U-G (mag)").opts(**img_opts)
    img_gmrz=hv.Image(fhgmrz, bounds=(0,-0.5,3,2.) ).opts(cmap="Greens",title="G - R vs reshift",ylabel="G-R (mag)").opts(**img_opts)
    img_rmiz=hv.Image(fhrmiz, bounds=(0,-0.5,3,2.) ).opts(cmap="Reds",title="R - I vs reshift",ylabel="R-I (mag)").opts(**img_opts)
    img_imzz=hv.Image(fhimzz, bounds=(0,-0.5,3,2.) ).opts(cmap="Oranges",title="I - Z vs reshift",ylabel="I-Z (mag)").opts(**img_opts)
    img_zmyz=hv.Image(fhzmyz, bounds=(0,-0.5,3,2.) ).opts(cmap="Greys",title="Z - Y vs reshift",ylabel="Z-Y (mag)").opts(**img_opts)


In [None]:
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    layout = img_umgz + img_gmrz + img_rmiz + img_imzz + img_zmyz 
    layout.cols(HV_IMAGE_MULTI_COLS)

Because photo-z's rely on the flux integrated in broad filters, they are more sensitive to broad, dramatic features of the SED. This means that the location of these breaks provides a lot of information for photo-z's. 
- In particular, the Balmer and 4000 angstrom breaks are in the wavelength range of the LSST filters up to a redshift of ~1.4, 
- and locating these breaks with the LSST filters provides good leverage for photo-z's (see e.g. Kalmbach et al. 2020 and Malz et al. 2021). 
- Note that while the Balmer break leaves the LSST filters around z=1.4, 
- the Lyman break doesn't enter the wavelength range of the LSST filters until about z=2.5 :  This gap in redshift coverage contributes to the degradation of photo-z's at high redshift.

- Note that in high-redshift galaxies, photo-z estimators might confuse the Lyman break for the Balmer break. This contributes to the physical degeneracies discussed above in "What makes photo-z's hard?"

In [None]:
from PIL import Image
if FLAG_SHOW_PHOTOMETRY_DETECTION: 
    img= Image.open("figs/sdss/plot_sdss_filters_2.png")
    img = img.resize((500, 400), Image.ANTIALIAS)
    img

# Machine Learning for Photo-Z estimation

## Requirements from LSST Science book:
(https://www.lsst.org/sites/default/files/docs/sciencebook/SB_3.pdf)

Photometric redshifts for LSST will be applied and calibrated over the redshift range $0 < z < 4$
for galaxies to $r  \simeq 27.5$. 
For the majority of science cases, such as weak lensing and BAO, a subset
of galaxies with $i < 25.3$ will be used. For this high S/N gold standard subset over the
redshift interval, $0 < z < 3$, the photometric redshift requirements are:

- The root-mean-square scatter in photometric redshifts, $ \sigma_z/(1+z)$, must be smaller than 0.05, with a goal of 0.02.
- The fraction of $3\sigma $  outliers at all redshifts must be below 10%.
- The bias in $e_z = (z_{photo}−z_{spec})/(1+z_{spec})$ must be below 0.003 (or 0.01 for combined,analyses of weak lensing and baryon acoustic oscillations); 
- The uncertainty in  $\sigma_z/(1+z)$ must also be known to similar accuracy.



### other definitions

- **the photo-z accuracy is the absolute value of the difference between the true and photometric redshifts**.

-  **the photo-z uncertainty is the standard deviation of the true redshifts** 

## Utility functions


- from DE School IV, University of Oxford, July 18, 2016 :  **Jeff Newman - photometric redshifts for LSST**

The tools for PhotoZ evaluation are givenin the notebook and also described in the LSST science book (Performance Chapter) 

### Performances Evaluation lines

In [None]:
#A function that we will call a lot: makes the zphot/zspec plot and calculates key statistics
def plot_lines(zmin=0,zmax=3,zstep=0.05,slope=0.15):
    
    x = np.arange(zmin,zmax,zstep)
    outlier_upper = x + slope*(1+x)
    outlier_lower = x - slope*(1+x)

    curv_bisect=hv.Curve(zip(x,x)).opts(color="red") 
    curv_outupper=hv.Curve(zip(x,outlier_upper)).opts(color="red",line_dash='dashed') 
    curv_outlower=hv.Curve(zip(x,outlier_lower)).opts(color="red",line_dash='dashed') 
    
    layout = curv_bisect * curv_outupper * curv_outlower
    return layout
    
    

In [None]:
plot_lines()

### Statistic lines

In [None]:
def get_stats(z_spec,z_phot,slope=0.15):
    """
    input : 
       - z_spec : spectroscopic redshift or true redshift
       - z_phot : photo-z reedshift
       - slope : slope of line defining the outliers  3 x sigma_z with sigma_z = 5%, so slope = 3 x 0.05 = 0.15 
    """
    
    mask = np.abs((z_phot - z_spec)/(1 + z_spec)) > slope
    notmask = ~mask 
    
    # Standard Deviation of the predicted redshifts compared to the data:
    #-----------------------------------------------------------------
    std_result = np.std((z_phot - z_spec)/(1 + z_spec), ddof=1)
    print('Standard Deviation: %6.4f' % std_result)
    

    # Normalized MAD (Median Absolute Deviation):
    #------------------------------------------
    nmad = 1.48 * np.median(np.abs((z_phot - z_spec)/(1 + z_spec)))
    print('Normalized MAD: %6.4f' % nmad)

    # Percentage of delta-z > 0.15(1+z) outliers:
    #-------------------------------------------
    eta = np.sum(np.abs((z_phot - z_spec)/(1 + z_spec)) > 0.15)/len(z_spec)
    print('Delta z >0.15(1+z) outliers: %6.3f percent' % (100.*eta))
    
    # Median offset (normalized by (1+z); i.e., bias:
    #-----------------------------------------------
    bias = np.median(((z_phot - z_spec)/(1 + z_spec)))
    sigbias=std_result/np.sqrt(0.64*len(z_phot))
    print('Median offset: %6.3f +/- %6.3f' % (bias,sigbias))
    
    
     # overlay statistics with titles left-aligned and numbers right-aligned
    stats_txt = '\n'.join([
        'NMAD  = {:0.2f}'.format(nmad),
        'STDEV = {:0.2f}'.format(std_result),
        'BIAS  = {:0.2f}'.format(bias),
        'ETA   = {:0.2f}'.format(eta)
    ])
    
    
    return nmad,std_result,bias,eta,stats_txt
    

##  The CMNN Photo-z Estimator

In [None]:
import datetime

In [None]:
from scipy.stats import chi2

In [None]:
def CMNNestimator(df,Ncalc = 5000):
    


    # galaxy true redshifts
    data_tz = np.asarray(df['redshift'], dtype='float' )

    # galaxy apparent magnitudes
    data_m = np.transpose( np.asarray( (df['mag_u'],df['mag_g'],\
                                    df['mag_r'],df['mag_i'],\
                                    df['mag_z'],df['mag_y']),\
                                  dtype='float' ) )

    # galaxy apparent magnitude errors
    data_me = np.transpose( np.asarray( (df['magerr_u'],df['magerr_g'],\
                                     df['magerr_r'],df['magerr_i'],\
                                     df['magerr_z'],df['magerr_y']),\
                                  dtype='float' ) )
    
    # galaxy colors and color errors
    data_c = np.zeros( (len(data_m),5), dtype='float' )
    data_ce = np.zeros( (len(data_m),5), dtype='float' )

    data_c[:,0] = data_m[:,0] - data_m[:,1]
    data_c[:,1] = data_m[:,1] - data_m[:,2]
    data_c[:,2] = data_m[:,2] - data_m[:,3]
    data_c[:,3] = data_m[:,3] - data_m[:,4]
    data_c[:,4] = data_m[:,4] - data_m[:,5]

    data_ce[:,0] = np.sqrt( data_me[:,0]**2 + data_me[:,1]**2 )
    data_ce[:,1] = np.sqrt( data_me[:,1]**2 + data_me[:,2]**2 )
    data_ce[:,2] = np.sqrt( data_me[:,2]**2 + data_me[:,3]**2 )
    data_ce[:,3] = np.sqrt( data_me[:,3]**2 + data_me[:,4]**2 )
    data_ce[:,4] = np.sqrt( data_me[:,4]**2 + data_me[:,5]**2 )
    
    
    cmnn_ppf = 0.68 
    cmnn_minNclr = 5
    
    
    cmnn_thresh_table = np.zeros( 6, dtype='float' )
    for d in range(6):
        cmnn_thresh_table[d] = chi2.ppf(cmnn_ppf,d)
    cmnn_thresh_table[0] = float(0.0000)

    for d in range(6):
        print('degrees of freedom, threshold = ',d,cmnn_thresh_table[d])
        
        
    data_pz = np.zeros( len(data_m), dtype='float' ) - 1.0
    data_pze = np.zeros( len(data_m), dtype='float' ) - 1.0
    
    
    
    
    t1 = datetime.datetime.now()

    for i in range( Ncalc ):
        if (i == 100) | (i == 1000) | (i == Ncalc-1000):
            t2 = datetime.datetime.now()
            print(i, t2-t1, ((t2-t1)/float(i))*(float(Ncalc)), ' remaining' )
        
        DM  = np.nansum( ( data_c[i,:] - data_c[:,:] )**2 / data_ce[i,:]**2, axis=1, dtype='float' )
        DOF = np.nansum( ( data_c[i,:]**2 + data_c[:,:]**2 + 1.0 ) / ( data_c[i,:]**2 + data_c[:,:]**2 + 1.0 ),axis=1, dtype='int' )
    
        data_th = np.zeros( len(data_c), dtype='float' )
        for d in range(6):
            tx = np.where( DOF == d )[0]
            data_th[tx] = cmnn_thresh_table[ d ]
            del tx
    
        # reset the Mahalanobis distance for this 'test' galaxy to be very large
        # this will "leave out" the current 'test' galaxy from the 'training set'
        DM[i] = 99.9
        
        index = np.where( \
        ( DOF >= cmnn_minNclr ) & \
        ( data_th > 0.00010 ) & \
        ( DM > 0.00010 ) & \
        ( DM <= data_th ) )[0]
    
        if len(index) > 0:
            rival = np.random.choice( index, size=1, replace=False )[0]
            data_pz[i] = data_tz[rival]
            data_pze[i] = np.std( data_tz[index] )
            del rival
        else:
            data_pz[i] = float('nan')
            data_pze[i] = float('nan')
        
        del index, data_th, DOF, DM
        
        
        
    tx = np.where( np.isnan(data_pz) )[0]
    print( len(tx), ' galaxies did not get a pz estimate' )
    del tx

    tx = np.where( data_pz > 0.0 )[0]
    print( len(tx), ' galaxies did get a pz estimate' )
    del tx
    
    
    return data_tz, data_pz,data_pze
    


## START ML here

### Prepare and Feature

Because we want to estimate the performance of photoz estimator itself, not the total performance including intrinsic redshift fluctuations. Thus only average magnitudes data will be used : detected magnitude are dropped. 

In [None]:
target = data["redshift"]

In [None]:
features = data[["mag_u","mag_g","mag_r","mag_i","mag_z","mag_y"]]

- Total number of samples to split in training, validation and test dataset

In [None]:
Ntot = len(target)
Ntot

### Split in training / test set

- speed of the notebook must be tuned with the training sample size

#### number of samples to be used in training

- depending on the required speed of the demo 

In [None]:
Ntrain = 10000
Ntest = Ntot-Ntrain

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Test fraction
test_sample_size_fraction=Ntest/Ntot
test_sample_size_fraction

In [None]:
# adapt the train dataset size according required running time 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_sample_size_fraction, random_state=0)

In [None]:
X_train.shape

In [None]:
X_test.shape

** Note**
- because the model fit (training) may be long, we should limit the training dataset size for this demo.

## Regressors definitions

### Regularized Linear model

- Instead of using the LinearRegressor, we start by using the regularized Ridge regressor with the alpha parameter setting the regularization.
- Linear model features should be always normalized,
- For non linearities, we include the possibility to develop the model as a polynomial of features

Scikit-Learn offer to define pipelines of tasks in an easy way:
- PolynomialFeatures() task extend features dataset in powers of thos features up to a power degree,
- StandardScaler() preprocess the features to normalize them,
- Ridge is the regularized version of the LinearRegressor

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

ridge_regressor = make_pipeline(PolynomialFeatures(degree=5), StandardScaler(),Ridge(alpha=0.05))

### RandomForest regressor

- RandomForest regressor is an Enssemble regressor of type Bagging (Bootstrap and Aggregation).  

- RandomForest regressor combines the regression of multiple decision tree regressors fitted in parallel 
on bootstrapped samples from the training sample.

- Each individual decision tree is deep, meaning they individually overfit the bootstrapped samples. 

- The aggregation of the overfitting parallel decision tree model reduce the over-fitting.
- For Random Forest, each Decision Tree node feature are drawn randomly. This reduce the error correlation of the various trees.
- From this caracteristics, RandomForest is expected to be one of the best non-linear regressor on column-tabulated datasets.


- RandomForst includes a number of hyper-parameters.
- We use the hyper-parameters chosen byJeff Newmann for the DE-School at Oxford 2016.

In [None]:
from sklearn.ensemble import RandomForestRegressor
randomforest_regressor = RandomForestRegressor(n_estimators = 50, max_depth = 30, max_features = 'auto')

### Gradient Boosting Regressor

The INRIA MOOC (2022) on scikit-Learn ( Machine learning in Python with scikit-learn: https://lms.fun-mooc.fr/courses/course-v1:inria+41026+session02/info)
recommend the histogram-binned version of GradientBoostingRegressor, expected to have a good balance between underfitting and overfitting.

- Boosting Regressor perform shallow Decision trees (underfitting) fit sequencially. Among them, Gradient Boosting Regressor are expected to avoid overfitting. Among them the Histogram Grandient Boosting regressor is expected to run faster    

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.ensemble import HistGradientBoostingRegressor
discretizer = KBinsDiscretizer(n_bins=64, encode="ordinal", strategy="quantile")
histogram_gradient_boosting_regressor = make_pipeline(discretizer, HistGradientBoostingRegressor(max_iter=30))

## Evaluation metrics

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from sklearn.metrics import make_scorer
scoring = {'r2': make_scorer(r2_score),'mae': make_scorer(mean_absolute_error),'mse': make_scorer(mean_squared_error)}

## Use of cross-validation

- We use cross-validation to select a sub-sample of galaxies from the complete training sample and train the model with this subset.
- This sub-sampling is repeated several times (n_split=5).

The interest of this multi-subsampling is to have a set of almost similar but slightly different fitted models from which we can derive several predictions for a test sample, thus an average predicted value and its variation (or a PDF)

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
from sklearn.model_selection import ShuffleSplit

In [None]:
# We don't know of the galaxies are ordered by redshift or come randomly. Thus we activate a pre-random-shuffling in the training dataset. 
cv = ShuffleSplit(n_splits=5, test_size=.80, random_state=0)

## Ridge model

- The cross_validate function performs fit on n_splits models from n_splits random subsamples
- The smaple is previously randomized
- The evaluation metric is given by the scoring (The INRIA MOOC use this coring="neg_mean_absolute_error"),
- The n_splits fitted models are retuned (to be able to make n_split prediction for a single test sample)

### Cross validation

In [None]:
%%time
t1 = datetime.datetime.now()
cv_results = cross_validate(ridge_regressor,X_train,y_train,cv=cv,scoring=scoring,return_estimator=True)
t2 = datetime.datetime.now()
deltat = (t2-t1).total_seconds() 
print(f"Ridge CV : elapsed time {deltat:.2f} sec")

In [None]:
df_cv_results = pd.DataFrame(cv_results)
df_cv_results

### estimation

In [None]:
# Choose one of the n_splits model, but all predictions for all estimators could be calculated (average and rms)
y_pred = cv_results["estimator"][0].predict(X_test)

### Performances

In [None]:
coords = zip(y_test,y_pred)
points = hv.Points(coords).opts(tools=['box_select', 'lasso_select'])

In [None]:
nmad,std_result,bias,eta,stats_txt1 = get_stats(y_test.values,y_pred)

In [None]:
# Create a holoviews object to hold and plot data
# Create the linked streams instance
boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

# Apply the datashader
p1 = dynspread(datashade(points, cmap="Viridis"))
p1 = p1.opts(width=HV_HISTO_SINGLE_WIDTH, height=HV_HISTO_SINGLE_HEIGHT,
    padding=0.05, show_grid=True,
    xlim=(0, 3), ylim=(0, 3.0),
    xlabel="z-spec", ylabel="z-phot",title="Ridge Regressor")

In [None]:
p1 * plot_lines() * hv.Text(0.5, 2.5, stats_txt1)

### performance metrics

In [None]:
msg_r2   = f"R2 score : \t\t {df_cv_results['test_r2'].mean():.3f} +/-  {df_cv_results['test_r2'].std():.3f}"
msg_mae  = f"MAE mean absolute error : \t {df_cv_results['test_mae'].mean():.3f} +/-  {df_cv_results['test_mae'].std():.3f}"
msg_rmsq = f"Root MSE error : \t\t {np.sqrt(df_cv_results['test_mse'].mean()):.3f} +/-  {np.sqrt(df_cv_results['test_mse'].std()):.3f}"

In [None]:
print(msg_r2)
print(msg_mae)
print(msg_rmsq)

## Random Forest

- take the hyper-parameter of the DE school

### Select smaller training sample

- faster model fit

In [None]:
Ntrain = 8000
Ntest = Ntot-Ntrain
test_sample_size_fraction=Ntest/Ntot
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_sample_size_fraction, random_state=0)

### training

In [None]:
#cv_results = cross_validate(randomforest_regressor ,X_train,y_train,cv=cv,scoring=scoring,return_estimator=True)

In [None]:
%%time
# We simply use the fit method, not the cross_validate to accelerate the demo 
t1 = datetime.datetime.now()
randomforest_regressor.fit(X_train,y_train)
t2 = datetime.datetime.now()
deltat = (t2-t1).total_seconds() 
print(f"RandomForest : elapsed time {deltat:.2f} sec")

### Estimate

In [None]:
y_pred =  randomforest_regressor.predict(X_test)

### Performances

In [None]:
nmad,std_result,bias,eta,stats_txt2= get_stats(y_test.values,y_pred)

In [None]:
coords = zip(y_test,y_pred)
points = hv.Points(coords).opts(tools=['box_select', 'lasso_select'])

In [None]:
boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

# Apply the datashader
p2 = dynspread(datashade(points, cmap="Viridis"))
p2 = p2.opts(width=HV_HISTO_SINGLE_WIDTH, height=HV_HISTO_SINGLE_HEIGHT,
    padding=0.05, show_grid=True,
    xlim=(0, 3), ylim=(0, 3.0),
    xlabel="z-spec", ylabel="z-phot",title="Random Forest Regressor")

In [None]:
p2 * plot_lines() *  hv.Text(0.5, 2.5, stats_txt2)

#### Performance metrics in scikit learn

In [None]:
r2  = r2_score(y_pred,y_test)
mae = mean_absolute_error(y_pred,y_test)
mse = mean_squared_error(y_pred,y_test)

In [None]:
msg_r2   = f"R2 score : \t\t {r2:.3f}"
msg_mae  = f"MAE mean absolute error : \t {mae:.3f}"
msg_rmsq = f"Root MSE error : \t\t {np.sqrt(mse):.3f}"

In [None]:
print(msg_r2)
print(msg_mae)
print(msg_rmsq)

## Histogram Gradient Boosting regressor

- No particular optimisation of hyper parameters performed here 

### Select smaller training sample

- faster model fit

In [None]:
Ntrain = 8000
Ntest = Ntot-Ntrain
test_sample_size_fraction=Ntest/Ntot
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=test_sample_size_fraction, random_state=0)

### training (model fit)

In [None]:
%%time
# We simply use the fit method, not the cross_validate to accelerate the demo 
t1 = datetime.datetime.now()
histogram_gradient_boosting_regressor.fit(X_train,y_train) 
t2 = datetime.datetime.now()
deltat = (t2-t1).total_seconds() 
print(f"Histogram Gradient Boosting : elapsed time {deltat:.2f} sec")

### Estimation

In [None]:
y_pred =  histogram_gradient_boosting_regressor.predict(X_test)

### Performances

In [None]:
coords = zip(y_test,y_pred)
points = hv.Points(coords).opts(tools=['box_select', 'lasso_select'])

In [None]:
nmad,std_result,bias,eta,stats_txt3 = get_stats(y_test.values,y_pred)

In [None]:
boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

# Apply the datashader
p3 = dynspread(datashade(points, cmap="Viridis"))
p3 = p3.opts(width=HV_HISTO_SINGLE_WIDTH, height=HV_HISTO_SINGLE_HEIGHT,
    padding=0.05, show_grid=True,
    xlim=(0, 3), ylim=(0, 3.0),
    xlabel="z-spec", ylabel="z-phot",title="Histogram Gradient Boosting Regressor")

In [None]:
p3 * plot_lines() *  hv.Text(0.5, 2.5, stats_txt3)

#### Performance metrics in scikit learn

In [None]:
r2  = r2_score(y_pred,y_test)
mae = mean_absolute_error(y_pred,y_test)
mse = mean_squared_error(y_pred,y_test)

In [None]:
msg_r2   = f"R2 score : \t\t {r2:.3f}"
msg_mae  = f"MAE mean absolute error : \t {mae:.3f}"
msg_rmsq = f"Root MSE error : \t\t {np.sqrt(mse):.3f}"

In [None]:
print(msg_r2)
print(msg_mae)
print(msg_rmsq)

## The CMNN Photo-z Estimator


We want to compare the above ML algorithm with the CMNN leave-one-out estimator propose by Melissa Graham in another DP0 contributed notebook.

A full description of the Color-Matched Nearest-Neighbors (CMNN) Photometric Redshift Estimator can be found in the following journal articles:
 * <a href="https://ui.adsabs.harvard.edu/abs/2018AJ....155....1G/abstract">Photometric Redshifts with the LSST: Evaluating Survey Observing Strategies</a> (Graham et al. 2018) 
 * <a href="https://ui.adsabs.harvard.edu/abs/2020AJ....159..258G/abstract">Photometric Redshifts with the LSST. II. The Impact of Near-infrared and Near-ultraviolet Photometry</a> (Graham et al. 2020)

The CMNN PZ Estimator can also be found on GitHub: https://github.com/dirac-institute/CMNN_Photoz_Estimator

In [None]:
cmnn_tz, cmnn_pz, cmnn_pze = CMNNestimator(data,Ncalc = 3000)

In [None]:
tx = np.where(cmnn_pz > 0.0 )[0]

cmnn_tz_sel=cmnn_tz[tx]
cmnn_pz_sel=cmnn_pz[tx] 
cmnn_pze_sel =  cmnn_pze[tx]

In [None]:
coords = zip(cmnn_tz_sel,cmnn_pz_sel)
points = hv.Points(coords).opts(tools=['box_select', 'lasso_select'])

In [None]:
nmad,std_result,bias,eta,stats_txt4 = get_stats(cmnn_tz_sel,cmnn_pz_sel)

In [None]:
boundsxy = (0, 0, 0, 0)
box = streams.BoundsXY(source=points, bounds=boundsxy)
bounds = hv.DynamicMap(lambda bounds: hv.Bounds(bounds), streams=[box])

# Apply the datashader
p4 = dynspread(datashade(points, cmap="Viridis"))
p4 = p4.opts(width=HV_HISTO_SINGLE_WIDTH, height=HV_HISTO_SINGLE_HEIGHT,
    padding=0.05, show_grid=True,
    xlim=(0, 3), ylim=(0, 3.0),
    xlabel="z-spec", ylabel="z-phot",title="CMNN estimator")

In [None]:
p4 * plot_lines() *  hv.Text(0.5, 2.5, stats_txt4)

**The photo-z accuracy is the absolute value of the difference between the true and photometric redshifts**.

Recall that **the photo-z uncertainty is the standard deviation of the true redshifts**  of the training-set galaxies in the CMNN subset. The fact that a bunch of galaxies have an uncertainty of zero means there are galaxies with only 1 training-set galaxy in their CMNN subset. The full CMNN PZ Estimator treats such galaxies better.

### Should we use uncertainty-accuracy normalized by (1+z) or not ?

In [None]:
bias=np.abs(cmnn_tz_sel-cmnn_pz_sel)
err= cmnn_pze_sel
coords1 = zip(bias,err)
points1 = hv.Points(coords1).opts(tools=['box_select', 'lasso_select'])

In [None]:
bias_norm = np.abs(cmnn_tz_sel-cmnn_pz_sel)/(1+cmnn_tz_sel)
err_norm= cmnn_pze_sel/(1+cmnn_tz_sel)
coords2 = zip(bias_norm,err_norm)
points2 = hv.Points(coords2).opts(tools=['box_select', 'lasso_select'])

In [None]:
# Apply the datashader
p5 = dynspread(datashade(points1, cmap="Viridis"))
p5 = p5.opts(width=HV_HISTO_SINGLE_WIDTH//2, height=HV_HISTO_SINGLE_HEIGHT//2,
    padding=0.05, show_grid=True,xlim=(0,1),ylim=(0,1),
    xlabel="accuracy", ylabel="uncertainty",title="CMNN estimator : uncertainty vs accuracy ")

In [None]:
# Apply the datashader
p6 = dynspread(datashade(points2, cmap="Viridis"))
p6 = p6.opts(width=HV_HISTO_SINGLE_WIDTH//2, height=HV_HISTO_SINGLE_HEIGHT//2,
    padding=0.05, show_grid=True,xlim=(0,1),ylim=(0,1),
    xlabel="accuracy", ylabel="uncertainty",title="CMNN estimator : norm-uncertainty vs norm-accuracy ")

In [None]:
p5 + p6

# Future work : Optimisation of hyper parameters

- TBD later in another notebook

In [None]:
from sklearn.model_selection import GridSearchCV

### 6. Clean the output directory 

In [None]:
if FLAG_CLEAN_DATAONDISK:
    if os.path.isdir(TMPNBDIR):
        try:
            shutil.rmtree(TMPNBDIR)
        except OSError as e:
            print("Error: %s : %s" % (TMPNBDIR, e.strerror)) 