In [1]:
%run ../talktools.py

# An Overview of "The Cannon"

Astro 128/256 (UC Berkeley, 2024)

From [Ness et al. (2015)](https://ui.adsabs.harvard.edu/abs/2015ApJ...808...16N/abstract)

## Motivation

* A few of your stars have good labels from somewhere
* How can you quickly and efficienctly transfer them to stars without labels?
* Why do this?
 * No good models at wavelengths of interest
 * Two surveys on the same "system"
 * Stars at a variety of SNRs
 * Model-based fitting of each star prohibitively expensive (e.g., too many stars)
* Two conceptual steps
 * Training step -- use high quality data to train and validate your model
 * Test step -- assume that the model created in the training step applies to all spectra in survey and apply the model
 * Together these two steps result in "label transfer"

## Step 1: Select a Training Set (Sec 2.2)

* The set of reference objects is critical, as the label transfer to the survey objects can only be as good as the quality of the reference label set.

* 542 stars from 19 clusters 
 * Clusters have an advantage over field stars: all stars have the same age, [Fe/H], distance, ...
* Use known labels from APOGEE/ASPCAP pipeline ($T_{eff}$, log g, [$\alpha$/Fe], [C/M], [N/M], micro-turbulence)
* One goal is to place all stellar labels (true and inferred) on a common scale.

__Figure 1 from Ness et al.:__ ASPCAP-corrected training set for The Cannon:
<img src="figs/ness2015_fig1.jpg" width=700 height=700></img>

### Modifying the Training Set

* ASPCAP only includes stars with SNR>70 and log g < 3.5
 * No dwarf stars
* Pleiades is the only cluster with dwarf stars (small training set may make it hard to assign dwarf star labels)
* For clusters, stellar parameters should lie on single isochrone
* "Isochrone-corrected labels": modify all training stars so their logg values agrees with with single Padova isochrone based on literature values of cluster age and metallicity

__Figure 2 from Ness et al.:__ Isochrone-corrected training set for The Cannon:
<img src="figs/ness2015_fig2.jpg" width=700 height=700></img>

## Step 2: Continuum Normalization (Sec 2.3, 5.3)
* In theory, continuum is defined as pixels that are not affected by any absoprtion or emission lines
* In practice, very hard to find "pure" continuum in almost any spectrum
* Good practice would be to find pixels that are not affected by changing any features in the model

### In The Cannon:
* First, define "pseudo-continuum" by using a polynomial to fit upper 90% of spectra as determined by a running median over 50A windows.  Effectively smooths out the spectrum.
 * Effective, but S/N dependent
* Second, using the above as an initialization, run The Cannon on training set to find continuum pixels (Sec 5.3).
 * That is, using The Cannon, they find pixels that are minimally affected by changing stellar labels.
 * They find very small S/N dependence in continuum determination this way
* Fit 2nd order [Chebyshev polynomials](http://mathworld.wolfram.com/ChebyshevPolynomialoftheFirstKind.html) to continuum pixels in hand for each of the 3 chips (15150-15800A, 15890-16430A, 16490-16950A)
 * Polynomials poorly contrained at boundaries
 
__Figure 3 from Ness et al.:__ Example Continuum Normalized Spectra:
<img src="figs/ness2015_fig3.jpg" width=700 height=700></img>


In [2]:
import numpy as np
np.polynomial.Chebyshev.fit?

[0;31mSignature:[0m
[0mnp[0m[0;34m.[0m[0mpolynomial[0m[0;34m.[0m[0mChebyshev[0m[0;34m.[0m[0mfit[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mx[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdeg[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdomain[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrcond[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfull[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mw[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mwindow[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msymbol[0m[0;34m=[0m[0;34m'x'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Least squares fit to data.

Return a series instance that is the least squares fit to the data
`y` sampled at `x`. The domain of the returned instance can be
specified and 

## Step 3: Training the Generative Model (Sec 3)

### Underlying assumptions:
 * Continuum-normalized spectra of stars with identical lablels look nearly-identical at every pixel
  * Really an approximation since number of labels used is not exhaustive
 * Expected flux at every pixel changes continuously as a function of the labels
* Model is generative: produces a PDF for the flux at every pixel/wavelength.

### Define variable:
* Number of reference stars: $N_{ref} = n$  
 * Each has continuum-normalized flux measurement $f_{n \lambda}$ at each wavelength $\lambda$
* Each of the training spectra (of index $n$) has $K$ labels $\ell_{nk}$
 * Assume labels are prefectly known.
* label vector: $\ell_n$
* For any star, $n$ , at any pixel $\lambda$, the flux $f_{n \lambda}$ can be described as a smooth function of the star's labels: $\ell(T_{eff}, \log g, [Fe/H], \cdots)$

### Uncertainties
* Observational uncertainties at each pixel: $\sigma_{n \lambda}$
* Second noise term to account of possible systematics, other variations per pixel: $s_{n \lambda}$
 * Presumed to be Gaussian

### Write down model:

* Predict the flux $f_{n \lambda}$ at every pixel, given label and coefficient vectors: $\ell_n$ and $\theta_\lambda$
* Likelihood: $\rm{ln} \ p (f_n\ | \ l_n, \theta) = \sum_{\lambda=1}^{L} \rm{ln} \ p(f_{n \lambda} \ |\ \ell_n, \theta_{\lambda}, s^2_{\lambda})$

* Single pixel Likelihood: $\rm{ln}\ p(f_{n \lambda} \ |\ l_n, \theta_{\lambda}, s^2_{\lambda}) = - \frac{1}{2} \frac{[f_{n \lambda} - \theta_{\lambda}^{T} \dot\ \ell_{n}]^2}{\sigma_{n \lambda}^2 + s_\lambda^2}$ + $\rm{ln}\ (\sigma_{n \lambda}^2 + s_\lambda^2)$

* $\ell^T \equiv \{1, T_{eff}, \log g, [Fe/H], T_{eff}^2, T_{eff} \log g, \cdots, [Fe/H]^2\}$
 * This vector contains permutations of stellar labels
 * Simplest model would be linear model (Eqn 5 in Ness et al.)
 * This is the quadratic model (Eqn 6 in Ness et al.)

* $\theta^T \equiv \{\theta_{\lambda}, s_{\lambda}^2 \}_{\lambda=1}^{L}$
 * This vector is a vector of coefficients
 * Every pixel (i.e., every wavelength) has a set of coefficients

* Some interpretation:
 * $\theta_{\lambda 0}$: baseline spectrum
 * $\theta_{\lambda k}$: first order coefficients are first derivatives of spectrum wrt to linear labels
 * $\theta_{\lambda k k^\prime}$: second order coefficients are second derivatives of spectrum wrt to quadratic labels
 
### Training the model

* For training set $f_{n \lambda}$ and $\ell_n$ are known
* Thus, can solve for $\theta_{\lambda}$ and $s_{\lambda}$
* Could use MCMC, but would be very slow ($\sim$80,000 values of $\theta_{\lambda}$)
* Use non-linear optimization (non-linear becuase of quadratic labels and noise $s_\lambda^2$)
* $\theta_{\lambda}, s_{\lambda} \leftarrow \rm{argmax}(\theta_{\lambda}, s_{\lambda}) \sum_{n=1}^{N} \ \rm{ln} \ p(f_{n, \lambda} \ | \ \theta_{\lambda}, \ell_{\lambda}, s_{\lambda}^2)$
* Consider all pixels in survey, one pixel at a time


### A very simple conceptual example

* Suppose we only had two labels: $T_{\rm eff}$ and [Fe/H] 
* A quadratic model for the flux at one pixel would be:
 - $f$ = $\theta_0$ + $\theta_1$ $T_{\rm eff}$ + $\theta_2$ [Fe/H] + $\theta_3$ $T_{\rm eff}$[Fe/H] + $\theta_4$ $T_{\rm eff}^2$ + $\theta_5$ [Fe/H]$^2$
 - repeat procedure summing over all pixels
 - use the training data to find values for $\theta_k$ (i.e., you have labels, solve for $\theta_k$)
 - for validation/real data: you know each $\theta_k$ and you want to assign a lablel for each star



## Step 4: Labeling Survey Spectra (Sec 4)

* "Test data": $M=m$ unlabeled spectra
 * continuum-normalized flux $f_{m \lambda}$ at each wavelength $\lambda$
 * associated uncertainty: $\sigma_{m \lambda}$

* Assume spectral model coefficients from Step 3 and solve for labels of test data

* $\{\ell_{m k}\} \leftarrow (\rm{argmax})\{\ell_{m k} \} \sum_{\lambda=1}^{N_{pix}} \ \rm{ln} \ p(f_{m, \lambda} \ | \ \theta_{\lambda}, \ell_{m}, s_{\lambda}^2)$

* Follow same optimization procedure as above, only solving different variables
* Optimize over all spectral model coefficients and scatter, considering all reference objects one pixel at a time


## Step 5: Results and Validation

### Model Selection (Sec 5.1)

* What about only using a linear-in-labels model?  Turns out to be too inflexible.
 * Show large and systematic deviations when run on objects with known labels
 * Next simplest is the quadratic model.
 * Is a quadratic model good enough? Beyond scope of paper.
 
### Take-one-out Validation (Sec 5.2)

* Use N-1 reference stars to train model
* Run model on Nth star to predict its labels
* Repeat $N_{ref}$ times
* Only consider 3 labels ($T_{eff}, \log g, [Fe/H]$)

__Figure 4 from Ness et al.:__ Take-one-out Validation:
<img src="figs/ness2015_fig4.jpg" width=700 height=700></img>

* The Cannon generally has smaller scatter than ASPCAP
* They proceed to discuss trends and outliers in above plot.

### Asessment of model performace (Sec 5.2)

* Examine coefficients and scatter for example spectral regions A & B

__Figure 5 from Ness et al.:__ Detailed look at one star:
<img src="figs/ness2015_fig5.jpg" width=700 height=700></img>

"Figure 5. First-order coefficients and scatter across the sample regions of the spectra from Figure 3, (A) and (B). Top panel: the baseline spectra representing the first coefficient from the set of reference spectra; middle panel: the next three coefficients (${\theta }_{1}$, ${\theta }_{2}$, ${\theta }_{3}$), which correspond to the labels (${T}_{\mathrm{eff}},\mathrm{log}\;g,[\mathrm{Fe}/{\rm{H}}]$ ); bottom panel: the scatter of the fit with a tenfold expanded vertical scale. The red, blue, and green areas in the top panel encompass the wavelength regions with the 5% highest (absolute value) coefficients for the $[\mathrm{Fe}/{\rm{H}}]$, $\mathrm{log}\;g$ and ${T}_{\mathrm{eff}}$ labels, respectively. The ${T}_{\mathrm{eff}}$ coefficient has been multiplied by a factor of 1000 simply to show this coefficient on a similar scale to the other coefficients. This indicates where the flux in these spectrum is particularly sensitive to the labels. Note that the $[\mathrm{Fe}/{\rm{H}}]$ label is dominant in the contribution level and from the top panel it is clear that there is significant covariance between the labels and there are only a few regions of $\mathrm{log}\;g$ sensitivity. The filled dots in the baseline spectrum in the top penal indicate the wavelengths at which the dependencies on all labels are weak, which we operatively identify as continuum pixels."

* There are very few regions where the flux is a function of only one of the labels, and pixels are typically co-variant. (that is, the same pixel will have a higher flux at both lower ${T}_{\mathrm{eff}}$ and higher $[\mathrm{Fe}/{\rm{H}}]$). This simply reflects well-known co-variances between, for example, temperature and $[\mathrm{Fe}/{\rm{H}}]$. The strongest $\mathrm{log}\;g$ dependence is typically associated with weak lines including the wings of the feature and the $[\mathrm{Fe}/{\rm{H}}]$ label, with strong lines, particularly the depth of the line.

* The scatter is small and this indicates that our model is a good representation of the data. However, the scatter is highest where the most information in the spectra are contained. This implies that either our quadratic-in-labels spectral model is still somewhat too restricted, or that the labels of our training data set are imperfect or incomplete (for example, lacking $[\alpha /\mathrm{Fe}]$ as a label), or a combination of these effects.


### Model-Data comparison (Sec 5.2)

__Figure 6 from Ness et al.:__ Models + Scatter (cyan), data (black) for 4 stars:
<img src="figs/ness2015_fig6.jpg" width=1500 height=1500></img>




## Step 6: Application to APOGEE (Sec 5.4)

__Figure 7 from Ness et al.:__ Application to APOGEE:
<img src="figs/ness2015_fig7.jpg" width=700 height=700></img>

Figure 7. ASPCAP DR10 vs. The Cannon for six different fields including in the disk, bulge, and halo. The number of stars for each subfigure is 211 (4431), 207 (4384), 217 (4399), 210 (4309), 198 (4311), 319 (4255). Each panel lists the mean difference between the labels (bias), the scatter between the labels (rms), and the formal uncertainly returned by The Cannon (precision).

* There are weak trends; at low ${T}_{\mathrm{eff}}$ ~ 3700 K, we find temperatures about 100 K cooler than APOGEE and at low $\mathrm{log}\;g$ we find ~0.15 dex larger $\mathrm{log}\;g$ than APOGEE. At the lowest metallicities $[\mathrm{Fe}/{\rm{H}}]\;\lt \;-2.0$, we typically report higher metallicities on the order of 0.05 to 0.3 dex

__Figure 8 from Ness et al.:__ Assessment of Biases:
<img src="figs/ness2015_fig8.jpg" width=700 height=700></img>

Figure 8. Difference between the labels (${T}_{\mathrm{eff}}$, $\mathrm{log}\;g$, and $[\mathrm{Fe}/{\rm{H}}]$) derived by The Cannon and their ASPCAP DR10 values for all the 1400 stars shown in Figure 7. The error bars are dominated by those quoted by ASPCAP. There are systematic offsets at the coolest temperatures.

### Comparison to isochrones

__Figure 9 from Ness et al.:__ Comparison to expectations from isochrones:
<img src="figs/ness2015_fig9.jpg" width=700 height=700></img>

Figure 9. Labels for the ~35,000 stars from DR10 derived by The Cannon based on ASPCAP-corrected labels for the set of reference objects. The set of panels on the left shows ${T}_{\mathrm{eff}}$–$\mathrm{log}\;g$ in four metallicity bins. There are ~19,000, 13,000, 1600, and 1000 stars in the most metal-rich to metal-poor metallicity bins, respectively. The isochrones plotted are 10 Gyr Padova isochrones at the metallicities marked in the upper left hand corners of each sub-panel. The panel on the right shows all stars colored in $[\mathrm{Fe}/{\rm{H}}]$ on the four isochrones. Note that the $\mathrm{log}\;g$ distribution at low $\mathrm{log}\;g$ is narrow and offset from the giant branch. Reference objects are shown as open circles.

### Scatter in the labels

* Rerun model with 20 bootstrap realizations of training set to quantify formal scatter in labels
* Scatter largest in regions outside training set


__Figure 11 from Ness et al.:__ Formal Scatter in Labels:
<img src="figs/ness2015_fig11.jpg" width=700 height=700></img>

Figure 11. Standard deviation in the labels returned in ${T}_{\mathrm{eff}}$, $\mathrm{log}\;g$ and $[\mathrm{Fe}/{\rm{H}}]$, shown in the ${T}_{\mathrm{eff}}$−$\mathrm{log}\;g$ plane, normalized by the optimization error on each measurement, for 20 bootstrapping tests of the training set. The representative sample of ~670 stars shown here has been drawn from an equal sampling of a grid spaced by 100 K in ${T}_{\mathrm{eff}}$, 0.25 dex in $\mathrm{log}\;g$ and 0.25 dex in $[\mathrm{Fe}/{\rm{H}}]$ from the labels returned using the model trained on the isochrone-corrected reference objects. The location of the reference objects is shown in the gray shaded regions in the panel. Note the narrow region of reference objects also on the main sequence. The highest scatter in the labels is seen for regions where the labels are extrapolated. These figures are shown for the isochrone-corrected labels discussed in Section 2.4.



### How well does extrapolation work?

__Figure 12 from Ness et al.:__ Formal Scatter in Labels:
<img src="figs/ness2015_fig12.jpg" width=700 height=700></img>


Figure 12. Difference in labels between The Cannon and ASPCAP indicating the regions of extrapolation where the difference in the labels extends beyond the estimated errors of the methods, due to the limited sampling of the reference objects which does not fully cover the label-space of the survey. The ASPCAP-corrected training labels were used to generate the model applied at the test step on the DR10 data.


