# Mathematical Framework

## Bayesian Formulation of Light Curve Inversion
The inversion of rotational light curves into surface maps can be formulated as a Bayesian linear regression problem. For each wavelength bin $\lambda$, we denote the observed light curve as
$$
D = \{t_m, f_m\}_{m=1}^N,
$$
where $f_m$ is the normalized flux measured at time $t_m$. We collect these measurements into a flux vector $\boldsymbol{f} \in \mathbb{R}^N$ and model it as a linear combination of spherical-harmonic basis light curves:
$$
\boldsymbol{f} = \mathbf{A}\,\boldsymbol{w} + \boldsymbol{\varepsilon},
$$
where $\mathbf{A} =\{\phi_d(t_m)\}_{d=1}^D \in \mathbb{R}^{N\times D}$ is the design matrix whose columns are the light curves of spherical-harmonic basis at the given viewing angle, $\boldsymbol{w}$ is the vector of spherical-harmonic coefficients, and $\boldsymbol{\varepsilon}$ represents observational noise. Recall that $\mathbf{A}$ is different up to a rotation for a different viewing angle (Luger 2019). Therefore, this framework is applicable to any viewing angle with the correctly rotated design matrix $\mathbf{A}$ used.

We assume the noise is independent and identically distributed Gaussian with a diagonal covariance matrix $\Sigma_n = \sigma^2 \mathbf{I}_N$:
$$
\boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0},\, \sigma^2 \mathbf{I}_N).
$$

In the absence of prior information on the maps, the maximum-likelihood estimator (MLE) of this linear regression problem yields the surface map that most closely reproduces the observed light curve (Luger 2021).

### Null Space of the Light Curve Inversion Problem

It is well known that many spherical harmonics produce no imprint on a rotational light curve.  
A simple example is a dipole pattern that is antisymmetric about the equator when viewed equator-on: its contributions cancel upon integration over the visible hemisphere.  
We refer to such harmonics as belonging to the *null space* of the light-curve operator.

More generally, for rotational light curves, all harmonics with odd $l > 1$ and $m < 0$ are in the null space, independent of the viewing geometry.  
For an equator-on viewing geometry (inclination $\approx 90^\circ$), harmonics with even $l$ and odd $m>0$ also lie in the null space; this degeneracy is partially broken when the object is not viewed exactly equator-on.

If a harmonic $h$ lies in the null space, its associated basis light curve vanishes:
$$
\phi_h(t_m) = 0 \quad \forall m.
$$
Equivalently, the corresponding column of the design matrix $\mathbf{A}$ is identically zero, so $\mathbf{A}$ is not full rank.  
The system is therefore underdetermined, and the usual MLE solution
$$
\hat{\boldsymbol{w}} = (\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top \boldsymbol{f}
$$
is not well-defined.  
A standard remedy is to adopt *regularized least squares*, in which one minimizes
$$
\|\boldsymbol{f} - \mathbf{A}\boldsymbol{w}\|^2 + \lambda \|\boldsymbol{w}\|^2.
$$
This is equivalent to replacing $\mathbf{A}^\top \mathbf{A}$ with $\mathbf{A}^\top \mathbf{A} + \lambda \mathbf{I}_D$, which is now non-singular, and yields the solution
$$
\hat{\boldsymbol{w}} = (\mathbf{A}^\top \mathbf{A} + \lambda \mathbf{I}_D)^{-1} \mathbf{A}^\top \boldsymbol{f}.
$$
In a Bayesian interpretation, this is identical to imposing a zero-mean Gaussian prior on the weights with covariance $\lambda^{-1} \mathbf{I}_D$, i.e. penalizing large coefficients.

As emphasized by Luger (2019), the rotational light-curve inversion problem is highly underdetermined.  
An MLE (or weakly regularized) solution typically overfits the data, and all harmonics in the null space are driven toward zero by the regularizer.  
However, one can always add any linear combination of null-space harmonics to $\hat{\boldsymbol{w}}$ without altering the light curve, implying a large family of surface maps that fit the data equally well.  
Both the inferred coefficients and their associated uncertainties therefore depend sensitively on the assumed regularization.

This can be made explicit by Bayes' theorem:
$$
p(\boldsymbol{w} \mid D)
= \frac{p(D \mid \boldsymbol{w})\,p(\boldsymbol{w})}{p(D)},
$$
which shows that the posterior surface map depends on both the likelihood (data) and the prior (regularization).  

Our goal in this work is to identify distinct surface regions and their emergent spectra that are characteristic of the mechanisms driving the observed variability.  
Because of the strong degeneracies in the mapping problem, it is essential to propagate uncertainties in the spherical-harmonic coefficients before drawing physical conclusions.  
A Bayesian linear regression framework provides a natural extension of regularized least squares, replacing a single “best-fit” map with a full posterior distribution over maps.

### Prior on the Spherical Harmonic Coefficients

MacKay (1992) provide a general recipe for Bayesian linear regression with Gaussian priors.  
They introduce two hyperparameters, $\alpha$ and $\beta$, where $\alpha$ is the prior precision of the weights and $\beta$ is the noise precision.  
The expressions summarized in Equations (A2)–(A9) above give the posterior mean and covariance for a given choice of $(\alpha,\beta)$, as well as fixed-point update rules that maximize the marginal likelihood (the *evidence*) with respect to these hyperparameters.

The resulting $(\alpha,\beta)$ can be interpreted as the regularization strengths that yield the most predictive model in an Occam’s-razor sense.  
In the context of light-curve inversion, a larger $\alpha$ corresponds to a stronger prior preference for small-amplitude maps (e.g. smoother surfaces or smaller variance for higher spherical-harmonic degrees $\ell$), while a larger $\beta$ corresponds to a stronger belief in the accuracy of the data.  
In practice, one may choose to fix $\beta$ using the measurement uncertainties from the data-reduction pipeline and optimize only $\alpha$, or vice versa; the framework remains the same with straightforward modifications.

### Optimizing the Hyperparameters

In the Bayesian linear model introduced above, two key hyperparameters control the relative weights of the prior and likelihood terms: 
the prior precisions $\boldsymbol{\alpha} = \{\alpha_i\}$, governing the amplitude of each weight $w_i$, and the noise precision $\beta = \sigma^{-2}$, describing the inverse variance of the observational noise. 
These quantities determine the strength of regularization and hence the effective complexity of the inferred model.

#### Evidence maximization

Following the approach of MacKay (1992) and Tipping (2001), the optimal hyperparameters are obtained by maximizing the marginal likelihood (or *evidence*)
$$
p(\boldsymbol{f} \mid \mathbf{A}, \boldsymbol{\alpha}, \beta)
= \int p(\boldsymbol{f}\mid \mathbf{A},\boldsymbol{w},\beta)\,p(\boldsymbol{w}\mid \boldsymbol{\alpha})\,d\boldsymbol{w},
$$
which, for the Gaussian model, has the closed form
$$
\log p(\boldsymbol{f}\mid \mathbf{A}, \boldsymbol{\alpha}, \beta)
= -\frac{1}{2}\!\left[
\beta \|\boldsymbol{f}-\mathbf{A}\boldsymbol{\mu}\|^2
+ \boldsymbol{\mu}^\top \mathrm{diag}(\boldsymbol{\alpha})\,\boldsymbol{\mu}
- \sum_i \log\alpha_i
+ N\log\beta
- \log|\mathbf{C}|
+ N\log(2\pi)
\right],
$$
where $\mathbf{C} = \beta^{-1}\mathbf{I}_N + \mathbf{A}\,\mathrm{diag}(\boldsymbol{\alpha})^{-1}\mathbf{A}^\top$ is the covariance of the data.

Since $\alpha_i$ and $\beta$ appear nonlinearly, analytical maximization is not possible.  
Instead, fixed-point updates are obtained by differentiation and rearrangement. Such fixed-point iterations converge to the optimal parameters efficiently.

Differentiating the log evidence with respect to $\alpha_i$ and setting the derivative to zero yields:
$$
\alpha_i^{\text{new}} = \frac{\gamma_i}{\mu_i^2},
$$
where $\mu_i$ is the $i$-th posterior mean weight and 
$$
\gamma_i = 1 - \alpha_i \Sigma_{ii}
$$
is the *degree of relevance* of the $i$-th basis vector, with $\Sigma_{ii}$ the $i$-th diagonal element of the posterior covariance matrix of the weights.
The quantity $\gamma_i \in [0,1]$ acts as an ``effective number of parameters'' associated with $w_i$:  
if $\alpha_i$ is large, the prior strongly suppresses $w_i$ and $\gamma_i \approx 0$;  
if $\alpha_i$ is small, the data determine $w_i$ well and $\gamma_i \approx 1$.

Differentiating the log evidence with respect to $\beta$ gives:
$$
\frac{1}{\beta^{\text{new}}} = (\sigma^2)^{\text{new}}
= \frac{\|\boldsymbol{f} - \mathbf{A}\boldsymbol{\mu}\|^2}
{N - \gamma},
$$
which updates the estimated noise variance in proportion to the residual error, corrected for the number of effectively determined parameters.

### Posterior Distribution

More generally, suppose we adopt a Gaussian prior with mean $\boldsymbol{\mu}_0$ and covariance $\Sigma_0$,
$$
\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{\mu}_0, \mathbf{\Sigma}_0),
$$
and assume a Gaussian noise covariance $\mathbf{\Sigma}_n$ for the data.  
Bayes' theorem gives the posterior
$$
p(\boldsymbol{w} \mid \boldsymbol{f}, \mathbf{A}) \propto 
p(\boldsymbol{f} \mid \boldsymbol{w}, \mathbf{A})\,p(\boldsymbol{w}),
$$
which is again Gaussian with
$$
\mathbf{\Sigma}_{\text{post}} = \left(\mathbf{A}^\top \mathbf{\Sigma}_n^{-1} \mathbf{A} + \mathbf{\Sigma}_0^{-1}\right)^{-1},
$$
$$
\boldsymbol{\mu}_{\text{post}} = 
\boldsymbol{\mu}_0 +
\mathbf{\Sigma}_{\text{post}}\,\mathbf{A}^\top \mathbf{\Sigma}_n^{-1}\!\left(\boldsymbol{f} - \mathbf{A}\boldsymbol{\mu}_0\right).
$$
In our context, the posterior mean $\boldsymbol{\mu}_{\text{post}}$ defines our best estimate of the surface map, while $\Sigma_{\text{post}}$ quantifies the uncertainty and covariance between different spherical-harmonic modes.

### Evidence
The marginal likelihood (or evidence) is obtained by integrating over the weights:
$$
p(\boldsymbol{f} \mid \mathbf{A})
= \mathcal{N}\!\left(\mathbf{A}\boldsymbol{\mu}_0,\,\mathbf{C}\right),
\qquad
\mathbf{C} = \mathbf{\Sigma}_n + \mathbf{A}\mathbf{\Sigma}_0 \mathbf{A}^\top.
$$
The corresponding log-evidence is
$$
\log p(\boldsymbol{f} \mid \mathbf{A})
= -\frac{1}{2}
\left[
(\boldsymbol{f} - \mathbf{A}\boldsymbol{\mu}_0)^\top \mathbf{C}^{-1}(\boldsymbol{f} - \mathbf{A}\boldsymbol{\mu}_0)
+ \log|\mathbf{C}| + N\log(2\pi)
\right].
$$
Although not obvious at first sight, this is equivalent to the equation above when $\Sigma_0 = \alpha^{-1} \mathbf{I}_D $ and $\Sigma_n = \beta^{-1} \mathbf{I}_N$. After solving for the optimal values of $\alpha$ and $\beta$, we can calculate the evidence for model comparison. In our application, we use it to identify the maximum spherical-harmonic degree $l_{\max}$ that should be included in the fit. This framework prevents us from overfitting features that are not robustly constrained by the light curves while propagating uncertainties on the surface maps and spatial spectra.

### From Spherical Harmonics Coefficients to Surface Maps
Now that we have obtained the posterior distribution of the spherical-harmonic weights, we can define a grid on the spherical surface onto which we project the surface maps. We define the intensity mapping matrix $\mathbf{I} \in \mathbb{R}^{L\times D}$, whose columns are the spherical harmonic functions evaluated on the grid, $\{Y^l_m(\boldsymbol{r}_\ell)\}_{d=1}^D$, where $\boldsymbol{r}_\ell = (\text{latitude}, \text{longitude})$ and $L$ is the total number of grid points. Then,
$$
\boldsymbol{I} = \mathbf{I}\,\boldsymbol{w}
$$
is the surface intensity vector on the grid.

Recall the posterior distribution of $\boldsymbol{w}$:
$$
\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{\mu}_{\text{post}}, \mathbf{\Sigma}_{\text{post}}).
$$
The surface intensity $\boldsymbol{I}$ is a linear transformation of $\boldsymbol{w}$, and therefore follows a Gaussian distribution:
$$
\boldsymbol{I} \sim \mathcal{N}\!\left(\mathbf{I}\boldsymbol{\mu}_{\text{post}}, \,\mathbf{I}\,\mathbf{\Sigma}_{\text{post}}\,\mathbf{I}^\top\right).    
$$

We thus obtain a surface intensity map with fully propagated uncertainties at a particular wavelength bin $\lambda$. Note that in the above discussion we have hidden the wavelength dependence by omitting the subscript $\lambda$. We can repeat on every wavelength bin and obtain a surface spectra matrix $\mathbf{F} = \{\pi \boldsymbol{I}_\lambda\} \in \mathbb{R}^{L \times M}$ (where M is the number of wavelength bins and $\pi$ is the geometric factor to convert intensity to flux density).

### Summary
In the following, we summarize the basic elements of this framework: the choice of prior, the resulting posterior mean and covariance of the coefficients, and the fixed-point updates for the regularization hyperparameters that control the effective model complexity.
$$
\text{Model:}\quad
\boldsymbol{f} = \mathbf{A}\boldsymbol{w} + \boldsymbol{\varepsilon}, \qquad
\boldsymbol{\varepsilon} \sim \mathcal{N}\!\left(\boldsymbol{0}, \beta^{-1} \mathbf{I}_N\right)
$$
$$
\text{Prior:}\quad
p(\boldsymbol{w}\mid\alpha) = \mathcal{N}\!\left(\boldsymbol{w}\mid\boldsymbol{0}, \alpha^{-1} \mathbf{I}_D\right)
$$
$$
\text{Posterior:}\quad
p(\boldsymbol{w}\mid\boldsymbol{f},\mathbf{A},\alpha,\beta) = \mathcal{N}\!\left(\boldsymbol{w}\mid \boldsymbol{m}_N, \mathbf{S}_N\right)
$$
$$
\mathbf{S}_N = \left(\alpha \mathbf{I}_D + \beta \mathbf{A}^\top \mathbf{A}\right)^{-1}
$$
$$
\boldsymbol{m}_N = \beta \mathbf{S}_N \mathbf{A}^\top \boldsymbol{f}
$$
$$
\text{Effective number of parameters:}\quad
\gamma = \sum_{i=1}^{D}  \gamma_i = \sum_{i=1}^{D} \frac{\lambda_i}{\alpha + \lambda_i}
      \;=\; D - \alpha\,\mathrm{tr}\!\left(\mathbf{S}_N\right),
\quad \{\lambda_i\} = \text{eigenvalues of } \beta \mathbf{A}^\top \mathbf{A}
$$
$$
\text{Fixed-point updates:}\quad
\alpha_{\text{new}} = \frac{\gamma}{\boldsymbol{m}_N^\top \boldsymbol{m}_N}
$$
$$
\beta_{\text{new}}  = \frac{N - \gamma}{\left\|\boldsymbol{f} - \mathbf{A}\boldsymbol{m}_N\right\|^2}
$$