# Error propagation and least-squares fitting

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/planck_TT.png" width=500px />

## PHYS 2600: Scientific Computing

## Lecture 23

In [None]:
%matplotlib inline
# %pip install -q gvar lsqfit # uncomment this line if you're running in Colab
import numpy as np
import matplotlib.pyplot as plt
import gvar as gv
import lsqfit


## Handling error bars with `gvar`

Today, we'll cover two Python modules that are essential for working with real data - `gvar` for "Gaussian random variables", i.e. numbers with error bars, and `lsqfit` for least-squares fits.  These tools can save you some time and headaches in the physics lab (either a class, or a real research lab!)

Suppose you measure the length `L` and height `h` of an inclined plane, to find the angle.  You measure in cm, with an uncertainty of 1 mm.  To input your measurements into `gvar` (short name `gv`), you write them like you would in a lab notebook:

In [None]:
L = gv.gvar('4.0(1)') # = 4.0 +- 0.1
h = gv.gvar('2.0(1)') # = 2.0 +- 0.1
print(L,h)

L = gv.gvar(4.0, 0.1) # = 4.0 +- 0.1
h = gv.gvar(2.0, 0.1) # = 2.0 +- 0.1
print(L,h)


`number(error)` is short-hand for "`number` has an uncertainty of `error`".  This means we think of `number` as being represented by a normal distribution - I won't review this formalism.  but this is what "Gaussian variable" means.  The central limit theorem justifies a lot of this treatment!

One of the uses of Gaussian variables is __error propagation__ when we take functions them.  I'll remind you of the most general formula for two variables: if $z = f(x,y)$, then

$$
\sigma_z = \sqrt{\left( \frac{\partial f}{\partial x}\right)^2 \sigma_x^2 + \left( \frac{\partial f}{\partial y} \right)^2 \sigma_y^2 + 2 \left( \frac{\partial f}{\partial x} \right) \left( \frac{\partial f}{\partial y} \right) \sigma_x \sigma_y \rho_{xy}}
$$

where $\sigma_x$ and $\sigma_y$ are the error bars on $x$ and $y$ respectively, and $\rho_{xy}$ is the __correlation coefficient__ between $x$ and $y$ (zero if they are independent.)

One of the most powerful features of `gvar` is that _it handles error propagation automatically!_  For example,
we can get the angle of our ramp as a function of `L` and `h`:

In [None]:
theta = np.arctan2(L,h)
print(theta, theta * 180 / np.pi)

Our original inputs `h` and `L` are uncorrelated by default.  But since `theta` was produced _from_ `L` and `h`, it does have non-zero correlation with both quantities.  We can see the correlation with `gv.corr`:

In [None]:
print(gv.corr(L,h))
print(gv.corr(theta,h))

A powerful feature of `gvar` is that it __automatically remembers and propagates all correlations__!  This is important to get things right, and the difference (vs. ignoring correlations) is often large.

For example, suppose we want to reconstruct the original $L = h \tan \theta = 4.0(0.1)$.  Look at what happens with a new `gvar` (uncorrelated) versus using `h` and `theta` with correlations:

In [None]:
print(h*np.tan(gv.gvar('1.107(22)')))
print(h*np.tan(theta))

Including the correlations gives us back the original, correct error value for `L`!  (Of course, correlations only change the _error_ propagation, so the mean value is the same either way.)

Sometimes measurements come with a known error, but often the error is inferred from statistics: we take a number of repeated measurements of some random process, and then compute the standard deviation $\sigma$ to get an error bar.  This can be done automatically by `gv.dataset.avg_data()`:

In [None]:
some_data = np.random.normal(1.3,0.4,size=100)
print(some_data[:10])
gv.dataset.avg_data(some_data)

Notice that the width of the `gvar` is _smaller_ than the width 0.4 I started with - exactly $\sqrt{N}$ smaller. (`avg_data` computes the standard error.)

A powerful feature of `gv.dataset.avg_data()` is that it can also compute __correlations__ from raw data, if you have synchronized measurements!

## Least-squares fitting

Least-squares fitting is a very common tool in physics for: (1) determining what model describes some data, and (2) inference of model parameters (physical constants.)  I'll assume you've seen this before, so I'll just remind you of some key concepts.

The starting point of least-squares fitting is defining the _chi-squared function_,

$$
\chi^2 = \sum_i \left(\frac{y_i - f(x_i, \mathbf{a})}{\sigma_i} \right)^2
$$

where $\mathbf{a}$ is a vector of model parameters.  The __minimum possible__ $\chi^2_{\star}$ is the "best-fit" $\chi^2$ statistic, with model parameters ${\mathbf{a}}_{\star}$ (the "best fit parameters.")  Standard numerical minimization methods work to find the best fit.

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/anscombes_quartet.png" width=400px style="float:right;" />

Although $\chi^2$ is useful, because it's a _global summary statistic_, looking at $\chi^2$ alone can obscure local features in your data or model that you might not expect!

The most famous pathological example of this effect is [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), shown to the right (plot from the `seaborn` [Python module docs](https://seaborn.pydata.org/examples/anscombes_quartet.html).)  Assuming the error bars on all points are the same, a linear fit to all four data sets will yield _exactly the same fit parameters and $\chi^2$ statistic!_  



(If that's not convincing yet, we get exactly the same linear fit from [the Datasaurus shown below](https://dabblingwithdata.wordpress.com/2017/05/03/the-datasaurus-a-monstrous-anscombe-for-the-21st-century/).)

<img src="https://raw.githubusercontent.com/wlough/CU-Phys2600-Fall2025/main/lectures/img/datasaurus-main.png" width=300px style="float:left;margin:10px;"/>

The moral: quick and easy statistical tests are no replacement for looking at your data!  __Plot your data and model against each other!__  Plotting _residuals_ (data minus model) can be a good way to make deviations stand out by eye.

To evaluate goodness of fit, we can compute the __p-value__ from the expected chi-squared distribution.  Most fitting packages will do this for you.  A good __rule of thumb__ is to compute $\chi^2 / N_{\rm dof}$, the "reduced chi-square".  A good model will yield $\chi^2 / N_{\rm dof} \lesssim 1$, no matter what $N_{\rm dof}$ is.  


There are multiple options available in Python for doing fits to data - minimizing $\chi^2$, reporting best-fit parameters, etc.  For example, `scipy` contains `scipy.optimize.curve_fit` for non-linear least squares, `numpy` has `numpy.polyfit`, and the scikit-learn module (focused on Machine Learning) also has a number of routines for model regression.

However, most of these tools are not specialized to the most common form of model fitting in physics: fitting _arbitrary_ (not necessarily linear!) functions to data _with errors_.  (In modern versions of SciPy, `curve_fit` can handle this problem.)

For this class, I've chosen to teach `lsqfit` - it is from the same author as `gvar`, and so it automatically uses `gvars` to deal with errors and error propagation.  It has the best "ergonomics" for physics problems, in my experience.  But as a warning: it is much less well known than the equivalent `scipy` tools!

## Tutorial 23

Let's start the tutorial!