# Data & Error Analysis - *Accounting for Data Uncertainty*

#### ASU Online Physical Chemistry Laboratory




*Jeff Yarger, School of Molecular Sciences, Arizona State University.*

jyarger@proton.me


# Introduction

In science, the word “error” does not carry the usual connotations of “mistake” or “blunder”.  Error in a scientific measurement means the inevitable ***uncertainty*** involved in all measurements.  As such, errors are not mistakes; you cannot avoid them even when being very careful.  The best you can hope to do is minimize errors and to have a reliable estimate of error in any measurement or observation.

When discussing errors in your experimental data or observations, it is customary to distinguish between precision versus accuracy or systematic versus random errors, respectively.  Systematic errors are errors that will make results differ from the “true” values with reproducible discrepancies.  Therefore, the accuracy of an experiment is generally dependent on how well we can control or compensate for systematic errors.  Common examples include (i) failure to establish a proper instrument calibration, (ii) improper graduation or alignment of an instrument scale, (iii) drift in instrumental property, (iv) incomplete material transfer or leakage, and (v) faulty approximation or theory.  The precision of an experiment is dependent on how well we can overcome random errors.  These are the fluctuations in observations that yield results that differ from experiment to experiment and often require repeated collection to yield precise results.  Accuracy versus precision is further discussed in introductory chemistry and analytical chemistry.  In experimental physical chemistry, the focus is typically on random errors and will assume systematic errors are minimal.

An important skill in a scientific laboratory is the ability to estimate error or uncertainties in a measurement.  Consider for example that you want to measure the volume of a liquid using a graduated cylinder, and find the volume to be 292.5 mL. This number is subject to uncertainty: It is unthinkable that you could know the volume to be exactly 292.5000 mL rather than 292.5001 mL. The difference between the two is 0.1 $\mu$L, which would make a tiny drop hardly visible by the human eye. It is clear that you cannot measure with such precision using a graduated cylinder, as shown in figure 1.  




|       |          |
|-      |-         |
|![Figure 1](https://github.com/CHM343/Images/blob/main/Graduated_Cylinder_Example_Image.jpg?raw=true)| **Figure 1** - Picture of a graduated cylinder <br>for measuring the volume of a liquid. <br> The graduation marks on the glass cylinder <br>are calibrated for the volume in milliliters. <br> The meniscus of the liquid is between <br> the ticks for 290 mL and 295 mL.  <br>Therefore, we can estimate the volume <br>to be  half way between these graduation marks <br> with an estimated error associated with our ability <br>to estimate the volume, 292.5 mL $\pm$ 2.5 mL.|


Let's suppose the graduate cylinder is graduated every five milliliters. You will have to estimate where the top of the liquid lies relative to the markings of the cylinder, and this necessity causes some uncertainty in the measurement. No physical quantity (a length, time, temperature, etc) can be measured with complete certainty.

# Estimating Error

How can we evaluate the magnitude of uncertainty in a measurement?  Such evaluation can be fairly complicated, and we will just cover some basic principles.  However, this is a very important skill in physical chemistry and is often used because multiple measurements is not a viable option in lots of measurements and scientific procedures.  Let's consider first the graduated cylinder shown in figure 1.  You could reasonably decide that the volume is 290 mL $\leq$ V $\leq$ 295 mL.  We will write this statement as:

\begin{equation}
V = 292.5 \pm 2.5 \: mL
\end{equation}

Note that the estimated error depends on the separation in the markings on the cylinder. This is a 'safe' estimation.  Often, you can estimate a quantity with more accuracy than half the separation between markings.  A traditional thermometer would be measured in a similar fashion.  However, it is more common in modern experimental chemistry to use thermocouples with digital readouts to determine the temperature.  


|       |
|-      |
|![Figure 2](https://github.com/CHM343/Images/blob/main/Digi_Sense_Thermocouple_Display.jpg?raw=true?)|
| **Figure 2** - Picture of a Digi-Sense digital display thermocouple <br> from Cole-Palmer.  The display shows the temperature in <br> degrees Fahrenheit ($^o$F). |



Consider now the thermocouple with a digital display shown in figure 2.  The display reads 199.8$^o$F.  How can we estimate the uncertainty?  The safest way to do this is by referring to the documentation provided with the instrument by the provider.  For example, the manual for this particular thermometer says “Accuracy is $\pm$0.06$^o$F from -40 to +99.99 $^o$F ($\pm$ 0.03 $^o$C from -40 to +99.99$^o$C); ±0.1$^o$F from 100 to 257 $^o$F ($\pm$ 0.1$^o$C from 100 to 125$^o$C); $\pm$0.2$^o$F from 257 to 302 $^o$F ($\pm$ 0.2$^o$C from 125 to 150$^o$C)." Thus, we would report this value as

\begin{equation}
T = 199.8 \pm 0.1 ^oF
\end{equation}


If the manual is not available, the uncertainty is commonly estimated by observing the fluctuations in the temperature or making several separate measurement of the temperature using the same thermocouple and digital display.  Estimating error from digital readouts and reporting numbers to a reasonable degree of significant figures is a very important skill and keeps us from providing misleading results.  For example, if we report the temperature as 99.999999$^o$F, this implies that we have the precision and accuracy to determine temperature to within one one-millionth of one degree.  While given the specification of the above thermocouple and most common thermocouples, it is more appropriate to write the temperature as 99.9$^o$F, no matter how many significant figures are given on the display (similar to a scientific calculator).  Lastly, what we are asking for here is that all experimental numbers be listed with its estimate or determined uncertainty.

In general, the result of any measurement of a quantity x should be stated as

\begin{equation}
x = x_{best} \pm \delta x
\end{equation}

where $x_{best}$ is the best estimate for the quantity concerned and $\delta x$ is called the uncertainty, or error, in the measurement of $x$. We'll discuss in a moment how to get the best estimate.



# Reporting Uncertainty

Since the quantity $\delta$x is an estimate of an uncertainty, it should obviously not be stated with too much precision. In the example of the graduate cylinder above, it would be absurd to state the result as V = 292.5 $\pm$ 2.52356 mL.  Uncertainties are most often reported with one or two significant figures . This difference depends somewhat on the field and personal taste of the researcher.  Once the uncertainty in the measurement has been estimated, one must also consider the number of significant figures in the measured value.  A statement like V = 292.5256712 $\pm$ 2.5 mL is obviously ridiculous. The uncertainty of 2.5 mL means that the digit 2 before the decimal point might really be as small as 0 or as large as 4. Clearly the digits 2 5, 6, 7,…, after the decimal point have no significance at all and should be rounded off. The last significant figure in any stated answer should be in the same decimal position as the uncertainty. In the last case, this means V = 292.5 $\pm$ 2.5 mL.

If the measured number is so small or large that it calls for scientific notation, then it is simpler and clearer to put the answer and uncertainty in the same form (see example below).

In summary:

1.   The value and the uncertainty should be reported to the same number of decimal places: (75.63 $\pm$ 0.06) kg, **not** (75.6347 $\pm$ 0.06) kg or (75.63 $\pm$ 0.0629) kg.
2.   The value and the uncertainty should have the same power of 10 if you are using scientific notation. One good format is (4.73 $\pm$ 0.08) x $10^{-5}$ J.  You make readers do extra work with (4.732 $\pm$ $10^{-5}$ $\pm$  8.5 x $10^{-7}$) J.
3.   There is no generally accepted convention for the number of significant figures to use in reporting the uncertainty. Some authors report one, some use two. Give your result to the same number of decimal places as you give its uncertainty. If your calculator gave 752.2083 torr $\pm$ 0.0143 torr, you should report (752.208 $\pm$ 0.014) torr.
4.   The units should be clear. In these examples we have used parentheses to make it clear that the unit applies to both the number and its error, but it's also acceptable to place the unit explicitly on both value and uncertainty: 752.208 torr $\pm$ 0.014 torr.



# The Mean & Standard Deviation

Suppose we need to measure some quantity $x$, and have identified and minimized all sources of systematic error.  Since all remaining sources of uncertainty are random, we should be able to detect them by repeating the measurement several times.  We might, for example, measure a temperature five times, and find the following results:

|  |  |  |  |  |
|:---:| :---: | :---: | :---: | :---: |
| 53.4 $^o$C | 53.2 $^o$C | 54.0 $^o$C | 53.9 $^o$C | 53.7 $^o$C |

It seems reasonable that the best estimate of the temperature is the average or mean of the five values we found. In general:

\begin{equation}
x_{best} = \bar{x} = \frac{x_1 + x_2 + ... + x_n}{N}
\end{equation}

where $N$ is the number of measurements and the bar on top of the quantity $x$ means “average” or “mean”.

In the example above, $T_{best}$ = 53.64 $^o$C. We'll discuss in a moment how to determine the appropriate number of significant figures we should use to report this value. The standard deviation is an estimate of the average uncertainty of the measurements $x_1, x_2,... x_n$. Given that the mean, $\bar{x}$ , is our best estimate of the quantity $x$, it is natural to consider the difference $x_ − \bar{x} = d_i$ . If these differences are very small, then the measurements are all close together and very precise. If the differences are large, then the measurements are largely scattered and the result will not be very precise. For the example above:

|     i &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;   | 1 | 2 | 3 | 4 | 5  |
|:------------|---|---|---|---|---|
| $T_i$ | 53.4 $^o$C | 53.2 $^o$C | 54.0 $^o$C | 53.9 $^o$C | 53.7 $^o$C |
|  $d_i = T_i - \bar{T}$ | -0.24 $^o$C | -0.44 $^o$C | 0.36 $^o$C | 0.26 $^o$C | 0.06 $^o$C |


to estimate the uncertainty of the measurements we might naturally try averaging the deviations $d_i$. However, the average of the deviations is zero (try taking the average of the third row in the table above). Some values of $d_i$ are positive and some negative, so the average of the deviations is not a useful way to characterize the reliability of the measurements. The best way to avoid this is to square all the deviations, which will create a set of positive numbers. The square root of the average of the squares of the deviations is called the standard deviation, and is denoted $\sigma _x$.
<br>

\begin{equation}
\sigma _x = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} (x_i-\bar{x})^2}
\end{equation}
<br>
<br>

|     i &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;   | 1 | 2 | 3 | 4 | 5  | &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;   |
|:------------|---|---|---|---|---|---|
| $T_i$ | 53.4 $^o$C | 53.2 $^o$C | 54.0 $^o$C | 53.9 $^o$C | 53.7 $^o$C |   |
|  $d_i = T_i - \bar{T}$ | -0.24 $^o$C | -0.44 $^o$C | 0.36 $^o$C | 0.26$^o$C | 0.06 $^o$C | $\sum_{i=1}^{5} d_i = 0$   |
 |  $d_i^2 = (T_i - \bar{T})^2$ | -0.24 $^o$C | -0.44 $^o$C | 0.36 $^o$C | 0.26$^o$C | 0.06 $^o$C | $\sum_{i=1}^{5} d_i^2 = 0.452$    |

<br>
<br>

 \begin{equation}
\sigma _x = \sqrt{  \frac{1}{N} \sum_{i=1}^{N} (x_i-\bar{x})^2 }  =  \sqrt{ \frac{0.452}{5}} = 0.30066..
\end{equation}

<br><br>

If you choose to report errors with one significant figure, you would write: $\sigma _x = 0.3$ , so each measured temperature would be reported as:

|  |  |  |  |  |
|:---:| :---: | :---: | :---: | :---: |
| (53.4 $\pm$ 0.3) $^o$C | (53.2 $\pm$ 0.3) $^o$C | (54.0 $\pm$ 0.3) $^o$C | (53.9 $\pm$ 0.3) $^o$C | (53.7 $\pm$ 0.3) $^o$C |

NOTE that the standard deviation describes the uncertainty of each of the five individual measurements. However, the uncertainty of the mean is lower as we'll see in the next section. Before we continue, you should be aware of an alternative definition of the standard deviation:

<br>
\begin{equation}
\sigma _x = \sqrt{ \frac{1}{N-1} \sum_{i=1}^{N} (x_i-\bar{x})^2}
\end{equation}
<br>

The two definitions give a similar result when N is large. The distinction is important in statistics, but it is beyond the scope of this lesson.  FYI, these are often referred to as population (N) vs sample (N-1) standard deviation.


For many cases where multiple measurements are made, you will want to find the standard deviation of the mean value. If $x_1, x_2,... x_n$ are the results of $N$ measurements of the same quantity $x$, the best estimate for the quantity $x$ is the mean. The standard deviation characterizes the average uncertainty of the separate measurements $x_1, x_2,... x_n$. However, $\bar{x}$ is more reliable than any individual measurement, so it is expected that its uncertainty is smaller than the corresponding uncertainty for the individual measurements. The uncertainty in $\bar{x}$ turns out to be:

<br>
\begin{equation}
\sigma _{\bar{x}} = \frac{\sigma _x}{ \sqrt{N}}
\end{equation}
<br>

Going back to our previous example:

|     i &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;   | 1 | 2 | 3 | 4 | 5  |
|:------------|---|---|---|---|---|
| $T_i$ | 53.4 $^o$C | 53.2 $^o$C | 54.0 $^o$C | 53.9 $^o$C | 53.7 $^o$C |

<br>
<br>
\begin{equation}
\bar{x} = \frac{1}{N}  \sum_{i=1}^{N} x_i = 53.64
\end{equation}
<br>

\begin{equation}
\sigma _x = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} (x_i-\bar{x})^2} = \sqrt{ \frac{0.452}{5}} = 0.30066..
\end{equation}

\begin{equation}
\sigma _{\bar{x}} = \frac{\sigma _x}{ \sqrt{N}} = \frac{0.30066}{ \sqrt{5}} = 0.13446..
\end{equation}
<br>
<br>

Thus, you would report the mean value as T = (53.6 $\pm$ 0.1)$^o$C or T = (53.64 $\pm$ 0.13)$^o$C depending on whether you choose to report the error with one or two significant figures.
In summary, the standard deviation ($\sigma _x$) represents the uncertainty of each individual measurement, while the standard deviation of the mean ($\sigma _x$/$\sqrt{N}$) is a measure of the uncertainty of the mean value.

#### **TECHNICAL NOTES (Colab)**
In google colab, the square root over an equations with a complex summation (e.g. standard deviation equation) is displaying a really wierd 3 vertical box that SHOULD NOT be there, and the same latex equation displays properly in most other latex or markdown viewers.  Hence, an alternative is to use a free latex equation viewer and display the downloaded image instead of the wierdly displayed equation in colab.  For example, below is the standard deviation equation (exact same code as in this markdown cell) displayed at an [online latex editor](https://latexeditor.lagrida.com/)

<br>
<center><img src="https://github.com/CHM343/Images/blob/main/Std_Dev_Equation_lagrida_latex_editor.png?raw=true" width="200" alt="Sample Holder" /></center>



## Calculating Mean and Standard Deviation in Python

Using the example values above, show how the **numpy** python library can be used to easily determine the mean and standard deviation.

In [1]:
# Load pandas and numpy libraries
import numpy as np

# Make a list of the Temperature values
Temps = [53.4, 53.2, 54.0, 53.9, 53.7]

# determine mean value with numpy
mean_Temps_numpy = np.mean(Temps)

# determine standard deviation (population - ddof=0) with numpy.  Note if you want sample standard deviation (n-1), just set ddof=1.
std_p_Temps_numpy = np.std(Temps, ddof=0)

# determine mean value uncertainty
mean_uncert_Temps_numpy = np.std(Temps, ddof=0)/np.sqrt(len(Temps))

# Print mean and standard deviation values determined from numpy
print(f"Mean")
print(f"------")
print(mean_Temps_numpy)

print(f" ")

print(f"Standard Deviation")
print(f"------------------")
print(std_p_Temps_numpy)

print(f" ")
print(f" ")

print(f"Report Mean value +/- uncertainty in mean")
print(f"-----------------------------------------")
print(f"({mean_Temps_numpy:.2f} +/- {mean_uncert_Temps_numpy:.2f}) \u2070C")



Mean
------
53.64
 
Standard Deviation
------------------
0.3006659275674574
 
 
Report Mean value +/- uncertainty in mean
-----------------------------------------
(53.64 +/- 0.13) ⁰C


Using the example values above, show how the **pandas** python library can be used to easily determine the mean and standard deviation.

In [2]:
# @title
# Load pandas and numpy
import numpy as np
import pandas as pd

# Make a list of the Temperature values
Temps = [53.4, 53.2, 54.0, 53.9, 53.7]

# Make a pandas dataframe of the list of temperatures, and give it a column heading of - Temp (C)
df_Temps = pd.DataFrame(Temps, columns=['Temp (C)'])

# determine mean value with pandas
mean_Temps_pandas = df_Temps.mean()

# determine standard deviation (population - ddof=0) with pandas and numpy.  Note if you want sample standard deviation (n-1), just set ddof=1.
std_p_Temps_pandas = df_Temps.std(ddof=0)

# determine mean value uncertainty
mean_uncert_Temps_pandas = df_Temps.std(ddof=0)/np.sqrt(len(Temps))

# Print mean and standard deviation values determined from numpy
print(f"Mean")
print(f"------")
print(mean_Temps_pandas)

print(f" ")

print(f"Standard Deviation")
print(f"------------------")
print(std_p_Temps_pandas)



Mean
------
Temp (C)    53.64
dtype: float64
 
Standard Deviation
------------------
Temp (C)    0.300666
dtype: float64


# Rejection of Data

Every now and then, you will come across an experimental result that simply does not seem to make sense. For example, let's say you perform a sixth measurement of temperature and obtain

|     i | 1 | 2 | 3 | 4 | 5  |  6 |
|:------------|---|---|---|---|---|---|
| $T_i$ | 53.4 $^o$C | 53.2 $^o$C | 54.0 $^o$C | 53.9 $^o$C | 53.7 $^o$C | 54.9 $^o$C    |


The last measurement seems a bit off, and you may be tempted to throw it out of the set on aesthetic grounds alone. However, you must never throw out a result from a data set unless you have a statistical or chemical reason to do so. First, you should use your common sense: did something unusual happen during that measurement? Sometimes one realizes that a particular determination wasn't performed properly, and it would make sense to "flag it" in your notebook and eventually discard it if it is off. Statisticians have devised many rejection tests for the detection of non-random errors. We will describe only one, the $Q$ test, which works well in cases where $3 < N < 10$. The quantity $Q$ is defined by

Q = |(suspect value) − (value closest to it)| /[(highest value) − (lowest value)]

In our example above,

Q = (54.9-54.0) / (54.9-53.2) = 0.53

We now compare this value of $Q$ with the critical value $Q_c$ corresponding to the number of observations.  For example, the $Q_c$ 90% confidence levels in the Dixon Q test table ([wiki link - Table](https://en.wikipedia.org/wiki/Dixon%27s_Q_test)).  If $Q$ is equal to or larger than $Q_c$, the suspect measurement can be rejected. In our case, $N = 6$, so $Q < Q_c = 0.56$. Thus, the suspect measurement **cannot** be rejected.





# Propagation of Error

Most physical quantities cannot be measured in a single direct measurement. Instead, they are calculated from quantities measured in different steps. For example, to measure the density of a substance one would measure its volume, its mass, and then calculate the density ($\rho$) from the definition:

<br>
\begin{equation}
\rho = \frac{mass}{Volume} =  \frac{m}{V}
\end{equation}
<br>


To calculate the uncertainty in the density, one must first determine the uncertainty in the measured quantities, $m$ and $V$, and **propagate** through the calculation.

Suppose that $x, ..., z$ are measured with uncertainties $\delta x, ..., \delta z$, and the measured values are used to compute a function $q(x, ..., z)$. If the uncertainties in $x, ..., z$ are independent and random, then the uncertainty in $q$ is

<br>
\begin{equation}
\delta q = sqrt | ( \frac{\partial q}{\partial x} \delta x  )^2 + ... + ( \frac{\partial q}{\partial z} \delta z  )^2 |
\end{equation}
<br>

In this equation, the term $\partial q / \partial x$ represents the partial derivative of $q$ with respect to $x$. If you need help calculating partial derivatives please is Appendix II or a general calculus book. You will need to be able to calculate partial derivatives in order to understand the reminder of this section.

For example, let's assume your goal is to calculate the density of a substance. You determine its mass $m \pm \delta m$ and its volume $V \pm \delta V$. You would calculate the density as $\rho = m/V$. But what about the uncertainty in the density?

Is $\delta \rho = \delta m / \delta V$?

Using the equation for error propagation we have:

<br>
\begin{equation}
 \frac{\partial q}{\partial m} = \frac{1}{V}
\end{equation}
<br>
\begin{equation}
 \frac{\partial q}{\partial V} = -\frac{m}{V^2}
\end{equation}
<br>
<br>
so,
\begin{equation}
\delta \rho = sqrt | ( \frac{1}{V} \delta m )^2  + ( - \frac{m}{V^2} \delta V  )^2 |
\end{equation}
<br>


For instance, if m = (32.52 $\pm$ 0.15) g and V = (35.17 $\pm$ 0.52) mL

<br>
\begin{equation}
\rho =  \frac{m}{V} = \frac{32.52 g}{35.17 mL} = 0.92465 \frac{g}{mL}
\end{equation}
<br>

and

<br>
so,
\begin{equation}
\delta \rho = sqrt | ( \frac{1}{32.52 mL} 0.15 g )^2  + ( - \frac{32.52 g}{(35.17)^2 mL^2} 0.52 mL  )^2 | = 0.014428 \frac{g}{mL}
\end{equation}
<br>

If we choose to report our error with two significant figures, we would report the calculated density as:

<br>
\begin{equation}
\rho =  (0.925 \pm 0.014) \frac{g}{mL}
\end{equation}
<br>

Note that we used the same number of decimal places for the value and error.


# Least-Squares Fitting

## (A) Data fitting to a line

Often, an experiment involves measuring several values of two different physical variables in order to investigate the mathematical relationship between the two. For instance, one might measure the volume of a mole of gas at 1 atm as a function of temperature (in $^o$C) and see if the volumes and temperatures are connected by the expected ideal gas equation of state (EoS):

\begin{equation}
PV = nRT
\end{equation}


This relationship is linear, that is, if we plot $V$ as a function of $T$ we expect a straight line. Let's consider a general case where two physical variables $y$ and $x$ are connected by a linear relation of the form

\begin{equation}
y = mx + b
\end{equation}

where $m$ and $b$ are constants. A plot of $y$ against $x$ should give a straight line with slope $m$ and y-intercept $b$.

Going back to the example above, imagine we make a series of measurements of volume and temperature. Since we know that the two variables are linearly related, we can ask ourselves which is the straight line $V = mT + b$ that best fits the measurements. The analytical method of finding the best straight line to fit a series of experimental points is called linear regression, or least-squares fit.

In this method, the sum of the squares of the residuals is minimized. Residuals are defined as the difference between the predicted and observed values. This concept is better understood graphically comparing the three graphs shown below. All three plots have the same data points, but different straight lines that were drawn according to the equations shown in each plot. Which of the three straight lines would you consider a better fit to the experimental data?


In the least-squares method, we first define the residuals as the difference between the experimental value and the y-value predicted by the straight line. Residuals are shown as dotted lines in the plots in Figure 3. Note that residuals can be positive or negative, depending on whether the experimental point is on top or below the straight line. If we sum the residuals, we might get a small number just because the positive numbers cancel out with the negative values, so a small sum of residuals is not necessarily a good indicator of a good fit. To work around this problem, we'll calculate the sum of the squares of the residuals. We'll square all residuals first (to assure that all values are positive), and then calculate the sum of these values. A small value of the sum of the squares is an indication of a good fit, while a large value indicates a poor fit. In a least- squares method, the best straight line is calculated by minimizing the sum of the squares of the residuals. The plot on the left shows the result of the least-squares analysis for this data set, while the other two plots show straight lines that give a larger sum of squares (and thus represent poorer fits). One could imagine changing the slope and intercept by trial and error until the lowest sum of squares is obtained, but in fact there is an analytical solution to this problem.2 That is, there are formulas that will give the best intercept and slope for a given set of experimental data.

The two graphs in Fig. 4 show two different data sets that give the same linear fit. That is, the y-intercept and slope obtained by performing linear regression is the same in both cases. However, there is a big difference between the two data sets: if we change the slope or intercept by a small amount, the sum of the squares changes more dramatically in the example on the right. The two graphs on the top row represent the best fit, which is the same for both sets of data. If we change the slope and intercept (bottom plots), the fit gets much worse for the data set on the right than what it does for the data set on the left. In other words, both the intercept and slope determined by the least-squares procedure have more uncertainty for the data set on the left, and are determined more accurately for the data set on the right. This is not surprising given that the points on the right are less scattered than the points on the left.


Thus, it is important to calculate the uncertainties in both the slope and intercept, and report them together with their values. The least-squares fit of the data set on the left gives a value of the slope m = (0.082 $\pm$ 0.029) L/$^o$C and a value of the intercept b = (22.4 $\pm$ 1.6) L. The corresponding values for the data set on the right are: m = (0.0819 $\pm$ 0.0081) L/$^o$C and b = (22.41 $\pm$ 0.43) L. Note that since the error is smaller for the data set on the right, we report the slope and intercept with more significant figures.


## (B) Data fitting to non-linear curves

We can extend the previous discussion to cases where the relationship between the variables is not linear. For example, consider an experiment where the concentration of a reactant is followed as a function of time. Let's assume that we know that for that particular reaction the concentration is expected to decay exponentially as:


\begin{equation}
C = C_o e^{-kt}
\end{equation}

Here, $C_o$ is the initial concentration of reactant, $t$ is the time, and the parameter $k$ is known as the rate constant. Since we know the initial concentration of reactant we used in the experiment, we can plot $C/C_o$ as a function of time and obtain the value of $k$ through a least-squares analysis.

Figure 6 represents an example of experimental data obtained in such an experiment. The line in red represents the equation $C/C_o = e^{-kt}$ with $k = 0.404 \: min^{-1}$. But how do we know this is the best fit? Would the fit be better of worse for $k = 0.420 \: min^{-1}$ ? To address this question, we can calculate the sum of the squares of the residuals for both cases and select the value with the lowest value. The residuals are again defined as the distance between the experimental point and the red line, and we take the squares to assure that we deal with positive values. This procedure is shown on graph 6B, which shows that the lowest value is obtained for $k = 0.404 \: min^{-1}$. Thus, the red curve in figure 6A is the best fit for this set of experimental data.


Finding the best value of k was relatively easy because one can easily calculate the sum of the squares of the residuals for many different values of $k$ and identify the value for which this sum is lowest. This would not work for more complicated cases that involve more than one parameter. For instance, imagine that you study a more complex reaction for which $C/C_o$ decays as the sum of two exponentials:

\begin{equation}
\frac{C}{C_o} = \alpha e^{-k_1t} + (1- \alpha) e^{-k_2t}
\end{equation}

In this case, we need to determine the best values of $\alpha$, $k_1$ and $k_2$, so performing a manual minimization of the sum of the squares of the residuals becomes a very complicated task.

Nonlinear regression is a general technique to fit data to any equation that defines $Y$ as a function of $X$ and one or more parameters. It finds the values of those parameters that generate the curve that comes closest to the data (that is, the parameters that minimize the sum of the squares of the vertical distances between the data points and curve). In contrast to the linear regression case, it is not possible to directly derive an equation to compute the best-fit values from the data. Instead nonlinear regression requires a computationally intensive, iterative approach. Luckily, there are many software programs that do this type of analysis. Choose a program that gives the statistical uncertainties as well, so you can report the best parameters and their corresponding errors.












# Updates - Coming Soon

The Least-Squares Fitting section is not complete.  Updates will be coming soon.

# References



1.   List item
2.   List item

