# Homework 1: Data analysis with Python

#### Deadline: Wednesday 24 September 2025 (by 7pm)

#### Credits: 25 points -> 20 credits

## Name: 

- When you finish, please submit your notebook via the link provided.

- This homework is individual. Please include your name in the notebook.

- Within a **single python notebook**, solve the following problems:

## Problem 1. (7 points) Maxwell–Boltzmann distribution

For a system containing a large number of identical non-interacting, non-relativistic classical particles in thermodynamic equilibrium, the fraction of the particles within an infinitesimal element of the three-dimensional velocity space $d^3v$, centered on a velocity vector of magnitude $v$, is given by

$$f(v)\:d^3v = \biggl[\frac{m}{2 \pi k_bT}\biggr]^\frac{3}{2} \, \exp\left(-\frac{mv^2}{2k_bT}\right)\:d^3v$$

where $m$ is the particle mass, $k_b$ is the Boltzmann constant, $T$ is thermodynamic temperature, 
$f(v)$ is the (Maxwell–Boltzmann) probability distribution function (PDF), properly normalized so that $\int f(v) \, d^3v$ over all velocities is unity.

Recognising the symmetry of $f(v)$, one can integrate over solid angle and write a probability distribution of speeds as the function:

$$f(v) = \biggl[\frac{m}{2 \pi k_bT}\biggr]^\frac{3}{2} \, 4\pi v^2 \exp\left(-\frac{mv^2}{2k_bT}\right)$$

This probability density function gives the probability, per unit speed, of finding the particle with a speed near $v$. This equation is simply the Maxwell–Boltzmann distribution.

### Coding tasks:

(a) Create a python function that reads in the particle mass, the thermodynamic temperature, and the velocity, and then returns the (Maxwell–Boltzmann) probability distribution function, $f(v)$.

(b) Define a python dictionary containing the symbols of 7 noble gases as "keys" with their respective atomic masses in SI units ($m=m_u\,A_r$) as elements. Here, $m_u$ is the atomic mass constant and $A_r$ is the standard atomic weight. For each atom, we have:

- Helium ($\rm He$) has $A_r(\rm He)=4.003$

- Neon ($\rm Ne$) has $A_r(\rm Ne)=20.180$

- Argon ($\rm Ar$) has $A_r(\rm Ar)=39.963$

- Krypton ($\rm Kr$) has $A_r(\rm Kr)=83.798$

- Xenon ($\rm Xe$) has $A_r(\rm Xe)=131.293$

- Radon ($\rm Rn$) has $A_r(\rm Rn)=222.000$

- Oganesson ($\rm Og$) has $A_r(\rm Og)=294.000$

(c) Generate a 1D velocity vector covering the range from $0$ to $4000\,\rm \frac{m}{s}$, and fix the temperature of the gas at $40\,\rm K$.

(d) Using the values of atomic masses from the dictionary in point (b), and the fixed velocity vector and temperature defined in point (c), call the function created in point (a) to obtain a set of arrays with the Maxwell–Boltzmann distributions, $f(v)$, for each nobel gas. (**Hint:** a for loop can help access the dictionary elements).

(e) Use matplotlib to make a single high-quality labeled of the velocity distribution of all 7 nobel gases for the set temperature. The plot should have $f(v)$ on the Y axis and $v$ on the X axis.

(f) Repeat steps (c,d,e) for 3 more gas temperatures ($120\,\rm K$, $298\,\rm K$, and $550\,\rm K$), and report all the results in a single 4-panel high-quality labeled figure. 

### Analysis questions:

(g) What happens with the velocity distributions when we increase the temperature of the gas? Why?

## Problem 2. (7 points) Exploring Statistics and Confidence Levels

Imagine that you are provided with two datasets (in the following CSV files), where experimenters took $N=10^4$ measurements of two different quantities:

Sample $1$: https://github.com/wbandabarragan/physics-teaching-data/blob/main/1D-data/sample1_hres.dat

Sample $2$: https://github.com/wbandabarragan/physics-teaching-data/blob/main/1D-data/sample2_hres.dat

The goal of this exercise is to determine how the data points in each sample are distributed, to visualise confidence intervals, and to quantify p-values after carrying out regressions.

### Coding tasks:

(a) Create python functions to load and visualise the datasets. Make histograms of each dataset and examine the shape of the distributions.

(b) Based on the histogram plots, propose, write down, and fit candidate statistical distributions. Report the results in plots containing the histograms and the best-fit functions.

(c) Calculate the Cumulative Distribution Function (CDF) and perform goodness-of-fit analyses using the Kolmogorov-Smirnov (KS) test. Determine and report the KS statistic and the resulting p-values.

(d) Calculate the mean of each dataset. Then, compute confidence intervals corresponding to $1\sigma$, $2\sigma$, and $3\sigma$. Report the $1\sigma$ errors (upper and lower) for each dataset explicitly.

(e) Plot the histograms overlaid with the best-fit probability density functions. Add shaded regions to illustrate the $1\sigma$, $2\sigma$, and $3\sigma$ confidence intervals.


### Analysis questions:

(f) What do the KS and p-values computed in (c) indicate for each data sample?

(g) Comment on the differences in the shapes of the confidence intervals between the two datasets and how this relates to uncertainties.

## Problem 3. (11 points) Data I/O and regression: $\rm CO_2$ emission

This problem consists of reading, displaying, analysing, and modelling $\rm CO_2$ concentration data from a monitoring station located in Kennaook/Cape Grim, Australia.

The data files for this and the next exercises were downloaded from the CSIRO site:

https://www.csiro.au/en/research/natural-environment/atmosphere/Latest-greenhouse-gas-data

For this problem, you will use the following CSV file:

https://github.com/wbandabarragan/physics-teaching-data/blob/main/1D-data/CapeGrim_CO2_data_2025.csv


### Coding Tasks:

#### Basic I/O:

(a) Inspect the structure of the file. Then, create a python function that uses **pandas** to read in the **CapeGrim_CO2_data_download.csv** file, skips the header, place the data into a pandas dataframe, and returns 3 columns as arrays, namely: "DATE", "CO2(ppm)", "SD(ppm)". The first one has the data collection times, the second one contains $\rm CO_2$ concentrations in $\rm ppm$, and the third one has the respective instrumental standard deviations also in $\rm ppm$.

(b) Now, you will rescale the uncertainties to more realistic values by multiplying the instrumental standard deviation by $10$, so $\sigma=10\,\sigma_{\rm instrumental}$. The instrumental uncertainties are very small, and there are unaccounted-for errors, such as systematic errors. A factor of $\sim 10$ is a fine estimate for this data set.

(c) Make a high-quality, labeled plot of the $\rm CO_2$ concentrations (in $\rm ppm$) versus time, including the y-error bars equal to the uncertainties calculated in (b).

(d) How monotonic and linear is the relation between the two variables?

#### Regressions and goodness-of-fit tests:

(e) Now, you will carry out a regression on the data, assuming a simple linear regression model (**"Model 1"**) with 2 free parameters. Carry out a $\chi^2$ minimisation, and report the resulting equation and figure. The figure should have a high-quality, labeled plot of the original data with error bars and the resulting best-fit line.

(f) To perform a goodness-of-fit test, create a python function that calculates the global $\chi^2$ statistic, the number of degrees of freedom, and the reduced $\chi^2$ statistic. 

(g) Next, you can propose your own model (**"Model 2"**) for the data. Motivate and write down your model. Then, repeat steps (e) and (f) for your model and report the results, including the reduced $\chi^2$ statistic. **Hint:** Keep the model simple with maximum 2 free parameters and avoid overfitting.

#### Confidence intervals:

(h) Finally, you will construct **2D confidence intervals** using the $\chi^2$ statistic evaluated in the vicinity of the best-fit values for the free parameters of Models 1 and 2. Since both models have two free parameters, you should create python functions that compute and return the 2D $\chi^2$ surface of each model onto the respective parameter space. **Hints:** to create a 2D parameter space as a meshgrid, you should first create vectors with numbers around the best-fit values found in your regressions. 1D plots showing $\chi^2$ as a function of each parameter may help identify optimal numbers.

(i) Make a high-quality, labeled, multi-panel figure showing the 2D $\chi^2$ surface of each model: first in 3D, and then in projection with a marker showing the minimum and overimposed contours indicating at least three confidence levels. **Hint:** the maps should have one of the free parameters in the X axis, the other free parameter in the Y axis, and the $\chi^2$ surface in the Z axis.

### Analysis questions:

(j) Compare the $\chi^2$ statistic of models 1 and 2. Which model is better? Model 1 or model 2? Why?

(k) Briefly comment the results in (i) highlighting the differences between the confidence intervals of both models. What do these confidence intervals tell us? 