# Histogramming and fitting

The aim of this exercise is to try something that is very commonly done: taking some data, making a histogram, and then fitting it. You can work from this notebook, since notebooks aren't just for a bunch of code but can also contain text, headers, etc.

## Initialization

There are some things that we are going to need for sure: numpy, matplotlib, scipy. Best put them all at the start of your notebook. Write the imports in the following cell:

In [None]:
import ...

Neat little trick: you can stow away your imports in a separate notebook, and run it from this one. This is done with something called 'cell magic' (no joke). Try it!

In [None]:
%run 'def.ipynb'

## Reading in data

There is a text file in this folder called 'data.txt'. Read it in with your favorite package. If this costs you more than one line, you're being inefficient.

## Plot a histogram

Now use matplotlib to plot a histogram. Try the following arguments (for example) range=(-1, 2), bins=100, histtype='step', normed=True. See what the arguments do and find your best-looking plot

In [None]:
plt.hist(...)
...
plt.show()

## Guessing the values

Needless to say, the distribution looks Gaussian. A good starting point for a fit is often just to plot the fit function on top of the histogram. Define a fit function and try a few values until you find some good starting values.

In [None]:
def gauss(x, A, mu, sigma):
    return ...

In [None]:
plt.hist(...) # Here comes the histogram
plt.plot(...) # Here comes the fit plot
...
plt.show()

## Extracting the data

In Python, you don't directly fit the histogram, but you rather take the data from it first and then fit that. Extract the data from the histogram with the following lines:

In [None]:
counts, bin_edges, _ = plt.hist(...)

If you don't want to *show* the histogram, use np.histogram instead. It has the same syntax, but doesn't take all the plot style arguments.

In [None]:
counts, bin_edges = np.histogram(...)

By default, the bin *edges* are extracted. For a fit, we want to have the bin *centers*. Note that this list is one element shorter. Extract the bin centers.

In [None]:
bin_centers = ....

As a check, plot the counts versus the bin_centers and see that it resembles the histogram. If you want, you can do 'ls=steps' or 'ls=steps_mid' for a more histogrammy look.

In [None]:
plt.plot(...)

## Fitting the extracted histogram

Next, fit your Gaussian function with curve_fit:

In [None]:
popt, pcov = curve_fit(... , ..., ..., p0=[..., ..., ...])

For documentation on curve_fit, execute the following cell:

In [None]:
curve_fit?

After executing this cell, you get some values out:

In [None]:
print(popt)

## The uncertainties

These are the optimized values. But wait! Did we put the errors in somewhere? No, so curve_fit just guesses them. In fact, we do know the bin-by-bin errors, since we're doing a counting experiment (yay particle physics!). The error on the number of counts is just the square root, in the limit of large N. Let's extract the error on the counts.

In [None]:
counts_err = np.sqrt(counts)

If you have zero counts somewhere, the counts will go zero too and your fitting will fail. Check if there are zeros in your error array.

In [None]:
...

There may or may not be zeros in there. Write a function to replace all zeros in an array with ones. Can you do a one-liner again?

In [None]:
def replace_zeros(arr):
    ...
    return arr

In [None]:
counts_err = replace_zeros(arr)

Now, refit and check the influence on the result.

In [None]:
popt0, pcov0 = curve_fit(... , ..., ..., p0=[..., ..., ...], )
popt1, pcov1 = curve_fit(... , ..., ..., p0=[..., ..., ...], sigma=counts_err)

In [None]:
print('No   errs: ', popt0)
print('With errs: ', popt1)

Scipy also returns the covariance matrix, pcov. If the errors are not correlated (check this for any serious analysis, but assume here), this is how you get the errors:

In [None]:
perr = np.sqrt(np.diag(pcov1))

Hey, time to print your result!

In [None]:
print('==== My awesome fit! =======')
print('A:     %.4f +- %.4f' % (popt1[0], perr[0]))
print('mu:    %.4f +- %.4f' % (popt1[1], perr[1]))
print('sigma: %.4f +- %.4f' % (popt1[2], perr[2]))

## Bins, fit range, and your result...

Well, wasn't that easy? But wait, *you* put in the number of bins and the histogram range. Doesn't that influence your result? Well, maybe. Let's try two methods. First, let's vary the fit range and number of bins and see what it does to the results. After that, we can try something a little more advanced: unbinned fits!

Write a function that does all you've just done: extract the histogram, bin it, get the data, fit it, return the parameters. Be sure to put comments in, because this will be a longer function.

In [None]:
def fit_it(arr, hist_range, hist_bins):
    # Make a histogram and extract the data
    ...
    
    # Get the bin centers
    ...
    
    # Get the errors
    ...
    
    # Replace with ones if zeros are found
    ...
    
    # Fit it
    ...
    
    # Optional: plot the fitted gaussian
    ...
    
    return popt

Next, vary the number of bins and see what happens.

In [None]:
try_bins = ... # List or array of the number of bins you'd like to try
popts = np.array([fit_it(...) for hist_bins in try_bins])

Check what happens to the parameters.

In [None]:
plt.scatter(try_bins, popts[:, 0])
plt.ylabel('A')
plt.xlabel('Nbins')
plt.show()

plt.scatter(try_bins, popts[:, 1])
plt.ylabel('mu')
plt.xlabel('Nbins')
plt.show()

plt.scatter(try_bins, popts[:, 2])
plt.ylabel('sigma')
plt.xlabel('Nbins')
plt.show()

Now do the same thing for the range.

Based on what you have just seen, do you think the uncertainties you got before are OK?

In [None]:
print(perr)

## Unbinned fits

Unbinned fits are a little more tricky: you don't get the nice curve_fit routine, but on the plus side: no dependence on fit range, no stupid unphysical zero errors and you can impress everyone by saying 'unbinned fit'. Let's go!

Fitting is the process of minimizing the negative log likelihood. Wow. For this, we define the probability density function: in our case, just the Gaussian again, but normalized. Define it here.

In [None]:
def gauss_norm(x, mu, sigma):
    return ...

Here is the log likelihood:

In [None]:
def loglikelihood(arr, mu, sigma):
    return np.log(gauss_norm(arr, mu, sigma))

And this is what we wish to minimize:

In [None]:
def neglog(mu, sigma):
    return - ...

Check the behavior: if we fix sigma to (approximately) the right value and vary mu, what does it do?

In [None]:
sigma_fix = ...
mu_guesses = ... #
values = [... for mu in mu_guesses]

In [None]:
plt.scatter(mu_guesses, values)
...

Now use a minimizing routing (scipy.optimize.minimize, for example, or iMinuit if you have it installed) to find the best-fit values.

## Challenge yourself!

If you haven't got enough already, try:
  * Limiting the data to see what happens at lower statistics.
  * Read the data with uniform background, and modify the fit function to take it into account.