## Interlude: What is a model anyway? <a class="anchor" id="third-bullet"></a>

In order to compare different models, we need to find a way to objectively quantify them. One approach would be to try to measure their "quality". By quality, we mean how well a model can "explain" a given dataset. A model, in its broadest sence, is  a mathematical function that captures the statistics within a given dataset*. To make this statement more precise, we can consider the space of all possible candidate functions and define a (conditional) probability function over this space as

$$
p(\text{model}|\text{data})
$$

where $\text{data}$ is the dataset we are trying to model. Maximizing this probability, as a function of models, is equivalent to finding the best fit for the data. Unfortunately, this probability distribution is usually not directly accessible. Luckily, we can use [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) to rewrite this expression as

$$
p(\text{model}|\text{data}) = \frac{p(\text{data}|\text{model})p(\text{model})}{p(\text{data})}
$$
 
As it is [often done](https://stats.stackexchange.com/questions/85465/theoretically-why-do-we-not-need-to-compute-a-marginal-distribution-constant-fo), let's ignore the denomiotor and focus on $p(\text{model})$ (the *prior*) and $p(\text{data}|\text{model})$ (the *likelihood*). The prior in some sense captures our knowledge about the data before we look at it. In deep learning tasks, this often manifests through the model architecture choice. Let's say that we are trying to model images and we choose to use convolutional networks. Then, $p(\text{model})$ is zero for every model that is not a convolutional neural network (or can not be described by one). However, within convolutional neural networks, we still need to find the optimal weights and model parameters. If we have no prior knowledge about these parameters, we can assume that their values are all equally likely**. In this case, $p(\text{model})$ would be a uniform distribution over the space of all possible convolutional neural networks. This is all just to say that its reasonable to consider that $p(\text{model}|\text{data})$ is just proportional to the likelihood

$$
p(\text{model}|\text{data}) \propto p(\text{data}|\text{model})
$$

So we are able to maximize the left hand side (which we don't have direct access to) by maximizing the likelihood. But note that the likelihood, i.e. "given a specific model, what is the probability for \text{data}", is something we already computed! 
For example,
$$
p(\text{"Aargauerstrasse"}|\text{model}) = p("$","A") * p("A","a") * p("a","r") * ... p("e","$")
$$
where the terms on the right hand side are the entries of our lookup table!
If we compute this we get 

*Todo: note on overfitting <-> if we make log loss lower and lower we are overfitting because we are maximishing the likelihood for the training data and not the true global distribution.
**in practice this is not quite true but lets just roll with the argument

In [None]:
names = [street.rstrip() for street in open("../data/streets_zh.txt")] #.rstrip() removes new line character '\n'


In [28]:
likelihood = 1
for pair in zip(names[0], names[0][1:]):
        print(f"p({pair[0]}, {pair[1]}) = ", freq_per_char[pair[0]][pair[1]])
        likelihood *=  freq_per_char[pair[0]][pair[1]]
print("Likelihood for 'Aargauerstrasse': ", likelihood)

p($, A) =  0.055776892430278883
p(A, a) =  0.00847457627118644
p(a, r) =  0.04895104895104895
p(r, g) =  0.05328376703841388
p(g, a) =  0.15490375802016498
p(a, u) =  0.047639860139860137
p(u, e) =  0.05530973451327434
p(e, r) =  0.16194644696189495
p(r, s) =  0.0912845931433292
p(s, t) =  0.28368964688926257
p(t, r) =  0.5757261410788381
p(r, a) =  0.49111937216026436
p(a, s) =  0.5721153846153846
p(s, s) =  0.3264472736007687
p(s, e) =  0.3281287533029066
p(e, $) =  0.3303295571575695
Likelihood for 'Aargauerstrasse':  1.2080002980894415e-14


As we are multiplying a lot of probabilities together our total likelihood per word will be very small, which can cause numerical issues. To handle this let us not look at the likelihood but instead the logarithm of it

In [56]:
from math import log
log(likelihood)

NameError: name 'likelihood' is not defined

The logarithm is a monotinic function, which means that when we maximize the log-likelihood we are also maximizing the likelihood. In the next section we are going to use optimization methods. For historical reasons optimization methods are cast in the language of *minimizing* functions, so lets decide to use the *negative log-likelihood* as our metric for model quality

In [32]:
-log(likelihood)

32.04722495564126

To simply things further, let us not look at the average negative log-likelihood per word (as lenghts of the street names varrry) but per bigram. For the full dataset we arrive at then at

In [None]:
total = 0
n_pairs = sum(count_pairs.values())
for street in names:
    for pair in zip(street, street[1:]):
        total +=  -log(freq_per_char[pair[0]][pair[1]])
print(total/n_pairs)

NameError: name 'freq_per_char' is not defined