# A review of statistics
___

From the equation for conditional probabilities:  

$ P(X,Y) = P(X|Y) * P(Y)$

$ P(X|Y) * P(Y) = P(Y|X) * P(X)$

$ P(X|Y) = \frac{P(Y|X) * P(X)}{P(Y)}$

___
## Bayes' Law

This equation is known as Bayes' law. It is the starting point for an entire field of study called Bayesian statistics, but here we will focus on this core principle.  

Here, $P(X)$ is called the **prior distribution** and $P(X|Y)$ the **posterior distribution**. You can think of the prior as what we originally believed the distribution of X to be, and the posterior our new beliefs afterseeing Y.

Consider a real-life example: you are arriving at New York for the first time, coming out of the airport. Searching for a taxi, you remember that in every movie you've ever seen, taxis in New York are always yellow. This is your **prior** distribution: all taxis are yellow in New York. However, soon after leaving the airport, you see a black taxi. You must now update your belief about taxis in New York. Therefore, you arrive at a new distribution: perhaps not **all** taxis are yellow. This is your **posterior** after seeing a black taxi.

___

___
## Machine learning is statistics

What does this have to do with machine learning, I hear you ask. Well, training a model is a random process, and for a given dataset, we have a probability distribution for the values of our parameters!  
Let's call this probability distribution $p(\theta|X)$, that is, the joint probability distribution of over our model parameters given set of training data X.  

This gives us:
$ p(\theta|X) = \frac{p(X|\theta) * p(\theta)}{p(X)}$
  
Let us consider the base case, where we make no assumptions about our parameters. Our prior $p(\theta)$ is therefore constant, and so $p(\theta|X)$ will be high where $p(X|\theta)$ is high. That is, the model that best describes our data is most likely to emerge.   
Take a moment to make sure you understand why this is.

___

___
## Priors push our results in different directions

Since our prior represents our initial beliefs, we can use it to represent our assumptions about properties of our parameters $\theta$.
For instance, if we assume that our weights should be larger in the first layer than in the second (for whatever reason), we can represent this as a larger probability density for weight matrices that represent this belief.
This changes the distribution $p(\theta)$, but neither $p(X|\theta)$ nor $p(X)$. Can you see how this should change our posterior?

By using a prior, we bias our training algorithm to more likely converge to some parameters rather than others. It improves the probability that parameters that agree with our assumptions will emerge, and decreases the probability of those that don't.  
In short, it **manipulates our results to behave in a certain way**.


### Exercise 1:
What beliefs about the parameters could you imagine being reasonable for a neural network (perhaps for a specific application)? If you can think of one, how could they be represented as a prior?

___

___
## Working together with the algorithm

Why bother with prior in the first place though? Isn't it easier to just let the training algorithm find what's best without our input?  

Not for the training algorithm! 
By giving it a prior, you essentially are helping the training algorithm by telling it where to look!  

Think of it this way: the training algorithm is a *heuristic* algorithm, that is to say, its job is to look for something--the best values for $\theta$.
Imagine you lost your phone. You can either assume nothing and search every corner of the house, or remember the places you've been and assume that most likely it's somewhere nearby.  
Which is easier?  

Prior can do the same for our optimization process: it tells it where to start looking for the best values.

___

___
## Priors can be manipulated

In machine learning, prior also have the function of telling the optimization process what properties we want from our results.  
By increasing the probability of parameters that give desirable results, we can bias our algorithm to choose those over other alternatives!

For instance, by using a prior that gives high probability to solutions that generalize well, we can decrease the probability of ending up with an overfitted model!

Let's look at a concrete example. 
The initial values of our model are where out algorithm begins searching for optimal parameters. Initially, this increases the probability of values near those starting points, since we update using small steps.   
The longer we train, the further away our parameters can go. Somehow, our prior is getting weaker and weaker. We can think of this as our model gradually giving more and more importance to what it learns from the data rather than the prior.
This makes the effect of this prior weaker and weaker the longer we train! 
While the specific reason for this decay is beyond the scope of this lesson, note that not all priors behave like this.


### Exercise 2:
Can you think of other concrete ways that we might insert our beliefs into our model or our training process?  

___

___

## Priors in neural networks


### Exercise 3:
Please try to apply modify the following model so that it gives better results on average.
Whatever modifications you make, please justify them, if possible, in terms of Bayesian priors.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms.functional as vF
import numpy as np
import torchvision
import os
from pathlib import Path
import glob
import PIL.Image as Image

In [2]:
class Dataset_MNIST(torch.utils.data.Dataset):
    def __init__(self, root, classes, mode="train", transform=None, balance=[0.7,0.15,0.15], each_data_num=10000000):
        
        self.transform = transform
        self.images = []
        self.labels = []

        images = {} 
        labels = {}
        
        for cl in classes:
            path_list = glob.glob(root + f"{cl}/*")
            path_list.sort()
            path_list = path_list[:each_data_num]
            train_num = int(balance[0]*len(path_list))
            val_num = int(balance[1]*len(path_list))
            test_num = int(balance[2]*len(path_list))
            if mode=="train":
                path_list = path_list[:train_num]
            elif mode=="val":
                path_list = path_list[train_num:train_num+val_num]
            elif mode=="test":
                path_list = path_list[-test_num:]
            images[str(cl)] = path_list
            labels[str(cl)] = [cl]*len(path_list)
            
        # combine them together
        for label in classes:
            for image, label in zip(images[str(label)], labels[str(label)]):
                self.images.append(image)
                self.labels.append(label)

    def __getitem__(self, index):
        
        image = self.images[index]
        label = self.labels[index]
        
        with open(image, 'rb') as f:
            image = Image.open(f)
            image = image.convert("L")
        
        if self.transform is not None:
            image = self.transform(image)
            
        return image, label
    
    def __len__(self):
        return len(self.images)

In [3]:
MAX_EPOCH = 100
LR = 0.01
TRIALS = 10

In [4]:
class BasicModel(nn.Module):
    def __init__(self):
        super(BasicModel, self).__init__()
        self.c1 = nn.Conv2d(1, 9, (3,3), padding=(1,1))
        self.p1 = nn.MaxPool2d(2, stride=2)
        self.c2 = nn.Conv2d(9, 16, (3,3), padding=(1,1))
        self.p2 = nn.MaxPool2d(2, stride=2)
        self.l1 = nn.Linear(7*7*16, 32)
        self.l2 = nn.Linear(32, 10)
        
    def forward(self, x):
        h = F.relu(self.c1(x))
        h = self.p1(h)
        h = F.relu(self.c2(h))
        h = self.p2(h)
        h = h.view(-1, 7*7*16)
        h = F.relu(self.l1(h))
        h = F.relu(self.l2(h))
        y = F.softmax(h, dim=1)
        return y

In [5]:
dataset_train = Dataset_MNIST('./smalldataset/', mode='train', classes=[5,6,8], transform=torchvision.transforms.ToTensor(), balance=[0.8,0,0.2])
dataset_test = Dataset_MNIST('./smalldataset/', mode='test', classes=[5,6,8], transform=torchvision.transforms.ToTensor(), balance=[0.8,0,0.2])
dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=8, shuffle=True)

In [6]:
def calc_train_acc():
    c = 0 
    w = 0
    for x, y in dataset_train:
        if model(x[None,...].cuda()).argmax()==y:
            c += 1
        else:
            w -= -1
#     print('Train accuracy: {}'.format(c/(c+w)))
    return c/(c+w)

In [7]:
def calc_test_acc():
    c = 0 
    w = 0
    for x, y in dataset_test:
        if model(x[None,...].cuda()).argmax()==y:
            c += 1
        else:
            w -= -1
#     print('Test accuracy: {}'.format(c/(c+w)))
    return c/(c+w)

In [8]:
losses = []
train_acc = []
test_acc = []
for trial in range(TRIALS):
    criterion = nn.CrossEntropyLoss().cuda()
    model = BasicModel().cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=LR, momentum=0.9)
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda epoch: 0.999**epoch)
    
    weights = []
    for child in model.children():
        try:
            weights.append(child.weight)
            weights.append(child.bias)
        except:
            pass
    
    best_loss = 1000
    for epoch in range(MAX_EPOCH):
        for x, y in dataloader_train:
    #         zero buffers
            optimizer.zero_grad() 
    #         forward propagation
            prediction = model(x.cuda())

            loss = criterion(prediction, y.cuda())

    #         gradient calculation
            loss.backward()
    #         parameter update
            optimizer.step() 

        print('Trial: {}, Epoch: {}, Loss: {}'.format(trial, epoch, loss), end='\r')
        scheduler.step()
        
        if loss<best_loss:
            best_loss = loss
            
    losses.append(loss)
    train_acc.append(calc_train_acc())
    test_acc.append(calc_test_acc())
    print()
        

Trial: 0, Epoch: 99, Loss: 2.3355238437652594
Trial: 1, Epoch: 99, Loss: 1.8161097764968872
Trial: 2, Epoch: 99, Loss: 1.4613323211669922
Trial: 3, Epoch: 99, Loss: 1.4616003036499023
Trial: 4, Epoch: 99, Loss: 1.4611873626708984
Trial: 5, Epoch: 99, Loss: 1.7110190391540527
Trial: 6, Epoch: 99, Loss: 1.5867010354995728
Trial: 7, Epoch: 99, Loss: 1.4611637592315674
Trial: 8, Epoch: 99, Loss: 1.8344522714614868
Trial: 9, Epoch: 99, Loss: 2.0857512950897217


In [9]:
print('Mean Loss: {:.4}, Mean Train Acc: {:.4}, Mean Test Acc: {:.4}'.format(
                                                        sum(losses)/TRIALS,
                                                        sum(train_acc)/TRIALS,
                                                        sum(test_acc)/TRIALS,
                                                        ))

Mean Loss: 1.721, Mean Train Acc: 0.7639, Mean Test Acc: 0.6889
