# Homework 4 - Basic Bayes
You have seen a demostration of Bayesian inference in section. Your homework will explore simple variation of it to solidify your understanding of priors, likelihoods, and posteriors.

# 1 Bayes Rule Introduction and Warm Up (2 pts)

Recall Bayes: $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$


**Scenario:**

Imagine you are a dietician for toddlers. You are trying to recommend to parents which fruit they should buy their kids. You have a client 'Tommy' who likes bananas. You also know, from your many years as a toddler dietician, the following statistics:
- If a child likes apples it is 95% gaurenteed that they also like bananas.
- 50% of children like apples.
- 75% of children like bananas.

**Tasks**

- [1 pt] Use Bayes Theorem to calculate the probability that Tommy likes apples
- [1 pt] Print the probabilty 

In [None]:
# TODO Translate probabilities
p_likes_bananas_given_apples = None
p_likes_apples = None
p_likes_bananas = None

# TODO Bayes theorem implementation 


# TODO Print



# 2 Bayesian Inference with Synthetic Data (17 pts)

In this section, you will simulate a dataset and update a prior distribution using Bayes' Rule.

Your friend Steve is a really sleazy guy. He wants to make a bet with you on the outcome of a coin being flipped. The only catch is that he wants to bet that his lucky "totally fair" coin will land on heads (a binomial success). Obviously you don't believe him and decide to run an experiment to understand the true fairness of the coin. 

You decide to flip the coin 100 times to get a good understanding of the bias of the coin. Additionally, knowing that Steve is completely unpredictable you have a prior belief that the coin is equally likley to be biased in any amount towards heads (i.e. your prior is a uniform distribution across the values 0-1)

### 2.1 Setup

**Tasks**
- [1 pts] Set the `true_bias` to .7 and `n_flips` to 100
- [2 pts] Use `np.random.binomial` to simulate the 100 flips. Save them in `flips`

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# TODO: Coin flip experiment setup
true_bias = None
n_flips = None
flips = None 

### 2.2 Initialize Prior

For our own sake we are going to assume that for our prior distribution all values **UP TO 2 decimal places** within 0-1 (inclusive) are equally likely. 

That means that .01, .79, .54, etc... all equally likely. 

**Tasks**
- [1 pts] Create a list of all possible bias values and store it in `bias_values`
- [2 pts] Create the prior (uniform) distribution across all possible coin biases
    - Save this in the variable `prior`
    - You you output a numpy array of length 101
    - Hint: You can create an array of ones and divide them all by the length of the array

In [2]:
# TODO: Initialize prior (uniform distribution over 101 possible biases)
bias_values = None
prior = None

### 2.3 Function to update prior

**Tasks**

- [3 pts] Fill in the missing code in the function below to update the prior distribution with more data
    - The posterior returned should be an rray of length 101

In [3]:
# TODO: Function to update prior based on observed flip
def update_posterior(prior, flip, bias_values):
    likelihood = bias_values if flip == 1 else (1 - bias_values)

    return posterior

### 2.4 Updating prior

**Tasks**
- [2 pts] For each flip in `flips` update the `posterior` using the function defined above


In [4]:
# TODO: Perform Bayesian updating


### 2.5 Plot

**Task**
- [2 pts] On **ONE** plot. Show the posterior distribution after 5, 10, 20, and 100 flips
- [1 pts] Include appropriate axis titles and legend

In [None]:
# TODO: Plot posterior distribution after 5, 10, 20 and 100 flips


### 2.6 Discussion

[1 pt] How does the prior influence the model's in early iterations (with little data) versus later iterations (with more data)?

**Ans** Meaningful discussion about how the prior becomes less important as we gather more data

[2 pt] Should you take the bet with Steve? Why or why not. Include reasoning drawn from 1.5

**Ans** No. The plot of the distribution shows us that we are confident that the coin is biased to land on heads 70% of the time. This means that Steve is likely to win the bet. 


## 3 Intro: Poisson Likelihood + Gamma Prior -> Gamma Posterior

### Setup
Assume we have $X_1, X_2, \dots, X_n$ independent and identically distributed (i.i.d) Poisson distribution. So, 

$$X_i \sim \text{Pois}(\lambda) \text{ for all } i.$$

You can imagine $x_i$ as counting the number of telephone calls in day $i$, which follows a Poisson distribution, where $\lambda$ is the (unknown) average number of phone calls a day.

**Goal:** We want to conduct Bayesian inference on the data $x_1, \dots, x_n$ in order to infer the unknown parameter $\lambda$.

### a. Likelihood $p(x | \lambda)$
The pmf of Poisson is 

$$p(x | \lambda) = \frac{\lambda^{x}e^{-\lambda}}{x!}.$$

Therefore, by i.i.d. assumption, the joint likelihood of all the $n$ pieces of data will be the product of the pmf, simplified for you here
$$p (x_1, \dots, x_n) = \frac{\lambda^{x_1 + \dots  + x_n} e^{-n\lambda}}{x_1! \dots x_n!}.$$

### b. Prior $p(\lambda)$
Remember, we don't know what $\lambda$ is, so we will treat it as a random variable $\Lambda$ (this is capital letter for $\lambda$). Magically, if we let $\Lambda$ follow a Gamma distribution, we get a nice posterior, so we will do just that. Now, the pdf of gamma(shape=$\alpha$, rate=$\beta$) is

$$p(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha-1}e^{-\beta\lambda},$$

where $\Gamma(\cdot)$ is the gamma function. When you choose your prior, **choose any $\alpha >0, \beta>0$ that is suitable for the prior knowledge you have about the data**. See section notes if you feel unfamiliar about it.

_Do not be intimidated by this crazy formula! It will be very friendly to us in the end of the calculation. Question: doesn't the gamma distribution look kind of similar to the Poisson distribution? This may give us a sense that the posterior will be nice! :)_

### c. Posterior $p(\lambda | x)$
We're almost done. 

Recall: The formula of the posterior distribution is
$$
p(\lambda | x) = \frac{p(x|\lambda)p(\lambda)}{\int_{\lambda} \ p(x|\lambda)p(\lambda) \ d\lambda} = \frac{\text{likelihood} \cdot \text{prior}}{\text{normalizing constant}}.
$$
We will skip the algebra and just tell you that 
$$
p(\lambda | x) = \frac{\lambda^{{\color{red}{x_1 + \dots + x_n + \alpha}} - 1} e^{-({\color{red}{n + \beta}})\lambda}}{\text{normalizing constant}}.
$$
The normalizing constant is not very important, the **MAIN TAKEAWAY IS THAT THE POSTERIOR IS ALSO DISTRIBUTED GAMMA! :)** In fact, the posterior is 
$$
\text{gamma}(x_1 + \dots + x_n + \alpha, \ n + \beta).
$$

### Summary
- Likelihood $p(x | \lambda) \sim \text{Pois}(\lambda)$
- Prior $p(\lambda) \sim \text{gamma}(\alpha, \beta)$
- Posterior $p(\lambda | x)\sim \text{gamma}(x_1 + \dots + x_n + \alpha, \ n + \beta)$

With this, you are ready to tackle the first problem.

# 3 Peaked Prior (13 pts)
Run the following code cell.

Let the full data ($n=30$) be
$$
x = [ 8,  8,  7, 11, 10,  6,  7, 11,  5, 12,  8,  7,  8,  8, 11,  4,  3,
        9,  9,  4,  7,  7,  9, 12,  8,  9, 10,  9,  8,  8]
$$
Define:
- ```x```, the original 30 data points
- ```x_short```, only the first 3 data points

In [None]:
import scipy.stats as stats

x = [ 8,  8,  7, 11, 10,  6,  7, 11,  5, 12,  8,  7,  8,  8, 11,  4,  3,  9,  9,  4,  7,  7,  9, 12,  8,  9, 10,  9,  8,  8]
n = len(x)
print('n:', n, '\nmean:', np.mean(x))
x_short = x[:3]
n_short = len(x_short)
print('x_short:', x_short)
print('n_short:', n_short, '\nmean x_short:', np.round(np.mean(x_short),2))

## 3.1 Prior

Recall the story about inferring the average number of phone calls. First, we want to create a (peaked) prior that reflects our belief about what the data (number of phone calls) is.     

Suppose Mrs. Morgan said, "From my experience and memory, I think the average number of phone calls every day is 4. Most of the time (like 95% of the time), it's between 2 to 6 calls every day."


**Task:**
1. [2 pt] Define appropriate ```alpha_prior1```, ```beta_prior1``` to obtain a suitable prior gamma($\alpha, \beta$) with appropriate mean and variance. 

    (Hint: for gamma($\alpha, \beta$)
    - mean = $\frac{\alpha}{\beta}$ and variance = $\frac{\alpha}{\beta^2}$. 
    - It may also be easier to define $\beta$ before defining $\alpha$.)
2. [1 pt] Compute ```prior1```, which is the pdf of gamma($\alpha, \beta$) along the $\lambda$-axis `lamb`. 
    - Make sure you read the scipy documentation correctly and input the correct arguments (rate and scale are reciprocals of each other!).
3. [1 pt] Plot the density of the prior.

In [None]:
# TODO alpha and beta
beta_prior1 = None
alpha_prior1 = None

# TODO prior1
lamb = np.linspace(0, 15, 1000)
prior1 = None

# TODO plot


## 3.2 Posterior parameters
Let the posterior parameters be
- ```alpha_post1```, be the posterior shape for full data ```x```.
- ```beta_post1```, be the posterior rate for full data ```x```.
- ```alpha_post1_short```, be the posterior shape for short data ```x_short```.
- ```beta_post1_short```, be the posterior rate for short data ```x_short```.

**Task:**

[2 pt] Define ```alpha_post1```, ```beta_post1```, ```alpha_post1_short```, and ```beta_post1_short``` using the formula in the introduction.

In [9]:
# TODO alpha, beta
alpha_post1 = None
beta_post1 = None

alpha_post1_short = None
beta_post1_short = None


## 3.3 Posterior plot

**Task:**

1. [2 pt] Define ```posterior1``` and ```posterior1_short```, the respective pdf of the posteriors. 
    - Use the same horizontal axis ```lamb``` from previous parts.
2. [1 pt] Plot the three densities (prior1, posterior1, posterior1_short) on the same figure.

In [None]:
# TODO posterior densities
posterior1 = None
posterior1_short = None

# TODO plot


## 3.4 Peaked MAP
**Task:**

1. [1 pt] Compute the MAP estimator for both posteriors, storing it as ```lamb_MAP1``` and ```lamb_MAP1_short```. 

2. [1 pt] Print both these values, rounded to 3 decimal places.

In [1]:
# TODO MAP
lamb_MAP1 = None
lamb_MAP1_short = None

# TODO print


## 3.5 Discuss MAP
[2 pt] What do you observe about the MAP estimator for the full data and short data? Which is "closer" to the prior? Give an explanation for your observations.

**Ans:** 