## Problem 1

In this homework problem, you are going to use `tensorflow-probability` to deal with an unfair dice, i.e., a dice that has different probability of settling with each of the face 1 to 6 facing up, instead of the $p=1/6$ equal probability in the case of a fair dice.

You are provided with 5000 data entries of this dice. Each entry is a length-6 vector with one element being 1 and 0 for the rest. For example, $[0,1,0,0,0,0]$ means the "2" face of this dice landed facing up. This form is also the data form generated by `tfp.distribution.Bernoulli` when you feed multiple probability to it.

You are going to estimate the 6 probabilities describing this unfair dice (6 face). $\tilde{p} = [p1,p2,p3,p4,p5,p6]$. Keep in mind that they sum up to 1.

### Answer:
An alternate way of formulating this question is to consider the result of each of the n throws of the die as independent random variables $R$, where $R$ is distributed according to $R\sim Multinomial(n,\vec{p}) \text{ for } \vec{p}=(p_1, p_2, p_3,..., p_6)$. Then define $X_k$ to be the total number of rolls with result $k$. It can be shown that the MLE estimate for $p_k$ is given by $\hat{p}_k=x_k/n$. Calculating below...

In [1]:
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np
import matplotlib.pyplot as plt
tfd = tfp.distributions
tfb = tfp.bijectors

In [2]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [3]:
# Get the roll data, the roll sums (x_k's), and the number of rolls (n)
roll_data = tf.constant(np.loadtxt('unfair_dice.txt'))
roll_sums = tf.reduce_sum(roll_data, axis=0)
num_rolls = roll_data.shape[0]

# Calculate the MLE
p_hat_vec = tf.divide(roll_sums, num_rolls)

# Check that probabilities sum to 1, else adjust for rounding error
if tf.reduce_sum(p_hat_vec)==1:
    tf.print('MLE of p vector:', p_hat_vec)
else:
    p_hat_vec = tf.divide(p_hat_vec, tf.reduce_sum(p_hat_vec))
    tf.print('MLE of p vector:', p_hat_vec)
        

MLE of p vector: [0.049 0.099 0.146 0.4486 0.0508 0.2066]


### (2) Use MAP to estimate $\tilde{p}$. Selecte three different prior distribution (if the distribution is parametrized, select three different enough parameters). Using 5000 sample, compare which prior gives the best estimation of why.

[Hint: If the optimization takes too long, try to run a certain amount of steps instead of setting a criteria for the gradient. Check the remaining gradient and determine whether to increase the number of steps]

### Answer:
An alternate way to obtain a point estimate for $\vec{p}$ is using MAP, which is more akin to the Bayesian approach to creating point estimates in that it multiplies the likelyhood by a prior distribution $P(\vec{p})$, allowing us to bake in prior beliefs about the distribution of $\vec{p}$. This is Bayesian because we are now allowing ourselves to talk about $\vec{p}$ in terms of probabilities, something that is not allowed under the frequentist perspective.

Let's play the role of a casino that has noticed a suspicous amount of 1's and 4's being rolled at a specific craps table. We suspect that some of the die at the table are each weighted to land on one of these values more often than the others. To investigate, we have analyzed hours of casino floor film to get 5000 roll results of one particular die (giving our data above), and now we want to investigate if this die is weighted to 1, 4, or is unweighted. Let's make the prior for $\vec{p}$ a Dirlechet distribution, and use three different $\alpha$ paramaterizations to reflect our three prior beleifs. We will let all elements of $\vec{\alpha}$ equal 1 for the first paramaterization (prior beleif that die is not loaded), we will let $\alpha_1=5$ with all others being 1 for the second paramaterization (prior beleif that the die is loaded to land on 1), and we will let $\alpha_4=5$ with all others being 1 for the third paramaterization.

Now we want to calculate the MAP estimate for each of these paramaterizations below. However, it is much harder to derive the argmax of the resulting function with respect to $\tilde{\vec{p}}$ of this equation analytically, so we will isntead use gradient ascent to find the maximum. Doing this below...

In [35]:
# Create a list of params to try
param_0 = np.ones(6)
param_1 = np.copy(param_0)
param_5 = np.copy(param_0)
param_1[0] = 5
param_5[3] = 5

param_0 = tf.constant(param_0)
param_1 = tf.constant(param_1)
param_5 = tf.constant(param_5)

param_list = [param_0, param_1, param_5]
map_est_list = []

for param in param_list:
    
    LR = 0.001
    
    # Create some values to keep track of
    logit_p_vec_est = tf.constant(np.ones(6))
    abs_grad = np.ones(6)
    grad_list = []
    
    # Define the dirichlet distribution with the proper params
    prior_dis = tfd.Dirichlet(concentration=param)
    
    while True:
        with tf.GradientTape() as tape:
            tape.watch(logit_p_vec_est)
            
            # Define the likelyhood distribution
            lh_dis = tfd.Multinomial(total_count=num_rolls, logits=logit_p_vec_est)
            
            # Define the MAP loss function, we will optimize with respect to the total number of counts
            # of each rather than raw data since it will result in less extreme values. We can do this
            # since they are proportional and we really dont care what order the rolls occured in 
            map_loss = (lh_dis.log_prob(value=roll_sums) + \
                        prior_dis.log_prob(value=tf.nn.softmax(logit_p_vec_est)))
            
            # Take the gradient
            grad = tape.gradient(map_loss, logit_p_vec_est)
            abs_grad = np.abs(grad.numpy())
        
        logit_p_vec_est += LR*grad
        grad_list.append(grad.numpy())
        
        # Check for convergence
        if np.all(abs_grad<0.0001):
            break

    
    p_vec_est = tf.nn.softmax(logit_p_vec_est)
    #print("logit_p_vec_est", logit_p_vec_est)
    tf.print("p_vec_est for", param,":", p_vec_est)
    map_est_list.append(p_vec_est)

p_vec_est for [1 1 1 1 1 1] : [0.04900001447993637 0.098999997805673187 0.14599999856912141 0.4485999989867353 0.050799991327776496 0.2065999988307573]
p_vec_est for [5 1 1 1 1 1] : [0.049760202669150433 0.0989208608014212 0.14588329176145176 0.44824140589677658 0.050759388085767444 0.20643485078543256]
p_vec_est for [1 1 1 5 1 1] : [0.048960845801685589 0.098920861117763617 0.14588329193697647 0.44904076636928542 0.050759383821592272 0.20643485095269665]


In [22]:
grad_list

[array([-588.33333333, -338.33333333, -103.33333333, 1409.66666667,
        -579.33333333,  199.66666667]),
 array([-100.05299292,   51.94318699,  169.57355218, -301.52178576,
         -94.17248652,  274.23052604]),
 array([-91.4162696 ,  -7.87693125,  14.50412351, 214.84762412,
        -87.45974572, -42.59880106]),
 array([-39.82368154,  32.14869555,  56.54763504, -89.42884981,
        -36.23976517,  76.79596593]),
 array([-33.74091882,   8.23053199,   4.25334143,  70.85013648,
        -31.06114202, -18.53194907]),
 array([-17.69827905,  16.5816821 ,  19.53572791, -29.84387234,
        -15.37564872,  26.8003901 ]),
 array([-14.4494229 ,   6.02024451,   1.70345778,  25.36370401,
        -12.66285236,  -5.97513104]),
 array([ -8.30114528,   7.735034  ,   7.38362398, -10.02176302,
         -6.8091732 ,  10.01342352]),
 array([-6.55412142,  3.27318171,  1.02288965,  9.36392234, -5.39709095,
        -1.70878133]),
 array([-3.98626575,  3.48611255,  2.97641374, -3.30895751, -3.04641306,
   

In [30]:
np.any(grad_list[50]<0.01)

True

In [20]:
p_vec_est = tf.constant((1/6)*np.ones(6))
lh_dis = tfd.Multinomial(total_count=num_rolls, probs=p_vec_est, validate_args=True)

lh_dis.log_prob(roll_sums)

<tf.Tensor: shape=(), dtype=float64, numpy=-1506.1565747316326>

In [17]:
prior_dis = tfd.Dirichlet(concentration=tf.constant(np.ones(6)), validate_args=True)
prior_dis.log_prob(p_vec_est)

InvalidArgumentError: Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'sample last-dimension must sum to `1`'
b'x and y not equal to tolerance rtol = tf.Tensor(2.220446049250313e-15, shape=(), dtype=float64), atol = tf.Tensor(2.220446049250313e-15, shape=(), dtype=float64)'
b'x (shape=() dtype=float64) = '
1.0
b'y (shape=() dtype=float64) = '
60000000.0

In [11]:
test = [0.5,0,0,0,0.5,0]

In [12]:
prior_dis.prob(test)

<tf.Tensor: shape=(), dtype=float64, numpy=119.99999999999997>

### (3) Using Monte-Carlo sampling, generate posterial samples and estimate the 6 probabilities again.

*Your solution here*

In [22]:
import tensorflow_probability as tfp
tfd = tfp.distributions

# Create a single trivariate Dirichlet, with the 3rd class being three times
# more frequent than the first. I.e., batch_shape=[], event_shape=[3].
alpha = [1, 1, 1, 1, 1, 1]
prior_dist = tfd.Dirichlet(alpha)



In [24]:
logit_p_vec_est=np.array([1,1,1,1,1,1])
prior_dis.log_prob(value=tf.nn.softmax(logit_p_vec_est))

NotFoundError: Could not find valid device for node.
Node:{{node Softmax}}
All kernels registered for op Softmax :
  device='XLA_GPU'; T in [DT_FLOAT, DT_DOUBLE, DT_BFLOAT16, DT_HALF]
  device='XLA_CPU'; T in [DT_FLOAT, DT_DOUBLE, DT_BFLOAT16, DT_HALF]
  device='XLA_CPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_BFLOAT16, DT_HALF]
  device='XLA_GPU_JIT'; T in [DT_FLOAT, DT_DOUBLE, DT_BFLOAT16, DT_HALF]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_HALF]
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]
 [Op:Softmax]

In [23]:
x = tf.constant([10, .1, .1, .1, .1, .1]) 
prior_dist.prob(x)

<tf.Tensor: shape=(), dtype=float32, numpy=120.00001>

In [43]:
import tensorflow_probability as tfp
import tensorflow as tf
tfd = tfp.distributions

# Create a single trivariate Dirichlet, with the 3rd class being three times
# more frequent than the first. I.e., batch_shape=[], event_shape=[3].
alpha = [.2, .1, .1]
dist = tfd.Dirichlet(alpha)

In [46]:
x = tf.constant([.2, 10000, 1000]) 
dist.prob(x)

<tf.Tensor: shape=(), dtype=float32, numpy=9.696046e-09>