# NEU502a (Spring 2018)
## Problem Set #5: Reinforcement Learning

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.io import loadmat
from scipy.optimize import minimize
sns.set_style('white')
sns.set_context('notebook', font_scale=1.5)
%matplotlib inline

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Define useful functions.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

def inv_logit(arr):
    '''Elementwise inverse logit (logistic) function.'''
    return 1 / (1 + np.exp(-arr))

def phi_approx(arr):
    '''Elementwise fast approximation of the cumulative unit normal. 
    For details, see Bowling et al. (2009). "A logistic approximation 
    to the cumulative normal distribution."'''
    return inv_logit(0.07056 * arr ** 3 + 1.5976 * arr)

def softmax(arr, beta=1):
    '''Softmax function'''
    return np.exp(beta * arr) / np.nansum( np.exp( beta * arr ) )

# README
I'm not 100% confident in these answers so far given ambiguities in Sutton & Barto, ch. 6/12. 

Also everything desperately needs to be rewritten and organized.

## Part 1
### The Setup

We have three states, ${S_1, S_2, S_3}$ and two actions ${A_1, A_2}$ where the state action pairs have a unique transition structure: $A_1 \mid S_1 \rightarrow S_2 $ and $A_2 \mid S_1 \rightarrow S_3 $. 

The challenge is to learn the value of each state-action pair so as to maximize reward over time. We will assign a new set of variables, $Q$ values, to represent these estimates: 

|       | $A_1$     | $A_2$     |
|:-----:|:---------:|:---------:|
| $S_1$ | $Q_{1,1}$ | $Q_{1,2}$ |
| $S_2$ | $Q_{2,1}$ | $Q_{2,2}$ |
| $S_3$ | $Q_{3,1}$ | $Q_{3,2}$ |

where the notation denotes *[state, action]*. Note that these values will update over time with each additional task trial completed. 

### Model-Free Learning with SARSA-TD(0)

The general TD(0) learning rule for a trial $t$:

$$ \delta =  R_t + \gamma Q(s',a') - Q(s,a) $$

$$ Q(s,a) = Q(s,a) + \eta \delta $$

where $Q(s,a)$ is the value of the current state-action pair; $Q(s',a')$ is the value of the next state-action pair; $\eta$ is the learning rate; and $\gamma$ is a discounting factor. In the following, we will set the discounting parameter $\gamma=1$ because the two-step task does not allow subjects to choose between rewards at different delays.

1) At the final, or terminal, states we know that there is no next state-action pair. Thus, the formula simplifies to:

$$ \delta = R - Q(s,a) $$

$$ Q(s,a) = Q(s,a) + \eta \delta $$

2) At the initial states we know the agent does not experience reward. Thus the formula simplifies to:
    
$$ \delta = Q(s',a') - Q(s,a) $$
    
$$ Q(s,a) = Q(s,a) + \eta \delta $$

---

### Model-Free Learning with SARSA-TD($\lambda$)

If we parameterize the model according with eligibility traces, then what we have is:

$$ e(s,a) = \begin{cases} \lambda \ e(s,a)_{t-1} & if s \neq s_t \\ \lambda \ e(s,a)_{t-1} & if s = s_t \\ \end{cases} $$

where $e(s,a)_{t=0}=0$. At the terminal state we have:

$$ \delta = R - Q(s,a) $$

and each $Q$-value updated such that:

$$ Q(s,a) = Q(s,a) + \eta \delta e(s,a) $$

In [2]:
def sarsa_ngl(params, Y, R, n_states=3, n_actions=2):
    '''Negative log-likelihood function of SARSA-TD(lambda) model.
    
    Parameters
    ----------
    params : list
      SARSA model parameters, (eta, lambda, beta)
    Y : 2d array
      Choices of participant (see notes).
    R : 1d array
      Reward earned.
    n_states : int
      Total number of unique states in task.
    n_actions : int
      Total number of unique actions in task.
    
    Returns
    -------
    log_lik : scalar
      Negative log-likelihood of data given parameters.
      
    Notes
    -----
    The choice data should be a 2d matrix of size [N,M]
    where N is the number of trials and M is the number
    of states visited per trial. Matrix should be pythonic
    such that first state is s=0.
    '''
    
    ## Extract parameters.
    eta, lambd, beta = params
    
    ## Transform into proper units.
    eta = phi_approx(eta)
    lambd = phi_approx(lambd)
    beta = phi_approx(beta) * 10
    
    ## Initialize Q-values.
    Q = np.zeros((n_states, n_actions))
    
    log_lik = 0
    for i in np.arange(R.size):
                  
        ## Compute likelihood of choice at first state (s).
        s = 0
        theta = softmax(Q[s], beta)
        log_lik += np.log(theta[Y[i,0]])
        
        ## Compute likelihood of choice at second state (s_prime).
        s_prime = Y[i,0] + 1
        theta = softmax(Q[s_prime], beta)
        log_lik += np.log(theta[Y[i,1]])
        
        ## Compute reward prediction error (delta).
        delta = R[i] - Q[s_prime, Y[i,1]]
        
        ## Compute eligibility traces.
        E = np.zeros_like(Q)
        E[s,Y[i,0]] = lambd
        E[s_prime,Y[i,1]] = 1
        
        ## Update Q-values.
        Q += eta * delta * E
        
    return -log_lik

### Model-Based Learning

1) At the final, or terminal, states we know that there is no next state-action pair. Thus, the formula simplifies to:

$$ \delta = R - Q(s,a) $$

$$ Q(s,a) = Q(s,a) + \eta \delta $$

2) At the initial state we have:

$$ Q(s_1,a_1) = p(s_2 \mid s_1, a_1) \max_{a \in a_1, a_2} Q(s_2,a) + p(s_3 \mid s_1, a_1) \max_{a \in a_1, a_2} Q(s_3,a) $$

$$ Q(s_1,a_2) = p(s_2 \mid s_1, a_2) \max_{a \in a_1, a_2} Q(s_2,a) + p(s_3 \mid s_1, a_2) \max_{a \in a_1, a_2} Q(s_3,a) $$

In [3]:
def mb_ngl(params, Y, R, n_states=3, n_actions=2):
    '''Negative log-likelihood function of model-based learning.
    
    Parameters
    ----------
    params : list
      Model parameters (eta, beta).
    Y : 2d array
      Choices of participant (see notes).
    R : 1d array
      Reward earned.
    n_states : int
      Total number of unique states in task.
    n_actions : int
      Total number of unique actions in task.
    
    Returns
    -------
    log_lik : scalar
      Negative log-likelihood of data given parameters.
      
    Notes
    -----
    The choice data should be a 2d matrix of size [N,M]
    where N is the number of trials and M is the number
    of states visited per trial. Matrix should be pythonic
    such that first state is s=0.
    '''
    
    ## Extract parameters.
    eta, beta = params
    
    ## Transform into proper units.
    eta = phi_approx(eta)
    beta = phi_approx(beta) * 10
    
    ## Initialize Q-values.
    Q = np.zeros((n_states, n_actions))
    
    log_lik = 0
    for i in np.arange(R.size):
                  
        ## Compute likelihood of choice at first state (s).
        s = 0
        theta = softmax(Q[s], beta)
        log_lik += np.log(theta[Y[i,0]])
        
        ## Compute likelihood of choice at second state (s_prime).
        s_prime = Y[i,0] + 1
        theta = softmax(Q[s_prime], beta)
        log_lik += np.log(theta[Y[i,1]])
        
        ## Compute reward prediction error (delta).
        delta = R[i] - Q[s_prime, Y[i,1]]        
        
        ## Update Q-values.
        Q[s_prime, Y[i,1]] += eta * delta                    # Terminal state
        Q[0, 0] = 0.7 * np.max(Q[1]) + 0.3 * np.max(Q[2])    # s1, a1
        Q[0, 1] = 0.3 * np.max(Q[1]) + 0.7 * np.max(Q[2])    # s1, a2
        
    return -log_lik

## Part 2
### Load and prepare data

In [4]:
## Load matlab file (Paula's data).
mat = loadmat('Subj100_2018-4-20_11-1-37.mat')
# mat = loadmat('data.mat')

## Assemble choice data.
Y = np.array([mat['choice1'].squeeze(), 
              mat['choice2'].squeeze()]).T
Y -= 1

## Assemble reward data.
R = mat['money'].squeeze()

## Remove trials with missing data.
indices = np.invert(np.any(Y<0, axis=-1))
Y = Y[indices]
R = R[indices]

### Model-Free Learning with SARSA-TD($\lambda$)

In [5]:
## Define initial parameters.
x0 = np.zeros(3, dtype=float)

## Minimize negative log-likelihood.
fit = minimize(sarsa_ngl, x0, args=(Y,R))
print('Convergence = %s' %fit.success)

## Extract parameters.
eta, lambd, beta = phi_approx(fit.x)
beta *= 10
print('Estimated Parameters')
print('eta = %0.3f' %eta)
print('lambda = %0.3f' %lambd)
print('beta = %0.3f' %beta)

Convergence = True
Estimated Parameters
eta = 0.244
lambda = 0.494
beta = 1.896


### Model-Based Learning

In [6]:
## Define initial parameters.
x0 = np.zeros(2, dtype=float)

## Minimize negative log-likelihood.
fit = minimize(mb_ngl, x0, args=(Y,R))
print('Convergence = %s' %fit.success)

## Extract parameters.
eta, beta = phi_approx(fit.x)
beta *= 10
print('Estimated Parameters')
print('eta = %0.3f' %eta)
print('beta = %0.3f' %beta)

Convergence = False
Estimated Parameters
eta = 0.257
beta = 1.939


## Appendix: Stan Implementations
### Model-Free Learning with SARSA-TD($\lambda$)

In [7]:
try: 
    import pystan
except ModuleNotFoundError: 
    pass
else:

    ## Compile Model.
    StanMF = pystan.StanModel(file='stan_models/sarsa.stan')
    
    ## Fit model.
    data = dict(N=R.size, Y=Y+1, R=R)
    fit = StanMF.sampling(data=data, chains=4, iter=1250, 
                          warmup=1000, seed=47404, n_jobs=4)
    
    print(fit)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_3e098205698f4bfaf54ae1d60e6439d9 NOW.


Inference for Stan model: anon_model_3e098205698f4bfaf54ae1d60e6439d9.
4 chains, each with iter=1250; warmup=1000; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=1000.

            mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
eta_pr     -0.39    0.03   0.52  -1.19  -0.75  -0.47  -0.08   0.82    352    1.0
lambda_pr   0.32    0.04   0.72  -1.03  -0.16   0.25   0.74   2.06    372    1.0
beta_pr    -1.08    0.02   0.29  -1.66  -1.27  -1.08  -0.87  -0.51    293    1.0
eta         0.36  9.3e-3   0.18   0.12   0.23   0.32   0.47   0.79    358    1.0
lambda       0.6    0.01   0.22   0.15   0.44    0.6   0.77   0.98    393    1.0
beta        1.51    0.04   0.66   0.48   1.02   1.41   1.92   3.05    284    1.0
lp__      -272.9    0.07   1.33 -276.5 -273.4 -272.6 -272.0 -271.5    338    1.0

Samples were drawn using NUTS at Sat Apr 21 19:34:41 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale r

### Model-Based Learning

In [10]:
try: 
    import pystan
except ModuleNotFoundError: 
    pass
else:

    ## Compile Model.
    StanMB = pystan.StanModel(file='stan_models/mb.stan')
    
    ## Fit model.
    data = dict(N=R.size, Y=Y+1, R=R)
    fit = StanMB.sampling(data=data, chains=4, iter=1250, 
                          warmup=1000, seed=47404, n_jobs=4)
    
    print(fit)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_cab21fa26a61287d8736a1f9c7bdd386 NOW.


Inference for Stan model: anon_model_cab21fa26a61287d8736a1f9c7bdd386.
4 chains, each with iter=1250; warmup=1000; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=1000.

          mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
eta_pr   -0.46    0.02   0.42  -1.11  -0.72  -0.53  -0.28   0.64    351    1.0
beta_pr  -1.01    0.02   0.31  -1.62  -1.17  -0.98  -0.81   -0.5    249   1.01
eta       0.33  7.8e-3   0.15   0.13   0.23    0.3   0.39   0.74    354    1.0
beta      1.67    0.04   0.67   0.52   1.21   1.64    2.1   3.07    291   1.01
lp__    -274.5     0.1   1.35 -277.7 -274.9 -274.1 -273.6 -273.3    179   1.01

Samples were drawn using NUTS at Sat Apr 21 19:35:48 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
