# TD with Emphasis

According to the definitive source, the algorithm can be described (for accumulating traces) via:

\begin{align}
\delta_t &= R_{t+1} + \gamma_{t+1}\theta_{t}^{\top} \phi_{t+1} - \theta_{t}^{\top}\phi_{t} 
\\
F_{t} &= \rho_{t-1}\gamma_{t} F_{t-1} + I_{t}
\\
M_{t} &= \lambda_{t}I_{t} + (1 - \lambda_{t})F_{t} 
\\
e_{t} &= \gamma_{t} \lambda_{t} e_{t-1} + \alpha_{t} M_{t} \phi_{t}
\\
\theta_{t+1} &= \theta_{t} + \delta_{t} e_{t}
\end{align}

If we avoid parameters that vary per-timestep, and assume that $\gamma = 0$ and $\phi_{T} = \vec{0}$ in the terminal state, we can write it more simply as 

\begin{align}
\delta_t &= R_{t+1} + \gamma \theta_{t}^{\top} \phi_{t+1} - \theta_{t}^{\top}\phi_{t} 
\\
F_{t} &= \rho_{t-1}\gamma F_{t-1} + I_{t}
\\
M_{t} &= \lambda I_{t} + (1 - \lambda)F_{t} 
\\
e_{t} &= \gamma \lambda e_{t-1} + M_{t} \phi_{t}
\\
\theta_{t+1} &= \theta_{t} + \alpha \delta_{t} e_{t}
\end{align}


# General

## Particular Values

For $\gamma = 1$, $F_{t} = \sum_{k=1}I_{k}$

In this case, we have to start worrying about precision; a sufficiently long episode might take us well out of the realm where early states matter at all.

## Least Squares Methods

LSTD and ELSTD both get the feature weights via 

$$\theta = A^{-1} b$$

Where 

$$A_{t} = A_{0} + \sum_{t=1}^{t}z_{t} (\phi_{k} - \gamma_{k+1} \phi_{k+1})$$

$$b_{t}  = b_0 + \sum_{k=1}^{t} z_{k} R_{k+1}$$

The main difference comes from how the traces are updated.


## When Should We Expect Emphasis To Be Better?

In the tabular case, both algorithms should perform about as well as each other, provided that the emphasis $M_{t}$ is nonzero during the episode (maybe more accurate to say so long as it's nonzero when $\phi$ is nonzero).

We can further say that there is no advantage to using emphasis when there is no state aliasing during the episodes under consideration, because this reduces to the tabular case as well.

However, we *will* expect emphatic algorithms to perform better according to our interest-weighted error measure when there is state aliasing.
The emphasis placed on states with more interest (and their successors for $\lambda > 0$) will weight them more highly, making their approximation more accurate than their counterparts which might have a similar feature vector but less interest.

## What should we set interest to be? (Undiscounted)

In the undiscounted case, we probably want to estimate the reward of the start state most accurately.

As such, having an interest that is nonzero at any other time presents a bit of an issue, because the followon trace is the sum of the interest up to that point in time, and $M_t$ varies linearly with $F_t$.

What about having interest be allocated to *only* the start state?

This too presents a problem, because trajectories that repeatedly visit the start state will be weighted more highly, and I am not sure about how to interpret that... it's not quite Every-Visit-MC nor is it First-Visit-MC.

So it's probably best to have it equal to `1` in the start state of each episode, and `0` at all other times.

## What should we set lambda to be?

In the case where $\lambda = 0$, we have essentially TD(0) for the undiscounted case with interest in the first state only.

In the case where $\lambda = 1$ w/ start-state interest, we are essentially performing First-Visit MC on the initial state only.

In the case where $\lambda = 1$ w/ first-visit to each state interest, we are performing the First-Visit MC version of TD.