# Comparison vs State of the Art
Let's look at the mechanics of how my solutions compare to the state of the art.
[Open Ai Gym Cartpole Leaderboard](https://github.com/openai/gym/wiki/Leaderboard)

## Ben Harris
[Ben Harris Solution](https://github.com/Ben-C-Harris/Reinforcement-Learning-Pole-Balance/blob/master/kerasPoleBalance.py)

Key differences between [my solution](https://github.com/unoti/ai-gym-experiments/blob/master/cartpole_lab/deeprico.py) [as it stands now](https://github.com/unoti/ai-gym-experiments/commit/ddd7b4e2f9f52ac346534cba3e0e44fb970c4393#diff-658aac99e3937373c82819a6d3ffa7a6) and Ben's:

 | Issue | Ben | Mine |
 |:-------|-----|------|
 | Batch size | 10 | 20 |
 | Model | 24, 24, 2 outputs | 64, 64, 2 outputs |
 | Q Update | see below | Maybe different? |
 | Reward at terminal | -1 | 0, and maybe different |
 | Future reward | amax(..) | max(..) |
 | Adam parameters | defaults | , beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0 |
 
 Is np.amax the same as np.max in this context?

Pretty much everything else is the same.  Things that are the same:
 * **Memory size** We're both using 1,000,000
 * **Hyperparameters** Learning rate=0.001, gamma=0.95. epsilon_start=1, epsilon_min=0.01, epsilon_decay=0.995 on each step
 

Things to investigate:
 * **Unreliable random starts.** Is the solution highly vulnerable to getting a bad random seed for the initial network weights?  Do you need to just get lucky?
   * I could determine this by downloading their solution and seeing how reliably it converges to "solved" and comparing that to mine
   * So run theirs a few times and record the results carefully, and repeat the same thing for mine.
 * **np.amax** vs **np.max**. I've researched this before and determined it's the same thing, but it's worth another look.
 * **Model**.  My model size is different from Ben's.  The architecture is the same, I think, but the number of nodes per layer is much different.  Could this be the difference?  Unfortunately there's no real way to check how well the model is underfitting/overfitting. Unless I could take a fully-trained champion model, and compare mean squared error on theirs versus mine or something...
 * **Q Update**.  The Q update functions for Ben's is different from mine.
 * **Reward at Terminal**.  Ben's function manually tweaks the rewards to -1 somewhere, and mine does not.

I'll investigate all of these issues, starting with the Q Update function.

### Q Update Comparison
The Q Update step is different in Ben's. This is an issue that's been gnawing on my consciousness for a while now, so I'd like to get to the bottom of it.  Both Ben's update function as well as my own appear to be different from the SARSA update function described in the [Sutton and Barto textbook](http://incompleteideas.net/book/RLbook2018.pdf).  I've tried precisely implementing the one in the book and couldn't make it work exactly as written.

Here's the SARSA update formula from *Sutton and Barto* on p. 130:

$$ Q(S_t, A_t) = Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right] $$

 * 𝛼: learning rate 0-1
 * 𝛾: future reward discount rate 0-1



Comparing my update function to Ben's, at a glance they look like they are probably (maybe?) algebraically equivalent, but I'm not sure.  Let's work it through to be sure, and see how Ben's compares to mine, and how both of them compare to official SARSA.

#### Ben's Q Update Function

In [4]:
def ben_q_update(self, state, action, reward, next_state, terminal):
    q_update = reward # Updated quality score
    if not terminal:
        q_update = reward + gamma * np.amax(self.model.predict(state_next)[0]) # gamma = discount factor & max predicted quality
    q_values = self.model.predict(state)
    q_values[0][action] = q_update
    self.model.fit(state, q_values)

#### My Q Update function

In [5]:
def rico_q_update(self, prev_state, prev_action, reward, state, done):
    rewards_all = self.model.predict(prev_state)
    if done:
        future_rewards = 0
    else:
        future_rewards = np.max(self.model.predict(state))
    target_reward = reward + self.gamma * future_rewards
    rewards_all[prev_action] = target_reward
    self.model.train(prev_state, rewards_all)

#### Transform to make similar

In [10]:
def ben_q_update(self, state, action, reward, next_state, terminal):
    if terminal:
        q_update = reward # Updated quality score
    else:
        q_update = reward + gamma * np.amax(self.model.predict(state_next)[0]) # gamma = discount factor & max predicted quality
    q_values = self.model.predict(state)
    q_values[0][action] = q_update
    self.model.fit(state, q_values)

def rico_q_update(self, state, action, reward, next_state, done):
    if done:
        future_rewards = 0
    else:
        future_rewards = np.max(self.model.predict(next_state))
    q_values = self.model.predict(state)
    q_update = reward + self.gamma * future_rewards
    q_values[action] = q_update
    self.model.train(state, q_values)

They look pretty similar so far, except for what happens in terminal state.  Let's rewrite Rico's to calculate q_update entirely within the if statement.

In [11]:
def rico_q_update(self, state, action, reward, next_state, done):
    if done:
        q_update = reward
    else:
        q_update = reward + self.gamma * np.max(self.model.predict(next_state))
    q_values = self.model.predict(state)
    q_values[action] = q_update
    self.model.train(state, q_values)

That looks exactly the same.  For now I'll conclude that the SARSA update function is the same.  I like the wording of my original function better, because it more intuitively describes the spirit of what we're doing with SARSA update.

But note there's another difference in how the updates work which we'll investigate next.

## Reward at Terminal

Ben has some code that tweaks the reward value to -1 when the episode is over.  Mine doesn't.

Ben has this little diddy in the main episode loop:

In [12]:
def main_loop():
    #...
    reward = reward if not terminal else -reward

If we translate that into code that normal people would rather read, it looks like this:

In [14]:
def main_loop():
    if terminal:
        reward = -reward

This comes before the part where the step is inserted into replay memory.