Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are you using SARSA instead of Q-Learning? #94

Closed
laz8 opened this issue Mar 19, 2020 · 1 comment
Closed

Why are you using SARSA instead of Q-Learning? #94

laz8 opened this issue Mar 19, 2020 · 1 comment

Comments

@laz8
Copy link

laz8 commented Mar 19, 2020

You are doing Q-Learning:

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done, info = env.step(action)

target[i][action[i]] = reward[i] + self.discount_factor * (

But isn't that SARSA?

                a = np.argmax(target_next[i])
                target[i][action[i]] = reward[i] + self.discount_factor * (target_val[i][a])

Is that a mistake or is that a valid approach? I'm new to RL...

@laz8
Copy link
Author

laz8 commented May 31, 2020

Closed, i was confused by different versions of a DDQN.

It is explained here:

What makes this network a Double DQN?

The Bellman equation used to calculate the Q values to update the online network follows the equation:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

The Bellman equation used to calculate the Q value updates in the original (vanilla) DQN[1] is:

value = reward + discount_factor * max(target_network.predict(next_state))

The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in [2].

https://medium.com/@leosimmons/double-dqn-implementation-to-solve-openai-gyms-cartpole-v-0-df554cd0614d

And also the names confused me, everything is a target, you renamed a lot of stuff that makes it harder to understand your code.

But it seems to be correct.

@laz8 laz8 closed this as completed May 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant