You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generally π is any probability distribution. If there is only one optimal action, optimal policy must deterministically choose that action. If there are multiple optimal actions, the optimal policy can be any distribution over them. I added a note to the slide saying that in case of multiple actions minimizing q_*(s, a), the policy can stochastically choose any of them.
Is pi*(s) probability distribution, or does it return just the best action? Slide number 5 says policy pi computes probability distribution...
The text was updated successfully, but these errors were encountered: