For a detailed discussion, visit : https://sridhartee.blogspot.in/2016/11/policy-gradient-methods.html
We design and test 3 policy gradient methods in this repository
-
Monte Carlo Policy Gradient : Baseline used is average of rewards obtained, no baseline results in high variance
-
Actor Critic Method : Using Softmax policy and Q-learning Critic for value function estimation
-
Numerical Gradient Estimation : perturb the parameters and estimate the gradient using regression (X'X)^-1X'y. Change num_rollouts to change the number of training examples we learn the gradient from. Note that the actual number of runs is number of episodes * num_rollouts