MBPO : A Review

Objective :

To simulate MBPO algorithm for LQR setting(ie linear dynamics and quadratic cost).

Algorithm proposed :(Simplified version)

Algorithm proposed:(Implementations version)

Intuition :

MBPO optimizes a policy under a learned model, collects data under the updated policy, and uses that data to train a new model.

Current settings :

True env : LQR (Linear dynamics with quadratic cost)
Model is parametrised by A,B matrices.( No neaural network)
Note that we need a policy gradient algorithm that operates on off-policy data.So chose to go with "Off-Policy policy gradient" (aka importance sampling based)
Only Linear policies are considered. ( ie u=K*x )
- Problems encountered:
  - Problems of with the structure of the distribution .Here it is dirac-delta kind of distribution.
  - More clearly ,we are dealing with deterministic policies.
- Solution:
  - Go forward ..(One approach is based on the paper by David Silver,ICML 2014)
  - Take a bypass ( Use gaussian policy with it's mean parametrised by linear function of state.)
Policy gradient over linear policy setting is discussed in detail here: (https://arxiv.org/pdf/2011.10300. pdf ,2021). In the paper they are discussing the the following settings:
- given model.(known parameter setting)
- we can query the env for say m trajectory for a set of "polices" .(BUT we dont have the laxury of using the trajectories originated from some other policies)
But this setting wont help us !!

Note:

Their comparison may help us !!
https://arxiv.org/abs/1809.05870

In the original paper :(More complicated setting !!)

Modelling env : instead of a single model an ensemble model is considered. (They are citing a paper for the same)
True env : open-AI gym env are used.
updating the policy : SAC is used for policy optimization .
While debugging the original code the following is noticed:

Since we are interested in how the real vs fake ratio's affect.In the program ,they are maintaining a a variable named real_ratio,but inside the program they are using this variable to adjust some of the other training parameters.But I didn't get why they are doing so. Additionally, they use this variable in some edge cases detection.

An important note :

In a blog written by the first author (MBPO), he mention Dyna Algorithm(Sutton).
- The following sentence is taken that blog :
"An important detail in many machine learning success stories is a means of artificially increasing the size of a training set. It is difficult to define a manual data augmentation procedure for policy optimization, but we can view a predictive model analogously as a learned method of generating synthetic data. The original proposal of such a combination comes from the Dyna algorithm by Sutton, which alternates between model learning, data generation under a model, and policy learning using the model data. "

Updates on implementation:

Done with the implementation for MBPO algorithm for linear setting.
Tested Model updation section,Real and Fake data generation ,things are working fine.
But there is a problem with the gradient update rule (Policy gradient is used))
We need to working in the following settings
- Linear policy (Dirac delta distribution)
- Off policy setting (Note that there is an important sampling term)
To handle the importance sampling term ,initialy planned to use the following strategy : use a gaussian policy with mean equal to K@x term and with a fixed covariance matrix.The derivation and the expression for the gradient term is in the "main.ipynb" file.
As a sanity check,I used the following test.
In short MBPO, proposes the following approach for finding the optimal policy.Use fake data for performing policy gradient instead of real data.

The vanilla policy gradient approach :

Use the Real data to construct a cost function( which is a function of the policy),
Use the gradient descent to do the minimisation step.

The drawback:

we need a lot of real data(trajectories).

MBPO proposes the following strategy:

Use the real data to construct a model
generate fake data(trajectories) using this model
Do the policy gradient step using this fake data.

The Current problem :

MBPO uses the fake data to update the policy,but instead I will use the real data to upate the policy.It should gradually converge to the optimal policy (which we know in LQR case).But it observed that the " update rule " obtained with the "gaussian hack" is not converging to the optimal policy.Here "gaussian hack" refers to using a gaussian policy for preventing numerically exploding.I have used comments in the script to denote what is going on for each term.
The problem that I'm facing now: For this particular setting "Linear policy" and "off policy trajectories",need to get an expression for grad to carry out the policy updation rule.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
Screenshot 2022-12-20 at 1.00.21 PM.png		Screenshot 2022-12-20 at 1.00.21 PM.png
env.py		env.py
grad_term.jpeg		grad_term.jpeg
main.ipynb		main.ipynb
main.py		main.py
off_policy_policy_gradient.png		off_policy_policy_gradient.png
policy_gradient.ipynb		policy_gradient.ipynb
rough_0.ipynb		rough_0.ipynb
structure.md		structure.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MBPO : A Review

Objective :

Algorithm proposed :(Simplified version)

Algorithm proposed:(Implementations version)

Intuition :

Current settings :

In the original paper :(More complicated setting !!)

An important note :

Updates on implementation:

The vanilla policy gradient approach :

The drawback:

MBPO proposes the following strategy:

The Current problem :

About

Uh oh!

Releases

Packages

Languages

vaishn99/MBPO-LQR

Folders and files

Latest commit

History

Repository files navigation

MBPO : A Review

Objective :

Algorithm proposed :(Simplified version)

Algorithm proposed:(Implementations version)

Intuition :

Current settings :

In the original paper :(More complicated setting !!)

An important note :

Updates on implementation:

The vanilla policy gradient approach :

The drawback:

MBPO proposes the following strategy:

The Current problem :

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages