DRL-using-PyTorch

PyTorch implementations of Deep Reinforcement Algorithms

All the following DQN variations are derived from DQNwithNoisyNet folder and already contain DDQN, prioritized replay, fixed target　network.

DQN with NoisyNet:

https://github.com/LilTwo/DRL-using-PyTorch/tree/master/DQN_NoisyNet.
reference:https://arxiv.org/pdf/1706.10295.pdf

NoisyNets add randomness to the parameters of the network.
With the presence of noisy layers, network is able to learn a domain-specific exploration strategy,
rather than using epsilon-greedy and increase epsilon manualy during training.
From my expeience, a NoisyNet usually needs a smaller leaning rate than nomal nets to work well,
and is very sensitive to parameters's initial value.
In MountainCar environment, there are some chances that the car never hit the top in first epsiode. I'm not sure whether this is because I wrote somthing wrong.

In the original paper, auothrs suggest that the summation of "sigma" can be viewed as the stochasticity of the layer.
This have been implemented in "randomness" method of "NoisyLinear" class with one modification: each "sigma" is normalized by "mu" before the summation.

DQN from Demonstrations (DQfD)

https://github.com/LilTwo/DRL-using-PyTorch/tree/master/DQNfromDemo
reference:https://arxiv.org/pdf/1704.03732.pdf

If there are some expert's demonstrations produced by human or another well-trained agent, one may expect these data could speed up the training process by saving time from random exploration in a large state/action space.
DQfD proivds a method to leverage demonstration data by pre-training the model on the demonstartion data solely before it starts to interact with the environment.

Hindsight Experience Replay (HER)

code will be uploaded soon.
reference:https://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf

Since model-free RL algorithms like DQN know nothing about the environment, they usually need lots of exploration to find out what is good or bad at the begining, especially when dealing with sparse reward.
At the first few epochs of training, an agent is likely to get no positive reward during the whole episode, HER can make good use of these trajectorys by storing each trajectory in the replay buffer again but with different goals which are achieved by some states in the trajectory.
So you can know for sure that there have some transition with positive reward are stored in the replay buffer after every episode is finished.
The key for HER to work is that these goals should be correlated in a resonable way so that learning to behave well on one of them can also help to behave well on another one, so I express reservations about the authors opinion: using HER requires less domain knowledge than redefining a shaped reward.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.idea		.idea
Common		Common
DQNfromDemo		DQNfromDemo
DQNwithNoisyNet		DQNwithNoisyNet
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
__init__.pyc		__init__.pyc
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

Common

Common

DQNfromDemo

DQNfromDemo

DQNwithNoisyNet

DQNwithNoisyNet

pycache

pycache

.gitignore

.gitignore

README.md

README.md

init.py

init.py

init.pyc

init.pyc

test.py

test.py

Repository files navigation

DRL-using-PyTorch

DQN with NoisyNet:

DQN from Demonstrations (DQfD)

Hindsight Experience Replay (HER)

About

Releases

Packages

Languages

stjordanis/DRL-using-PyTorch

Folders and files

Latest commit

History

Repository files navigation

DRL-using-PyTorch

DQN with NoisyNet:

DQN from Demonstrations (DQfD)

Hindsight Experience Replay (HER)

About

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Languages