karpathy给出了js版reinforcement learning和rnn/lstm。其中RNN用于字符生成的demo,将Paul Graham的诗集编码成RNN的权重,
http://it.sohu.com/20161202/n474728555.shtml
- workshop
- zhihu整理
- Benchmarking Deep Reinforcement Learning for Continuous Control
- Deep Reinforcement Learning in Large Discrete Action Spaces
[Mnih etal.](Playing Atari with Deep Reinforcement Learning)的Atari Game Playing游戏很好的描述Deep Reinforcement Learning的作用,用卷积神经网络对action-value函数Q(s,a)建模,
在Oxford的lecture12和UCL的lecture6都讲到了DQN,
两条主线PG和Q-learning,分别有两个知名例子, Atari games:Q Learning with function approximation, AlphaGo :uses policy gradients with Monte Carlo Tree Search (MCTS)
karpathy的这篇文章详细介绍了Policy Gradient,用PG学习Atari游戏。
- Stochastic policy gradient Agent利用REINFORCE和LSTMs学习actor policy和value function baseline,在karpathy的文章中就是UP/DOWN的概率。
- Deterministic Policy Gradients
DQN的例子
karpathy在文章中说更多人倾向于用Policy Gradient,而不是Q-learning,因为PG是end-to-end,当调参好时,PG比Q-learning效果好。
参考以Cart-pole为例,构建一个控制器,系统状态(theta,w,x,v),critic记载reward v(theta,w,x,v),然后Actor u=u(theta,w,x,v)+rn,F=Fmax(u)施加一个F给environment,通过获取大的V(theta,w,x,v)值,得到摆的直立状态,直立状态获取最大的reward。然而,V(theta,w,x,v)是未知的值,必须做函数近似来获取V。
- critic来估计V(theta,w,x,v)
- Actor
在对话系统中的应用 http://www.maluuba.com/blog/2016/11/23/deep-reinforcement-learning-in-dialogue-systems
Policy Networks with Two-Stage Training for Dialogue Systems
参考基于tensorflow的DQN,在
https://github.com/dennybritz/reinforcement-learning
- Playing Atari with Deep Reinforcement Learning论文Human-level control through deep reinforcement learning
这篇论文中讲到当一个非线性函数近似比如神经网络用于表示Q函数时,RL通常不稳定,甚至发散,DQN用experience replay和fixed Q-targets来增加稳定性。 参考RL(sutton)p385.
- motivation
While these methods might have produced results comparable to DQN’s, they would have been more complicated to implement and would have significantly increased the time needed for learning. Another motivation for using Q-learning was that DQN used the experience replay method。Mnith modified the basic Q-learning procedure in three ways, as follows
- experience replay
This method stores the agent's experience at each time step in a replay memory that is accessed to perform the weight updates.
参考
- Oxford reinforcement learning lecture
- David Silver's RL class
- Udacity RL class
- berkeley deep RL course 这个课程是最全的,包括policy-gradient,action-value approximation等。
- http://www.wildml.com/
参考devsisters代码,git clone DQN-tensorflow。这个代码依赖
- Python
- gym
- tqdm
- OpenCV2
- TensorFlow
- OpenCV2安装 按流程将opencv git clone到/home/crawler/tensorflow/dqn/gym/DQN-tensorflow/opencv ,但这么做很繁琐,直接用命令apt-get install python-opencv 就可以安装完成,然后import cv2测试。
- gym安装 pip install tqdm gym OpenAI Gym是一个Reinforcement Learning算法的toolkit,对agent结构没有假设,并且兼容tensorflow和theano。它包括两部分:1.gym开源库;2.OpenAI Gym service。
/home/crawler/tensorflow/dqn/gym路径下,
- case.py,39上
- case1.py 这个cart-pole的例子,是用actor-critic的方式,actor-critic是Policy Gradient的一种方法,这篇文章比较好的介绍了这个方法,参考论文 A Survey of Actor-Critic Reinforcement Learning:Standard and Natural Policy Gradients。例子中cart-pole就是environment,gym库的主要目的是提供environment的集合,通过下面的命令查看
from gym import envs
print(envs.registry.all())
在~/tensorflow/dqn/gym/DQN-tensorflow路径下,执行main.py程序,测试dqn。
https://zhuanlan.zhihu.com/p/21477488?refer=intelligentunit
两种方法
-
JekyII机制:username.github.io,访问这个时,JekyII会解析username用户下,username.github.io项目的master分支
-
阮:username/blog 的 gh-pages 分支。cxwangyi.github.com 阮的文章中详细记录每个文件每个目录的作用,但jekyll new命令可以直接生成这些文件和目录。
实现用localhost访问,没有结合github
~ $ gem install jekyll bundler
~ $ jekyll new myblog
~ $ cd myblog
新建一个名为Gemfile的文件,内容如下
source 'https://rubygems.org'
gem 'github-pages', group: :jekyll_plugins
~/myblog $ bundle install
~/myblog $ bundle exec jekyll serve
# => Now browse to http://localhost:4000
- 按https://pages.github.com/设置io
- Jekyll是一个静态地址生成器,先安装sudo apt install ruby,按照https://jekyllrb.com/docs/quickstart/安装Jekyll
Jekyll的好处:
- 你可以用MD而不是HTML,
- 添加Jekyll theme
https://www.zhihu.com/question/28123816
http://www.wildml.com/2016/04/deep-learning-for-chatbots-part-1-introduction/