**Hi!**

I was rejected from [DLSS/RLSS](https://mila.quebec/en/cours/deep-learning-summer-school-2017/) this year, but I decided not to be stressed about it, watch all the lectures and make the summary of them. I understand, that a summer school is not only about the lectures, but I don't have more. Going through the lectures and writing up will still be useful for me. It might also be useful for some of you. Let's start!

Sometimes, I tried to write my impression about the lectures or the related thoughts. Sometimes, the notes are the copypaste from the slides since I find this useful and it is nice to have all the stuff in one place.

The sections go in their natural order as in the schedule. You can find the slides [here](https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides/) and the videos [here](http://videolectures.net/deeplearning2017_montreal/).

If you see anything wrong or misunderstood, please, email [me](vitaliykurin@gmail.com) or find me on [twitter](@y0b1byte).

## Reinforcement Learning: basic concepts, *Joelle Pineau*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137927/deeplearning2017_pineau_reinforcement_learning_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_pineau_reinforcement_learning/)*

The best introduction to RL I have seen so far. Heavily recommended. Even if you already know some stuff, it will be useful for you to have a more or less whole picture of the basics.

In a short introduction, Joelle Pineau mentions the recent applications of RL found in [RLDM 2017](http://rldm.org/rldm2017/) submissions:

* robotics
* video Games
* conversational systems
* medical intervention (x_X)
* algorithm improvement
* improvisational theater
* autonomous driving
* prosthetic arm control
* financial trading
* query completion

As you can see, the list is really large. And this is only after looking through half of the accepted papers.

The lecture proceeds with the question "When should I apply RL to my problem".
The answer is the combination of the following prerequisites:

* your data comes in the form of trajectories (forget about the i.i.d. assumption);
* your system needs some kind of intervention;
* you need feedback in order to understand if your are doing well or not
* your problem is the one requiring both learning and planning

The lecturer mentions, that RL is somehow similar to Supervised Learning, but not completely.
There are some challenges, both practical and technical:

* need some environment to operate in (one of the reasons for slow progress of RL in the past)
* you need to learn and plan at the same time using correlated samples
* your data distribution changes along your learning procedure (the actions you take bring an agent to different states, and it makes the learning harder)

Then, Joelle gives the formal representation of an RL problem as a Markov Decision Process (MDP), defined as a tuple $\langle S,A,T(s,a,s'), R(s,a), \mu(s) \rangle$, where $S$ is the set of states, $A$ is the set of actions, $T(s,a,s')$ is the transition function returning the distribution over the next states $s'$ given the current state $s$ and the action taken $a$.

M in MDP stands for 'Markov', i.e. the process holds the Markov assumption: the future is independent of the past given the future. What does it mean? Our next state depends only on the current one, we do not need to know the whole history of states in order to predict the future.
And the definition of the *state* according to Joelle is the following. A *state* is a sufficient amount of information about the world in order to predict the future.
Sometimes in the real life the assumption does not hold, that's true. But RL still uses it in order to reduce the complexity.

What is the goal of RL? We want to maximize the reward we get for our interaction with the environment.
We can have two options here, either the task on hand is episodic (e.g. a game episode ends when you win or loose) or continious (e.g. balancing).

Usually, the future reward flow is discounted by the coefficient $\gamma \in [0,1)$ (usually close to 1.
$\gamma$ helps to trade off the balance between preferences in the immediate reward and the future reward.
It is often said, that we discount the reward flow due to psychological reasons: humans prefer the immediate reward to the future reward.
But, as Joelle mentions, it is much more mathematically convenient to use the discounting.

We now go to one of the most important definitions in RL -- policy function. 
Policy $\pi$ is a function, that returns an action given the state: $\pi(s,a) = p(a_t = a | s_t = s)$.
And to solve an RL problem is to find a policy which maximizes the expected future reward flow: $argmax_{\pi} E_{\pi} [r_0 + r_1 + ... + r_T | s_0]$.

The value function of a state is an expected return of a policy starting from this particular state: $V_{\pi}(s) = E_{\pi} [r_t + r_{t+1} + ... + r_T | s_t = s]$.
I don't get why, but all the definitions are given without the discounting. 
Maybe it does not matter here since when we take expectations later, we will be able to keep the gammas outside of the expectations, but I'm not sure. At Sutton & Barto's book, all the definitions and derivations are given for the discounted return. 

In order not to mess all the terms and definitions, the lecturer gives the following slide:

* Reward is a one step numerical feedback
* Return is sum of rewards of the agent's trajectory
* Value is the expected sum of rewards over the agent's trajectory
* Utility is the numerical function representing preferences (in RL *return* $\equiv$ *utility*)

Ok, let's go to the policy evaluation. What is the value of a policy? It is just the expected immediate reward plus the expected future reward: 
$V_{\pi}(s) = E_{\pi}[r_t + r_{t+1} + ... + r_T | s_t = s] = E_{\pi}[r_t] + E_{\pi}[r_{t+1} + ... + r_{T} | s_t = s]$.

Let's rewrite the expectations now: 

$V_{\pi}(s) = \sum_{a \in A}\pi(s,a)R(s,a) + E_{\pi}[r_{t+1} + ... + r_T | s_t = s]$ 

$V_{\pi}(s) = \sum_{a \in A}\pi(s,a)R(s,a) + \sum_{a \in A}\pi(s,a)\sum_{s' \in S}T(s,a,s')E_{\pi}[r_{t+1} + ... + r_T | s_{t+1} = s']$ 

And, looking at the definition of value function, we can see, that the the last expecation on the right hand side is just the value function of the state $s'$:


$V_{\pi}(s) = \sum_{a \in A}\pi(s,a)R(s,a) + \sum_{a \in A}\pi(s,a)\sum_{s' \in S}T(s,a,s')V_{\pi}(s')$ 


From here we can see, that this is a dynamic programming algorithm.

The lecturer uses the formulas with discounting now:

$V_{\pi}(s) = \sum_{a \in A}\pi(s,a)[R(s,a) + \gamma \sum_{s' \in S}T(s,a,s')V_{\pi}(s')]$

We also have to write the equation for the state-action value function $Q$ -- the function, returning the value of a state given that we take the particular action first and then follow the policy $\pi$:

$Q_{\pi}(s,a) = R(s,a) + \gamma \sum_{s' \in S}\big[T(s,a,s')\sum_{a' \in A}[\pi(s',a')Q_{\pi}(s',a')]\big]$

The last two formulas are the two forms of Bellman's equation.

We can rewrite the first one in the matrix form $V_{\pi} = R_{\pi} + \gamma T_{\pi}V_{\pi}$.
It has the unique solution $V_{\pi} = (I - \gamma T_{\pi})^{-1}R_{\pi}$.

Let's now assume, that we have the fixed policy, how can we evaluate it? 
Let's somehow initialize the value function $V_{\pi}$ (with zeroes, for instance). 
On each iteration, we update the value function for each state:

$V_{k+1}(s) \leftarrow (R(s,\pi(s)) + \gamma \sum_{s' \in S}T(s,\pi(s), s')V_k(s')$, where $k$ is the index of the iteration.

We repeat it until the value function does not update anymore or the number of updates is no more than some threshold. 
There is the derivation of convergence in the slides (slide #31), but I will not write it here. I will just say, that it uses the fact, that $\gamma < 0$ to show, that
the norm between the current approximation and the true value function contracts to zero. 

We move from the fixed policy to finding the best (optimal) policy.
The optimal value function is the highers return we can get from the state: $V^{*}(s) = max_{\pi}V_{\pi}(s)$. 
A policy, that achieves $V^{*}$ is called an *optimal policy* $\pi^*$.
For each MDP there is a unique optimal value function. **BUT** the optimal policy is not necessarily unique.

Having a solution to an MDP means having either the optimal value function $V^*$ or an optimal policy $\pi^*$.
We are saying that since if we have one of them, we can derive the other.

We have already looked at the policy evaluation algorithm for a fixed policy. But how to find the best policy? There are two related algorithms: policy iteration and value iteration. 

Policy iteration goes as follows:

* initialize a policy somehow, random is also possible
* Repeat
  * Compute $V_{\pi}$ using Policy Evaluation algorithm
  * Compute $\pi'$ that is greedy with respect to $V_{\pi}$
  
* terminate when $\pi = \pi'$


A the value iteration:

* initialize the value function $V_0(s)$
* each iteration do the update $V_{k+1}(s) = max_{a \in A}(R(s,a) + \gamma \sum_{s' \in S}T(s,a,s')V_k(s'))$
* stop when the value function changes for a step is below some threshold


The complexities for the algorithms are the following ($S$ is the state space size, $A$ is the action space size):

* policy evaluation: $O(s)^3$
* policy iteration: $O(S^3 + S^2A)$ per iteration
* value iteration: $O(S^2A)$ per iteration

There is an example in the slides, but I will not put it here, but it's important to go through it if you think, you're confused about all the said above.

We can see from the complexities, that the algorithms get less and less feasible as our state-action space scales, howewer we can try not to update all the states in value iteration, but only the important ones. 
Moreover, we can do the asynchronous updates, generating trajectories through the MDP and update the states only when they appear on a trajectory. In policy iteration we are not forced to do one policy update after each policy evaluation. We can combine the updates and evaluations in any combination we find appropriate.

For those, who does not want to read the slides and watch the lectures anymore, but want to do the hardcore research, Joelle has the 'challenges' slide. I will also put them in a list:

* Designing the problem domain
  * state representation
  * action choice
  * cost/reward signal
* aquiring data for training
  * exploration/exploitation
  * high cost actions
  * time-delayed cost/reward signal
* function approximation
* validation/confidence measures

The lecture proceeds with describing on-line learning, which can be of two types, according to the lecturer:

* Monte-Carlo estimate, when we use the empirical return $U(s_t)$ as a target estimate for the actual value function: $V(s_t) = V(s_t) + \alpha(U(s_t) - V(s_t)$
* Temporal-Difference (TD) learning: $V(s_t) = V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1} - V(s_t)] \forall t = 0,1,2,...$

As Joelle mentions, online learning is highly unstable, and a lot of recent RL research focused on improving the stability of the online learning algorithms.

Up to now, the lecture assumed, we are in a tabular setup, when the value function and the policy can be represented as a large table. But in the real world this approach will never work since the problems are much harder. 
We need to use the function approximations.
Linear functions has been used for a long time. 
Recently, using neural nets as function approximators has become very popular, and we can use all the Deep Learning progress within RL, e.g. memory augmented models [1].

The lecture continues with describing the on-policy/off-policy dichotomy in RL. 
As we mentioned earlier, each policy change leads to data distribution change, so, when we evaluate several policies within the same batch, we need a large batch of data and a policy which adequately covers all (s,a) combinations. 
One of the solution to the problem is to use importane sampling attaching different weights do data collected from different policies.

Exploration/exploitation dilemma goes next.
You've definitely heard about it. 
When you have some policy, you might follow it and get your deserved return or you can try something new to achive, possibly, more. 
Though, researchers have been trying to solve the problem for a long time, it is still far from being solved.

At the end, Joelle mentions the two approaches to RL: model-based and model-free. The first tries to learn the model of the environment first and do the planning later. The second, that is more successful recently, is trying to learn a policy directly using the data from the environment. As for me, I find model-based approach very cool, but it is harder to learn. It has not been so hot recently, but the research is going on and, I hope, we will see great results in the near future. 

There was also an interesting question from the audience about choosing the discounding coefficient $\gamma$. 
Joelle says that before she thought, that choosing the gamma is the problem of one who chooses the domain and creates the environment.
But the community moves further and further to the fact, that $\gamma$ is a hyperparameter and, may be, we should be more aggressive at the beginning of the training when our estimations are too noisy. 
There were no literature pointers in the lecture, but I wrote an email, and Joelle send me the link [2].

The lecture is over, if you want to know more, either continue to read or find more awesome resources [here]( https://github.com/aikorea/awesome-rl).
As for me, I want to add, that the lecture is great not only because of summarizing the basics of RL, but also giving some intuitions which help to understand the concepts better. 

## Policy Search for RL, *Pieter Abbeel*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137919/deeplearning2017_abbeel_policy_search_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_abbeel_policy_search/)*


**TBD**


## TD Learning, *Richard Sutton*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137922/deeplearning2017_sutton_td_learning_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_sutton_td_learning/)*


**TBD**


## Deep Reinforcement Learning, *Hado van Hasselt*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137918/deeplearning2017_van_hasselt_deep_reinforcement_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_van_hasselt_deep_reinforcement/)*


**TBD**

## Deep Control, *Nando de Freitas*

*No slides yet* |*[Video](http://videolectures.net/deeplearning2017_de_freitas_deep_control/)*


**TBD**



## Theory of RL, *Csaba Szepesvári*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137923/deeplearning2017_szepesvari_theory_of_rl_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_szepesvari_theory_of_rl/)*


**TBD**


## Reinforcement Learning, *Satinder Singh*

*[Slides](http://videolectures.net/site/normal_dl/tag=1129741/deeplearning2017_singh_reinforcement_learning_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_singh_reinforcement_learning/)*


**TBD**


## Safe RL, * Philip Thomas*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137917/deeplearning2017_thomas_safe_rl_01.pdf)* |*[Video](http://videolectures.net/deeplearning2017_thomas_safe_rl/)*


**TBD**


## Applications of bandits and recommendation systems, * Nicolas Le Roux*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137926/deeplearning2017_le_roux_recommendation_system_01.pdf)* |*[Video](http://videolectures.net/site/normal_dl/tag=1137926/deeplearning2017_le_roux_recommendation_system_01.pdf)*


**TBD**


## Cooperative Visual Dialogue with Deep RL, *Dhruv Batra & Devi Parikh*

*[Slides](http://videolectures.net/site/normal_dl/tag=1137915/deeplearning2017_parikh_batra_deep_rl.pdf)* |*[Video](http://videolectures.net/deeplearning2017_parikh_batra_deep_rl/)*


**TBD**

## References

[1] Khan, Arbaaz, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, and Daniel D. Lee. "Memory Augmented Control Networks." arXiv preprint arXiv:1709.05706 (2017), [link](https://arxiv.org/abs/1709.05706).

[2] Jiang, Nan, Alex Kulesza, Satinder Singh, and Richard Lewis. "The dependence of effective planning horizon on model accuracy." In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181-1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015, [link](http://www-personal.umich.edu/~rickl/pubs/jiang-et-al-2015-aamas.pdf).