<a href="https://colab.research.google.com/github/shengy90/reinforcement-learning-an-introduction/blob/master/Chapter_1_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.1 Reinforcement Learning

**What is Reinforcement Learning?**
- The learner (or agent) is not told which actions to take, but rather through a series of trial and error, the learner discovers which action yields the most reward.
- In the most interesting and challenging cases, the learner must learn decisions that affect not only, but also future rewards.
- This 'trial-and-error' search and delayed reward are the 2 most distinguising features of Reinforcement Learning

**Reinforcement Learning is different to other branches of machine learning in the following sense:**
- unlike *supervised learning*, agents aren't given a set of labelled datasets to extract patterns to make predictions 
- unlike *unsupervised learning*, it's also not about finding hidden structure within unlaballed datasets 
- reinforcement learning is about *maximising a reward signal* and finding out the best course of action to perform to achieve that

# 1.2 Examples of Reinforcement learning problems

1. **A game of chess** - player needs to plan/ anticipate future moves, and also assess the desirability of any particular moves.
2. **New born calf learning to walk** - a new born calf struggles to stand on its feet, but could be running within a few hours of being born.

These examples all share the following features:

- all involve *interaction* between an active *decision-making agent (the learner)* and its *environment* 
- the agent seeks to achieve a *goal* despite *uncertainty* about its environment 
- the agent's action are permitted to affect the future state of the environment 
- correct choice requires taking into account indirect and delayed consequences of actions, and thus may require foresight/ planning 
- effect of actions cannot be fully predicted; the agent must constantly monitor and respond to any feedback signals
- the agent however can use its experience to improve its performance over time

# 1.3 Elements of Reinforcement Learning

**There are 4 main subelements of an reinforcement learning (RL) system:**
1. a `policy`
2. a `reward signal`
3. a `value function`
4. (optionally) a `model` of the environment

**`Policy` 👮‍♀️:** defines the learner's way of behaving. A policy is a mapping from perceived states of the environment to the actions to be taken within those states. 

**`Reward signal` 🤑:** defines the goal of a RL problem. On each time step, the environment sends the agent a number (the reward). The agent's sole objective is to *maximise the total reward it receives over the long run*. 

**`Value Function` ⚖️:** whilst a reward signal indicates what is good right now, the `value function` specifies what is good *in the long run*. TL;DR, the `value` of a state represents the *future value discounted to the present state*. 

> `Rewards` are in a sense *primary* whereas `values`, as predictions of rewards, are secondary. Without reward there can be no value. The only purpose of estimating `values` is to achieve more `reward`. Nevertheless, we're concerned with value when making and evaluating decisions, because we care about getting the greatest rewards over the long run. **Unfortunately, values are hard to detarmine; it must be estimated and re-estimated from the sequences of observations an agent makes over its lifetime**.

**`Model` 🗿:** a model is something that mimics (or predicts) the behaviour of the environment. There are 2 main spectrum for solving RL problems - *model-based* methods and *model-free* methods (the latter are explicitly trial-and-error learners). 



# 1.5 An Extended Example: Tic-Tac-Toe

**We're all familiar with the game of Tic-Tac-Toe. How might we solve this problem?**

- **classical *minimax* solution from game theory:** is not applicable because it assumes a particular way of playing by the opponent which is too prescriptive.
- **Dynamic programming:** where we iteratively all possible future actions to arrive at an optimum solution. But this requires complete specification of how the opponent will respond which is an option we often do not have access to. 
- **Learning from experience:** where we play many games against the opponent to learn their behaviour to a certain level of confidence, then try to compute an optimal solution.
- **Evolutionary method:** search the space of every possible configuration of Xs and Os and identify for each state, what is the move that has the highest probability of winning (which is estimated by playing against the opponent *a lot of times*).

This is a non-exhaustive list, but just an illustration of all possible approaches to the problem. **However (as this is a RL book), we'll focus on an approach that makes use of a `value function`:**
1. Set up a table of numbers, 1 for each possible state of the game. 
2. Each number is the *latest estimate* of the probability of winning from that state - an estimate of the state's `value`.
3. Set the initial value of all states as 0.5 (arbitrary, which represents a 50% of winning). Set the state with 3 Xs in a row as 1 (the win state as we have already won), and the state with 3 Os in a row as 0 (as we've lost).
4. Play *many* games against the opponent. 
5. At every turn, we look at the states that would result from each of our possible moves and look up their current values. 
6. Most of the time, we move *greedily* (selecting the move with highest value). But occasionally, we take a random move (called an *exploratory* move) - it could let us discover new states we've never seen before. 
7. We keep updating the value of the states after each greedy move to make them more accurate estimates of the probabilities of winning 
$$V(S_{t}) \leftarrow V(S_{t} + \alpha \Big[V(S_{t+1}) - V(S_{t})\Big],$$
where $\alpha$ (a small positive fraction) is the learning rate.

**The key difference between evolutionary methods vs value-function methods:**
- evolutionary methods relies on playing the game many many times to obtained the unbiased probability estimates (of winning) of every state. Only the 'trained model' is used in production. It doesn't take into account of/ respond to what happened during the game. 
- value function methods however allows individual states to be evaluated, taking into account of new situations/ information it's never seen before. 