# Reinforcement Learning

## Introduction
Reinforcement learning consists of 4 elements:
- **Policy**: Defines the way the agent should behave.  
Mapping from Environment State -> Action.


- **Reward Signal**: The Goal of the reinforcement algorithm.  
At each step the agent receives a single numeric "Reward". The goal of the system is to maximize this reward over the lifetime of the agent.  
The reward is calculated based on the last action the agent made


- **Value Function**: The expected reward in the long-run.  
Opposed to the reward signal which may be good or bad for the immidiate action, the value function estimates the cummulative rewared expected in the long run.  
This is important, as some actions may appear not optimal in the immidiate surrounding but provide enormeous value in the long run.  
 - **Examples**:  
 -"Giving" a piece in chess to hold a better position.  
 -Training, although it's hard right now, but yields great results later
 
 
- **Model** (Optional): A Model of the Environment. Allows to infer how the environment will behave based on the agent's actions.  
There are *Model-Based* systems and there are *Model Free* systems.  
The Model-Free systems are based on Trial & Error learning Vs. Planning for the Model-Based systems.



As in most cases, **we cannot compute the optimal solution** There are multiple ways to tackle reinforcement learning, either as an *Optimization* problem where the *value function* is the most important part.  
Or with an *Evolutionary* method like Genetic Algorithms to provide a static policy. 

- Model-Free algorithms are usually good when the environment modeling is hard.  The model-free algorithms doesn't require any knowledge on the environment to learn/evolve.

-----

**Exercise 1.1**: Self-Play Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?

**Answer**: 
- A Random opponent makes our agent see more states then a directed player, as it can perform "bad" moves that a directed player wouldn't have made
- When both sides learn, the learning process will be faster as we double the number of examples for the same time.
- With directed opponent we expect to still see all or most of the possible actions on the board.  
  This because the opponent should have an Exploration / Exploitation notion also and should sometimes explore random moves.

Combining all of this observations together, we would expect the algorithm to end up learning the same policy (at least for a this a game of tic-tac-toe).

-----

**Exercise 1.2**: Symmetries Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?

**Answer**: 
- In Tic-Tac-Toe the board is square.  Thus:  
We can Rotate or mirror the board to match different positions

If we will use a "normalizing" function to match rotation / mirror states to one another we could reduce our Sample Space / Solution Space by (# Rotations + # mirrors).
This means we will:
- **Accelerate our learning process** by this factor. 
- **Reduce our memory requirements**.
- **Reduce the look-ahead needed / possible**

-----

**Exercise 1.3**: Greedy Play Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?

**Answer**:
- Depends on the game
- A possible problem for a greedy player in chess, given this two options:  
 - Eat the opponent's queen this turn and then be mated.
 - Give up your queen this turn and mat your opponent in the next turn.
 
 A Greedy player will choose option 1, while a player with a more future-looking value function would choose option 2 which is clearly better.
 
-----

**Exercise 1.4**: *Learning from Exploration* - Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

**Answer**:
- In this variation we are learning from Exploitation steps and Exploration steps.  *(Opposed to Exploitation steps only)*
 - Normaly, exploration steps do not update their parent state.  
 Since $V({S_t}) = V({S_t})+\alpha \times V({S_{t+1}})$, If we are currently at state ${S_t}$:  
   - **Without taking Exploration into account** *(Normal)*: The step will only be updated when we choose the highest winning probability move.  
   This means that we evaluate the *Current* position, based on the *Optimal* moves we can do in the future.
   Since we evaluate the *Perceived Optimal* move at each turn, we expect $V(S_t)$ to show *Optimal Value* for the branch.
   - **With taking Exploration into account**: The step will be updated whether we choose the optimal move or a random one. 
   This means that we evaluate the *Current* position, based on the *A* move we can do in the future.
   Since we evaluate a *Random* move, we can no longer expect $V(S_t)$ to show *Optimal Value* for the branch.  
     - If an optimal move was selected, the updated value will be as normal
     - If an under-performing move was selected, it will lower the value of the current state! **Although in reality, it can produce better results**  
     This can lead to our algorithm Under-Performing, selecting moves that are not optimal.

-----
 

# Tic Tac Toe
## Environment & Agent