# The RL Framework: The Problem

## The setting

Agent

Reward - State - Action (these need to be defined to frame the problem)

Evironment

## Playing Chess

The reward is only delivered at the end of the game, and, let’s say, is +1 if you win, and -1 if you lose.

This is an episodic task, where an episode finishes when the game ends. The idea is that by playing the game many times, or by interacting with the environment in many episodes, you can learn to play chess better and better.

If you lose a game (and get a reward of -1 at the end of the episode), it’s unclear when exactly you went wrong: maybe you were so bad at playing that every move was horrible, or maybe instead … you played beautifully for the majority of the game, and then made only a small mistake at the end.

When the reward signal is largely uninformative in this way, we say that the task suffers the problem of sparse rewards.

## Episodic or Continuing?

Remember:

    A task is an instance of the reinforcement learning (RL) problem.
    Continuing tasks are tasks that continue forever, without end.
    Episodic tasks are tasks with a well-defined starting and ending point.
        In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
        Episodic tasks come to an end whenever the agent reaches a terminal state.

## The Reward Hypothesis

![maxreward.png](attachment:maxreward.png)

![goalAndreward.png](attachment:goalAndreward.png)

Position and velocity of all of the joints, plus the ground state slop and obstacle to learn how to walk, learn how to plan.

Maximise expected cumulative reward.

![whataretherewards.png](attachment:whataretherewards.png)


## Cumulative Reward

![cumulativeReward.png](attachment:cumulativeReward.png)

## Discounted Return

The present is more valuable then the future.

![discountedreturn.png](attachment:discountedreturn.png)

Discount rate is commonly set to 0.9.


## Markov Decision Process


In general, the state space S is the set of all nonterminal states. 

In episodic tasks, we use S+ to refer to the set of all states, including terminal states. 

The action space A is the set of possible actions available to the agent.

In the event that there are some states where only a subset of the actions are available, we can also use A(s) to refer to the set of actions available in state s∈S.

![canHuntingRobotStates.png](attachment:canHuntingRobotStates.png)

![MDPprocess.png](attachment:MDPprocess.png)


The Setting, Revisited

![summarySetting.png](attachment:summarySetting.png)

    The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment.
    At each time step, the agent receives the environment's state (the environment presents a situation to the agent), and the agent must choose an appropriate action in response. One time step later, the agent receives a reward (the environment indicates whether the agent has responded appropriately to the state) and a new state.
    All agents have the goal to maximize expected cumulative reward, or the expected sum of rewards attained over all time steps.

Episodic vs. Continuing Tasks

    A task is an instance of the reinforcement learning (RL) problem.
    Continuing tasks are tasks that continue forever, without end.
    Episodic tasks are tasks with a well-defined starting and ending point.
        In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
        Episodic tasks come to an end whenever the agent reaches a terminal state.

The Reward Hypothesis

    Reward Hypothesis: All goals can be framed as the maximization of (expected) cumulative reward.

Goals and Rewards

    (Please see Part 1 and Part 2 to review an example of how to specify the reward signal in a real-world problem.)

Cumulative Reward

    The return at time step ttt is Gt:=Rt+1+Rt+2+Rt+3+…G_t := R_{t+1} + R_{t+2} + R_{t+3} + \ldots Gt​:=Rt+1​+Rt+2​+Rt+3​+…
    The agent selects actions with the goal of maximizing expected (discounted) return. (Note: discounting is covered in the next concept.)

Discounted Return

    The discounted return at time step ttt is Gt:=Rt+1+γRt+2+γ2Rt+3+…G_t := R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots Gt​:=Rt+1​+γRt+2​+γ2Rt+3​+….
    The discount rate γ\gammaγ is something that you set, to refine the goal that you have the agent.
        It must satisfy 0≤γ≤10 \leq \gamma \leq 10≤γ≤1.
        If γ=0\gamma=0γ=0, the agent only cares about the most immediate reward.
        If γ=1\gamma=1γ=1, the return is not discounted.
        For larger values of γ\gammaγ, the agent cares more about the distant future. Smaller values of γ\gammaγ result in more extreme discounting, where - in the most extreme case - agent only cares about the most immediate reward.

MDPs and One-Step Dynamics

    The state space S\mathcal{S}S is the set of all (nonterminal) states.
    In episodic tasks, we use S+\mathcal{S}^+S+ to refer to the set of all states, including terminal states.
    The action space A\mathcal{A}A is the set of possible actions. (Alternatively, A(s)\mathcal{A}(s)A(s) refers to the set of possible actions available in state s∈Ss \in \mathcal{S}s∈S.)
    (Please see Part 2 to review how to specify the reward signal in the recycling robot example.)
    The one-step dynamics of the environment determine how the environment decides the state and reward at every time step. The dynamics can be defined by specifying p(s′,r∣s,a)≐P(St+1=s′,Rt+1=r∣St=s,At=a)p(s',r|s,a) \doteq \mathbb{P}(S_{t+1}=s', R_{t+1}=r|S_{t} = s, A_{t}=a)p(s′,r∣s,a)≐P(St+1​=s′,Rt+1​=r∣St​=s,At​=a) for each possible s′,r,s,and as', r, s, \text{and } as′,r,s,and a.
    A (finite) Markov Decision Process (MDP) is defined by:
        a (finite) set of states S\mathcal{S}S (or S+\mathcal{S}^+S+, in the case of an episodic task)
        a (finite) set of actions A\mathcal{A}A
        a set of rewards R\mathcal{R}R
        the one-step dynamics of the environment
        the discount rate γ∈[0,1]\gamma \in [0,1]γ∈[0,1]

