<a href="https://colab.research.google.com/github/rahiakela/reinforcement-learning-research-and-practice/blob/main/grokking-deep-reinforcement-Learning/02-mathematical-foundations/01_mathematical_foundations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Setup

In [None]:
!pip install gym

In [None]:
!pip show gym

  and should_run_async(code)


Name: gym
Version: 0.25.2
Summary: Gym: A universal API for reinforcement learning environments
Home-page: https://www.gymlibrary.ml/
Author: Gym Community
Author-email: jkterry@umd.edu
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: cloudpickle, gym-notices, numpy
Required-by: dopamine-rl


In [None]:
import sys

sys.path.append("/usr/local/lib/python3.10/dist-packages")

In [None]:
from gym import envs

for env in envs.registry.keys():
  print(env)

ALE/Tetris-v5
ALE/Tetris-ram-v5
Adventure-v0
AdventureDeterministic-v0
AdventureNoFrameskip-v0
Adventure-v4
AdventureDeterministic-v4
AdventureNoFrameskip-v4
Adventure-ram-v0
Adventure-ramDeterministic-v0
Adventure-ramNoFrameskip-v0
Adventure-ram-v4
Adventure-ramDeterministic-v4
Adventure-ramNoFrameskip-v4
AirRaid-v0
AirRaidDeterministic-v0
AirRaidNoFrameskip-v0
AirRaid-v4
AirRaidDeterministic-v4
AirRaidNoFrameskip-v4
AirRaid-ram-v0
AirRaid-ramDeterministic-v0
AirRaid-ramNoFrameskip-v0
AirRaid-ram-v4
AirRaid-ramDeterministic-v4
AirRaid-ramNoFrameskip-v4
Alien-v0
AlienDeterministic-v0
AlienNoFrameskip-v0
Alien-v4
AlienDeterministic-v4
AlienNoFrameskip-v4
Alien-ram-v0
Alien-ramDeterministic-v0
Alien-ramNoFrameskip-v0
Alien-ram-v4
Alien-ramDeterministic-v4
Alien-ramNoFrameskip-v4
Amidar-v0
AmidarDeterministic-v0
AmidarNoFrameskip-v0
Amidar-v4
AmidarDeterministic-v4
AmidarNoFrameskip-v4
Amidar-ram-v0
Amidar-ramDeterministic-v0
Amidar-ramNoFrameskip-v0
Amidar-ram-v4
Amidar-ramDeterministic-

##Bandit walk environment

BW is a simple grid-world (GW) environment. GWs are a common type of environment for studying RL algorithms that are grids of any size. GWs can have any model (transition and reward functions) you can think of and can make any kind of actions available.

![](https://github.com/rahiakela/reinforcement-learning-research-and-practice/blob/main/grokking-deep-reinforcement-Learning/images/bw.png?raw=1)

A graphical representation of the BW environment would look like the following.

![](https://github.com/rahiakela/reinforcement-learning-research-and-practice/blob/main/grokking-deep-reinforcement-Learning/images/bw-graph.png?raw=1)


Bandit Walk:
* deterministic environment (100% action success)
* 1 non-terminal states, 2 terminal states
* only reward is still at the right-most cell in the "walk"
* episodic environment, the agent terminates at the left- or right-most cell (after 1 action selection -- any action)
* agent starts in state 1 (middle of the walk) T-1-T
* actions left (0) or right (1)




In [None]:
P = {
    # The outer dictionary keys are the states
    0: {
        # The inner dictionary keys are the actions
        0: [(1.0, 0, 0.0, True)],
        1: [(1.0, 0, 0.0, True)]
    },
    1: {
        # The transition tuples have four values: the probability of that transition, the next state,the reward, and a flag indicating whether the next state is terminal
        0: [(1.0, 0, 0.0, True)],
        1: [(1.0, 2, 1.0, True)]
    },
    2: {
        0: [(1.0, 2, 0.0, True)],
        1: [(1.0, 2, 0.0, True)]
    }
}

In [None]:
# You can also load the MDP this way
# P = gym.make("BanditWalk-v0").env.P
# P

##Bandit slippery walk environment

Okay, so how about we make this environment stochastic?

Letâ€™s say the surface of the walk is slippery and each action has a 20% chance of sending the agent backwards. I call this environment the bandit slippery walk (BSW).

However, the transition function is different: 80% of the time the agent moves to the intended cell, and 20% of time in the opposite direction.

How do we know that the action effects are stochastic?

The graphical and table representations can help us with that.

![](https://github.com/rahiakela/reinforcement-learning-research-and-practice/blob/main/grokking-deep-reinforcement-Learning/images/bsw-graph.png?raw=1)

Representing environments as MDPs is a surprisingly powerful and straightforward approach to modeling complex sequential decision-making problems under uncertainty.

In [None]:
P = {
    # Look at the terminal states. States 0 and 2 are terminal
    0: {
        # The inner dictionary keys are the actions
        0: [(1.0, 0, 0.0, True)],
        1: [(1.0, 0, 0.0, True)]
    },
    1: {
        # This is how you build stochastic transitions. This is state 1, action 0.
        0: [(0.8, 0, 0.0, True), (0.2, 2, 1.0, True)],
        1: [(0.8, 2, 1.0, True), (0.2, 0, 0.0, True)]
    },
    2: {
        0: [(1.0, 2, 0.0, True)],
        1: [(1.0, 2, 0.0, True)]
    }
}

In [None]:
# You can also load the MDP this way
# P = gym.make("BanditSlipperyWalk-v0").env.P
# P

##Frozen lake environment