# Chapter 17 Reinforcement Learning

## Chatper 17.1 Markov Decision Process (MDP)

* The reinforcement learning problem is typically modeled using Markov Decision Processes. A Markov decision process (MDP) is defined by a tuple of four entities where is the state space, is the action space, is the transition function that encodes the transition probabilities of the MDP and is the immediate reward obtained by taking action at a particular state.

## Chapter 17.2 Value Iteration

In [1]:
%matplotlib inline
import random
import numpy as np
from d2l import torch as d2l

* Gym is a standard API for reinforcement learning, and a diverse collection of reference environments.

In [3]:
!pip install gym

Collecting gym
  Downloading gym-0.26.2.tar.gz (721 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m721.7/721.7 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting gym-notices>=0.0.4 (from gym)
  Obtaining dependency information for gym-notices>=0.0.4 from https://files.pythonhosted.org/packages/25/26/d786c6bec30fe6110fd3d22c9a273a2a0e56c0b73b93e25ea1af5a53243b/gym_notices-0.0.8-py3-none-any.whl.metadata
  Downloading gym_notices-0.0.8-py3-none-any.whl.metadata (1.0 kB)
Downloading gym_notices-0.0.8-py3-none-any.whl (3.0 kB)
Building wheels for collected packages: gym
  Building wheel for gym (pyproject.toml) ... [?25ldone
[?25h  Created wheel for gym: filename=gym-0.26.2-py3-none-any.whl size=827620 sha256=dcd98d7c66e535c46c35ed7f9d04d75f69cdd173e83e1abea2b15bb6c6352ff9
  S

In [4]:
seed = 0
gamma = 0.95
num_iters = 10
random.seed(seed)
np.random.seed(seed)

env_info = d2l.make_env('FrozenLake-v1', seed=seed)

AttributeError: 'FrozenLakeEnv' object has no attribute 'seed'

In [7]:
def value_iteration(env_info, gamma, num_iters):
    
    env_desc = env_info['desc']
    prob_idx = env_info['trans_prob_idx']
    nextstate_idx = env_info['nextstate_idx']
    reward_idx = env_info['reward_idx']
    num_states = env_info['num_states']
    num_actions = env_info['num_actions']
    mdp = env_info['mdp']
    
    V = np.zeros((num_iters + 1, num_states))
    Q = np.zeros((num_iters + 1, num_states, num_actions))
    pi = np.zeros((num_iters + 1, num_states))
    
    for k in range(1, num_iters + 1):
        for s in range(num_states):
            for a in range(num_actions):
                for pxrds in mdp[(s, a)]:
                    pr = pxrds[prob_idx]
                    nextstate = pxrds[nextstate_idx]
                    reward = pxrds[reward_idx]
                    Q[k, s, a] += pr * (reward + gamma * V[k-1, nextstate])
                    
            V[k, s] = np.max(Q[k, s, :])
            pi[k, s] = np.argmax(Q[k, s, :])
            
    d2l.show_value_function_progress(env_desc, V[:-1], pi[:-1])
    
value_iteration(env_info=env_info, gamma=gamma, num_iters=num_iters)

NameError: name 'env_info' is not defined

* The main idea behind the Value Iteration algorithm is to use the principle of dynamic programming to find the optimal average return obtained from a given state. Note that implementing the Value Iteration algorithm requires that we know the Markov decision process (MDP), e.g., the transition and reward functions, completely.

## Chapter 17.3 Q Learning

In [8]:
%matplotlib inline
import random
import numpy as np
from d2l import torch as d2l

seed = 0  # Random number generator seed
gamma = 0.95  # Discount factor
num_iters = 256  # Number of iterations
alpha   = 0.9  # Learing rate
epsilon = 0.9  # Epsilon in epsilion gready algorithm
random.seed(seed)  # Set the random seed
np.random.seed(seed)

# Now set up the environment
env_info = d2l.make_env('FrozenLake-v1', seed=seed)

AttributeError: 'FrozenLakeEnv' object has no attribute 'seed'

In [9]:
def e_greedy(env, Q, s, epsilon):
    if random.random() < epsilon:
        return env.action_space.sample()
    else:
        return np.argmax(Q[s, :])

In [11]:
def q_learning(env_info, gamma, num_iters, alpha, epsilon):
    
    env_desc = env_info['desc']
    env = env_info['env']
    num_states = env_info['num_states']
    num_actions = env_info['num_states']
    
    Q = np.zeros((num_states, num_actions))
    V = np.zeros((num_iters + 1, num_states))
    pi = np.zeros((num_iters + 1, num_states))
    
    for k in range(1, num_iters + 1):
        state, done = env.reset(), False
        
        while not done:
            action = e_greedy(env, Q, state, epsilon)
            next_state, reward, done, _ = env.step(action)
            
            y = reward + gamma * np.max(Q[next_state, :])
            Q[state, action] = Q[state, action] + alpha * (y - Q[state, action])
            
            state = next_state
            
        for s in range(num_states):
            V[k, s] = np.max(Q[s, :])
            pi[k, s] = np.argmax(Q[s, :])
            
    d2l.show_Q_function_progress(env_desc, V[:-1], pi[:-1])
    
q_learning(env_info=env_info, gamma=gamma, num_iters=num_iters, alpha=alpha, epsilon=epsilon)

NameError: name 'env_info' is not defined

* Q-learning is one of the most fundamental reinforcement-learning algorithms. It has been at the epicenter of the recent success of reinforcement learning, most notably in learning to play video games (Mnih et al., 2013). 

In [12]:
# 3.12, code can't run due to the make_env call, need to fix it.