# Reinforcement Learning with Pytorch

* Stastics
  * [CartPole statistics](#CartPole-statistics)
  * [FrozenLake statistics](#FrozenLake-statistics)

* Examples:
  * [rl01_CartPoleRandom.py](#rl01_CartPoleRandom.py)
    * [rl01_CartPoleRandom-code02.py](#rl01_CartPoleRandom-code02.py)
  * [rl02_FrozenLakeRandom.py](#rl02_FrozenLakeRandom.py)
    * [rl02_FrozenLakeRandom-code02.py](#rl02_FrozenLakeRandom-code02.py)
  * [rl03_CartPoleVideo.py](#rl03_CartPoleVideo.py)
  * [rl04_FrozenLakeStochasticDeterministic.py](#rl04_FrozenLakeStochasticDeterministic.py)
    * [Stochastic 環境 & random action](#Stochastic-環境-&-random-action)
    * [Deterministic 環境 & random action](#Deterministic-環境-&-random-action)
  * [rl05_FrozenLakeDeterministicBellman.py](#rl05_FrozenLakeDeterministicBellman.py)
  * [rl06_FrozenLakeStochastic.py](#rl06_FrozenLakeStochastic.py)
    * [rl06-FrozenLake-0.4.0.py](#rl06-FrozenLake-0.4.0.py)
  * [rl07_FrozenLakeStochasticQLearning.py](#rl07_FrozenLakeStochasticQLearning.py)
  * [rl08_egreedy.py](#rl08_egreedy.py)
    * [rl08-b-egreedy-decay.py](#rl08-b-egreedy-decay.py)
  * [rl09_bonus_value_iteration.py](#rl09_bonus_value_iteration.py)
  * [rl10_homework.py](#rl10_homework.py)
  * [ rl11_NN_review.py](#rl11_NN_review.py)
    * [rl11_NN-review-0.4.0.py](#rl11_NN-review-0.4.0.py)
  * [rl12_CartPoleRandomNew.py](#rl12_CartPoleRandomNew.py)
  * [rl13_egreedy_tool.py](#rl13_egreedy_tool.py)
  * [rl14_CartPole-NN.py](#rl14_CartPole-NN.py)
  * [rl15_CartPole-NN-log.py](#rl15_CartPole-NN-log.py)
  * [rl16_CartPole-NN-2layer.py](#rl16_CartPole-NN-2layer.py)
  * [rl17_CartPole-Challenge.py](#rl17_CartPole-Challenge.py)
  * [rl18_CartPole-ExperienceReplay.py](#rl18_CartPole-ExperienceReplay.py)
  * [rl19_CartPole-targetnet.py](#rl19_CartPole-targetnet.py)
  * []()

## Statistics

### CartPole statistics

* [CartPole](https://gym.openai.com/envs/CartPole-v0/) 環境
* 比較各種演算法的結果

|演算法|程式碼|Average number of steps|Average reward|Average reward (last 100 episodes)|Solved after N episodes|
|:---|:---|:---|:---|:---|:---|
|Random moves|[`rl01_CartPoleRandom.py`](#rl01_CartPoleRandom.py)|22.30|
|Random moves|[`rl12_CartPoleRandomNew.py`](#rl12_CartPoleRandomNew.py)||21.89|22.24|
|NN - basic|[`rl14_CartPole-NN.py`](#rl14_CartPole-NN.py)||19.63|31.89|
|NN - 2 layers|[`rl16_CartPole-NN-2layer.py`](#rl16_CartPole-NN-2layer.py)||152.56|188.79|
|NN - 2 layers|[`rl17_CartPole-Challenge.py`](#rl17_CartPole-Challenge.py) 調整參數||173.81|200.00|
|NN - 2 layers - Experience replay|[`rl18_CartPole-ExperienceReplay.py`](#rl18_CartPole-ExperienceReplay.py)||192.04|199.79|
|NN - target net + tuning|[`rl19_CartPole-targetnet.py`](#rl19_CartPole-targetnet.py) 的第二組參數||187.46|199.37|
|NN - Double DQN|[`rl20_CartPole-DoubleDQN.py`](#rl20_CartPole-DoubleDQN.py) 的第一組參數: 快||187.33|176.00|130|
|NN - Double DQN|[`rl20_CartPole-DoubleDQN.py`](#rl20_CartPole-DoubleDQN.py) 的第二組參數: stable||103.97|200.00|382|
|NN - Dueling DQN + tuning|quick-win||193.59|200.00|117|


### FrozenLake statistics

* 接下來要用 [Frozen Lake](https://gym.openai.com/envs/FrozenLake-v0/) 的例子來說明與比較不同的 algorithms 的結果
* 比較各種演算法的結果

|演算法|程式碼|Percent of episodes finished successfully|Percent of episodes finished successfully (last 100 episodes)|Average number of steps|Average number of steps (last 100 episodes)|註解|
|:---:|:---|:---|:---|:---|:---|:---|
|Random|修改過的 [`rl02_FrozenLakeRandom.py`](#rl02_FrozenLakeRandom.py)|0.016|0.001|7.59|0.71|* stochastic 環境 & random action<br />* 只有少部分的 episode 順利結束|

1. Random
  * 用修改過的 [`rl02_FrozenLakeRandom.py`](#rl02_FrozenLakeRandom.py) 跑的結果
    * stochastic 環境 & random action
  * 只有少部分的 episode 順利結束
```
Percent of episodes finished successfully: 0.016
Percent of episodes finished successfully (last 100 episodes): 0.001
Average number of steps: 7.59
Average number of steps (last 100 episodes): 0.71
```
2. Bellman equation (deterministic environment)
  * 用 `rl05_FrozenLakeDeterministicBellman.py` 跑的結果
    * deterministic 環境
    * 用 Bellman equation 求 Q table: $Q(s, a) = r + \gamma \times \max_{a'} Q(s', a')$
  * 大部分的 episode 都順利結束了，表示 Bellman equation 可以得到很好的結果
```
Percentage of episodes finished successfully: 0.916
Percentage of episodes finished successfully (last 100 episodes): 1.0
Average number of steps: 6.22
Average number of steps (last 100 episodes): 6.00
```
3. $\epsilon$-greedy
  * 用 `rl08_egreedy.py` 跑的結果
    * deterministic 環境
    * 用 Bellman equation 求 Q table: $Q(s, a) = r + \gamma \times \max_{a'} Q(s', a')$
    * 固定 $epsilon$ 的數值
  * 大部分的 episode 都順利結束了，因為採用了 $\epsilon$-greedy 所以有時候會探索未知，造成結果比單純用 Bellman equation 差一些
```
Percentage of episodes finished successfully: 0.713
Percentage of episodes finished successfully (last 100 episodes): 0.85
Average number of steps: 6.53
Average number of steps (last 100 episodes): 6.35
```
4. Bellman equation (stochastic environment)
  * 用 `rl06_FrozenLakeStochastic.py` 跑的結果
    * stochastic 環境
    * 用 Bellman equation 求 Q table: $Q(s, a) = r + \gamma \times \max_{a'} Q(s', a')$
  * 只有少部分的 episode 順利結束，表示 Bellman equation 不適合用在 stochastic 環境中
```
Percentage of episodes finished successfully: 0.017
Percentage of episodes finished successfully (last 100 episodes): 0.02
Average number of steps: 7.79
Average number of steps (last 100 episodes): 7.88
```
5. Q-learning (stochastic environment)
  * 用 `rl07_FrozenLakeStochasticQLearning.py` 跑的結果
    * stochastic 環境
    * 用 Q-learning 求 Q table: $Q(s, a) = (1 - \alpha)Q(s, a) + \alpha[r + \gamma \times \max_{a'} Q(s', a')]$
  * 大概有將近一半的 episode 順利結束，表示用 Q-learning 比用 Bellman equation 能得到更多的改善
```
Percentage of episodes finished successfully: 0.405
Percentage of episodes finished successfully (last 100 episodes): 0.44
Average number of steps: 39.26
Average number of steps (last 100 episodes): 41.57
```
6. Bonus lesson
  * Value iteration for stochastic environment
  * 用 `rl09_bonus_value_iteration.py` 跑的結果
    * stochastic 環境
    * 用 value iteration 求 Q-table
  * 有超過一半的 episode 順利結束，所以 Value iteration 又比用 Q-learning 好多了
```
Percentage of episodes finished successfully: 0.729
Percentage of episodes finished successfully (last 100 episodes): 0.74
Average number of steps: 40.80
Average number of steps (last 100 episodes): 44.31
```

## Examples

### Leaner Demo

In [None]:
import WorldDemo
import threading

action = []
actions = WorldDemo.actions

def try_move(action):
    if action == actions[0]:
        WorldDemo.try_move(0, -1)
    elif action == actions[1]:
        WorldDemo.try_move(0, 1)
    elif action == actions[2]:
        WorldDemo.try_move(-1, 0)
    elif action == actions[3]:
        WorldDemo.try_move(1, 0)
    else:
        return
    
def main():
    print("START")
    print(actions)
    
t = threading.Thread(target=main)
t.daemon = True
t.start()
WorldDemo.start_game()

### rl01_CartPoleRandom.py

* 採用 random action 的方式
* 如何判斷一個 episode 結束:
  * 就是要麻走完了 step 限制的 1000 步，要麻就是 CartPole 遊戲玩到死掉了

In [None]:
import gym

env = gym.make("CartPole-v1") # 建立環境，這邊用 gym 中已經設定好的環境
num_episodes = 1000 # 跑 1000 個 episodes

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    for step in range(100): # 每一個 episode 最多走 100 步
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action
        env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中
        if done:
            break

### rl01_CartPoleRandom-code02.py
* 修改上面的程式碼，當結束的時候把總共走了幾個步驟記錄下來，然後畫圖。
* 跑完的結果
```
Episode finised after 13 steps
Average number of steps: 22.17
```

In [None]:
import gym
import matplotlib.pyplot as plt

env = gym.make("CartPole-v1") # 建立環境，這邊用 gym 中已經設定好的環境

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    
    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action

        print(new_state) # 每走了一步之後變成新的狀態
        print(info)

        env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        if done:
            steps_total.append(step) # 把每個 episode 共走了幾步記錄下來
            print("Episode finised after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
plt.plot(steps_total) # 把每個 episode 共走了幾步畫出來
plt.show()

### rl02_FrozenLakeRandom.py

* 也是使用 random action 的方式
* 但是不知道為什麼輸出和標示的 action 不同...
  * 官網的說明: the ice is slippery, so you won't always move in the direction you intend.
  * 所以真正的移動方向和 action 的方向可能不一樣 $\Rightarrow$ 這是 stochastic 環境
* 程式碼基本上和 [`rl01_CartPoleRandom.py`](#rl01_CartPoleRandom.py) 一樣，只是用不同的環境
  * 用 `FrozenLake-v0` 環境
  * 把 `rl01_CartPoleRandom.py` 中的 `print(new_state)` 和 `print(info)` 刪掉
  * 加入 `time.sleep(0.4)`
* 跑完的結果
```
Average number of steps: 7.69
```

In [None]:
import gym
import time
import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v0") # 建立環境，這邊用 gym 中已經設定好的環境

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    
    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action
        
        time.sleep(0.4) # 每個 step 相隔 0.4 秒

        env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        if done:
            steps_total.append(step) # 把每個 episode 共走了幾步記錄下來
            print("Episode finished after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
plt.plot(steps_total) # 把每個 episode 共走了幾步畫出來
plt.show()

### rl02_FrozenLakeRandom-code02.py
* 修改上面的程式碼，讓結果可以和其他的比較
  * 要比較的有
    * random move (就是下面這個程式)
    * [deterministic](#Deterministic-環境-&-random-action)
    * [stochastic](#Stochastic-環境-&-random-action)
  * 加入 `rewards_total` 來記錄每個 episode 的 reward
  * 把要輸出的資訊和圖表做些修改
  * 跑完的結果
```
Percent of episodes finished successfully: 0.019
Percent of episodes finished successfully (last 100 episodes): 0.0
Average number of steps: 7.64
Average number of steps (last 100 episodes): 7.84
```
  * 因為是 random walk, 所以只有很小一部分的 episodes 有完成

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v0") # 建立環境，這邊用 gym 中已經設定好的環境

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步
rewards_total = [] # 用來記錄每個 episode 的 reward

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    
    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action
        
        # time.sleep(0.4) # 每個 step 相隔 0.4 秒

        # env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        if done:
            steps_total.append(step) # 把每個 episode 共走了幾步記錄下來
            rewards_total.append(reward)
            print("Episode finised after %i steps" % step)
            break

print("Percent of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percent of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))
print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

### rl03_CartPoleVideo.py
* 這個例子是用來說明如何記錄 moves (i.e. agent behaves)
* 要有安裝 ffmpeg
  * On OS X, you can install ffmpeg via `brew install ffmpeg`.
  * On most Ubuntu variants, `sudo apt-get install ffmpeg` should do it. 
    * On Ubuntu 14.04, however, you'll need to install avconv with `sudo apt-get install libav-tools`.
* 修改 [`rl01_CartPoleRandom.py`](#rl01_CartPoleRandom) 的第二支程式，加入下面兩行
```python
videosDir = "./RLvideos/" # 指定哪邊存放 videos
env = gym.wrappers.Monitor(env, videosDir) # 要把結果錄下來，就要這一行
```
* 跑完的結果
```
Average number of steps: 21.36
```

In [None]:
import gym
import matplotlib.pyplot as plt

env = gym.make("CartPole-v1") # 建立環境，這邊用 gym 中已經設定好的環境

videosDir = "./RLvideos/" # 指定哪邊存放 videos
env = gym.wrappers.Monitor(env, videosDir) # 要把結果錄下來，就要這一行

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    
    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action

        env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        if done:
            steps_total.append(step) # 把每個 episode 共走了幾步記錄下來
            print("Episode finised after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
plt.plot(steps_total) # 把每個 episode 共走了幾步畫出來
plt.show()

* 列出 site packages 存放的位置

In [3]:
import site
site.getsitepackages()

['/usr/local/anaconda3/lib/python3.8/site-packages']

### rl04_FrozenLakeStochasticDeterministic.py

* 原本的 [`rl02_FrozenLakeRandom.py`](#rl02_FrozenLakeRandom.py) 就是 stochastic 的範例
  * stochastic 就是說想要執行的動作與真正執行得動作不一定是一樣的，只有一定的機率會去執行想要執行的動作
  * 例如下圖，原本在 S 想要採取 action = Right 但是真正執行這個 action 的機率只有 0.33
```
  (Right)
SFFF
FHFH
FFFH
HFFG
0
{'prob': 0.3333333333333333}
```

### Stochastic 環境 & random action
* 程式碼基本上就是和 [`rl02_FrozenLakeRandom.py`](#rl02_FrozenLakeRandom.py) 一模一樣，只是加入了兩行輸出
```python
print(new_state) # 每走了一步之後變成新的狀態
print(info) # info 可以顯示真正執行想要的動作的機率
```
* 跑完的結果
```
Average number of steps: 7.49
```

In [None]:
import gym
import time
import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v0") # 建立環境，這邊用 gym 中已經設定好的環境

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態

    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action
        
        time.sleep(0.4) # 每個 step 相隔 0.4 秒

        env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        print(new_state) # 每走了一步之後變成新的狀態
        print(info) # info 可以顯示真正執行想要的動作的機率

        if done:
            steps_total.append(step) # 把每個 episode 共走了幾步記錄下來
            print("Episode finished after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
plt.plot(steps_total) # 把每個 episode 共走了幾步畫出來
plt.show()

### Deterministic 環境 & random action
* Deterministic 的話要對程式做修改
  * Deterministic 就是說想要執行的動作，和真正執行的動作，是一樣的
  * 例如下圖，本來在 S 想要執行 action = Down，因為真正執行這個 action 的機率是 1，所以結束的狀態是 F
```
  (Down)
SFFF
FHFH
FFFH
HFFG
4
{'prob': 1.0}
```
* Deterministic 的程式碼和 Stochastic 的程式碼差別在使用的環境不一樣
  * Deterministic 用下面的環境設定
```python
# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
#     max_episode_steps=100,
#     reward_threshold=0.78, # optimum = .8196
)
env = gym.make("FrozenLakeNotSlippery-v0") # Deterministic 要用不同的環境
```
* 跑完的結果
```
Average number of steps: 7.92
```

In [None]:
import gym
import time
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
#     max_episode_steps=100,
#     reward_threshold=0.78, # optimum = .8196
)

env = gym.make("FrozenLakeNotSlippery-v0") # Deterministic 要用不同的環境

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態

    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action
        
        time.sleep(0.4) # 每個 step 相隔 0.4 秒

        env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        print(new_state) # 每走了一步之後變成新的狀態
        print(info) # info 可以顯示真正執行想要的動作的機率

        if done:
            steps_total.append(step) # 把每個 episode 共走了幾步記錄下來
            print("Episode finished after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
plt.plot(steps_total) # 把每個 episode 共走了幾步畫出來
plt.show()

## Rewards
* Agent 要把 total future rewards 最大化
  * Deterministic: $R_{t} = r_{t} + r_{t+1} + r_{t+2} + \cdots + r_{n}$
  * Stochastic: $R_{t} = r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{n-t} r_{n}$
    * 當對未來有不確定性的時候，rewards 要引入 discount factor $\gamma$，使得越遠的未來的 reward 越不重要
    * $0 < \gamma < 1$
      * $\gamma=0$: only immediately reward
      * $\gamma=1$: deterministic
      * 通常用 $\gamma=0.99$ or 0.9 之類的數值

## Markov Decision Process
* $V(s) := \sum_{s'} P_{\pi(s)}(s, s')\Big(R_{\pi(s)}(s, s') + \gamma V(s')\Big)$
* $\pi(s) := argmax_{a} \Big\{\sum_{s'} P_{a}(s, s')\Big(R_{a}(s, s') + \gamma V(s')\Big)\Big\}$
* $V_{i+1}(s) := \max_{a}\Big\{\sum_{s'}P_{a}(s, s')\Big(R_{a}(s, s') + \gamma V_{i}(s')\Big)\Big\}$
  * stochastic environment 中要把所有可能的結果都列入考慮，所以要把每個可能的結果的幾率加入公式裡面
* $Q(s, a) = \sum_{s'} P_{a}(s, s')\Big(R_{a}(s, s') + \gamma V(s')\Big)$

## Solution
* $Q$-value function: $Q^{\pi}(s, a) = \mathbb{E}\Big[\sum_{t \ge 0}\gamma^{t} r_{t}\Big| s_{0}=s, a_{0}=a, \pi\Big]$
  * $\pi$ 是 policy, $\mathbb{E}$ 是期望值
  * $Q$-value function 可以想成是某一種 quality function，用來判斷 state-action pair 的好壞
    * 在某個 policy $\pi$ 的情況下，某個 state 執行某個 action 的期望值
* 最佳化的 $Q$-value: $Q^{*}(s, a) = \max_{\pi}\mathbb{E}\Big[\sum_{t \ge 0} \gamma^{t} r_{t} \Big| s_{0}=s, a_{0}=a, \pi\Big]$
* $Q^{*}(s, a) = \mathbb{E}_{s'\sim \epsilon}\Big[r + \gamma \max_{a'} Q^{*}(s', a')\Big|s, a\Big]$

## Bellman equation
* $Q(s, a) = r + \gamma \max_{a'}Q(s', a')$

* 印出 action space 和 observation space

In [4]:
import gym
env = gym.make("FrozenLake-v0")

In [5]:
env.observation_space

Discrete(16)

In [6]:
env.action_space

Discrete(4)

In [7]:
env.observation_space.n

16

In [8]:
env.action_space.n

4

* Tabular method 就是建立一個大大的表格，每個 row 代表一個 state 每個 column 代表一個 action，每一格的數值就是 $Q$-value

In [9]:
import torch

number_of_states = env.observation_space.n
number_of_actions = env.action_space.n

Q = torch.zeros([number_of_states, number_of_actions])

In [10]:
Q

tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

因為環境有 16 個 state 和 4 個 action，所以 Q table 就有 16 rows x 4 columns

In [11]:
a = torch.zeros(1, 4)
a # 是一個 1x4 的 tensor

tensor([[0., 0., 0., 0.]])

In [12]:
a[0] # 是 size 1 的 tensor

tensor([0., 0., 0., 0.])

In [13]:
a[0][0], a[0][1] # 取得數值

(tensor(0.), tensor(0.))

In [14]:
a[0][1] = 0.2 # 修改數值
a[0][2] = 0.6
a

tensor([[0.0000, 0.2000, 0.6000, 0.0000]])

In [15]:
torch.max(a, 1)
# 第二個參數指明怎麼找最大值
# 1: find the max value in the row
# 0: find the max value in the column

torch.return_types.max(
values=tensor([0.6000]),
indices=tensor([2]))

In [16]:
torch.max(a, 1)[0] # get the max value in tensor form

tensor([0.6000])

In [17]:
torch.max(a, 1)[1] # get index of the max value，傳回的是 tensor

tensor([2])

In [18]:
torch.max(a, 1)[0][0] # get the max value in value form

tensor(0.6000)

In [19]:
torch.max(a, 1)[1][0] # 沿著 row 方向上，數值最大的那個的 index

tensor(2)

In [20]:
env.action_space.sample() # 隨機的 action 所以每次的結果都不同

1

In [21]:
env.action_space.sample() # 隨機的 action 所以每次的結果都不同

2

In [22]:
b = torch.zeros(1, 4)
b

tensor([[0., 0., 0., 0.]])

In [23]:
torch.max(b, 1) # 當 tensor 中的元素都一樣的時候，torch.max() 傳回最左邊的元素

torch.return_types.max(
values=tensor([0.]),
indices=tensor([0]))

In [24]:
torch.max(b, 1) # 當 tensor 中的元素都一樣的時候，torch.max() 傳回最左邊的元素
# 不管跑幾次都是一樣只傳回最左邊的元素，所以失去了隨機性

torch.return_types.max(
values=tensor([0.]),
indices=tensor([0]))

In [25]:
torch.randn(1, 4) # 產生 1x4 的隨機數值 tensor

tensor([[-0.1938,  0.8813,  1.4464, -0.8650]])

In [26]:
torch.randn(1, 4) # 產生 1x4 的隨機數值 tensor

tensor([[0.4518, 0.0544, 0.6908, 0.7725]])

In [27]:
torch.randn(1, 4)/1000 # 產生 1x4 的隨機數值 tensor，除以 1000 讓這個隨機數值變得很小

tensor([[5.3475e-05, 1.6591e-08, 1.2601e-04, 8.4253e-04]])

In [28]:
(torch.randn(1, 4)/1000)[0]

tensor([-0.0014, -0.0012,  0.0008, -0.0022])

In [29]:
(torch.randn(1, 4)/1000)[0][0]

tensor(-0.0003)

In [30]:
Q[0] # Q table 的內容都是 0

tensor([0., 0., 0., 0.])

In [31]:
Q[0] + torch.randn(1, 4)/1000 # Q table 的內容是很接近 0 的一個很小的數

tensor([[-0.0007, -0.0002,  0.0002,  0.0006]])

### rl05_FrozenLakeDeterministicBellman.py
* 用 deterministic 環境
* 用 Bellman equation 求 $Q$-table
  * $Q(s, a) = r + \gamma \max_{a'}Q(s', a')$

## Algorithm for deterministic environment
```
initial Q[num_states, num_actions]
observe initial state s
repeat until terminated
    select and perform action a
    observe reward r and new state s'
    Q(s, a) = r + gamma * max Q(s', a')
    s = s'
```

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

env = gym.make("FrozenLakeNotSlippery-v0")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

gamma = 1 # deterministic

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            print("Episode finised after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
plt.plot(steps_total)
plt.show()

修改上面的程式碼，印出更多的資訊

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

env = gym.make("FrozenLakeNotSlippery-v0")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

gamma = 1 # deterministic

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            print("Episode finised after %i steps" % step)
            break

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.plot(steps_total)
plt.show()

增加畫 reward 的圖的部分

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

env = gym.make("FrozenLakeNotSlippery-v0")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

gamma = 1 # deterministic

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            print("Episode finised after %i steps" % step)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.plot(steps_total)
plt.show()

plt.plot(rewards_total)
plt.show()

最後修改畫圖的部分，用 bar plot 來畫得好看一點

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

env = gym.make("FrozenLakeNotSlippery-v0")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

gamma = 1 # deterministic

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]
print("Initial Q-table:\n", Q)

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            print("Episode finised after %i steps" % step)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

### rl06_FrozenLakeStochastic.py
* 用 stochastic 環境
* 用 Bellman equation 求 Q-table
  * $Q(s, a) = r + \gamma \max_{a'}Q(s', a')$

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v0")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

gamma = 1 # deterministic

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            print("Episode finised after %i steps" % step)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

# plt.plot(steps_total)
# plt.show()

# plt.plot(rewards_total)
# plt.show()

### rl06-FrozenLake-0.4.0.py

In [None]:
import gym
import time
import torch

import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v0")

plt.style.use("ggplot")

number_of_states = env.observation_space.n
number_of_actions = env.action_space.n

gamma = 1

Q = torch.zeros([number_of_states, number_of_actions])

num_episodes = 1000

steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset()
    
    step = 0
    # for step in range(100)
    while True:
        step += 1
        # action = env.action_space.sample()
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1][0]
        new_state, reward, done, info = env.step(action.item())
        Q[state, action] = reward + gamma * torch.max(Q[new_state])
        state = new_state
        # time.sleep(0.4)
        # env.render()
        # print(new_state)
        # print(info)
        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            print("Episode finished after %i steps" % step)
            
print(Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

## Temporal difference
* The different between the current observation and the previous observation
* $[r + \gamma \times \max_{a'}Q(s', a')] - [Q(s, a)]$
  * current observation: $r + \gamma \times \max_{a'}Q(s', a')$
  * previous observation: $Q(s, a)$
  
## Q-learning
* Q-learning is off policy temporal difference control algorithm
* $Q_{t}(s, a) = Q_{t-1}(s, a) + \alpha \times TD$
  * $\alpha$: learning rate
* $Q(s, a) = Q(s, a) + \alpha [r + \gamma \times \max_{a'}Q(s', a') - Q(s, a)]$
* $Q(s, a) = (1 - \alpha)Q(s, a) + \alpha[r + \gamma \times \max_{a'}Q(s', a')]$
  * $\alpha=0$: we only care about the experience
  * $\alpha=1$: we don't care about the experience, we only care about the current observation
  
## Algorithm for stochastic environment
```
initial Q[num_states, num_actions]
observe initial state s
repeat until terminated
    select and perform action a
    observe reward r and new state s'
    Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
    s = s'
```

* Deterministic 和 stochastic environments 的 algorithm 唯一的差別在於如何計算 $Q(s, a)$
  * deterministic: $Q(s, a) = r + \gamma \times \max_{a'} Q(s', a')$
  * stochastic: $Q(s, a) = (1 - \alpha)Q(s, a) + \alpha [ r + \gamma \times \max_{a'}Q(s', a')]$

### rl07_FrozenLakeStochasticQLearning.py
* 用 stochastic 環境
* 用 Q-learning 來計算 Q-table
  * Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

env = gym.make("FrozenLake-v0")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]
print("Initial Q-table:\n", Q)

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            print("Episode finised after %i steps" % step)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

## Exploitation vs Exploration

* 在 best decision 和 more information 之間取得一個平衡
  * best decision 就是用 knowledge about the environment 來做的
* $\epsilon$-greedy: 設定一個 $\epsilon$ 數值，然後把機率和這個數值比較
  * $a = 1 - \epsilon$
  * $a = \epsilon$

In [32]:
torch.rand(1)

tensor([0.2946])

In [33]:
torch.rand(1)

tensor([0.7380])

In [34]:
torch.rand(1)[0]

tensor(0.5711)

In [35]:
torch.rand(1)[0]

tensor(0.1910)

### rl08_egreedy.py
* 用 deterministic 環境
* 用 Bellman equation 求 Q table
  * $Q(s, a) = r + \gamma \times \max_{a'}Q(s', a')$
* 用 $\epsilon$-greedy 來做 exploitation and exploration
  * 固定 $\epsilon$ 的數值

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

env = gym.make("FrozenLakeNotSlippery-v0")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# gamma = 1 # deterministic
gamma = 0.9
egreedy = 0.1 # epsilon greedy = 0.1 表示有十分之一的機會會採取 random action

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]
print("Initial Q-table:\n", Q)

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_for_egreedy = torch.rand(1)[0] # 產生一個介於 0 和 1 之間的隨機數，然後和 egreedy 比較
        if random_for_egreedy > egreedy:
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
            random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
            action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        else:
            action = env.action_space.sample()
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            print("Episode finised after %i steps" % step)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

# plt.plot(steps_total)
# plt.show()

# plt.plot(rewards_total)
# plt.show()

* 上面的例子是固定 $\epsilon$ 的值，但是也可以在訓練模型時用會改變的 $\epsilon$
  * 因為一開始我們對環境的認識並不多，所以 action 會是比較 random 的 (Exploitation)
  * 隨著我們對環境越加了解後，我們就更該利用我們知道的 (Exploration)

### rl08-b-egreedy-decay.py
* 用 deterministic 環境
* 用 Bellman equation 求 Q table
  * $Q(s, a) = r + \gamma \times \max_{a'}Q(s', a')$
* 用 $\epsilon$-greedy 來做 exploitation and exploration
  * $\epsilon$ 會改變

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# 要 deterministic 的話要加入這一部分
from gym.envs.registration import register
register(
    id="FrozenLakeNotSlippery-v0",
    entry_point="gym.envs.toy_text:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

env = gym.make("FrozenLakeNotSlippery-v0")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# gamma = 1 # deterministic
gamma = 0.9

egreedy = 0.7 # 一開始的 epsilon 數值會很大，然後會漸漸變小
egreedy_final = 0.1
egreedy_decay = 0.999


Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]
print("Initial Q-table:\n", Q)

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []
egreedy_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > egreedy:
        # action = env.action_space.sample() # 從 action space 中隨機選擇 action
        # action = torch.max(Q[state], 1)[1][0] # 沿著 row 方向找最大值，[1] 是第一個元素表示 index 結果是 tensor 所以要用 [0] 取得數值
        # 上面的 action 公式雖然正確，可是因為一開始的 Q table 都是 0 所以上面的公式只會選出第一個 action，因此失去的 exploration
            random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
            action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        else:
            action = env.action_space.sample()
        if egreedy > egreedy_final:
            egreedy *= egreedy_decay
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        Q[state, action] = reward + gamma * torch.max(Q[new_state]) # 6. Q(s, a) = r + gamma * max Q(s', a')
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(reward)
            egreedy_total.append(egreedy)
            print("Episode finised after %i steps" % step)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Egreedy value")
plt.bar(torch.arange(len(egreedy_total)), egreedy_total, alpha=0.6, color="blue")
plt.show()

## Value iteration
* Policy consists of clear instructions to Agent how to behave in our env
* Offline planning: agent 在選擇 action 之前，就對環境很了解，知道選擇了哪一個 ation 之後會有怎樣的結果
  * Agent 知道全部的 actions 執行後的結果
* 目標是建立一個表格，有每個 state 和對應的 value
  * 之前的方法是 agent 對環境不了解，藉由 exploration 慢慢地建立表格，並藉由學習把數值更新變得更好
    * 先建立表格，再藉由探索以學習慢慢更新，得到好結果
  * 現在的方法是對所有可能的 state 做很多次迭代，以得到最好的結果，這樣就能知道選擇哪個 action 比較合適
    * 先建立表格，直接用迭代，來得到最可能的好結果

In [40]:
import gym

env = gym.make("FrozenLake-v0")

In [41]:
env.env.P # 輸出每個 state 允許的動作，以及執行該動作的機率

{0: {0: [(0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 4, 0.0, False)],
  1: [(0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 4, 0.0, False),
   (0.3333333333333333, 1, 0.0, False)],
  2: [(0.3333333333333333, 4, 0.0, False),
   (0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False)],
  3: [(0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 0, 0.0, False)]},
 1: {0: [(0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 5, 0.0, True)],
  1: [(0.3333333333333333, 0, 0.0, False),
   (0.3333333333333333, 5, 0.0, True),
   (0.3333333333333333, 2, 0.0, False)],
  2: [(0.3333333333333333, 5, 0.0, True),
   (0.3333333333333333, 2, 0.0, False),
   (0.3333333333333333, 1, 0.0, False)],
  3: [(0.3333333333333333, 2, 0.0, False),
   (0.3333333333333333, 1, 0.0, False),
   (0.3333333333333333, 0, 0.0, False)]},
 2:

In [37]:
env.env.P[8] # 在 state 8 有 4 種 moves

{0: [(0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 8, 0.0, False),
  (0.3333333333333333, 12, 0.0, True)],
 1: [(0.3333333333333333, 8, 0.0, False),
  (0.3333333333333333, 12, 0.0, True),
  (0.3333333333333333, 9, 0.0, False)],
 2: [(0.3333333333333333, 12, 0.0, True),
  (0.3333333333333333, 9, 0.0, False),
  (0.3333333333333333, 4, 0.0, False)],
 3: [(0.3333333333333333, 9, 0.0, False),
  (0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 8, 0.0, False)]}

In [38]:
env.env.P[8][1] # 在 state 8 選擇動作 1 的時候，有三種可能的結果
# 每個 tuple 表示每種結果的 (機率，動作結束後會在哪個 state，reward，是否結束 episode)

[(0.3333333333333333, 8, 0.0, False),
 (0.3333333333333333, 12, 0.0, True),
 (0.3333333333333333, 9, 0.0, False)]

In [43]:
V = torch.zeros([number_of_states])
V

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### rl09_bonus_value_iteration.py
* 用 Stochastic 環境
* 用 Value iteration

In [None]:
import gym
import torch
import matplotlib.pyplot as plt

# stochastic
env = gym.make("FrozenLake-v0")

number_of_states = env.observation_space.n
number_of_actions = env.action_space.n

# V value - size 16 (as number of fields[states] in our lake)
V = torch.zeros([number_of_states])

gamma = 0.9

rewards_total = []
steps_total = []

# this is common function used in value_iteration and bulid_policy
# It gets through all possible moves from defined state
# and it returns best possible move (its value and index)
def next_step_evaluation(state, Vvalues):
    Vtemp = torch.zeros(number_of_actions)
    
    for action_possible in range(number_of_actions): # 把某個狀態下所有可能的 action 都做一遍
        for prob, new_state, reward, _ in env.env.P[state][action_possible]:
            Vtemp[action_possible] += prob * (reward + gamma * Vvalues[new_state])
            
    max_value, indice = torch.max(Vtemp, 0)
    return max_value, indice

# VALUE ITERATION
# this will build V values table from scratch
# will go through all possible states
def value_iteration():
    Qvalues = torch.zeros(number_of_states)
    # this is value based on experiment
    # after that many iterations values don't change significantly any more
    # it can be done in better way - with some kind of evaluation of our values
    # but this is simplified version which works also well in this example
    max_iterations = 1500
    
    for _ in range(max_iterations):
        # for each step we search for best possible move
        for state in range(number_of_states):
            max_value, _ = next_step_evaluation(state, Qvalues)
            # Qvalues[state] = max_value[0] # 舊版的 pytorch
            Qvalues[state] = max_value
            
    return Qvalues

# BUILD POLICY
# Now having V table - we can use it to build policy
# policy means clear instructions which are best moves from each single state
# So having V values table ready - we can easily understand which move is the best
# in each step
# so we are able to build clear instructions for our agent
# telling him which move he should choose in every state
def build_policy(Vvalues):
    Vpolicy = torch.zeros(number_of_states)
    
    for state in range(number_of_states):
        _, indice = next_step_evaluation(state, Vvalues)
        # Vpolicy[state] = indice[0] # 舊版的 pytorch
        Vpolicy[state] = indice
        
    return Vpolicy

# 2 main steps to build policy for our agent
V = value_iteration() # 用 value iteration 建立一個 V 表格
Vpolicy = build_policy(V) # 由 V 表格來查看哪個 action 是最好的選擇

# import sys
# print("V:", V)
# print("Vpolicy:", Vpolicy)
# sys.exit()

# main loop for our target
num_episodes = 1000

for i_episode in range(num_episodes):
    state = env.reset()
    
    step = 0
    while True:
        step += 1
        
        # action = Vpolicy[state]
        action = int(Vpolicy[state]) # Vpolicy[state] 傳回的是 float 但是 action 要用 int
        
        new_state, reward, done, info = env.step(action)
        
        state = new_state
        
        if done:
            rewards_total.append(reward)
            steps_total.append(step)
            break
            
print(V)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()


### rl10_homework.py

* 雖然 taxi-v2 是 deterministic 的，但是這邊試著用 stochastic 的 algorithm 來算 Q 值，試看看這樣的結果有沒有比較好
* Stochastic: $Q(s, a) = (1 - \alpha)Q(s, a) + \alpha [ r + \gamma \times \max_{a'}Q(s', a')]$

In [46]:
import gym
env = gym.make("Taxi-v3")
number_of_states = env.observation_space.n
number_of_actions = env.action_space.n

In [47]:
number_of_states

500

In [48]:
number_of_actions # 上, 下, 左, 右, pick up, drop off

6

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

改善畫圖的部分，只畫第 200 步以後的

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total[200:])), rewards_total[200:], alpha=0.6, color="green")
plt.show()

* Stochastic: $Q(s, a) = (1 - \alpha)Q(s, a) + \alpha [ r + \gamma \times \max_{a'}Q(s', a')]$
* 試試看不同的 learning rate
  * 用 learning rate $\alpha = 0$ (只注重 experience，不管 current observation)
```
Percentage of episodes finished successfully: -771.102
Percentage of episodes finished successfully (last 100 episodes): -769.36
Average number of steps: 196.34
Average number of steps (last 100 episodes): 194.71
```
  * 用 learning rate $\alpha = 1$ (不管過去的 experience，只注重 current observation)
```
Percentage of episodes finished successfully: -13.093
Percentage of episodes finished successfully (last 100 episodes): 7.54
Average number of steps: 26.43
Average number of steps (last 100 episodes): 13.46
```
  * 用 learning rate $\alpha = 0.5$
```
Percentage of episodes finished successfully: -25.829
Percentage of episodes finished successfully (last 100 episodes): 7.87
Average number of steps: 34.40
Average number of steps (last 100 episodes): 13.13
```
* 試試看不同的 gamma
  * 用 $\gamma = 0$ (只在乎 current reward)
    * 有學習但是學不好，亂 drop off 乘客
```
Percentage of episodes finished successfully: -194.968
Percentage of episodes finished successfully (last 100 episodes): -190.13
Average number of steps: 190.55
Average number of steps (last 100 episodes): 192.23
```
  * 用 $\gamma = 1$ (future reward 也考慮)
```
Percentage of episodes finished successfully: -19.19
Percentage of episodes finished successfully (last 100 episodes): 7.4
Average number of steps: 29.08
Average number of steps (last 100 episodes): 13.60
```
* 用 $\epsilon$ greedy 的方法
  * $\gamma = 0.95$, learning rate $\alpha = 0.9$, $\epsilon = 0.1$
```
Percentage of episodes finished successfully: -24.683
Percentage of episodes finished successfully (last 100 episodes): 3.01
Average number of steps: 30.00
Average number of steps (last 100 episodes): 14.84
```
  * $\gamma = 0.95$, learning rate $\alpha = 0.9$, $\epsilon = 0$
```
Percentage of episodes finished successfully: -15.745
Percentage of episodes finished successfully (last 100 episodes): 7.85
Average number of steps: 28.49
Average number of steps (last 100 episodes): 13.15
```
  * $\gamma = 0.95$, learning rate $\alpha = 0.9$, $\epsilon = 1$
    * 沒學任何東西，只是用 random action 的方式
```
Percentage of episodes finished successfully: -766.486
Percentage of episodes finished successfully (last 100 episodes): -776.62
Average number of steps: 196.47
Average number of steps (last 100 episodes): 198.58
```

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 1 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.5 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 1 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        # 4. select and perform action a
        random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
        action = torch.max(random_values, 1)[1].tolist()[0]
        # 因此要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡
egreedy = 0.1

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        
        random_for_egreedy = torch.rand(1)[0]
        # 4. select and perform action a
        if random_for_egreedy > egreedy:
            random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
            action = torch.max(random_values, 1)[1].tolist()[0]
            # 要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        else:
            action = env.action_space.sample()        
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total[200:])), rewards_total[200:], alpha=0.6, color="green")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡
egreedy = 0

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        
        random_for_egreedy = torch.rand(1)[0]
        # 4. select and perform action a
        if random_for_egreedy > egreedy:
            random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
            action = torch.max(random_values, 1)[1].tolist()[0]
            # 要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        else:
            action = env.action_space.sample()        
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total[200:])), rewards_total[200:], alpha=0.6, color="green")
plt.show()

In [None]:
import gym
import time
import torch
import matplotlib.pyplot as plt

# env = gym.make("Taxi-v2") # 舊版本
env = gym.make("Taxi-v3")

plt.style.use("ggplot")

number_of_states = env.observation_space.n # 用 .n 得到 space 中有幾個數 
number_of_actions = env.action_space.n

# 要調整這兩個數值，以得到最合適的
gamma = 0.95 # 因為不是用 deterministic environment 所以 gamma 要比 1 小
learning_rate = 0.9 # learning rate 是要實驗才知道哪個數值合適，要在現在的觀察結果與過去的經驗值之間取得平衡
egreedy = 1

Q = torch.zeros([number_of_states, number_of_actions]) # 1. initial Q[num_state, num_action]

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = []
rewards_total = []

for i_episode in range(num_episodes):
    state = env.reset() # 2. observe initial state s

    step = 0
    score = 0
    while True: # 3. repeat until terminated
        step += 1
        
        random_for_egreedy = torch.rand(1)[0]
        # 4. select and perform action a
        if random_for_egreedy > egreedy:
            random_values = Q[state] + torch.rand(1, number_of_actions) / 1000
            action = torch.max(random_values, 1)[1].tolist()[0]
            # 要加入一點很小的 randomness 使得有 exploration 且又不影響到 Q value
        else:
            action = env.action_space.sample()        
        
        new_state, reward, done, info = env.step(action) # 等號右邊是 4. select and perform action a 等號左邊是 5. observe reward r and new state s'
        
        score += reward
        
        # 6. Q(s, a) = (1 - alpha)Q(s, a) + alpha[r + gamma * max Q(s', a')]
        # 這裡的 Q 值計算，和 rl06_FrozenLakeStochastic.py 不同
        Q[state, action] = (1 - learning_rate) * Q[state, action] \
                           + learning_rate * (reward + gamma * torch.max(Q[new_state]))
        state = new_state # 7. s = s'
        
        # time.sleep(0.4) 
        # env.render() # 把圖案畫出來
        
        # print(new_state) # 每走了一步之後變成新的狀態
        # print(info)

        if done:
            steps_total.append(step)
            rewards_total.append(score)
            print("Episode finised after %i steps" % step)
            print(score)
            break

print("Final Q-table:\n", Q)

print("Percentage of episodes finished successfully: {0}".format(sum(rewards_total)/num_episodes))
print("Percentage of episodes finished successfully (last 100 episodes): {0}".format(sum(rewards_total[-100:])/100))

print("Average number of steps: %.2f" % (sum(steps_total)/num_episodes))
print("Average number of steps (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Steps / Episode length")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="red")
plt.show()

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(rewards_total[200:])), rewards_total[200:], alpha=0.6, color="green")
plt.show()

## Neural Network
* [http://neuralnetworksanddeeplearning.com/](http://neuralnetworksanddeeplearning.com/)
* Deep learning: when the hidden layers $\ge$ 2 can be called as deep learning
* Activation functions:
  * ReLU: DeepMind
  * Tanh: OpenAI
  * The activation function is used to determine if our neural network be actived or not
* Loss functions:
  * MSE
  * Huber Loss: DeepMind
    * The Huber loss is quadratic for small values and linear for large values.
      * To penalize our network class for huge mistakes comparing to MSE.
  * The loss function is used to see how good are we doing or how far away from the expected value we are
    * The difference between the predicted value and the actual outputs.
* Optimizer:
  * Adman
  * RMSProp
  * The optimizer is used to minize our loss.

### `torch.squeeze()` and `torch.unsqueeze()`
* `torch.squeeze()` 把 tensor 中 dim=1 的部分移除
  * 如果有指明哪一個 dim 要移除，若該 dim = 1 就只會移除該 dim，若該 dim != 1 就什麼都不做
* `torch.unsqueeze()` 在指定的 dim 部分加入 dim=1 

In [1]:
import torch

y = torch.zeros(1, 2, 1, 2)
y # size=1x2x1x2

tensor([[[[0., 0.]],

         [[0., 0.]]]])

In [2]:
torch.squeeze(y) # size=2x2

tensor([[0., 0.],
        [0., 0.]])

In [3]:
torch.squeeze(y, 0) # get rid of the first dim: size=2x1x2

tensor([[[0., 0.]],

        [[0., 0.]]])

In [4]:
torch.squeeze(y, 1) # get rid of the 2nd dim, but the 2nd dim != 1, so the size=1x2x1x2 is unchanged

tensor([[[[0., 0.]],

         [[0., 0.]]]])

In [6]:
z = torch.Tensor([1, 3, 6])
z # size=3

tensor([1., 3., 6.])

In [7]:
torch.unsqueeze(z, 0) # size=1x3

tensor([[1., 3., 6.]])

In [8]:
torch.unsqueeze(z, 1) # size=3x1

tensor([[1.],
        [3.],
        [6.]])

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable

# if GPU is to be used
use_cuda = torch.cuda.is_available()
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

目標是要用 NN 來猜 linear function: $y = W \cdot x + b$

In [2]:
W = 2
b = 0.2

In [4]:
x = torch.arange(100)
print(x.size())
x # size=100

torch.Size([100])


tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
        72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
        90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [5]:
x = Variable(torch.arange(100)) # pytorch 要求變數要放在 Variable()
print(x.size())
x # size=100

torch.Size([100])


tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
        54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
        72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
        90, 91, 92, 93, 94, 95, 96, 97, 98, 99])

In [10]:
x = Variable(torch.arange(100).unsqueeze(1)) # pytorch 要求變數有特定的格式
print(x.size())
print(x.type())
x # size=100x1

torch.Size([100, 1])
torch.LongTensor


tensor([[ 0],
        [ 1],
        [ 2],
        [ 3],
        [ 4],
        [ 5],
        [ 6],
        [ 7],
        [ 8],
        [ 9],
        [10],
        [11],
        [12],
        [13],
        [14],
        [15],
        [16],
        [17],
        [18],
        [19],
        [20],
        [21],
        [22],
        [23],
        [24],
        [25],
        [26],
        [27],
        [28],
        [29],
        [30],
        [31],
        [32],
        [33],
        [34],
        [35],
        [36],
        [37],
        [38],
        [39],
        [40],
        [41],
        [42],
        [43],
        [44],
        [45],
        [46],
        [47],
        [48],
        [49],
        [50],
        [51],
        [52],
        [53],
        [54],
        [55],
        [56],
        [57],
        [58],
        [59],
        [60],
        [61],
        [62],
        [63],
        [64],
        [65],
        [66],
        [67],
        [68],
        [69],
        [70],
      

### rl11_NN_review.py

In [17]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable

# if GPU is to be used
use_cuda = torch.cuda.is_available()
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

# NN 不知道這兩個數值，要藉由學習猜出來
W = 2
b = 0.2

x = Variable(torch.arange(100).unsqueeze(1))

y = W * x + b

###### PARAMS ######
learning_rate = 0.01
num_episodes = 1000

class NeuralNetwork(nn.Module): # 繼承自 nn.Module 所以可以用 pythorch 提供的 methods
    def __init__(self):
        super(NeuralNetwork, self).__init__() # 要呼叫 parent class
        self.linear1 = nn.Linear(1, 1) # 使用 linear model 且 input dim=1, output dim=1
        
    def forward(self, x): # 收到 input variable x 然後輸出 output variable
        output = self.linear1(x)
        return output
    
mynn = NeuralNetwork()

if use_cuda:
    mynn.cuda() # 如果有 GPU 就用 cuda
    
loss_func = nn. MSELoss()
# loss_func = nn.SmoothL1Loss() # Huber loss

optimizer = optim.Adam(params=mynn.parameters(), lr=learning_rate)
# optimizer = optim.RMSprop(params=mynn.parameterseters(), lr=learning_rate) # 用 RMSprop 當 loss function

for i_episode in range(num_episodes):
    predicted_value = mynn(x.type(torch.FloatTensor)) # RuntimeError: expected scalar type Float but found Long
    loss = loss_func(predicted_value, y)
    
    optimizer.zero_grad()
    loss.backward() # 計算 gradient
    optimizer.step() # update model parameters
    
    if i_episode % 50 == 0:
        print("Episode %i, loss %.4f" % (i_episode, loss.data[0]))
        
plt.figure(figsize=(12, 5))
plt.plot(x.data.numpy(), y.data.numpy(), alpha=0.6, color="green") # 不能直接放 pytorch tensor 給 matplotlib 所以要加上 .data.numpy()
plt.plot(x.data.numpy(), predicted_value.data.numpy(), alpha=0.6, color="red")
plt.show()

IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

### rl11_NN-review-0.4.0.py

In [19]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt

from torch.autograd import Variable

# if gpu is to be used
use_cuda = torch.cuda.is_available()

device = torch.device("cuda:0" if use_cuda else "cpu")

W = 2
b = 0.3

x = torch.arange(100).to(device).unsqueeze(1)

y = W * x + b

###### PARAMS ######
learning_rate = 0.01
num_episodes = 1000

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(1, 1)
        
    def forward(self, x):
        output = self.linear1(x)
        return output
    
mynn = NeuralNetwork().to(device)

loss_func = nn.MSELoss()
# loss_func = nn.SmoothL1Loss()

optimizer = optim.Adam(params=mynn.parameters(), lr=learning_rate)
# optimizer = optim.RMSprop(params=mynn.parameters(), lr=learning_rate)

for i_episode in range(num_episodes):
    predicted_value = mynn(x)
    
    loss = loss_func(predicted_value, y)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if i_episode % 50 == 0:
        print("Episode %i, loss %.4f" % (i_episode, loss.item()))
        
plt.figure(figsize=(12, 5))
plt.plot(x.cpu().numpy(), y.cpu().numpy(), alpha=0.6, color="green") # numpy only work on cpu so we have to call cpu()
plt.plot(x.cpu().numpy(), predicted_value.detach().cpu().numpy(), alpha=0.6, color="blue")

if use_cuda:
    plt.savefig("graph.png")
else:
    plt.show()

RuntimeError: expected scalar type Float but found Long

### rl12_CartPoleRandomNew.py

In [None]:
import gym
import torch
import random
import matplotlib.pyplot as plt

env = gym.make("CartPole-v1") # 建立環境，這邊用 gym 中已經設定好的環境

# 設定 random seed
seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

num_episodes = 1000 # 跑 1000 個 episodes
steps_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    
    step = 0 # 每個 episode 一開始的步數要先歸零
    while True:
        step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action

#         print(new_state) # 每走了一步之後變成新的狀態
#         print(info)

#         env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        if done:
            steps_total.append(step) # 每個 episode 共走了幾部記錄下來
            print("Episode finised after %i steps" % step)
            break

# 在 CartPole 中每一個 step 就得到一 reward 所以兩者是等價的，因此只需要一張圖就好
print("Average reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

### rl13_egreedy_tool.py

In [None]:
import torch
import matplotlib.pyplot as plt
import math
import gym

env = gym.make("CartPole-v0")
num_episodes = 150

egreedy = 0.7
egreedy_final = 0.1
egreedy_decay = 500

egreedy_prev = egreedy
egreedy_prev_final = egreedy_final
egreedy_prev_decay = 0.999

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

egreedy_total = []
egreedy_prev_total = []

steps_total = 0

for i_episode in range(num_episodes):
    state = env.reset()
    while True:
        steps_total += 1
        epsilon = calculate_epsilon(steps_total)
        egreedy_total.append(epsilon)
        
        action = env.action_space.sample()
        
        new_state, reward, done, info = env.step(action)
        
        state = new_state
        if egreedy_prev > egreedy_prev_final:
            egreedy_prev *= egreedy_prev_decay
            egreedy_prev_total.append(egreedy_prev)
        else:
            egreedy_prev_total.append(egreedy_prev)
        
        if done:
            break
            
plt.figure(figsize=(12, 5))
plt.title("Egreedy value")
plt.bar(torch.arange(len(egreedy_total)), egreedy_total, alpha=0.6, color="blue")

plt.figure(figsize=(12, 5))
plt.title("Egreedy 2 value")
plt.bar(torch.arange(len(egreedy_prev_total)), egreedy_prev_total, alpha=0.6, color="green")
plt.show()

In [None]:
import torch
import matplotlib.pyplot as plt
import math
import gym

env = gym.make("CartPole-v0")
num_episodes = 150

egreedy = 0.9
egreedy_final = 0.02
egreedy_decay = 500

egreedy_prev = egreedy
egreedy_prev_final = egreedy_final
egreedy_prev_decay = 0.999

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

egreedy_total = []
egreedy_prev_total = []

steps_total = 0

for i_episode in range(num_episodes):
    state = env.reset()
    while True:
        steps_total += 1
        epsilon = calculate_epsilon(steps_total)
        egreedy_total.append(epsilon)
        
        action = env.action_space.sample()
        
        new_state, reward, done, info = env.step(action)
        
        state = new_state
        if egreedy_prev > egreedy_prev_final:
            egreedy_prev *= egreedy_prev_decay
            egreedy_prev_total.append(egreedy_prev)
        else:
            egreedy_prev_total.append(egreedy_prev)
        
        if done:
            break
            
plt.figure(figsize=(12, 5))
plt.title("Egreedy value")
plt.bar(torch.arange(len(egreedy_total)), egreedy_total, alpha=0.6, color="blue")

plt.figure(figsize=(12, 5))
plt.title("Egreedy 2 value")
plt.bar(torch.arange(len(egreedy_prev_total)), egreedy_prev_total, alpha=0.6, color="green")
plt.show()

In [18]:
import torch
torch.__version__

'1.9.0'

### rl14_CartPole-NN.py

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.01
# num_episodes = 1000
# gamma = 0.99
num_episodes = 2000
gamma = 0.86

egreedy = 0.9
egreedy_final = 0.02
egreedy_decay = 500
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, number_of_output)
        
    def forward(self, x):
        output = self.linear1(x)
        return output

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self, state, action, new_state, reward, done):
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        
        reward = Tensor([reward]).to(device) # reward 是 scalar 所以要改成 Tensor
        if done:
            target_value = reward
        else:
            new_state_values = self.nn(new_state).detach()
            max_new_state_values = torch.max(new_state_values)
            target_value = reward + gamma * max_new_state_values
        
        predicted_value = self.nn(state)[action]
        loss = self.loss_func(predicted_value, target_value)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0

for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            print("Episode finished after %i steps" % step)
            break
            
print("Average reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

### rl15_CartPole-NN-log.py

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.01
# num_episodes = 1000
# gamma = 0.99
num_episodes = 2000
gamma = 0.86

egreedy = 0.9
egreedy_final = 0.02
egreedy_decay = 500

report_interval = 10
score_to_solve = 195
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, number_of_output)
        
    def forward(self, x):
        output = self.linear1(x)
        return output

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self, state, action, new_state, reward, done):
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        
        reward = Tensor([reward]).to(device) # reward 是 scalar 所以要改成 Tensor
        if done:
            target_value = reward
        else:
            new_state_values = self.nn(new_state).detach()
            max_new_state_values = torch.max(new_state_values)
            target_value = reward + gamma * max_new_state_values
        
        predicted_value = self.nn(state)[action]
        loss = self.loss_func(predicted_value, target_value)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("Average reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

### rl16_CartPole-NN-2layer.py
* 嘗試用兩層 hidden layers 看看結果是否有變好

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.01
num_episodes = 500
gamma = 0.99

hidden_layer = 64

egreedy = 0.9
egreedy_final = 0.02
egreedy_decay = 500

report_interval = 10
score_to_solve = 195
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self, state, action, new_state, reward, done):
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        
        reward = Tensor([reward]).to(device) # reward 是 scalar 所以要改成 Tensor
        if done:
            target_value = reward
        else:
            new_state_values = self.nn(new_state).detach()
            max_new_state_values = torch.max(new_state_values)
            target_value = reward + gamma * max_new_state_values
        
        predicted_value = self.nn(state)[action]
        loss = self.loss_func(predicted_value, target_value)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

### rl17_CartPole-Challenge.py
* 程式碼和 [`rl16_CartPole-NN-2layer.py`](#rl16_CartPole-NN-2layer.py) 一模一樣，只是改變了 parameters 來看對結果有什麼影響

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.02
num_episodes = 500
gamma = 1

hidden_layer = 64

egreedy = 0.9
egreedy_final = 0
egreedy_decay = 500

report_interval = 10
score_to_solve = 195
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self, state, action, new_state, reward, done):
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        
        reward = Tensor([reward]).to(device) # reward 是 scalar 所以要改成 Tensor
        if done:
            target_value = reward
        else:
            new_state_values = self.nn(new_state).detach()
            max_new_state_values = torch.max(new_state_values)
            target_value = reward + gamma * max_new_state_values
        
        predicted_value = self.nn(state)[action]
        loss = self.loss_func(predicted_value, target_value)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

## Deep learning 有兩個問題
* highly correlated data
  * data are strongly linked together
  * RL 的 sequence of observations 之間會有 correlation
* non-stationary distributions
  * Non-stationary 表示 not time independent, the value changes over time and they are not constant
  * $Q(s, a) = r + \gamma \max_{a'}Q(s', a')$
    * 用先前的 $Q(s', a')$ 來計算現在的 $Q(s, a)$
      * 用下一個可能的 action 來計算 $Q(s', a')$ 的時候，全部可能的 action 都要拿來算 $Q(s', a')$，然後選用結果是最大的 $Q$-value 那個
    * 可是用 neural network  來近似 $Q$-value 的時候，target $Q$-value 就會隨著時間而改變
      * Neural network 一直在猜可能的 current state 會導致計算出來的 $Q(s', a')$ 每次都改變一點點，所以 $Q(s, a)$ 就跟著改變
      
      
* 引入 neural network 來求 action-value ($Q$) function (是非線性方程) $\rightarrow$ 導致 RL 會 unstable 或是 diverge
  * Observations 之間有 correlations 的話，對 $Q$ 更新一點點就可能使 policy 改變很多而造成 data 的分佈改變
  * action-values $Q$ 和 target-values $r + \gamma\max_{a'}Q(s', a')$ 之間也有 correlation
* 因為有 correlation 就會造成 RL unstablize 所以要想辦法移除 correlation
  * 用 experience replay: 藉由 randomizes over data 來消除 correlation
  * 用 iterative update 來調整 $Q$ 使其朝 target-value 接近: 週期性的對 $Q$ 更新以降低和 target 之間的 correlation


## Experience Reply
* Experience replay randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors
* experience reply 就是採用 random shuffle input states 的方式來解決前面提到的兩種情況
  * highly correlated data
  * non-stationary distributions
* experience reply 藉由 random sampling 可以移除 correlation in sequence
  * 要知道 reply memory 要用多大，要有多少 transition 要存到 buffer 中
    * 要是太大則 data 無法 fit into memory 要是太小則只能記錄一小部分的 transition
  * 要設定 batch size 所以才知道要把多少 sample pull from buffer
  
  
## Algorithm: DQN with experience reply
```
initialize replay memory D to capacity N
initialize action-value function Q with random weights
for episode = 1, M do
    initialize sequence s_{1} = {x_1} and preprocessed sequenced \phi_{1} = \phi(s_{1})
    for t = 1, T do
        with probability \epsilon select a random action a_{t}
        otherwise select a_{t} = max_{a} Q*(\phi(s_{t}, a; \theta)
        execute action a_{t} in emulator and observe reward r_{t} and image x_{t+1}
        set s_{t+1} = s_{t}, a_{t}, x_{t+1} and preprocess \phi_{t+1} = \phi(s_{t+1})
        store transition (\phi_{t}, a_{t}, r_{t}, \phi_{t+1}) in D
        sample random minibatch of transitions (\phi_{j}, a_{j}, r_{j}, \phi_{j+1}) from D
        set y_{j} = r_{j} for terminal \phi_{j+1} or y_{j} = r_{j} + \gamma max_{a'}Q(\phi_{j+1}, a';\theta) for non-terminal \phi_{j+1}
        perform a gradient descent step on (y - Q(\phi_{j}, a_{j}, \theta))^2 according equation 3
    end for
end for
```

In [1]:
position = 0
capacity = 4
position = (position + 1) % capacity

In [2]:
position

1

In [3]:
position = (position + 1) % capacity
position

2

In [4]:
position = (position + 1) % capacity
position

3

In [5]:
position = (position + 1) % capacity
position

0

In [6]:
x = [[1, 2, 3], [4, 5, 6]]
x

[[1, 2, 3], [4, 5, 6]]

In [7]:
zip(x)

<zip at 0x7f80d2fcfb40>

In [8]:
list(zip(x))

[([1, 2, 3],), ([4, 5, 6],)]

In [9]:
list(zip(*x)) # 有 * 就是把相對應的元素放在同一組 tuple 中

[(1, 4), (2, 5), (3, 6)]

In [11]:
y = torch.Tensor([1, 0, 0])
y

tensor([1., 0., 0.])

In [12]:
1 - y

tensor([0., 1., 1.])

### rl18_CartPole-ExperienceReplay.py

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.02
num_episodes = 500
gamma = 1

hidden_layer = 64

replay_memory_size = 50000
# batch_size = 32
batch_size = 3

egreedy = 0.9
egreedy_final = 0
egreedy_decay = 500

report_interval = 10
score_to_solve = 195
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        
        new_state_values = self.nn(new_state).detach()
        max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

## DQN
* DQN: Deep $Q$-network


## Target Net
* Use an iterative update that adjusts the action-values Q toward target values that are only periodically updated, thereby reducing correlations with the target.
* $L_{i}(\theta_{i}) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \Big[ \Big( r + \gamma max_{a'}Q(s', a';\theta_{i}^{-}) - Q(s, a; \theta_{i})\Big) \Big]$
  * target value: $r + \gamma\max_{a'}Q(s', a'; \theta_{i}^{-})$
    * 要 stablize NN performance 的話，就不要常常改動 target 的數值
    * 用 Target net 來更新 $Q(s', a'; \theta_{i}^{-})$
      * 在一定的 number of steps 內要維持 fixed target value
  * predicted value: $Q(s, a; \theta_{i})$
    * 用 learning net 來更新 $Q(s, a; \theta_{i})$
  * 用 learning NN 來更新 target net
    * 就是更新 weight 的值
    * 其實就是每隔一定的 number of steps 把 learning nn 拷貝到 target net


* 當我們有 fix data to learn from 和 fix target to compare to 的時候，這樣會更類似 supervise learning

* How often we control to update target net?
  * 用 hyperparameters 來調
    * too high: 沒有新的資訊輸入造成 target 幾乎沒變 $\Rightarrow$ doesn't learning anything
    * too low: target 常常改變 non-stationary
  * 建議每隔 10 $\sim$ 100 或 $\sim$ 1000 步來更新
  * 會對結果有很大的影響，所以要先決定是要快得到結果，還是要得到穩定的結果


## Error Clipping
* Clip the error term from the update $r + \gamma \max_{a'}Q(s', a';\theta_{i}^{-}) - Q(s, a; \theta_{i})$ to be between  -1 and 1
  * Absolute value loss function $|x|$ 的導數是
    * -1, $x \lt 0$
    * 1, $x \gt 0$
* 用 error clipping 把誤差限制在 -1 ~ +1 之間，可以增進 stability
* python implemention
```python
loss.backward()

for param in self.agent.parameters():
    param.grad.data.clamp_(-1, 1)
    
self.optimizer.step()
```

### rl19_CartPole-targetnet.py

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.01
num_episodes = 500
gamma = 1

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 500 # 每 500 steps 更新一次 target network

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        
        # new_state_values = self.nn(new_state).detach()
        new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
        max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

* 調整參數看結果有沒有更好
  * 第一組參數是能得到 stable 的結果
  * 第二組參數是 solve as quick as possible

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.001
num_episodes = 500
gamma = 0.999

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 500 # 每 500 steps 更新一次 target network

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = True
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        
        # new_state_values = self.nn(new_state).detach()
        new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
        max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.01
num_episodes = 500
gamma = 1

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 100 # 每 500 steps 更新一次 target network

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        
        # new_state_values = self.nn(new_state).detach()
        new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
        max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

## Double DQN

* 某些情況下 $Q$-learning 會對 action-value 造成 overestimate
  * 因為每次都選最大 $Q$-value 的 action，但是這個可能不是最好的 action
  
* $Y_{t}^{DoubleDQN} = R_{t+1} + \gamma Q(S_{t+1}, \argmax_{a} Q(S_{t+1}, a; \theta_{t}), \theta_{t}^{-})$
  * 用兩個 NN 來求 $Q$-value
    * 一個 NN (learning net) 用來找出 best action: $\argmax_{a} Q(S_{t+1}, a; \theta_{t})$
    * 另一個 NN (target net) 用來求出 $Q$-value: $Q(S_{t+1}, a_{best}, \theta_{t}^{-})$
    

### rl20_CartPole-DoubleDQN.py

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.01
num_episodes = 500
gamma = 1

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 100 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN 找出最佳的 action
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach() # 用 target NN 來求出 Q
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.001
num_episodes = 500
gamma = 0.9999

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 500 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        self.linear2 = nn.Linear(hidden_layer, number_of_output)
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        output2 = self.linear2(output1)
        
        return output2

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach()
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

## Dueling DQN

* 只是做些簡單的修改就有可能大大地改變 efficiency & 結果的 stability
* Dueling DQN
  * For model free RL
  * 有兩個 estimators:
    * 一個用來計算 state value function
    * 一個用來計算 state dependent action advantage function
      * 有些時候某些 action 就比較重要，有些時候 action 就不重要，用哪個都無所謂。
* Advantage function: $A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$
  * 就是把 $Q$-value 拆成 $V$ value 和 $A$ value 兩部份分開計算，最後再把兩者結合計算 $Q$-value
    * $V$: how good it is to be in a defined state
    * $A$: 告訴我們 importance of each action
* $Q$-function: $Q(s, a; \theta, \alpha, \beta) = V(s;\theta, \beta) + \Big(A(s, a;\theta, \alpha) - \frac{1}{|\mathcal{A}|}\sum_{a'}A(s, a';\theta, \alpha)\Big)$
  * $\frac{1}{|\mathcal{A}|}\sum_{a'}A(s, a';\theta, \alpha)\Big)$: mean value of advantage
    * $|\mathcal{A}|$: Number of samples
* Dueling DQN 對 actions 越多的環境就越有效
* Dueling DQN 只需要把一層 hidden layer 改成兩個部分，一個用來算 $V$-value 另一個用來算 advantage
  * value 和 advantage 使用的輸入都是前一層的輸出
  * value 只有一個 output
  * Advangate 有 number of actions 個 outputs
  * 要把 value 和 advantage 的輸出做 fully connection 然後變成輸出
  

### rl20_CartPole-DuelingDQN.py
* 跑完的結果
```
Average reward: 128.41
Average reward (last 100 episodes): 195.77
Solved after 316 episodes
```

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.001
num_episodes = 500
gamma = 0.9999

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 500 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        # self.linear2 = nn.Linear(hidden_layer, number_of_output)
        # 第二層改為 value 和 advantage
        self.advantage = nn.Linear(hidden_layer, number_of_output) # advantage 有 number of actions 個輸出
        self.value = nn.Linear(hidden_layer, 1) # value 只有一個輸出
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        # output2 = self.linear2(output1)
        output_advantage = self.advantage(output1,)
        output_value = self.value(output1,)
        
        # 要把 value 和 advantage 的輸出結合起來
        output_final = output_value + output_advantage - output_advantage.mean()
        
        return output_final

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach()
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

* Have a stable result
  * 目標是 last 100 episodes 沒有損失任何的點數
    * 就是說最後的 100 episodes 的 rewards 都是 200
  * 把 update_target_frequency 改成 1500
* 跑完的結果
```
Average reward: 149.26
Average reward (last 100 episodes): 198.26
Solved after 253 episodes
```

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.001
num_episodes = 500
gamma = 0.9999

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 1500 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        # self.linear2 = nn.Linear(hidden_layer, number_of_output)
        # 第二層改為 value 和 advantage
        self.advantage = nn.Linear(hidden_layer, number_of_output) # advantage 有 number of actions 個輸出
        self.value = nn.Linear(hidden_layer, 1) # value 只有一個輸出
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        # output2 = self.linear2(output1)
        output_advantage = self.advantage(output1,)
        output_value = self.value(output1,)
        
        # 要把 value 和 advantage 的輸出結合起來
        output_final = output_value + output_advantage - output_advantage.mean()
        
        return output_final

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach()
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

* Solve the problem as soon as possible
  * 目標是要 beat DoubleDQN 的結果
  * 把 gamma 改成 1，把 update_target_frequency 設成 50
  * 把 value 和 advantage 各多加一層，所以整個模型裡有三層
* 跑完的結果
```

```

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import matplotlib.pyplot as plt

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor
env = gym.make("CartPole-v0")

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.001
num_episodes = 500
gamma = 1

hidden_layer = 64

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 50 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 1
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        # self.linear2 = nn.Linear(hidden_layer, number_of_output)
        # 第二層改為 value 和 advantage
        self.advantage = nn.Linear(hidden_layer, hidden_layer)
        self.advantage2 = nn.Linear(hidden_layer, number_of_output) # advantage 有 number of actions 個輸出
        
        self.value = nn.Linear(hidden_layer, hidden_layer)
        self.value2 = nn.Linear(hidden_layer, 1) # value 只有一個輸出
        
        self.activation = nn.Tanh()
        # self.activation = nn.ReLU()
        
    def forward(self, x):
        output1 = self.linear1(x)
        output1 = self.activation(output1)
        # output2 = self.linear2(output1)
        output_advantage = self.advantage(output1)
        output_advantage = self.activation(output_advantage)
        output_advantage = self.advantage2(output_advantage)
        
        output_value = self.value(output1)
        output_value = self.activation(output_value)
        output_value = self.value2(output_value)
        
        # 要把 value 和 advantage 的輸出結合起來
        output_final = output_value + output_advantage - output_advantage.mean()
        
        return output_final

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
        self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
                action = torch.max(action_from_nn, 0)[1]
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
        
        # 要把全部的 data 轉成 tensor
        state = Tensor(state).to(device)
        new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach()
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
        if self.update_target_counter % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
        self.update_target_counter += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

steps_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    step = 0
    # for step in range(100):
    while True:
        step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        if done:
            steps_total.append(step)
            mean_reward_100 = sum(steps_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0):
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(steps_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(steps_total)/len(steps_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(steps_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(steps_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.bar(torch.arange(len(steps_total)), steps_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

Gym environment 位在

## Stanford CS231 course CNN
http://cs231n.stanford.edu/

### rl23_PongRandom.py

* 修改 rl12_CartPoleRandomNew.py
* 使用 [atari_wrappers.py](https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers.py)

In [None]:
import gym
import torch
import random
import matplotlib.pyplot as plt
from atari_wrappers import make_atari, wrap_deepmind

# env = gym.make("CartPole-v1") # 建立環境，這邊用 gym 中已經設定好的環境
env_id = "PongNoFrameskip-v4"
env = make_atari(env_id)
env = wrap_deepmind(env)

# 設定 random seed
seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

num_episodes = 5 # 跑 1000 個 episodes
rewards_total = [] # 用來儲存每個 episode 走了幾步

for i_episode in range(num_episodes):
    state = env.reset() # 最初要先 reset 狀態
    
    score = 0 # 每個 episode 一開始的步數要先歸零
    while True:
#         step += 1
        action = env.action_space.sample() # 從 action space 中隨機選擇 action
        new_state, reward, done, info = env.step(action) # 告訴環境 agent 用了哪個 action
        score += reward

#         print(new_state) # 每走了一步之後變成新的狀態
#         print(info)

#         env.render() # 把圖案畫出來，會畫在新的 window 中，不會畫在 jupyter notebook 中

        if done:
            rewards_total.append(score) # 每個 episode 共走了幾部記錄下來
            print("Episode finised. Rewards: %i" % score)
            break

# 在 CartPole 中每一個 step 就得到一 reward 所以兩者是等價的，因此只需要一張圖就好
print("Average reward: %.2f" % (sum(rewards_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(rewards_total[-100:])/100))

plt.figure(figsize=(12, 5))
plt.title("Rewards")
plt.plot(rewards_total, alpha=0.6, color="green")
plt.show()

# 要把環境關掉
env.close()
env.env.close()

#### Save and load model
* https://pytorch.org/tutorials/beginner/saving_loading_models.html
* https://pytorch.org/tutorials/beginner/basics/saveloadrun_tutorial.html

## Flatten tensor

[[0, 1], [2, 3]] $\rightarrow$ [0, 1, 2, 3]

In [1]:
import torch

y = torch.zeros(1, 7, 7, 64)
y.shape # 這個是 4-dim

torch.Size([1, 7, 7, 64])

In [2]:
z = y.view(y.size(0), -1)
z

tensor([[0., 0., 0.,  ..., 0., 0., 0.]])

In [3]:
z.shape # 變成 2-dim

torch.Size([1, 3136])

In [4]:
7*7*64

3136

In [5]:
y = torch.randn(1, 84, 84)
y.shape

torch.Size([1, 84, 84])

In [None]:
nn.Conv2d(in_channels=1, out_channels=32, kernel_size=8, stride=4)(y)
# Conv2d 預計輸入是 4-dim tensor 但是 y 是 3-dim 所以會有 error

In [None]:
z = torch.randn(1, 1, 84, 84)
z.shape

### rl24_PongVideoOutput.py

* Copy rl22_CartPole-Dueling-Tune-Fast.py
* DeepMind's CNN

|Layer|Input|Filter Size|Stride|# of filters|Activation|Output|
|:---|:---|:---|:---|:---|:---|:---|
|conv1|84x84x4|8|4|32|ReLU|20x20x32|
|conv2|20x20x32|4|2|64|ReLU|9x9x64|
|conv3|9x9x64|3|1|64|ReLU|7x7x64|
|fc1|7x7x64|||512|ReLU|512|
|fc2|512||||Linear|# of actions|

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import os.path
import matplotlib.pyplot as plt
from atari_wrappers import make_atari, wrap_deepmind

plt.style.use("ggplot")

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor

# env = gym.make("CartPole-v0")
env_id = "PongNoFrameskip-v4"
env = make_atari(env_id)
env = wrap_deepmind(env)

# 把結果存成 video
directory = "./PongVideos/"
env = gym.wrappers.Monitor(env, directory, video_callable=lambda episode_id: episode_id % 20 == 0)

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.001
num_episodes = 500
gamma = 1

hidden_layer = 512

replay_memory_size = 50000
batch_size = 32

update_target_frequency = 50 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 1
egreedy_final = 0.01
egreedy_decay = 500

report_interval = 10
score_to_solve = 195

clip_error = False
normalize_image = True

file2save = "pong_save.pth"
save_model_frequency = 10000 # 每 1 萬個 frame 就存檔一次
resume_previous_training = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

def load_model():
    return torch.load(file2save)

def save_model(model):
    torch.save(model.state_dict(), file2save)

def preprocess_frame(frame):
    frame = frame.transpose((2, 0, 1))
    frame = torch.from_numpy(frame)
    frame = frame.to(device, dtype=torch.float32)
    frame = frame.unsqueeze(1)
    return frame
    
def plot_results():
    plt.figure(figsize=(12, 5))
    plt.title("Rewards")
    plt.plot(rewards_total, alpha=0.6, color="red")
#     plt.show()
    plt.savefig("Pong-results.png")
    plt.close()
    
class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
#         self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        # self.linear2 = nn.Linear(hidden_layer, number_of_output)
        # 改用 CNN: 有三層 Conv layers 和兩層 FC
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernal_size=8, stride=4) # input 是 1 frames
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernal_size=4, stride=2) # input 是前一級的 output
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernal_size=3, stride=1)
        
        # 第二層改為 value 和 advantage
        self.advantage1 = nn.Linear(7*7*64, hidden_layer)
        self.advantage2 = nn.Linear(hidden_layer, number_of_output) # advantage 有 number of actions 個輸出
        
        self.value1 = nn.Linear(7*7*64, hidden_layer)
        self.value2 = nn.Linear(hidden_layer, 1) # value 只有一個輸出
        
#         self.activation = nn.Tanh()
        self.activation = nn.ReLU()
        
    def forward(self, x):
        if normalize_image:
            x = x / 255
#         output = self.linear1(x)
#         output1 = self.activation(output1)
        output_conf = self.conv1(x)
        output_conf = self.activation(output_conf)
        output_conf = self.conv2(output_conf)
        output_conf = self.activation(output_conf)
        output_conf = self.conv3(output_conf)
        output_conf = self.activation(output_conf)
        
        # flatten: 把 conv 的輸出轉成 2d tensor
        output_conf = output_conf.view(output_conf.size(0), -1)
        
        # output2 = self.linear2(output1)
        output_advantage = self.advantage1(output_conf)
        output_advantage = self.activation(output_advantage)
        output_advantage = self.advantage2(output_advantage)
        
        output_value = self.value1(output_conf)
        output_value = self.activation(output_value)
        output_value = self.value2(output_value)
        
        # 要把 value 和 advantage 的輸出結合起來
        output_final = output_value + output_advantage - output_advantage.mean()
        
        return output_final

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
#         self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        self.number_of_frames = 0
        
        if resume_previous_training and os.path.exists(file2save):
            print("Loading previously saved model ...")
            self.nn.load_state_dict(load_model())
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = preprocess_frame(state)
#                 state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
#                 print("\n\n\nACTION NN")
#                 print(action_from_nn)
                action = torch.max(action_from_nn, 1)[1] # 第二個參數從 0 改成 1
#                 print("\n\nACTION")
#                 print(action)
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
#         print("\n\nREPORT")
#         print(type(state))
        # 這時候的 state 是一個 batch 有 32 個 frames 而不是 single image
        state = [preprocess_frame(frame) for frame in state] # 要把 batch 中的每個 frame 都轉換
#         print("\n\nPREPROCESSED")
#         print(type(state))
        state = torch.cat(state) # 要把 python list 接成一個 pytorch tensor
#         print("\n\nCONCATANTE")
#         print(type(state))
#         print(state.shape)
#         print("\n\n\n")
        
        new_state = [preprocess_frame(frame) for frame in new_state]
        new_state = torch.cat(new_state)
        
        # 要把全部的 data 轉成 tensor
#         state = Tensor(state).to(device)
#         new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach()
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
#         if self.update_target_counter % update_target_frequency == 0:
        if self.number_of_frames % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
#         if self.update_target_counter % save_model_frequency == 0:
        if self.number_of_frames % save_model_frequency == 0:
            save_model(self.nn)
            
#         self.update_target_counter += 1
        self.number_of_frames += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

rewards_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    score = 0
    # for step in range(100):
    while True:
#         step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        score += reward
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        
#         print("IMAGE")
#         print(state)
#         print("\n\nDATA TYPE")
#         print(type(state))
#         print(state.dtype)
#         print("\n\nSHAPE")
#         print(state.shape)
#         print("\n\n\n")
#         print("TRANSPOSE")
#         # 要把 numpy array 轉成 pytorch 能讀的格式
#         state = state.transpose((2, 0, 1)) # (2, 0, 1) 表示 new order of dims 
#         # numpy array 中每個 col 的 index (0, 1, 2) 放到 pytorch 後要變成 (2, 0, 1)
#         print(state.shape)
#         print("\n\n\n")
        
        if done:
            rewards_total.append(score)
            mean_reward_100 = sum(rewards_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0 and i_episode > 0): # 第一個 episode 時 data 量不夠，所以要 i_episode > 0
                plot_results()
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(rewards_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(rewards_total)/len(rewards_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(rewards_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(rewards_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
# plt.figure(figsize=(12, 5))
# plt.title("Rewards")
# plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
# plt.show()

# 要把環境關掉
env.close()
env.env.close()

* 目標是在 100 分鐘內解出來
  * 怎樣算解出來？ 最後 100 個 episodes 的平均點數要 18

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

import gym
import random
import math
import time
import os.path
import matplotlib.pyplot as plt
from atari_wrappers import make_atari, wrap_deepmind

plt.style.use("ggplot")

# if gpu is to be used
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")
Tensor = torch.Tensor
LongTensor = torch.LongTensor

# env = gym.make("CartPole-v0")
env_id = "PongNoFrameskip-v4"
env = make_atari(env_id)
env = wrap_deepmind(env)

# 把結果存成 video
directory = "./PongVideos/"
env = gym.wrappers.Monitor(env, directory, video_callable=lambda episode_id: episode_id % 20 == 0)

seed_value = 23
env.seed(seed_value)
torch.manual_seed(seed_value)
random.seed(seed_value)

###### PARAMS ######
learning_rate = 0.0001
num_episodes = 500
gamma = 0.99

hidden_layer = 512

replay_memory_size = 100000
batch_size = 32

update_target_frequency = 5000 # 每 500 steps 更新一次 target network

double_dqn = True

egreedy = 0.9
egreedy_final = 0.01
egreedy_decay = 10000

report_interval = 10
score_to_solve = 18

clip_error = True
normalize_image = True

file2save = "pong_save.pth"
save_model_frequency = 10000 # 每 1 萬個 frame 就存檔一次
resume_previous_training = False
####################

number_of_inputs = env.observation_space.shape[0]
number_of_output = env.action_space.n

def calculate_epsilon(steps_done):
    epsilon = egreedy_final + (egreedy - egreedy_final) * \
              math.exp(-1. * steps_done / egreedy_decay)
    return epsilon

def load_model():
    return torch.load(file2save)

def save_model(model):
    torch.save(model.state_dict(), file2save)

def preprocess_frame(frame):
    frame = frame.transpose((2, 0, 1))
    frame = torch.from_numpy(frame)
    frame = frame.to(device, dtype=torch.float32)
    frame = frame.unsqueeze(1)
    return frame
    
def plot_results():
    plt.figure(figsize=(12, 5))
    plt.title("Rewards")
    plt.plot(rewards_total, alpha=0.6, color="red")
#     plt.show()
    plt.savefig("Pong-results.png")
    plt.close()
    
class ExperienceReplay(object):
    def __init__(self, capacity):
        self.capacity = capacity # memory 的大小
        self.memory = []
        self.position = 0 # 幫忙追蹤放到 memory 中的 entry
        
    def push(self, state, action, new_state, reward, done):
        transition = (state, action, new_state, reward, done) # 把所有資訊放到一個 transition 中，然後把 transition 餵給 memory
        
        if self.position >= len(self.memory):
            self.memory.append(transition) # 在 memory 中加入新的 transition
        else:
            self.memory[self.position] = transition # overwrite 原有的 memory
            
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size): # batch_size 是只有多少 entries 要做 sampling
        return zip(*random.sample(self.memory, batch_size)) # random.sample(從哪裡取數，要取幾個數)
        
    def __len__(self):
        return len(self.memory)

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
#         self.linear1 = nn.Linear(number_of_inputs, hidden_layer)
        # self.linear2 = nn.Linear(hidden_layer, number_of_output)
        # 改用 CNN: 有三層 Conv layers 和兩層 FC
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernal_size=8, stride=4) # input 是 1 frames
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernal_size=4, stride=2) # input 是前一級的 output
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=64, kernal_size=3, stride=1)
        
        # 第二層改為 value 和 advantage
        self.advantage1 = nn.Linear(7*7*64, hidden_layer)
        self.advantage2 = nn.Linear(hidden_layer, number_of_output) # advantage 有 number of actions 個輸出
        
        self.value1 = nn.Linear(7*7*64, hidden_layer)
        self.value2 = nn.Linear(hidden_layer, 1) # value 只有一個輸出
        
#         self.activation = nn.Tanh()
        self.activation = nn.ReLU()
        
    def forward(self, x):
        if normalize_image:
            x = x / 255
#         output = self.linear1(x)
#         output1 = self.activation(output1)
        output_conf = self.conv1(x)
        output_conf = self.activation(output_conf)
        output_conf = self.conv2(output_conf)
        output_conf = self.activation(output_conf)
        output_conf = self.conv3(output_conf)
        output_conf = self.activation(output_conf)
        
        # flatten: 把 conv 的輸出轉成 2d tensor
        output_conf = output_conf.view(output_conf.size(0), -1)
        
        # output2 = self.linear2(output1)
        output_advantage = self.advantage1(output_conf)
        output_advantage = self.activation(output_advantage)
        output_advantage = self.advantage2(output_advantage)
        
        output_value = self.value1(output_conf)
        output_value = self.activation(output_value)
        output_value = self.value2(output_value)
        
        # 要把 value 和 advantage 的輸出結合起來
        output_final = output_value + output_advantage - output_advantage.mean()
        
        return output_final

class QNet_Agent(object):
    def __init__(self):
        self.nn = NeuralNetwork().to(device)
        self.target_nn = NeuralNetwork().to(device)
        self.loss_func = nn.MSELoss()
        # self.loss_func = nn.SmoothL1Loss()

        self.optimizer = optim.Adam(params=self.nn.parameters(), lr=learning_rate)
        # self.optimizer = optim.RMSprop(params=self.nn.parameters(), lr=learning_rate)
        
#         self.update_target_counter = 0 # 用來追蹤什麼時候需要更新 target net
        self.number_of_frames = 0
        
        if resume_previous_training and os.path.exists(file2save):
            print("Loading previously saved model ...")
            self.nn.load_state_dict(load_model())
        
    def select_action(self, state, epsilon):
        random_for_egreedy = torch.rand(1)[0]
        if random_for_egreedy > epsilon:
            with torch.no_grad():
                state = preprocess_frame(state)
#                 state = Tensor(state).to(device)
                action_from_nn = self.nn(state)
#                 print("\n\n\nACTION NN")
#                 print(action_from_nn)
                action = torch.max(action_from_nn, 1)[1] # 第二個參數從 0 改成 1
#                 print("\n\nACTION")
#                 print(action)
                action = action.item()
        else:
            action = env.action_space.sample()
            
        return action
    
    def optimize(self):
        # 要取出 barch_size 這麼多個 samples, 那 memory 中必須要有 batch_size 這麼多個 samples 才行
        if (len(memory) < batch_size):
            return # 當 memory 比 batch_size 小的時候就直接傳回
        state, action, new_state, reward, done = memory.sample(batch_size)
#         print("\n\nREPORT")
#         print(type(state))
        # 這時候的 state 是一個 batch 有 32 個 frames 而不是 single image
        state = [preprocess_frame(frame) for frame in state] # 要把 batch 中的每個 frame 都轉換
#         print("\n\nPREPROCESSED")
#         print(type(state))
        state = torch.cat(state) # 要把 python list 接成一個 pytorch tensor
#         print("\n\nCONCATANTE")
#         print(type(state))
#         print(state.shape)
#         print("\n\n\n")
        
        new_state = [preprocess_frame(frame) for frame in new_state]
        new_state = torch.cat(new_state)
        
        # 要把全部的 data 轉成 tensor
#         state = Tensor(state).to(device)
#         new_state = Tensor(new_state).to(device)
        # 這時候 reward, action, done 都是 batch 而不是單一的數值
        reward = Tensor(reward).to(device) # reward 是 scalar 所以要改成 Tensor
        action = LongTensor(action).to(device)
        done = Tensor(done).to(device)
        
        if double_dqn:
            new_state_indexes = self.nn(new_state).detach() # 用 learning NN
            max_new_state_indexes = torch.max(new_state_indexes, 1)[1] # 用 [1] 取出 index
            
            new_state_values = self.target_nn(new_state).detach()
            max_new_state_values = new_state_values.gather(1, max_new_state_indexes.unsqueeze(1)).squeeze(1)
        else:
            # new_state_values = self.nn(new_state).detach()
            new_state_values = self.target_nn(new_state).detach() # 改成用 target network 來計算 new state
            max_new_state_values = torch.max(new_state_values, 1)[0] # 1 表示從每個 row 中選出最大值, [0] 是取出最大值的數值
        target_value = reward + (1 - done) * gamma * max_new_state_values # 當 done = 1 的時候，1-done=0 所以只會剩下 reward

        # print(new_state_values)
        # print(max_new_state_values)
        
        # predicted_value = self.nn(state)[action]
        # print(self.nn(state).size())
        # print(action.unsqueeze(1).size())
        # print(self.nn(state))
        # print(action.unsqueeze(1))
        # print(self.nn(state).gather(1, action.unsqueeze(1))) # gather 沿著 row 方向照 action 指示的值取出來 state
        predicted_value = self.nn(state).gather(1, action.unsqueeze(1)).squeeze(1)
        # print(predicted_value.squeeze(1).size())
        # print(target_value.size())
        loss = self.loss_func(predicted_value, target_value) # predicted_value 和 target_value 要有相同的 dimension
        self.optimizer.zero_grad()
        loss.backward()
        
        if clip_error:
            for param in self.nn.parameters():
                param.grad.data.clamp_(-1, 1)
        self.optimizer.step()
        
#         if self.update_target_counter % update_target_frequency == 0:
        if self.number_of_frames % update_target_frequency == 0:
            self.target_nn.load_state_dict(self.nn.state_dict())
        
#         if self.update_target_counter % save_model_frequency == 0:
        if self.number_of_frames % save_model_frequency == 0:
            save_model(self.nn)
            
#         self.update_target_counter += 1
        self.number_of_frames += 1
        # Q[state, action] = reward + gamma * torch.max(Q[new_state])
        
memory = ExperienceReplay(replay_memory_size)
qnet_agent = QNet_Agent()

rewards_total = []
frames_total = 0
solved_after = 0
solved = False
start_time = time.time()
for i_episode in range(num_episodes):
    state = env.reset()
    score = 0
    # for step in range(100):
    while True:
#         step += 1
        frames_total += 1
        epsilon = calculate_epsilon(frames_total)
        # action = env.action_space.sample()
        action = qnet_agent.select_action(state, epsilon)
        new_state, reward, done, info = env.step(action)
        
        score += reward
        
        memory.push(state, action, new_state, reward, done) # 把 experience 收集到 memory 中
        qnet_agent.optimize()
        # qnet_agent.optimize(state, action, new_state, reward, done)
        state = new_state
        
#         print("IMAGE")
#         print(state)
#         print("\n\nDATA TYPE")
#         print(type(state))
#         print(state.dtype)
#         print("\n\nSHAPE")
#         print(state.shape)
#         print("\n\n\n")
#         print("TRANSPOSE")
#         # 要把 numpy array 轉成 pytorch 能讀的格式
#         state = state.transpose((2, 0, 1)) # (2, 0, 1) 表示 new order of dims 
#         # numpy array 中每個 col 的 index (0, 1, 2) 放到 pytorch 後要變成 (2, 0, 1)
#         print(state.shape)
#         print("\n\n\n")
        
        if done:
            rewards_total.append(score)
            mean_reward_100 = sum(rewards_total[-100:])/100
            if (mean_reward_100 > score_to_solve and solved == False):
                print("SOLVED! After %i episodes" % i_episode)
                solved_after = i_episode
                solved = True
                
            if (i_episode % report_interval == 0 and i_episode > 0): # 第一個 episode 時 data 量不夠，所以要 i_episode > 0
                plot_results()
                print("\n**** Episode %i ***\
                      \nAv.reward: [last %i]: %.2f, [last 100]: %.2f, [all]: %.2f,\
                      \nepsilon: %.2f, frames_total: %i"
                      % 
                      (i_episode,
                       report_interval,
                       sum(rewards_total[-report_interval:])/report_interval,
                       mean_reward_100,
                       sum(rewards_total)/len(rewards_total),
                       epsilon,
                       frames_total
                      )
                     )
                elapsed_time = time.time() - start_time
                print("Elapsed time:", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
            break
            
print("\n\n\n\nAverage reward: %.2f" % (sum(rewards_total)/num_episodes))
print("Average reward (last 100 episodes): %.2f" % (sum(rewards_total[-100:])/100))
if solved:
    print("Solved after %i episodes" % solved_after)
# plt.figure(figsize=(12, 5))
# plt.title("Rewards")
# plt.bar(torch.arange(len(rewards_total)), rewards_total, alpha=0.6, color="green")
# plt.show()

# 要把環境關掉
env.close()
env.env.close()