<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 12: Reinforcement Learning**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).
* Modified by uramoon@kw.ac.kr

# Module 12 Video Material

* **Part 12.1: Introduction to the OpenAI Gym** [[Video]](https://www.youtube.com/watch?v=_KbUxgyisjM&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_01_ai_gym.ipynb)
* Part 12.2: Introduction to Q-Learning [[Video]](https://www.youtube.com/watch?v=A3sYFcJY3lA&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_02_qlearningreinforcement.ipynb)
* Part 12.3: Keras Q-Learning in the OpenAI Gym [[Video]](https://www.youtube.com/watch?v=qy1SJmsRhvM&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_03_keras_reinforce.ipynb)
* Part 12.4: Atari Games with Keras Neural Networks [[Video]](https://www.youtube.com/watch?v=co0SwPWoZh0&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_04_atari.ipynb)
* Part 12.5: Application of Reinforcement Learning [[Video]](https://www.youtube.com/watch?v=1jQPP3RfwMI&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_12_05_apply_rl.ipynb)


# Part 12.1: Introduction to the OpenAI Gymnasium

Gymnasium은 OpenAI에서 만든 강화학습을 위한 API입니다.

[OpenAI Gym](https://gym.openai.com/) aims to provide an easy-to-setup general-intelligence benchmark with various environments. The goal is to standardize how environments are defined in AI research publications to make published research more easily reproducible. The project claims to provide the user with a simple interface. As of June 2017, developers can only use Gym with Python. 

OpenAI gym is pip-installed onto your local machine. There are a few significant limitations to be aware of:

* OpenAI Gym Atari only **directly** supports Linux and Macintosh
* OpenAI Gym Atari can be used with Windows; however, it requires a particular [installation procedure](https://towardsdatascience.com/how-to-install-openai-gym-in-a-windows-environment-338969e24d30)
* OpenAI Gym can not directly render animated games in Google CoLab.

Because OpenAI Gym requires a graphics display, an embedded video is the only way to display Gym in Google CoLab. The presentation of OpenAI Gym game animations in Google CoLab is discussed later in this module.

## OpenAI Gym Leaderboard

The OpenAI Gym does have a leaderboard, similar to Kaggle; however, the OpenAI Gym's leaderboard is much more informal compared to Kaggle. The user's local machine performs all scoring. As a result, the OpenAI gym's leaderboard is strictly an "honor system."  The leaderboard is maintained in the following GitHub repository:

* [OpenAI Gym Leaderboard](https://github.com/openai/gym/wiki/Leaderboard)

You must provide a write-up with sufficient instructions to reproduce your result if you submit a score. A video of your results is suggested but not required.

## Looking at Gym Environments

Gym은 강화학습을 수행할 수 있는 환경을 제공합니다.
The centerpiece of Gym is the environment, which defines the "game" in which your reinforcement algorithm will compete. An environment does not need to be a game; however, it describes the following game-like features:
* **action space**: 매 스텝에서 취할 수 있는 행동 목록 제공
* **observation space**: 현재 관측 가능한 상태 제공

Before we begin to look at Gym, it is essential to understand some of the terminology used by this library.

* **Agent** - 매 스텝 행동을 취하는 기계 학습 프로그램, 행동에 따라 다음 상태가 달라집니다.
* **Episode** - 연속된 스텝들의 모음. 에이전트가 실패하거나 미리 정해놓은 최대 스텝에 도달하면 에피소드는 종료합니다.
* **Render** - Gym은 frame단위로 에피소드에서 발생한 일들을 그릴 수 있습니다.
* **Reward** - 에피소드가 끝날 때 행동에 따라 에이전트는 보상을 받을 수 있습니다.
* **Non-deterministic** - 어떤 환경에서는 보상이 확률적으로 주어집니다. 예) 복권

It is important to note that many gym environments specify that they are not non-deterministic even though they use random numbers to process actions. Based on the gym GitHub issue tracker, a non-deterministic property means a deterministic environment behaves randomly. Even when you give the environment a consistent seed value, this behavior is confirmed. The program can use the seed method of an environment to seed the random number generator for the environment.

The Gym library allows us to query some of these attributes from environments. I created the following function to query gym environments.


In [1]:
# 현재 버전은 Colab에서 문제가 있어 구버전 설치
!pip install gym==0.15.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gym==0.15.3
  Downloading gym-0.15.3.tar.gz (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyglet<=1.3.2,>=1.2.0 (from gym==0.15.3)
  Downloading pyglet-1.3.2-py2.py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cloudpickle~=1.2.0 (from gym==0.15.3)
  Downloading cloudpickle-1.2.2-py2.py3-none-any.whl (25 kB)
Building wheels for collected packages: gym
  Building wheel for gym (setup.py) ... [?25l[?25hdone
  Created wheel for gym: filename=gym-0.15.3-py3-none-any.whl size=1644944 sha256=ade9c35eceab646c5eefdb8109a4bab3fe36e00debc60d2b65ed10fab1976dfa
  Stored in directory: /root/.cache/pip/wheels/dc/4c/1c/25048a3f2e8

In [2]:
import gym

# name에 해당하는 환경의 정보를 출력해주는 함수
def query_environment(name):
    env = gym.make(name)
    spec = gym.spec(name)
    print(f"Action Space: {env.action_space}")
    print(f"Observation Space: {env.observation_space}")
    print(f"Max Episode Steps: {spec.max_episode_steps}")
    print(f"Nondeterministic: {spec.nondeterministic}")
    print(f"Reward Range: {env.reward_range}")
    print(f"Reward Threshold: {spec.reward_threshold}")


## MountainCar 환경 살펴보기
We will look at the **MountainCar-v0** environment, which challenges an underpowered car to escape the valley between two mountains.  The following code describes the Mountian Car environment.

<img src='https://gymnasium.farama.org/_images/mountain_car.gif' width="360">

In [3]:
query_environment("MountainCar-v0")

Action Space: Discrete(3)
Observation Space: Box(2,)
Max Episode Steps: 200
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: -110.0


### TODO: MountainCar 질문에 대한 답 작성
Hint: https://gymnasium.farama.org/environments/classic_control/mountain_car/


In [4]:
# Q: MountainCar-v0에서는 매 스텝 어떤 행동을 취할 수 있을까요? 각 행동의 의미를 기재하세요.
# A: 0: 왼쪽가속, 1: 가만히 있기, 2: 오른쪽가속

# Q: 관측할 수 있는 것은 실수 두 개인데 각 실수는 무엇을 의미할까요?
# A: 첫 번째 실수: 위치 , 두 번째 실수: 속도

# Q: 강화학습의 목표는 최대한 많은 리워드를 받는 것입니다. 
# 이 환경에서는 골인 지점에 있지 않으면 매 스텝 -1의 리워드가 주어지는데 어떻게 행동해야 할까요?
# A: 가능한 빨리 목표지점에 도달하기 위해 밀고 당겨야 함

## CartPole 환경 살펴보기
<img src="https://gymnasium.farama.org/_images/cart_pole.gif" width="360">

In [5]:
query_environment("CartPole-v1")

Action Space: Discrete(2)
Observation Space: Box(4,)
Max Episode Steps: 500
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: 475.0


## TODO: CartPole 질문에 대한 답 작성
Hint: https://gymnasium.farama.org/environments/classic_control/cart_pole/




In [6]:
# Q: CartPole-v1에서는 매 스텝 어떤 행동을 취할 수 있을까요? 각 행동의 의미를 기재하세요.
# A: 0: 왼쪽으로 밀기, 1: 오른쪽으로 밀기

# Q: 관측할 수 있는 것은 실수 네 개인데 각 실수는 무엇을 의미할까요?
# A: 첫 번째 실수: 카트 위치, 두 번째 실수: 카트 속도, 세 번째 실수: 막대 각, 네 번째 실수: 막대의 각속도

# Q: Observation Space에 기재된 에피소드 종료 조건 세 가지는 무엇일까요?
# A: 첫 번째 종료조건: Pole Angle is greater than ±12°, 두 번째 종료조건: Cart Position is greater than ±2.4, 세 번째 종료조건: Episode length is greater than 500

# Q: 강화학습의 목표는 최대한 많은 리워드를 받는 것입니다. 
# 이 환경에서는 매 스텝 +1의 리워드가 주어지는데 어떻게 행동해야 할까요?
# A: 가능한 한 오래 막대를 균형을 유지하며 유지

## Atari ROM 파일 다운로드

Note: If you see a warning above, you can safely ignore it; it is a relatively minor bug in OpenAI Gym.

Atari games, like breakout, can use an observation space that is either equal to the size of the Atari screen (210x160) or even use the RAM of the Atari (128 bytes) to determine the state of the game.  Yes, that's bytes, not kilobytes!

In [7]:
!wget http://www.atarimania.com/roms/Roms.rar 
!unrar x -o+ /content/Roms.rar >/dev/nul
!pip install atari_py
!python -m atari_py.import_roms /content/ROMS >/dev/nul

--2023-05-29 13:35:49--  http://www.atarimania.com/roms/Roms.rar
Resolving www.atarimania.com (www.atarimania.com)... 195.154.81.199
Connecting to www.atarimania.com (www.atarimania.com)|195.154.81.199|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19612325 (19M) [application/x-rar-compressed]
Saving to: ‘Roms.rar’


2023-05-29 13:35:54 (3.63 MB/s) - ‘Roms.rar’ saved [19612325/19612325]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting atari_py
  Downloading atari-py-0.2.9.tar.gz (540 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.6/540.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: atari_py
  Building wheel for atari_py (setup.py) ... [?25l[?25hdone
  Created wheel for atari_py: filename=atari_py-0.2.9-cp310-cp310-linux_x86_64.whl size=2856237 sha256=ca5179795a76c8e6a6ab1124a

## 두 가지 버전의 벽돌깨기 (Breakout) 게임 환경
<img src="https://gymnasium.farama.org/_images/breakout.gif" width="160"><br>
https://gymnasium.farama.org/environments/atari/breakout/

In [8]:
# 인간 플레이어와 동일하게 210 x 160, 3채널 컬러 이미지를 관측하는 환경 (각 픽셀값은 0 ~ 255)
query_environment("Breakout-v0")

Action Space: Discrete(4)
Observation Space: Box(210, 160, 3)
Max Episode Steps: 10000
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: None


In [9]:
# Atari 게임기의 128 bytes 메모리를 관측하는 환경 (각 바이트는 0 ~ 255)
query_environment("Breakout-ram-v0")

Action Space: Discrete(4)
Observation Space: Box(128,)
Max Episode Steps: 10000
Nondeterministic: False
Reward Range: (-inf, inf)
Reward Threshold: None


## Render OpenAI Gym Environments from CoLab

그려봅시다! 

It is possible to visualize the game your agent is playing, even on CoLab. This section provides information on generating a video in CoLab that shows you an episode of the game your agent is playing. I based this video process on suggestions found [here](https://colab.research.google.com/drive/1flu31ulJlgiRL1dnN2ir8wGh9p7Zij2t).

Begin by installing **pyvirtualdisplay** and **python-opengl**.

In [10]:
# 이해할 필요 없습니다.
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

Next, we install the needed requirements to display an Atari game.

In [11]:
# 이해할 필요 없습니다.
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting setuptools
  Downloading setuptools-67.8.0-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 67.7.2
    Uninstalling setuptools-67.7.2:
      Successfully uninstalled setuptools-67.7.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.[0m[31m
[0mSuccessfully installed setuptools-67.8.0


Next, we define the functions used to show the video by adding it to the CoLab notebook.

In [12]:
# 이해할 필요 없습니다.
import gym
from gym.wrappers import Monitor
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment 
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
    mp4list = glob.glob('video/*.mp4')
    if len(mp4list) > 0:
        mp4 = mp4list[0]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
    else:
        print("Could not find video")

def wrap_env(env):
    env = Monitor(env, './video', force=True)    
    return env

### 아틀란티스 게임에서 랜덤 에이전트의 행동 그려보기

아틀란티스라는 게임은 플레이어가 매 스텝 다음의 행동을 취할 수 있습니다.
0. 아무것도 안함
1. 가운데 총 발사
2. 오른쪽 총 발사
3. 왼쪽 총 발사

매 번 랜덤한 행동을 취하는 에이전트의 플레이를 그려보겠습니다.

In [13]:
env = wrap_env(gym.make("Atlantis-v0"))
observation = env.reset()

while True:
    # 매 스텝 다음을 수행

    # 그리기
    env.render()    

    # 액션 정하기
    action = env.action_space.sample()  # 랜덤하게 설정

    # 행동에 따른 새로운 관측, 리워드, 에피소드 종료 여부, 종료에 관한 정보를 반환
    observation, reward, done, info = env.step(action)

    if done:
        break

env.close()
show_video()

## TODO: Kung Fu Master 랜덤 에이전트 만들기
https://gymnasium.farama.org/environments/atari/kung_fu_master/<br>
매 스텝 무작위 액션을 취하는 에이전트를 만들어 플레이 장면을 그려보세요.

In [14]:
env = wrap_env(gym.make("KungFuMaster-v4"))
observation = env.reset()

while True:
    # 매 스텝 다음을 수행

    # 그리기
    env.render()    

    # 액션 정하기
    action = env.action_space.sample()  # 랜덤하게 설정

    # 행동에 따른 새로운 관측, 리워드, 에피소드 종료 여부, 종료에 관한 정보를 반환
    observation, reward, done, info = env.step(action)

    if done:
        break

env.close()
show_video()

## TODO: 직접 CartPole 에이전트 만들기
https://gymnasium.farama.org/environments/classic_control/cart_pole/
<br>
단순 프로그래밍으로 500 스텝까지 살아남는 에이전트를 만들어보세요. 

500 스텝 이전에 종료 끝났다면 위에서 작성한 CartPole 문제에 대한 대답을 참고하여 <br>더 오랜 시간 살아남을 수 있도록 만들어 보세요. 


In [15]:
env = wrap_env(gym.make("CartPole-v1"))
observation = env.reset()

i = 0
done = False
while not done:   
    i += 1

    # 각도만 보는 에이전트
    if observation[2] < 0: 
      action = 0 # 왼쪽으로 기울어졌으면 왼쪽으로
      if observation[3]>1.7: 
        action=1
    else: 
      action = 1                 # 오른쪽으로 기울어졌으면 오른쪽으로      
      if observation[3]<-1.7:
        action=0
      
    if observation[0]>1.8:
      action=0
      if observation[1]<-1.8:
        action=1
    
    elif observation[0]<-1.8:
      action=1
      if observation[1]>1.8:
        action=0


    # 행동에 따른 새로운 관측, 리워드, 에피소드 종료 여부, 종료에 관한 정보를 반환
    observation, reward, done, info = env.step(action)
    print(f"Step {i}: Previous action={action}, Observation={observation}, Reward={reward}")

    env.render()

    if done:    
        break

env.close()
show_video()

Step 1: Previous action=0, Observation=[ 0.02873665 -0.16137117 -0.04078138  0.2611271 ], Reward=1.0
Step 2: Previous action=0, Observation=[ 0.02550922 -0.35588795 -0.03555884  0.54067329], Reward=1.0
Step 3: Previous action=0, Observation=[ 0.01839147 -0.55049251 -0.02474537  0.8219435 ], Reward=1.0
Step 4: Previous action=0, Observation=[ 0.00738162 -0.74526731 -0.0083065   1.10674185], Reward=1.0
Step 5: Previous action=0, Observation=[-0.00752373 -0.94027908  0.01382834  1.39680733], Reward=1.0
Step 6: Previous action=1, Observation=[-0.02632931 -0.74533183  0.04176448  1.10847981], Reward=1.0
Step 7: Previous action=1, Observation=[-0.04123595 -0.55078291  0.06393408  0.8291861 ], Reward=1.0
Step 8: Previous action=1, Observation=[-0.05225161 -0.35659051  0.0805178   0.55727645], Reward=1.0
Step 9: Previous action=1, Observation=[-0.05938342 -0.16268574  0.09166333  0.29100827], Reward=1.0
Step 10: Previous action=1, Observation=[-0.06263713  0.03101779  0.0974835   0.02858407], 

사람은 Observation의 각 원소가 지니는 의미와 물리 법칙을 이용하여 코딩할 수 있지만 기계학습을 사용할 때 컴퓨터는 아무런 배경지식 없이 Action에 따라 Observation의 각 원소가 어떻게 변화하는지 학습합니다.

일반적인 아타리 게임에서는 딥러닝 모델이 그림을 분석하여 현재 상태에서 취할 수 있는 바람직한 액션을 선택합니다.<br>
수고하셨습니다.