## MDP

markov decision procsses

목표: MDP를 소개하고, 어떻게 파이썬에서 코딩하는지 알아보는 것

MDP모델은 외부 환경과 연속적인 상호작용을 해야 함 -> 다음의 것들이 필요
- 상태 space
- action에 대한 집합
- 전이 함수: t에서의 s,a가 주어질 때, t+1에서의 s'을 확률로 표현하는 것 (여기서는 물리학 법칙에 따라 결정)
- reward함수: t에서 결정되는 reward(upright = 1, fallen over = 0)
- 할인율: $γ$ (=1)

$argmax_{\pi} \sum^{T}_{t=1} \gamma R_t(\pi)$ 를 구해야 함.

In [2]:
import gym
import numpy as np

In [3]:
env = gym.make("CartPole-v0")
print(env)

<TimeLimit<CartPoleEnv<CartPole-v0>>>


In [4]:
state = env.reset()
print(state)

[-0.00022455 -0.00535896  0.03023214 -0.03310407]


In [5]:
# action을 통해 다음 단계로 진행하기
action = 0 
state, reward, done, info = env.step(action)

print(f"state: {state}")  # action을 통해 얻는 새로운 상태
print(f"reward: {reward}") # 받은 보상
print(f"done: {done}") # 끝났는지 여부
print(f"info: {info}") # 추가 정보들

state: [-0.00033173 -0.20090112  0.02957006  0.26896203]
reward: 1.0
done: False
info: {}


In [6]:
# 무작위 작업을 수행하고 리워드 합계를 반환
def random_rollout(env):
    # initialize
    state = env.reset()
    done = False
    cumulative_reward = 0

    while not done:
        # 랜덤으로 행동을 선택
        action = np.random.choice([0, 1])
        # 환경과 상호작용하여 정보 획득
        state, reward, done, info = env.step(action)
        # 리워드 더하기
        cumulative_reward += reward

    return cumulative_reward

reward = random_rollout(env)
print(reward)
reward = random_rollout(env)
print(reward)

23.0
18.0


In [7]:
# env, policy를 받아서 랜덤한 방식 말고, 정책에 따라 행동을 결정하도록 코딩하기
def rollout_policy(env, policy):
    state = env.reset()
    done = False
    cumulative_reward = 0

    # ===== your code =====
    while not done:
        action = sample_policy(state)
        state, reward, done, info = env.step(action)
        cumulative_reward += reward
    # ===== your code =====

    return cumulative_reward

def sample_policy(state):
    if state[0] < 0:
        return 0
    else:
        return 1

In [8]:
reward = np.mean([rollout_policy(env, sample_policy) for _ in range(100)])
print(reward)

9.35


## PPO(proximal policy optimization)

1. 병렬적으로 많은 rollout들을 수행 
2. 합친 후 SGD를 수행 (목적함수를 최대화하는 정책 찾기)
3. 다시 rollout들에게 새로운 정책 가중치를 부여

In [9]:
import gym
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [10]:
ray.init(num_cpus=3, log_to_driver=False)

E0219 04:09:07.638053276    4611 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0219 04:09:07.673935961    4611 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0219 04:09:07.701426697    4611 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies


{'node_ip_address': '10.182.0.2',
 'raylet_ip_address': '10.182.0.2',
 'redis_address': '10.182.0.2:6379',
 'object_store_address': '/tmp/ray/session_2022-02-19_04-09-06_484024_4611/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2022-02-19_04-09-06_484024_4611/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2022-02-19_04-09-06_484024_4611',
 'metrics_export_port': 63406,
 'gcs_address': '10.182.0.2:43235',
 'node_id': '867f023c67c1ddcd664e88740bb11b6186b4191b6a3562c736ab004c'}

In [11]:
config = DEFAULT_CONFIG.copy()
config["num_workers"] = 1
config["num_sgd_iter"] = 5
config["sgd_minibatch_size"] = 256
# 히든 레이어의 사이즈 조정
config["model"]["fcnet_hiddens"] = [100, 100]
# 셀이 재실행 될 때 리소스 부족을 막을 수 있도록 함
config['num_cpus_per_worker'] = 0

agent = PPOTrainer(config, "CartPole-v0")

2022-02-19 04:09:09,103	INFO trainer.py:2054 -- Your framework setting is 'tf', meaning you are using static-graph mode. Set framework='tf2' to enable eager execution with tf2.x. You may also then want to set eager_tracing=True in order to reach similar execution speed as with static-graph mode.
2022-02-19 04:09:09,105	INFO ppo.py:249 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
2022-02-19 04:09:09,106	INFO trainer.py:790 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


In [12]:
# episode len mean 값을 보기
# 최대는 200. for문을 30번정도 하면 나올듯?
for i in range(2):
    result = agent.train()
    print(pretty_print(result))



agent_timesteps_total: 4000
custom_metrics: {}
date: 2022-02-19_04-09-19
done: false
episode_len_mean: 21.315508021390375
episode_media: {}
episode_reward_max: 59.0
episode_reward_mean: 21.315508021390375
episode_reward_min: 9.0
episodes_this_iter: 187
episodes_total: 187
experiment_id: 6887ed9c42794e0da2c54b82105b0716
hostname: instance-1
info:
  learner:
    default_policy:
      custom_metrics: {}
      learner_stats:
        cur_kl_coeff: 0.20000000298023224
        cur_lr: 4.999999873689376e-05
        entropy: 0.6914097666740417
        entropy_coeff: 0.0
        kl: 0.0018208251567557454
        model: {}
        policy_loss: -0.015460265800356865
        total_loss: 214.84207153320312
        vf_explained_var: -8.742014870222192e-06
        vf_loss: 214.85714721679688
  num_agent_steps_sampled: 4000
  num_agent_steps_trained: 4000
  num_steps_sampled: 4000
  num_steps_trained: 4000
  num_steps_trained_this_iter: 4000
iterations_since_restore: 1
node_ip: 10.182.0.2
num_healthy_w

In [13]:
checkpoint_path = agent.save()
print(checkpoint_path)



/home/dojm.ex5/ray_results/PPOTrainer_CartPole-v0_2022-02-19_04-09-09vfljfcxx/checkpoint_000002/checkpoint-2


In [14]:
trained_config = config.copy()

test_agent = PPOTrainer(trained_config, "CartPole-v0")
test_agent.restore(checkpoint_path)

2022-02-19 04:09:29,385	INFO trainable.py:472 -- Restored on 10.182.0.2 from checkpoint: /home/dojm.ex5/ray_results/PPOTrainer_CartPole-v0_2022-02-19_04-09-09vfljfcxx/checkpoint_000002/checkpoint-2
2022-02-19 04:09:29,387	INFO trainable.py:480 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': 8000, '_time_total': 8.599364757537842, '_episodes_total': 342}


In [15]:
# 불러온 데이터가 잘 작동하는지 테스트하기
env = gym.make("CartPole-v0")
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, info = env.step(action)
    cumulative_reward += reward
    
print(cumulative_reward)



86.0


In [16]:
# !tensorboard --logdir=~/ray_Results --host=0.0.0.0

## custom environment & reward shaping

아래 두 가지에 포커스

1. 어떻게 MDP추상화를 만들 수 있을지
2. 너의 에이전트를 더 효율적으로 만들기 위해 커스텀 환경에 따른 보상을 어떻게 설정할지

In [36]:
import gym
from gym import spaces
import numpy as np

# git clone을 해서 할 경우
# !git clone https://github.com/ray-project/tutorial
# from tutorial.rllib_exercises import test_exercises

# 그냥 따로 저장한 경우
import test_exercises

import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG

#### Different Spaces

action, observation space의 dimension을 정하는 것은 RL을 공식화 할 때 가장 먼저 해야할 일 (-> gym에서 이러한 것들을 제공)

In [37]:
discrete = spaces.Discrete(10)
print([discrete.sample() for i in range(4)])

[5, 9, 1, 9]


In [38]:
action_space_map = {
    "discrete_10": spaces.Discrete(10),
    "box_1": spaces.Box(0, 1, shape=(1,)),
    "box_3x1": spaces.Box(-2, 2, shape=(3, 1))
}

action_space_jumble = {
    "discrete_10": 2,
    "box_1": [0.1],
    "box_3x1": [[0], [0], [0]]
}


for space_id, state in action_space_jumble.items():
    print(action_space_map[space_id].contains(state))
    print((space_id, state))
    assert action_space_map[space_id].contains(state), (
        "Looks like {} to {} is matched incorrectly.".format(space_id, state))
    
print("Success!")

True
('discrete_10', 2)
True
('box_1', [0.1])
True
('box_3x1', [[0], [0], [0]])
Success!


  logger.warn("Casting input x to numpy array.")


#### setting custom env with rewards

n-chain 환경 에서 다음 두가지로 동작
1. forward: chain따라 움직이지만 리워드는 0
2. backword: 처음으로 돌아가는데 작은 리워드 반환
    
-> 그런데 체인의 마지막에서는 큰 리워드 발생(지속적으로 작은 보상 대신 forward를 선택해야 함)

목적: 이러한 환경 구성하기
    
1. ChainEnv._setup_spaces
    - observation space: 0 ~ n-1
    - action space: 0 or 1
2. reward function
    - action == 1일 때 return self.small_reward
    - action == 0이고 self.state < self.n - 1일 때 return 0
    - action == 0이고 self.state == self.n-1일 때 return self.large_reward 

In [39]:
class ChainEnv(gym.Env):
    def __init__(self, env_config={}):
        env_config = env_config
        self.n = env_config.get("n", 20)
        self.small_reward = env_config.get("small", 2)
        self.large_reward = env_config.get("large", 10)
        self.state = 0
        self._horizon = self.n
        self._counter = 0
        self._setup_spaces()
        
    def _setup_spaces(self):
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Discrete(self.n)
        
    def step(self, action):
        assert self.action_space.contains(action)
        
        if action == 1:
            reward = self.small_reward
            self.state = 0
        elif self.state < self.n - 1:
            reward = 0
            self.state += 1
        else:
            reward = self.large_reward
            
        self._counter += 1
        done = self._counter >= self._horizon
        return self.state, reward, done, {}
    
    def reset(self):
        self.state = 0
        self._counter = 0
        return self.state
    
test_exercises.test_chain_env_spaces(ChainEnv)
test_exercises.test_chain_env_reward(ChainEnv)

Testing if spaces have been setup correctly...
Success! You've setup the spaces correctly.
Testing if reward has been setup correctly...
Success! You've setup the rewards correctly.


In [40]:
trainer_config = DEFAULT_CONFIG.copy()
trainer_config["num_workers"] = 3
trainer_config["train_batch_size"] = 400
trainer_config["sgd_minibatch_size"] = 64
trainer_config["num_sgd_iter"] = 10

trainer = PPOTrainer(trainer_config, ChainEnv)
for i in range(20):
    print(f"Training iteration {i}")
    trainer.train()

E0219 11:58:17.608685267   11687 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0219 11:58:17.640864842   11687 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
E0219 11:58:17.663634791   11687 fork_posix.cc:70]           Fork support is only compatible with the epoll1 and poll polling strategies
2022-02-19 11:58:26,699	INFO trainable.py:125 -- Trainable.setup took 10.248 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.


Training iteration 0




Training iteration 1
Training iteration 2
Training iteration 3
Training iteration 4
Training iteration 5
Training iteration 6
Training iteration 7
Training iteration 8
Training iteration 9
Training iteration 10
Training iteration 11
Training iteration 12
Training iteration 13
Training iteration 14
Training iteration 15
Training iteration 16
Training iteration 17
Training iteration 18
Training iteration 19


In [54]:
env = ChainEnv()
state = env.reset()
done = False
max_state = -1
cumulative_reward = 0

while not done:
    action = trainer.compute_action(state)
    state, reward, done, info = env.step(action)
    max_state = max(max_state, state)
    cumulative_reward = reward
    
print(cumulative_reward)
print(max_state, env.n)

2
1 20


#### shaping reward

위에서 마지막까지 가면 보상이 많아도, forward를 택하는 경우가 없음 -> 애초에 마지막까지 못가니까...?

따라서 ShapedChainEnv.step을 수정하여 적절한 보상을 주어야 함

In [56]:
class ShapedChainEnv(ChainEnv):
    def step(self, action):
        assert self.action_space.contains(action)
        if action == 1:
            reward = -1
            self.state = 0
        elif self.state < self.n - 1:
            reward = -1
            self.state += 1
        else:
            reward = -1
        
        self._counter += 1
        done = self._counter >= self._horizon
        
        return self.state, reward, done, {}
    
test_exercises.test_chain_env_behavior(ShapedChainEnv)

Testing if behavior has been changed...
Success! Behavior of environment is correct.


In [None]:
trainer = PPOTrainer(trainer_config, ShapedChainEnv)

for i in range(20):
    print(f"training iteration: {i}")
    trainer.train()

In [62]:
env = ShapedChainEnv()
max_states = []

for i in range(5):
    state = env.reset()
    done = False
    max_state = -1
    
    while not done:
        action = trainer.compute_action(state)
        state, reward, done, info = env.step(action)
        max_state = max(max_state, state)
        cumulative_reward += reward
        
    max_states.append(max_state)

print("Cumulative reward you've received is: {}!".format(cumulative_reward))
print("Max state you've visited is: {}. This is out of {} states.".format(np.mean(max_states), env.n))

Cumulative reward you've received is: -498!
Max state you've visited is: 4.2. This is out of 20 states.


In [None]:
## Online learning with DQN