### - FrozenLake에서 Cross-entropy를 사용하여 해결하는 방법 
- 사용하는 에피소드의 배치 개수 늘리기 
    - CartPole 16개 사용
    - FrozenLake 최소 100개 사용 필요 
    
    
- 보상에 할인계수 적용 
    - 총보상을 에피소드의 길이에 의존하도록 만들고 에피소드의 다양성을 더하기 위해 할인계수 사용 
    - 0.9 혹은 0.95
    - 더 짧은 에피소드가 긴 에피소드 보다 더 높은 보상을 얻게 됨 
    
    
- 엘리트 에피소드를 더 오래 유지하기 
    - Cartpole, 환경에서 에피소드를 샘플링하고 좋은 것들에 대해 학습한 후 버림 
    - FrozenLake, 성공 에피소드가 훨씬 드물어서 그 에피소드에 대해 여러번 반복하여 학습 
    
    
- 학습률 낮추기 
    - 네트워크 시간을 더 많은 훈련 견본 평균에 줄 것
    
    
- 학습 시간 늘리기 
    - 성공 에피소드가 드물고 액션의 무작위성 때문에, 네트워크는 특정 상황에서 최선의 행동을 깨닫기가 어려움 
    - 50% 성공 에피소드를 달성하기 위해 5000번의 반복 학습이 요구됨 

#### - 코드 수정 
- filter_batch() 함수 수정 
    - 할인 보상 설정 
    - 엘리트 에피소드 유지 
    
    
- training loop 
    - 현재의 엘리트 에피소드를 저장하여 다음 학습 반복의 이전 함수에 전달
    
    
- 학습률 10배 줄임 


- BATCH_SIZE = 100 


In [1]:
import random
import gym
import gym.spaces
from collections import namedtuple
import numpy as np
from tensorboardX import SummaryWriter
import torch
import torch.nn as nn
import torch.optim as optim

In [2]:
HIDDEN_SIZE = 128
BATCH_SIZE = 100 
PERCENTILE = 30
GAMMA = 0.9  #discount factor

In [3]:
class DiscreteOneHotWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super(DiscreteOneHotWrapper, self).__init__(env)
        assert isinstance(env.observation_space, gym.spaces.Discrete)
        
        self.observation_space = gym.spaces.Box(0.0, 1.0, (env.observation_space.n, ), dtype=np.float32)
        
    
    def observation(self, observation):
        res = np.copy(self.observation_space.low)
        res[observation] = 1.0
        
        return res

In [4]:
class Net(nn.Module):
    def __init__(self, obs_size, hidden_size, n_actions):
        super(Net, self).__init__()
        
        self.net = nn.Sequential(nn.Linear(obs_size, hidden_size), 
                                 nn.ReLU(),
                                 nn.Linear(hidden_size, n_actions))
        
    def forward(self, x):
        return self.net(x)

In [5]:
Episode = namedtuple('Episode', field_names=['reward', 'steps'])
EpisodeStep = namedtuple('EpisodeStep', field_names=['observation', 'action'])

In [6]:
def iterate_batches(env, net, batch_size):
    batch = []
    episode_reward = 0.0
    episode_steps = []
    obs = env.reset()
    
    sm = nn.Softmax(dim=1)
    
    while True :
        obs_v = torch.FloatTensor([obs])
        act_probs_v = sm(net(obs_v))
        act_probs = act_probs_v.data.numpy()[0]
        action = np.random.choice(len(act_probs), p=act_probs)
        
        next_obs, reward, is_done, _ = env.step(action)
        
        #env.render()
        
        episode_reward += reward 
        episode_steps.append(EpisodeStep(observation=obs, action=action))
        
        
        if is_done :
            batch.append(Episode(reward=episode_reward, steps=episode_steps))
            
            episode_reward = 0.0
            episode_steps = []
            next_obs = env.reset()
            
            if len(batch) == batch_size :
                yield batch
                batch = []
                
        obs = next_obs

In [7]:
def filter_batch(batch, percentile):
    """discounted rewards"""
    disc_rewards = list(map(lambda s : s.reward * (GAMMA ** len(s.steps)), batch)) ###
    
    reward_bound = np.percentile(disc_rewards, percentile)
    
    train_obs = []
    train_act = []
    elite_batch = []
    
    for example, discounted_reward in zip(batch, disc_rewards):
        if discounted_reward > reward_bound :
            train_obs.extend(map(lambda step : step.observation, example.steps))
            train_act.extend(map(lambda step : step.action, example.steps))
            elite_batch.append(example)
            
    return elite_batch, train_obs, train_act, reward_bound

In [8]:
if __name__ == "__main__":
    random.seed(12345)
    
    env = DiscreteOneHotWrapper(gym.make("FrozenLake-v0"))
    #env = gym.wrappers.Monitor(env, directory="mon", force=True)
    
    obs_size = env.observation_space.shape[0]
    n_actions = env.action_space.n 
    
    net = Net(obs_size, HIDDEN_SIZE, n_actions)
    
    objective = nn.CrossEntropyLoss()
    
    optimizer = optim.Adam(params=net.parameters(), lr=0.001)
    
    writer = SummaryWriter(comment="-frozenlake-tweaked")
    
    
    full_batch = []
    
    for iter_no, batch in enumerate(iterate_batches(env, net, BATCH_SIZE)):
        reward_mean = float(np.mean(list(map(lambda s : s.reward, batch))))
        
        full_batch, obs, acts, reward_bound = filter_batch(full_batch + batch, PERCENTILE)
        
        if not full_batch :
            continue
            
        obs_v = torch.FloatTensor(obs)
        acts_v = torch.LongTensor(acts)
        
        full_batch = full_batch[-500:] ####
        
        optimizer.zero_grad()
        
        action_scores_v = net(obs_v)
        loss_v = objective(action_scores_v, acts_v)
        loss_v.backward()
        
        optimizer.step()
        
        print("%d: loss=%.3f, reward_mean=%.3f, reward_bound=%.3f, batch=%d" % (
               iter_no, loss_v.item(), reward_mean, reward_bound, len(full_batch)))
        writer.add_scalar("loss", loss_v.item(), iter_no)
        writer.add_scalar("reward_mean", reward_mean, iter_no)
        writer.add_scalar("reward_bound", reward_bound, iter_no)
        
        if reward_mean > 0.8:
            print("Solved!")
            break
            
    writer.close()

0: loss=1.386, reward_mean=0.030, reward_bound=0.000, batch=3
1: loss=1.378, reward_mean=0.010, reward_bound=0.000, batch=4
2: loss=1.375, reward_mean=0.000, reward_bound=0.000, batch=4
3: loss=1.378, reward_mean=0.030, reward_bound=0.000, batch=7
4: loss=1.377, reward_mean=0.020, reward_bound=0.000, batch=9
5: loss=1.376, reward_mean=0.000, reward_bound=0.000, batch=9
6: loss=1.374, reward_mean=0.020, reward_bound=0.000, batch=11
7: loss=1.374, reward_mean=0.010, reward_bound=0.000, batch=12
8: loss=1.372, reward_mean=0.000, reward_bound=0.000, batch=12
9: loss=1.372, reward_mean=0.010, reward_bound=0.000, batch=13
10: loss=1.374, reward_mean=0.020, reward_bound=0.000, batch=15
11: loss=1.366, reward_mean=0.020, reward_bound=0.000, batch=17
12: loss=1.365, reward_mean=0.020, reward_bound=0.000, batch=19
13: loss=1.369, reward_mean=0.030, reward_bound=0.000, batch=22
14: loss=1.369, reward_mean=0.020, reward_bound=0.000, batch=24
15: loss=1.363, reward_mean=0.020, reward_bound=0.000, b

127: loss=1.222, reward_mean=0.030, reward_bound=0.000, batch=222
128: loss=1.222, reward_mean=0.020, reward_bound=0.000, batch=224
129: loss=1.222, reward_mean=0.030, reward_bound=0.042, batch=227
130: loss=1.223, reward_mean=0.030, reward_bound=0.129, batch=229
131: loss=1.221, reward_mean=0.030, reward_bound=0.157, batch=230
132: loss=1.222, reward_mean=0.040, reward_bound=0.167, batch=230
133: loss=1.222, reward_mean=0.040, reward_bound=0.229, batch=230
134: loss=1.210, reward_mean=0.100, reward_bound=0.254, batch=216
135: loss=1.208, reward_mean=0.070, reward_bound=0.207, batch=221
136: loss=1.207, reward_mean=0.050, reward_bound=0.098, batch=224
137: loss=1.206, reward_mean=0.060, reward_bound=0.226, batch=227
138: loss=1.206, reward_mean=0.050, reward_bound=0.254, batch=225
139: loss=1.202, reward_mean=0.040, reward_bound=0.282, batch=191
140: loss=1.202, reward_mean=0.040, reward_bound=0.000, batch=195
141: loss=1.201, reward_mean=0.040, reward_bound=0.000, batch=199
142: loss=

252: loss=1.074, reward_mean=0.070, reward_bound=0.314, batch=227
253: loss=1.075, reward_mean=0.040, reward_bound=0.309, batch=229
254: loss=1.074, reward_mean=0.040, reward_bound=0.221, batch=230
255: loss=1.074, reward_mean=0.070, reward_bound=0.349, batch=230
256: loss=1.076, reward_mean=0.070, reward_bound=0.387, batch=226
257: loss=1.074, reward_mean=0.040, reward_bound=0.335, batch=228
258: loss=1.073, reward_mean=0.050, reward_bound=0.293, batch=229
259: loss=1.072, reward_mean=0.130, reward_bound=0.405, batch=230
260: loss=1.073, reward_mean=0.040, reward_bound=0.406, batch=231
261: loss=1.059, reward_mean=0.080, reward_bound=0.430, batch=137
262: loss=1.065, reward_mean=0.070, reward_bound=0.000, batch=144
263: loss=1.065, reward_mean=0.050, reward_bound=0.000, batch=149
264: loss=1.072, reward_mean=0.070, reward_bound=0.000, batch=156
265: loss=1.075, reward_mean=0.060, reward_bound=0.000, batch=162
266: loss=1.074, reward_mean=0.040, reward_bound=0.000, batch=166
267: loss=

377: loss=1.021, reward_mean=0.090, reward_bound=0.380, batch=229
378: loss=1.022, reward_mean=0.060, reward_bound=0.387, batch=227
379: loss=1.022, reward_mean=0.040, reward_bound=0.342, batch=229
380: loss=1.025, reward_mean=0.060, reward_bound=0.364, batch=230
381: loss=1.024, reward_mean=0.050, reward_bound=0.365, batch=231
382: loss=1.025, reward_mean=0.120, reward_bound=0.387, batch=230
383: loss=1.025, reward_mean=0.080, reward_bound=0.430, batch=221
384: loss=1.025, reward_mean=0.070, reward_bound=0.282, batch=223
385: loss=1.024, reward_mean=0.060, reward_bound=0.290, batch=226
386: loss=1.026, reward_mean=0.070, reward_bound=0.331, batch=228
387: loss=1.023, reward_mean=0.090, reward_bound=0.349, batch=227
388: loss=1.025, reward_mean=0.050, reward_bound=0.206, batch=228
389: loss=1.025, reward_mean=0.060, reward_bound=0.387, batch=228
390: loss=1.025, reward_mean=0.060, reward_bound=0.430, batch=225
391: loss=1.030, reward_mean=0.040, reward_bound=0.175, batch=227
392: loss=

502: loss=0.952, reward_mean=0.040, reward_bound=0.282, batch=228
503: loss=0.953, reward_mean=0.040, reward_bound=0.349, batch=221
504: loss=0.950, reward_mean=0.060, reward_bound=0.135, batch=224
505: loss=0.948, reward_mean=0.090, reward_bound=0.252, batch=227
506: loss=0.947, reward_mean=0.060, reward_bound=0.282, batch=228
507: loss=0.946, reward_mean=0.060, reward_bound=0.314, batch=228
508: loss=0.950, reward_mean=0.050, reward_bound=0.353, batch=229
509: loss=0.951, reward_mean=0.060, reward_bound=0.387, batch=213
510: loss=0.957, reward_mean=0.080, reward_bound=0.160, batch=219
511: loss=0.953, reward_mean=0.040, reward_bound=0.067, batch=223
512: loss=0.949, reward_mean=0.110, reward_bound=0.282, batch=225
513: loss=0.950, reward_mean=0.060, reward_bound=0.314, batch=226
514: loss=0.945, reward_mean=0.110, reward_bound=0.349, batch=227
515: loss=0.945, reward_mean=0.080, reward_bound=0.387, batch=224
516: loss=0.942, reward_mean=0.040, reward_bound=0.241, batch=227
517: loss=

627: loss=0.941, reward_mean=0.140, reward_bound=0.387, batch=221
628: loss=0.940, reward_mean=0.080, reward_bound=0.314, batch=224
629: loss=0.942, reward_mean=0.070, reward_bound=0.349, batch=226
630: loss=0.942, reward_mean=0.080, reward_bound=0.368, batch=228
631: loss=0.941, reward_mean=0.040, reward_bound=0.237, batch=229
632: loss=0.942, reward_mean=0.040, reward_bound=0.314, batch=229
633: loss=0.942, reward_mean=0.070, reward_bound=0.364, batch=230
634: loss=0.939, reward_mean=0.080, reward_bound=0.387, batch=229
635: loss=0.939, reward_mean=0.060, reward_bound=0.225, batch=230
636: loss=0.937, reward_mean=0.060, reward_bound=0.338, batch=231
637: loss=0.939, reward_mean=0.080, reward_bound=0.387, batch=231
638: loss=0.939, reward_mean=0.060, reward_bound=0.387, batch=231
639: loss=0.936, reward_mean=0.100, reward_bound=0.430, batch=215
640: loss=0.937, reward_mean=0.060, reward_bound=0.216, batch=220
641: loss=0.935, reward_mean=0.090, reward_bound=0.282, batch=223
642: loss=

753: loss=0.931, reward_mean=0.100, reward_bound=0.000, batch=121
754: loss=0.932, reward_mean=0.060, reward_bound=0.000, batch=127
755: loss=0.937, reward_mean=0.050, reward_bound=0.000, batch=132
756: loss=0.937, reward_mean=0.100, reward_bound=0.000, batch=142
757: loss=0.932, reward_mean=0.120, reward_bound=0.000, batch=154
758: loss=0.931, reward_mean=0.060, reward_bound=0.000, batch=160
759: loss=0.928, reward_mean=0.130, reward_bound=0.000, batch=173
760: loss=0.927, reward_mean=0.060, reward_bound=0.000, batch=179
761: loss=0.924, reward_mean=0.070, reward_bound=0.000, batch=186
762: loss=0.924, reward_mean=0.090, reward_bound=0.000, batch=195
763: loss=0.928, reward_mean=0.090, reward_bound=0.000, batch=204
764: loss=0.926, reward_mean=0.110, reward_bound=0.058, batch=213
765: loss=0.925, reward_mean=0.110, reward_bound=0.072, batch=218
766: loss=0.928, reward_mean=0.060, reward_bound=0.080, batch=221
767: loss=0.931, reward_mean=0.110, reward_bound=0.098, batch=224
768: loss=

878: loss=0.842, reward_mean=0.090, reward_bound=0.000, batch=210
879: loss=0.851, reward_mean=0.090, reward_bound=0.041, batch=217
880: loss=0.856, reward_mean=0.090, reward_bound=0.105, batch=222
881: loss=0.850, reward_mean=0.070, reward_bound=0.122, batch=221
882: loss=0.850, reward_mean=0.070, reward_bound=0.135, batch=224
883: loss=0.856, reward_mean=0.110, reward_bound=0.150, batch=226
884: loss=0.855, reward_mean=0.150, reward_bound=0.206, batch=221
885: loss=0.854, reward_mean=0.170, reward_bound=0.229, batch=222
886: loss=0.852, reward_mean=0.090, reward_bound=0.220, batch=225
887: loss=0.853, reward_mean=0.130, reward_bound=0.254, batch=219
888: loss=0.853, reward_mean=0.080, reward_bound=0.265, batch=223
889: loss=0.856, reward_mean=0.120, reward_bound=0.282, batch=211
890: loss=0.852, reward_mean=0.070, reward_bound=0.080, batch=217
891: loss=0.856, reward_mean=0.150, reward_bound=0.277, batch=222
892: loss=0.853, reward_mean=0.100, reward_bound=0.282, batch=223
893: loss=

1003: loss=0.800, reward_mean=0.070, reward_bound=0.011, batch=219
1004: loss=0.802, reward_mean=0.070, reward_bound=0.038, batch=223
1005: loss=0.811, reward_mean=0.130, reward_bound=0.058, batch=225
1006: loss=0.817, reward_mean=0.090, reward_bound=0.080, batch=226
1007: loss=0.821, reward_mean=0.090, reward_bound=0.104, batch=228
1008: loss=0.832, reward_mean=0.120, reward_bound=0.150, batch=223
1009: loss=0.829, reward_mean=0.110, reward_bound=0.167, batch=222
1010: loss=0.826, reward_mean=0.150, reward_bound=0.185, batch=223
1011: loss=0.825, reward_mean=0.030, reward_bound=0.010, batch=226
1012: loss=0.825, reward_mean=0.030, reward_bound=0.057, batch=228
1013: loss=0.830, reward_mean=0.110, reward_bound=0.206, batch=227
1014: loss=0.834, reward_mean=0.140, reward_bound=0.229, batch=217
1015: loss=0.836, reward_mean=0.140, reward_bound=0.254, batch=212
1016: loss=0.837, reward_mean=0.120, reward_bound=0.179, batch=218
1017: loss=0.837, reward_mean=0.140, reward_bound=0.282, batch

1126: loss=0.804, reward_mean=0.110, reward_bound=0.387, batch=223
1127: loss=0.802, reward_mean=0.100, reward_bound=0.314, batch=225
1128: loss=0.804, reward_mean=0.150, reward_bound=0.430, batch=223
1129: loss=0.800, reward_mean=0.130, reward_bound=0.282, batch=225
1130: loss=0.797, reward_mean=0.100, reward_bound=0.289, batch=227
1131: loss=0.798, reward_mean=0.060, reward_bound=0.308, batch=229
1132: loss=0.802, reward_mean=0.130, reward_bound=0.349, batch=229
1133: loss=0.802, reward_mean=0.070, reward_bound=0.405, batch=230
1134: loss=0.802, reward_mean=0.110, reward_bound=0.430, batch=227
1135: loss=0.802, reward_mean=0.080, reward_bound=0.387, batch=228
1136: loss=0.802, reward_mean=0.100, reward_bound=0.234, batch=229
1137: loss=0.800, reward_mean=0.070, reward_bound=0.309, batch=230
1138: loss=0.799, reward_mean=0.090, reward_bound=0.376, batch=231
1139: loss=0.800, reward_mean=0.110, reward_bound=0.430, batch=230
1140: loss=0.799, reward_mean=0.070, reward_bound=0.304, batch

1249: loss=0.807, reward_mean=0.190, reward_bound=0.349, batch=220
1250: loss=0.808, reward_mean=0.110, reward_bound=0.247, batch=224
1251: loss=0.811, reward_mean=0.090, reward_bound=0.314, batch=226
1252: loss=0.808, reward_mean=0.060, reward_bound=0.331, batch=228
1253: loss=0.802, reward_mean=0.120, reward_bound=0.387, batch=220
1254: loss=0.799, reward_mean=0.080, reward_bound=0.356, batch=224
1255: loss=0.797, reward_mean=0.120, reward_bound=0.311, batch=227
1256: loss=0.799, reward_mean=0.100, reward_bound=0.314, batch=227
1257: loss=0.797, reward_mean=0.130, reward_bound=0.342, batch=229
1258: loss=0.797, reward_mean=0.060, reward_bound=0.349, batch=229
1259: loss=0.796, reward_mean=0.110, reward_bound=0.364, batch=230
1260: loss=0.803, reward_mean=0.110, reward_bound=0.387, batch=225
1261: loss=0.803, reward_mean=0.050, reward_bound=0.253, batch=227
1262: loss=0.802, reward_mean=0.080, reward_bound=0.335, batch=229
1263: loss=0.799, reward_mean=0.080, reward_bound=0.364, batch

1373: loss=0.744, reward_mean=0.200, reward_bound=0.206, batch=218
1374: loss=0.744, reward_mean=0.080, reward_bound=0.158, batch=222
1375: loss=0.747, reward_mean=0.110, reward_bound=0.254, batch=220
1376: loss=0.747, reward_mean=0.160, reward_bound=0.282, batch=217
1377: loss=0.744, reward_mean=0.160, reward_bound=0.240, batch=222
1378: loss=0.745, reward_mean=0.160, reward_bound=0.292, batch=225
1379: loss=0.750, reward_mean=0.110, reward_bound=0.314, batch=215
1380: loss=0.749, reward_mean=0.140, reward_bound=0.240, batch=220
1381: loss=0.748, reward_mean=0.090, reward_bound=0.254, batch=223
1382: loss=0.747, reward_mean=0.090, reward_bound=0.261, batch=226
1383: loss=0.750, reward_mean=0.130, reward_bound=0.282, batch=225
1384: loss=0.750, reward_mean=0.120, reward_bound=0.321, batch=227
1385: loss=0.750, reward_mean=0.090, reward_bound=0.342, batch=229
1386: loss=0.746, reward_mean=0.110, reward_bound=0.349, batch=181
1387: loss=0.733, reward_mean=0.170, reward_bound=0.080, batch

1496: loss=0.761, reward_mean=0.170, reward_bound=0.311, batch=227
1497: loss=0.760, reward_mean=0.180, reward_bound=0.314, batch=228
1498: loss=0.773, reward_mean=0.160, reward_bound=0.349, batch=206
1499: loss=0.765, reward_mean=0.100, reward_bound=0.124, batch=214
1500: loss=0.766, reward_mean=0.120, reward_bound=0.185, batch=219
1501: loss=0.764, reward_mean=0.140, reward_bound=0.194, batch=223
1502: loss=0.762, reward_mean=0.190, reward_bound=0.206, batch=225
1503: loss=0.772, reward_mean=0.160, reward_bound=0.254, batch=225
1504: loss=0.772, reward_mean=0.100, reward_bound=0.282, batch=225
1505: loss=0.771, reward_mean=0.160, reward_bound=0.314, batch=224
1506: loss=0.773, reward_mean=0.200, reward_bound=0.349, batch=225
1507: loss=0.772, reward_mean=0.080, reward_bound=0.289, batch=227
1508: loss=0.772, reward_mean=0.110, reward_bound=0.314, batch=228
1509: loss=0.771, reward_mean=0.120, reward_bound=0.289, batch=229
1510: loss=0.771, reward_mean=0.090, reward_bound=0.364, batch

1619: loss=0.664, reward_mean=0.170, reward_bound=0.349, batch=222
1620: loss=0.666, reward_mean=0.200, reward_bound=0.324, batch=225
1621: loss=0.666, reward_mean=0.100, reward_bound=0.282, batch=225
1622: loss=0.667, reward_mean=0.150, reward_bound=0.321, batch=227
1623: loss=0.684, reward_mean=0.190, reward_bound=0.387, batch=179
1624: loss=0.659, reward_mean=0.130, reward_bound=0.000, batch=192
1625: loss=0.658, reward_mean=0.200, reward_bound=0.038, batch=203
1626: loss=0.674, reward_mean=0.220, reward_bound=0.117, batch=212
1627: loss=0.660, reward_mean=0.150, reward_bound=0.155, batch=218
1628: loss=0.668, reward_mean=0.180, reward_bound=0.229, batch=221
1629: loss=0.665, reward_mean=0.090, reward_bound=0.254, batch=221
1630: loss=0.671, reward_mean=0.140, reward_bound=0.282, batch=221
1631: loss=0.669, reward_mean=0.150, reward_bound=0.314, batch=217
1632: loss=0.670, reward_mean=0.170, reward_bound=0.342, batch=222
1633: loss=0.666, reward_mean=0.160, reward_bound=0.236, batch

1742: loss=0.674, reward_mean=0.130, reward_bound=0.112, batch=221
1743: loss=0.674, reward_mean=0.120, reward_bound=0.206, batch=223
1744: loss=0.678, reward_mean=0.250, reward_bound=0.282, batch=224
1745: loss=0.677, reward_mean=0.140, reward_bound=0.311, batch=227
1746: loss=0.680, reward_mean=0.180, reward_bound=0.314, batch=222
1747: loss=0.681, reward_mean=0.180, reward_bound=0.302, batch=225
1748: loss=0.684, reward_mean=0.150, reward_bound=0.349, batch=220
1749: loss=0.681, reward_mean=0.180, reward_bound=0.349, batch=223
1750: loss=0.680, reward_mean=0.220, reward_bound=0.387, batch=221
1751: loss=0.683, reward_mean=0.190, reward_bound=0.349, batch=224
1752: loss=0.681, reward_mean=0.190, reward_bound=0.387, batch=226
1753: loss=0.678, reward_mean=0.150, reward_bound=0.256, batch=228
1754: loss=0.679, reward_mean=0.190, reward_bound=0.349, batch=228
1755: loss=0.690, reward_mean=0.180, reward_bound=0.430, batch=197
1756: loss=0.679, reward_mean=0.160, reward_bound=0.147, batch

1865: loss=0.691, reward_mean=0.160, reward_bound=0.314, batch=221
1866: loss=0.686, reward_mean=0.200, reward_bound=0.349, batch=221
1867: loss=0.686, reward_mean=0.210, reward_bound=0.349, batch=224
1868: loss=0.691, reward_mean=0.190, reward_bound=0.387, batch=220
1869: loss=0.683, reward_mean=0.190, reward_bound=0.304, batch=224
1870: loss=0.691, reward_mean=0.140, reward_bound=0.308, batch=227
1871: loss=0.694, reward_mean=0.160, reward_bound=0.342, batch=229
1872: loss=0.695, reward_mean=0.160, reward_bound=0.349, batch=229
1873: loss=0.696, reward_mean=0.120, reward_bound=0.387, batch=227
1874: loss=0.690, reward_mean=0.150, reward_bound=0.430, batch=220
1875: loss=0.688, reward_mean=0.160, reward_bound=0.296, batch=224
1876: loss=0.686, reward_mean=0.140, reward_bound=0.345, batch=227
1877: loss=0.686, reward_mean=0.190, reward_bound=0.380, batch=229
1878: loss=0.684, reward_mean=0.150, reward_bound=0.387, batch=226
1879: loss=0.681, reward_mean=0.190, reward_bound=0.368, batch

1989: loss=0.546, reward_mean=0.290, reward_bound=0.098, batch=204
1990: loss=0.540, reward_mean=0.290, reward_bound=0.150, batch=211
1991: loss=0.531, reward_mean=0.250, reward_bound=0.185, batch=216
1992: loss=0.541, reward_mean=0.240, reward_bound=0.206, batch=218
1993: loss=0.539, reward_mean=0.250, reward_bound=0.229, batch=214
1994: loss=0.545, reward_mean=0.200, reward_bound=0.229, batch=219
1995: loss=0.542, reward_mean=0.230, reward_bound=0.254, batch=215
1996: loss=0.537, reward_mean=0.300, reward_bound=0.282, batch=215
1997: loss=0.533, reward_mean=0.210, reward_bound=0.206, batch=219
1998: loss=0.533, reward_mean=0.170, reward_bound=0.182, batch=223
1999: loss=0.536, reward_mean=0.220, reward_bound=0.282, batch=224
2000: loss=0.538, reward_mean=0.220, reward_bound=0.280, batch=227
2001: loss=0.536, reward_mean=0.200, reward_bound=0.314, batch=218
2002: loss=0.536, reward_mean=0.250, reward_bound=0.257, batch=222
2003: loss=0.538, reward_mean=0.220, reward_bound=0.324, batch

KeyboardInterrupt: 

## - 결과 
<img src = "./image/4/t-3.png">

- 이 전의 frozen lake 환경 보다 성능이 좋아짐
- 성공 에피소드의 55%정도에서 발전을 멈춤 
- cross-entropy의 규제(regularization) 등을 통해 다룰 수 있지만 앞으로 배울 예정 


#### - "미끄러움"의 영향 
- 각 행동은 33%의 확률로 90도 회전한 행동을 할 수 있음 
    - 위의 행동은 33%의 확률로 위, 33%의 왁률로 왼쪽, 33%의 확률로 오른쪽 으로 행동함 
- 미끄러움 여부는 환경 생성 시 설정 가능 
- 안미끄러운 버전은 미끄러운 버전보다 100배 빠른  성공 성능을 보임 

### - 결론 
#### cross entropy 방법 
- 현재의 정책을 사용하여 에피소드를 샘플링 
- 가장 성공적인 샘플과 우리의 정책에 대하여 negative log likelihood를 최소화 