## q-learning에서 고려해야할 점

더 나은 해를 찾기 위한 exploit & exploration 전략
* e-greedy: 매번 action을 결정할 때마다 항상 최적의 action만 취하지 않고 e의 확률로 무작위 action을 취함
* decaying e-greedy: episode가 진행됨에 따라 q function이 학습되어감을 감안하여 e의 값을 점점 줄여가며 e-greedy를 적용
* random noise: 항상 action을 정할 때 random noise를 주어 항상 가장 높은 확률의 action만 취하지는 않도록 함
* decaying random noise: episode가 진행됨에 따라 q function이 학습되어감을 감안하여 random noise를 점점 줄여가며 e-greedy를 적용

더 안정적이고 나은 해를 찾기 위한 discount 전략
* discounted reward: 가까운 미래에 대한 보상을 먼 미래에 대한 보상보다 더 크게 생각함

q-learning은 환경이 deterministic하고 observation space가 finite하다면 항상 converge함이 증명됨

non-deterministic한 환경에서 해를 찾기 위한 전략
* learning rate: 일정 비율만큼만 q를 학습시켜 강건하게 함

환경이 non-deterministic하다면 learning rate를 활용했을 때 항상 converge함이 증명됨

In [1]:
import gymnasium
import numpy as np
import ray.tune
import ray.tune.search.optuna
import ray.tune.schedulers.pb2
import ray.air.integrations.wandb
import os

In [2]:
class QLearning(ray.tune.Trainable):
    def setup(self, config):
        self.e_weight = config['e_weight']
        self.e_bias = config['e_bias']
        self.noise_amp = config['noise_amp']
        self.discount = config['discount']
        self.lr = config['lr']
        self.env = gymnasium.make('FrozenLake-v1')
        self.q = np.zeros((self.env.observation_space.n, self.env.action_space.n))
        self.rewards = []
    
    def step(self):
        u, info = self.env.reset()
        reward = 0
        step = 0
        while True:
            e = 1 / (step * self.e_weight + self.e_bias)
            if np.random.random() < e:
                action = self.env.action_space.sample()
            else:
                action = np.argmax(self.q[u, :] + np.random.random(self.env.action_space.n) * self.noise_amp)
            v, r, terminated, truncated, info = self.env.step(action)
            reward += r
            step += 1
            self.q[u, action] = (1 - self.lr) * self.q[u, action] + self.lr * (r + self.discount * np.max(self.q[v, :]))
            if terminated or truncated:
                break
        self.rewards.append(reward)
        return {'score': np.sum(self.rewards)}
    
    def save_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "q")
        self.q.dump(checkpoint_path)
        return tmp_checkpoint_dir
    
    def load_checkpoint(self, tmp_checkpoint_dir):
        checkpoint_path = os.path.join(tmp_checkpoint_dir, "q")
        self.q = np.load(checkpoint_path, allow_pickle=True)

    def cleanup(self):
        self.env.close()

In [3]:
tuner = ray.tune.Tuner(
    QLearning,
    tune_config=ray.tune.TuneConfig(
        num_samples=-1,
        scheduler = ray.tune.schedulers.pb2.PB2(
            time_attr='time_total_s',
            metric='score',
            mode='max',
            perturbation_interval=5.0,
            hyperparam_bounds={
                'e_weight': [0, 2],
                'e_bias': [0, 500],
                'noise_amp': [0, 1],
                'discount': [0, 1],
                'lr': [0, 1],
            }
        )
    ), 
    run_config=ray.air.RunConfig(
        callbacks=[
            ray.air.integrations.wandb.WandbLoggerCallback(project='QLearning'),
        ],
        stop={
            'training_iteration': 500,
        },
        checkpoint_config=ray.air.CheckpointConfig(
            num_to_keep=3,
            checkpoint_score_attribute='score',
            checkpoint_score_order='max',
            checkpoint_frequency=5,
            checkpoint_at_end=True,
        ),
    ), 
)

In [4]:
results = tuner.fit() 

2023-07-25 09:56:31,430	INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
2023-07-25 09:56:33,100	INFO tune.py:226 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.


0,1
Current time:,2023-07-25 09:58:34
Running for:,00:01:59.04
Memory:,7.3/7.7 GiB

Trial name,status,loc,discount,e_bias,e_weight,lr,noise_amp,iter,total time (s),score
QLearning_19293_00008,RUNNING,172.26.215.93:84690,0.342404,490.272,0.0939159,0.612227,0.892059,185.0,0.529877,5.0
QLearning_19293_00009,RUNNING,172.26.215.93:85091,0.448611,357.19,1.29258,0.303155,0.295359,1.0,0.00365186,0.0
QLearning_19293_00010,RUNNING,172.26.215.93:85093,0.91863,61.5705,1.27952,0.713519,0.983653,7.0,0.0104692,0.0
QLearning_19293_00011,RUNNING,172.26.215.93:85096,0.533566,216.329,0.29919,0.638748,0.248776,2.0,0.00630784,0.0
QLearning_19293_00013,RUNNING,172.26.215.93:85127,0.890601,471.822,1.96318,0.312419,0.245912,,,
QLearning_19293_00014,RUNNING,172.26.215.93:85128,0.900362,137.849,0.293181,0.17519,0.250292,5.0,0.0184,0.0
QLearning_19293_00015,RUNNING,172.26.215.93:85461,0.737518,280.687,1.04735,0.562984,0.277247,2.0,0.00400329,0.0
QLearning_19293_00012,PENDING,,0.62004,210.363,0.665454,0.307021,0.193716,,,
QLearning_19293_00016,PENDING,,0.406881,136.438,1.05413,0.635878,0.0627093,,,
QLearning_19293_00017,PENDING,,0.384536,4.17258,1.75333,0.66518,0.66244,,,


2023-07-25 09:56:35,068	INFO wandb.py:320 -- Already logged into W&B.


Trial name,date,done,hostname,iterations_since_restore,node_ip,pid,score,time_since_restore,time_this_iter_s,time_total_s,timestamp,training_iteration,trial_id
QLearning_19293_00000,2023-07-25_09-58-04,True,DESKTOP-0P789CI,500,172.26.215.93,82461,7,0.455641,0.000976562,0.455641,1690246684,500,19293_00000
QLearning_19293_00001,2023-07-25_09-58-08,True,DESKTOP-0P789CI,500,172.26.215.93,82462,4,0.455195,0.0010097,0.455195,1690246688,500,19293_00001
QLearning_19293_00002,2023-07-25_09-58-12,True,DESKTOP-0P789CI,500,172.26.215.93,82463,5,0.486526,0.000966072,0.486526,1690246692,500,19293_00002
QLearning_19293_00003,2023-07-25_09-58-10,True,DESKTOP-0P789CI,500,172.26.215.93,82464,10,0.545912,0.00057888,0.545912,1690246690,500,19293_00003
QLearning_19293_00004,2023-07-25_09-58-09,True,DESKTOP-0P789CI,500,172.26.215.93,82465,8,0.494533,0.000878811,0.494533,1690246689,500,19293_00004
QLearning_19293_00005,2023-07-25_09-58-00,True,DESKTOP-0P789CI,500,172.26.215.93,82466,3,0.446376,0.000694275,0.446376,1690246680,500,19293_00005
QLearning_19293_00006,2023-07-25_09-58-04,True,DESKTOP-0P789CI,500,172.26.215.93,82467,12,0.53046,0.00168705,0.53046,1690246684,500,19293_00006
QLearning_19293_00007,2023-07-25_09-57-59,True,DESKTOP-0P789CI,500,172.26.215.93,82468,7,0.438363,0.000495911,0.438363,1690246679,500,19293_00007
QLearning_19293_00008,2023-07-25_09-58-34,False,DESKTOP-0P789CI,186,172.26.215.93,84690,5,0.531027,0.00115037,0.531027,1690246714,186,19293_00008
QLearning_19293_00009,2023-07-25_09-58-24,False,DESKTOP-0P789CI,1,172.26.215.93,85091,0,0.00365186,0.00365186,0.00365186,1690246704,1,19293_00009


[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: Currently logged in as: seokjin. Use `wandb login --relogin` to force relogin
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: wandb version 0.15.6 is available!  To upgrade, please run:
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb:  $ pip install wandb --upgrade
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: Tracking run with wandb version 0.15.4
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: Run data is saved locally in /home/seokj/ray_results/QLearning_2023-07-25_09-56-28/QLearning_19293_00002_2_2023-07-25_09-56-35/wandb/run-20230725_095651-19293_00002
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: Run `wandb offline` to turn off syncing.
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: Syncing run QLearning_19293_00002
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: ⭐️ View project at https://wandb.ai/seokjin/QLearning
[2m[36m(_WandbLoggingActor pid=82811)[0m wandb: 🚀 View run at https://wandb.ai/seokjin/