# Scripts for Evaluation

In this notebook we show how to evaluate the trained agents

- Use the flag `-t` to specify the model to be loaded.

## 1. LunarLander with Bomb

**a) Evaluate baseline agent**

`python3 eval_lunarT.py -t sac_lunarT_baseline_plus4_seed0 --noBomb`

>Expected evaluation results (100 episodes each seed)
>
>Seed: 0 
>- Average Reward (n=99): 281.51(+-17.32), Min=244.33, Max=314.37
>- Stats: n_success: 100, n_hits: 0
>- Results saved in: ./eval_out/res_LunarLanderContinuous-v2_sac_lunarT_plus4_Nov18_seed0
>- Computation time (minutes):  0.6
>
>Seed: 1
>- Average Reward (n=99): 276.81(+-16.62), Min=232.63, Max=310.76
>- Stats: n_success: 100, n_hits: 0
>- Results saved in: ./eval_out/res_LunarLanderContinuous-v2_sac_lunarT_plus4_Nov18_seed1
>- Computation time (minutes):  0.7
>
>Seed: 2
>- Average Reward (n=99): 280.77(+-17.53), Min=240.49, Max=319.12
>- Stats: n_success: 100, n_hits: 0
>- Results saved in: ./eval_out/res_LunarLanderContinuous-v2_sac_lunarT_plus4_Nov18_seed2
>- Computation time (minutes):  0.6

**b) Evaluate baseline agent (bomb unaware lander)** 

`python3 eval_lunarT.py -t sac_lunarT_baseline_plus4_seed0`

>Evaluation results (100 pisodes)
>- Average Reward (n=99): 228.34(+-91.27), Min=22.75, Max=311.69
>- Stats: n_success: 74, n_hits: 3, n_bombs=22
>- Computation time (minutes):  4.8

**c) Evaluate the best SAC and SAC-I (circle reward shaping)**

- SAC best agent:

`python3 eval_lunarT.py --env LunarLanderContinuous-v2 -t sac_lunarTB_shapedC_seed4`

- SAC-I best agent:

`python3 eval_lunarT.py --env LunarLanderContinuous-v2 -t saci_lunarTB_shapedC_seed4`

>Expected evaluation results
>
>SAC agent:
>- Average Reward (n=99): 258.09(+-70.80), Min=-95.76, Max=319.30
>- Stats: n_success: 93, n_hits: 1, n_bombs=0
>- Computation time (minutes):  7.7
>
>SAC-I agent:
>- Average Reward (n=99): 252.97(+-77.70), Min=-83.75, Max=319.68
>- Stats: n_success: 94, n_hits: 0, n_bombs=0
>- Computation time (minutes):  7.8

**d) Evaluate the best SAC-I agent (conservative shaping)**

`python3 eval_lunarT.py --env LunarLanderContinuous-v2 -t saci_lunarTB_shapedVVHA_seed5`

>Expected evaluation results
>
>- Average Reward (n=99): 268.44(+-39.45), Min=92.34, Max=314.33
>- Stats: n_success: 97, n_hits: 0, n_bombs=3
>- Computation time (minutes):  5.8

## 2. BipedalWalkerHardcore-v3

**a) Evaluate the SAC agent trained in the BipedalWalker-v3 (easy version)**

`python3 eval_bipedal.py -t sac_bipedal_baseline_seed0 --env BipedalWalker-v3`

>This agent is good at running and will be used to train the other agents in the hardcore version

**b) Evaluate SAC-I agents trained from scratch**

Agent walks with legs-up: `python3 eval_bipedal.py -t saci_bipedalH_from0_seed0`

Agent walks with legs-down: `python3 eval_bipedal.py -t saci_bipedalH_from0_seed1`
>These agents are cool but the training is not so stable...

**c) Evaluate the best SAC retrained from baseline agent**

`python3 eval_bipedal.py -t sac_bipedalH_retrained_seed0`
> Legs-up strategy

**d) Evaluate the best SAC-I retrained from baseline agent**

`python3 eval_bipedal.py -t saci_bipedalH_retrained_seed8`
> Leg-up strategy. Evaluation average of 10 episodes +-278

## (EXTRA)

**e) Evaluate best SAC-I agents retrained from baseline agent (with extra bonus reward)**

- Evaluate: `python3 eval_bipedal.py -t saci_bipedalH_retrained_bonus_seed1`
> Seems to have an energy efficient walk (gets +=305 on Go trials) but falls more...
- Evaluate: `python3 eval_bipedal.py -t saci_bipedalH_retrained_bonus_seed4`
> Not so energy efficient (gets +=292 on Go trials) but falls less...
- To evaluate on Go trials: just add the flag `--go`

    `python3 eval_bipedal.py -t saci_bipedalH_retrained_bonus_seed1 --go`

    `python3 eval_bipedal.py -t saci_bipedalH_retrained_bonus_seed4 --go` 

**f) Evaluate the SAC agents retrained from baseline agent (with extra bonus reward)**

`python3 eval_bipedal.py -t sac_bipedalH_retrained_bonus_seed3`
> Does not work... Interestingly, this agent starts to exploit getting stuck... :P schizophrenic agent... 