<a href="https://colab.research.google.com/github/MatchLab-Imperial/deep-learning-course/blob/master/08_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coursework

## Task 1: On-policy vs. Off-policy
Use the code given below to run the training loop, where the agent is trained for 200 episodes. The agent we give follows a Q-learning approach, which is an off-policy approach. You will now change the approach to SARSA, which is an on-policy approach. Also, for both Q-learning and SARSA test two different policies: $\epsilon$-greedy and Softmax. $\epsilon$-greedy is already defined in the tutorial and implemented in the given agent. Softmax policy refers to sampling the next action following the probability distribution given by $Softmax(Q(s, a))$. We provide you the NumPy softmax function to normalize the Q-Values into a probability function to use before sampling. Similarly to RNN, in the softmax function, there is a temperature value involved, we set a default value that works, but you can tweak it if you find another value with better performance. Report the new value if you decide to do so.

You will need to modify `act` and `replay` from the `DQNAgent` to implement the different approaches we ask for. Results may differ from run to run due to different initialization states.

**Report**
* Plot the average reward for the last 50 episodes vs. number of training episodes (train for 200 episodes) for the four agents trained: Q-learning and SARSA with both $\epsilon$-greedy policy and Softmax policy.

* Attach in the Appendix the modifications done to `DQNAgent` to implement the different agents. Do not include your code, a simple explanation with the key modifications is enough.

* In addition to the average reward plot, include **at least one secondary plot** that illustrates the behavioral differences between the four agents. For example, you might consider:

  ---

  **$\epsilon$-greedy vs. Softmax**

  * **Fraction of greedy actions**  
  Proportion of times in an episode the agent chooses the action with the highest estimated Q-value.  
  Shows how quickly exploration gives way to exploitation, and is useful when comparing ϵ-greedy and Softmax.  

  * **Average policy entropy**  
  A measure of how “spread out” the action probabilities are per-step in an episode (compute an episode average).  
  High entropy means the agent is exploring more uniformly; low entropy means it is being more deterministic.  

  ---

  **Q-Learning vs. SARSA**

  * **Variance of returns**  
  Plot a rolling variance across episodes of total episode reward (e.g., window = 20).  
  Highlights stability differences between Q-Learning and SARSA.  

  * **Average Temporal-difference (TD) error magnitude**  
  Measures how different the agent’s current Q-value estimate is from the updated target it just computed.  
  Captures how big the agent’s “surprise” is at each step.  

    - For Q-learning (off-policy):  
      $
      \delta = \left| r + \gamma \max_{a} Q(s', a) - Q(s, a) \right|
      $

    - For SARSA (on-policy):  
      $
      \delta = \left| r + \gamma Q(s', a') - Q(s, a) \right|, \quad $where  $a'$ is the next action chosen by the policy.


    Large TD errors mean the agent is still making big adjustments (its predictions are far off).  
    Smaller TD errors suggest the estimates are stabilizing.  

  ---

  These are suggestions — you are welcome to choose other metrics if you find them insightful, as long as you clearly explain how your plot highlights differences between the agents.
  
  To make trends easier to see, consider smoothing your plots with a **moving average** across episodes (e.g. a 20-episode window).



In [None]:
def softmax(x, temperature=0.025):
    """Compute softmax values for each sets of scores in x."""
    x = (x - np.expand_dims(np.max(x, 1), 1))
    x = x/temperature
    e_x = np.exp(x)
    return e_x / (np.expand_dims(e_x.sum(1), -1) + 1e-5)

class DQNAgent:
    def __init__(self, state_size, action_size, device=None):
        self.state_size   = state_size
        self.action_size  = action_size
        self.memory       = collections.deque(maxlen=20000)
        self.gamma        = 0.95
        self.epsilon      = 1.0
        self.epsilon_min  = 0.01
        self.epsilon_decay= 0.995
        self.learning_rate= 0.001

        self.device = device or torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model  = self._build_model().to(self.device)
        self.opt    = optim.Adam(self.model.parameters(), lr=self.learning_rate)
        self.loss_fn= nn.MSELoss()

    def _build_model(self):
        net = nn.Sequential(
            nn.Linear(self.state_size, 24),
            nn.ReLU(),
            nn.Linear(24, 48),
            nn.ReLU(),
            nn.Linear(48, self.action_size)  # Q(s,·)
        )
        return net

    # ---- replay buffer ----
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    # ---- ε-greedy & exploit ----
    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        self.model.eval()
        with torch.no_grad():
            s = torch.as_tensor(state, dtype=torch.float32, device=self.device)
            q = self.model(s)  # (1, A)
            return int(torch.argmax(q, dim=1).item())

    def exploit(self, state):
        self.model.eval()
        with torch.no_grad():
            s = torch.as_tensor(state, dtype=torch.float32, device=self.device)
            q = self.model(s)
            return int(torch.argmax(q, dim=1).item())

    # ---- Q-learning update via replay ----
    def replay(self, batch_size):
        if len(self.memory) < batch_size:
            return
        minibatch = random.sample(self.memory, batch_size)

        state_b      = torch.as_tensor(np.vstack([m[0] for m in minibatch]),
                                       dtype=torch.float32, device=self.device)
        action_b     = torch.as_tensor([m[1] for m in minibatch],
                                       dtype=torch.long, device=self.device)
        reward_b     = torch.as_tensor([m[2] for m in minibatch]),
        reward_b     = reward_b[0].to(self.device).float()
        next_state_b = torch.as_tensor(np.vstack([m[3] for m in minibatch]),
                                       dtype=torch.float32, device=self.device)
        done_b       = torch.as_tensor([m[4] for m in minibatch],
                                       dtype=torch.float32, device=self.device)

        # target = r + γ * max_a' Q(next_state, a'); if done -> r
        self.model.eval()
        with torch.no_grad():
            next_q_max = self.model(next_state_b).max(dim=1).values
            target_scalar = reward_b + self.gamma * next_q_max * (1.0 - done_b)

            # target_f: start from current Q(s,·), overwrite chosen action with target_scalar
            target_full = self.model(state_b).clone()
            target_full[torch.arange(batch_size, device=self.device), action_b] = target_scalar

        # fit to target_full
        self.model.train()
        pred_q = self.model(state_b)
        loss = self.loss_fn(pred_q, target_full)

        self.opt.zero_grad()
        loss.backward()
        self.opt.step()

        # ε decay
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    # ---- save/load weights ----
    def save(self, path):
        torch.save(self.model.state_dict(), path)

    def load(self, path):
        state = torch.load(path, map_location=self.device)
        self.model.load_state_dict(state)
        self.model.to(self.device)
        self.model.eval()

In [None]:
EPISODES = 200
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
batch_size = 32
episode_reward_list = collections.deque(maxlen=50)

for e in range(EPISODES):
    state, _ = env.reset()
    state = np.reshape(state, [1, state_size])
    total_reward = 0
    for time in range(200):
      action = agent.act(state)
      next_state, reward, terminated, truncated, _ = env.step(action)
      done = terminated or truncated
      total_reward += reward
      next_state = np.reshape(next_state, [1, state_size])
      agent.remember(state, action, reward, next_state, done)
      state = next_state
      if done:
          break
      if len(agent.memory) > batch_size:
          agent.replay(batch_size)
    episode_reward_list.append(total_reward)
    episode_reward_avg = np.array(episode_reward_list).mean()
    print("episode: {}/{}, score: {}, e: {:.2}, last 50 ep. avg. rew.: {:.2f}"
                .format(e, EPISODES, total_reward, agent.epsilon, episode_reward_avg))