<a href="https://colab.research.google.com/github/vijaygwu/classideas/blob/main/PPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Summary**:
- The agent (PPO) uses two neural networks: an actor (policy) and a critic (value function).
- It interacts with an environment, collects experiences, and then trains its policy and value function based on these experiences.
- PPO introduces a clipping mechanism in the policy update step to prevent large policy changes, ensuring more stable and robust learning.
- The algorithm also leverages Generalized Advantage Estimation (GAE) for a better estimate of advantages, and entropy regularization to encourage exploration.

The key advantage of PPO over other policy gradient methods is its balance between ease of implementation, sample efficiency, and training stability. This code captures the core essence of PPO in a modular and organized manner, making it easier to understand and potentially extend.

** Implementation Details **

The code is a Python implementation of the Proximal Policy Optimization (PPO) algorithm using the PyTorch library.

1. **Neural Network Architecture (ActorCritic)**:
    - **Actor**: Represents the policy of the agent. Given a state, it outputs action probabilities (for discrete action spaces) or action means (for continuous action spaces).
    - **Critic**: Represents the value function of the agent. Given a state, it estimates its value.

2. **PPO Algorithm Class (PPO)**:
   - Initializes various parameters and the neural network (ActorCritic).
   - `compute_gae`: Calculates the Generalized Advantage Estimation (GAE). GAE is a technique to estimate the advantage of an action in a state, which can lead to more stable and efficient learning.
   - `ppo_step`: Performs one update step for the PPO algorithm. It updates the policy using the PPO clipping objective and also updates the value function.
   - `train`: Given collected experience (memory), this method divides the experience into mini-batches and updates the policy and value function using the `ppo_step` method.

3. **Memory Class (Memory)**:
   - A simple class to store the experiences (state, action, value, log-probability of the action, reward, and mask indicating episode end) collected by the agent during its interaction with the environment.
   - `append`: Appends an experience to the memory.
   - `clear`: Clears the memory.

4. **Environment Setup **:
   -  You'd need to use the `gym` library to utilize this part of the code. The setup includes:
      - Environment creation.
      - Instantiating the PPO agent and memory.
      -  In the end - You'd then loop to collect experiences using the policy, store them in memory, and train the agent using these experiences.



In [11]:
import torch
import torch.nn as nn
import torch.optim as optim

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim, continuous=False):
        super(ActorCritic, self).__init__()
        self.continuous = continuous

        self.actor = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )

        self.critic = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1)
        )

        if continuous:
            self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        value = self.critic(state)
        if self.continuous:
            mean = self.actor(state)
            std = torch.exp(self.log_std).unsqueeze(0).expand_as(mean)
            action_dist = torch.distributions.Normal(mean, std)
        else:
            action_probs = torch.softmax(self.actor(state), dim=-1)
            action_dist = torch.distributions.Categorical(action_probs)
        return action_dist, value

class PPO:
    def __init__(self, state_dim, action_dim, continuous=False, lr=3e-4, gamma=0.99, tau=0.95, clip_epsilon=0.2, entropy_coef=0.01):
        self.gamma = gamma
        self.tau = tau
        self.clip_epsilon = clip_epsilon
        self.entropy_coef = entropy_coef

        self.policy = ActorCritic(state_dim, action_dim, continuous)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)

    def compute_gae(self, next_value, rewards, masks, values):
        values = values + [next_value]
        gae = 0
        returns = []
        for step in reversed(range(len(rewards))):
            delta = rewards[step] + self.gamma * values[step + 1] * masks[step] - values[step]
            gae = delta + self.gamma * self.tau * masks[step] * gae
            returns.insert(0, gae + values[step])
        return returns

    def ppo_step(self, states, actions, log_probs, returns, advantages):
        action_dist, values = self.policy(states)
        entropy = action_dist.entropy().mean()
        new_log_probs = action_dist.log_prob(actions)

        ratio = (new_log_probs - log_probs).exp()
        surr1 = ratio * advantages
        surr2 = torch.clamp(ratio, 1.0 - self.clip_epsilon, 1.0 + self.clip_epsilon) * advantages

        actor_loss = -torch.min(surr1, surr2).mean()
        critic_loss = 0.5 * (returns - values).pow(2).mean()

        loss = actor_loss + critic_loss - self.entropy_coef * entropy

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

    def train(self, memory, ppo_epochs=4, mini_batch_size=64):
        states = torch.tensor(memory.states, dtype=torch.float32)
        actions = torch.tensor(memory.actions)
        old_log_probs = torch.tensor(memory.log_probs)

        rewards = torch.tensor(memory.rewards)
        masks = torch.tensor(memory.masks)

        with torch.no_grad():
            _, next_value = self.policy(states[-1])
        returns = self.compute_gae(next_value, memory.rewards, memory.masks, memory.values)

        returns_tensor = torch.tensor(returns, dtype=torch.float32)
        values_tensor = torch.tensor(memory.values, dtype=torch.float32)
        advantages = returns_tensor - values_tensor
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        for _ in range(ppo_epochs):
            for index in range(0, len(states), mini_batch_size):
                sampled_states = states[index:index+mini_batch_size]
                sampled_actions = actions[index:index+mini_batch_size]
                sampled_old_log_probs = old_log_probs[index:index+mini_batch_size]
                sampled_returns = returns_tensor[index:index+mini_batch_size]
                sampled_advantages = advantages[index:index+mini_batch_size]

                self.ppo_step(sampled_states, sampled_actions, sampled_old_log_probs, sampled_returns, sampled_advantages)

class Memory:
    def __init__(self):
        self.states = []
        self.actions = []
        self.values = []
        self.log_probs = []
        self.rewards = []
        self.masks = []

    def append(self, state, action, value, log_prob, reward, mask):
        self.states.append(state)
        self.actions.append(action)
        self.values.append(value.item())
        self.log_probs.append(log_prob.item())
        self.rewards.append(reward)
        self.masks.append(mask)

    def clear(self):
        self.states.clear()
        self.actions.clear()
        self.values.clear()
        self.log_probs.clear()
        self.rewards.clear()
        self.masks.clear()


This example uses the CartPole-v1 environment from OpenAI Gym. In this code:

We first define a function collect_experience_and_train that will be responsible for collecting experiences and training the agent.
Within the function, we loop for a given number of episodes.
For each episode, we iterate for a set number of timesteps or until the episode ends.
At each timestep, we use the policy to sample an action, take the action in the environment, and then store the experience in memory.
After the episode is over, we use the collected experiences in memory to train the PPO agent.
After training, we clear the memory to prepare for the next episode.
Finally, we demonstrate how to use the function with the CartPole-v1 environment.

** In Detail **

Imports:

python
Copy code
import gym
We're using the gym library, which provides a suite of environments for reinforcement learning. In this example, we're using the CartPole-v1 environment.

Function Definition - collect_experience_and_train:

python
Copy code
def collect_experience_and_train(agent, memory, env, num_episodes, max_timesteps=1000):
This function collects experiences and trains the agent.

Parameters:
agent: The PPO agent we're training.
memory: The memory object to store experiences.
env: The environment we're interacting with.
num_episodes: Number of episodes for training.
max_timesteps: Maximum number of timesteps in each episode.
Episode Loop:

python
Copy code
for episode in range(num_episodes):
    state = env.reset()
    episode_reward = 0
We loop through the specified number of episodes. For each episode, we reset the environment to get the initial state and set the episode reward to zero.

Timestep Loop:

python
Copy code
for t in range(max_timesteps):
For each episode, we have another loop to iterate through timesteps. This can either run for max_timesteps or end earlier if the episode is terminated (e.g., the cartpole falls).

Policy Evaluation & Action Selection:

python
Copy code
state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
with torch.no_grad():
    action_dist, value = agent.policy(state_tensor)
    action = action_dist.sample().item()
    log_prob = action_dist.log_prob(torch.tensor([action]))
We convert the current state to a PyTorch tensor.
We pass the state through the policy network to get the action distribution and value.
We sample an action from the action distribution and compute its log probability.
Environment Step:

python
Copy code
next_state, reward, done, _ = env.step(action)
We take the selected action in the environment. This returns the next state, the reward for the action, and a flag done indicating if the episode ended (e.g., the pole fell).

Store Experience:

python
Copy code
memory.append(state, action, value, log_prob, reward, not done)
We store the experience (state, action, value, log probability, reward, and continuation flag) in memory.

Update State and Reward:

python
Copy code
episode_reward += reward
state = next_state
We update the cumulative episode reward with the new reward and set the current state to the next




In [12]:
import gym

def collect_experience_and_train(agent, memory, env, num_episodes, max_timesteps=1000):
    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0
        for t in range(max_timesteps):
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0)
            with torch.no_grad():
                action_dist, value = agent.policy(state_tensor)
                action = action_dist.sample().item()
                log_prob = action_dist.log_prob(torch.tensor([action]))
            next_state, reward, done, _ = env.step(action)

            memory.append(state, action, value, log_prob, reward, not done)

            episode_reward += reward
            state = next_state

            if done:
                break

        print(f"Episode {episode + 1} Reward: {episode_reward}")

        # After collecting experiences for an episode, train the agent
        agent.train(memory)

        # Clear the memory after training
        memory.clear()

env_name = "CartPole-v1"
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

agent = PPO(state_dim, action_dim, continuous=False)
memory = Memory()

collect_experience_and_train(agent, memory, env, num_episodes=100)


Episode 1 Reward: 20.0
Episode 2 Reward: 20.0
Episode 3 Reward: 15.0
Episode 4 Reward: 48.0
Episode 5 Reward: 17.0
Episode 6 Reward: 37.0
Episode 7 Reward: 26.0
Episode 8 Reward: 20.0
Episode 9 Reward: 16.0
Episode 10 Reward: 43.0
Episode 11 Reward: 13.0
Episode 12 Reward: 13.0
Episode 13 Reward: 16.0
Episode 14 Reward: 29.0
Episode 15 Reward: 30.0
Episode 16 Reward: 26.0
Episode 17 Reward: 10.0
Episode 18 Reward: 32.0
Episode 19 Reward: 36.0
Episode 20 Reward: 24.0
Episode 21 Reward: 42.0
Episode 22 Reward: 17.0
Episode 23 Reward: 15.0
Episode 24 Reward: 16.0
Episode 25 Reward: 40.0
Episode 26 Reward: 18.0
Episode 27 Reward: 26.0
Episode 28 Reward: 18.0
Episode 29 Reward: 14.0
Episode 30 Reward: 13.0
Episode 31 Reward: 17.0
Episode 32 Reward: 10.0
Episode 33 Reward: 17.0
Episode 34 Reward: 20.0
Episode 35 Reward: 13.0
Episode 36 Reward: 17.0
Episode 37 Reward: 44.0
Episode 38 Reward: 40.0
Episode 39 Reward: 59.0
Episode 40 Reward: 25.0
Episode 41 Reward: 14.0
Episode 42 Reward: 13.0
E