<a href="https://colab.research.google.com/github/rahul-727/Reinforcement-Learning-/blob/main/2348544_Lab10_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

Policy Network
* Policy network is used to map states to action probabilities

In [11]:
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.fc(state)

Policy Gradient Agent
* Selects actions using the policy network.
* Records rewards and log probabilities of actions.
* Updates the policy using the policy gradient theorem.

In [12]:
class PolicyGradientAgent:
    def __init__(self, state_dim, action_dim, lr=0.01):
        self.policy_net = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        self.episode_rewards = []
        self.episode_log_probs = []

    def select_action(self, state):
        state = torch.tensor(state, dtype=torch.float32)
        action_probs = self.policy_net(state)
        dist = Categorical(action_probs)
        action = dist.sample()
        self.episode_log_probs.append(dist.log_prob(action))
        return action.item()

    def record_reward(self, reward):
        self.episode_rewards.append(reward)

    def update_policy(self):
        R = 0
        discounted_rewards = []
        for reward in reversed(self.episode_rewards):
            R = reward + 0.99 * R
            discounted_rewards.insert(0, R)

        discounted_rewards = torch.tensor(discounted_rewards)
        discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-8)
        policy_loss = []
        for log_prob, reward in zip(self.episode_log_probs, discounted_rewards):
            policy_loss.append((-log_prob * reward).unsqueeze(0))
        policy_loss = torch.cat(policy_loss).sum()

        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()

        self.episode_rewards = []
        self.episode_log_probs = []

The below is a reinforcement learning agent using the Policy Gradient method to solve the CartPole-v1 environment
* CartPole environment, where the agent must balance a pole on a cart by moving left or right.
* The policy is updated using rewards obtained during episodes to improve the agent's ability to balance the pole. Training stops when the agent consistently performs at or above the target reward for a specified number of episodes.

In [13]:
if __name__ == "__main__":
    import gym

    env = gym.make('CartPole-v1')
    agent = PolicyGradientAgent(state_dim=env.observation_space.shape[0], action_dim=env.action_space.n)

    target_reward = 195
    patience = 10  # Number of episodes to observe consistent performance

    consecutive_wins = 0
    for episode in range(1000):
        state = env.reset()
        episode_reward = 0

        for t in range(1, 10000):
            action = agent.select_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.record_reward(reward)

            state = next_state
            episode_reward += reward

            if done:
                break

        agent.update_policy()
        print(f"Episode {episode + 1}: Total Reward = {episode_reward}")

        # Early stopping
        if episode_reward >= target_reward:
            consecutive_wins += 1
            if consecutive_wins >= patience:
                print(f"Environment solved in {episode + 1} episodes!")
                break
        else:
            consecutive_wins = 0

    env.close()

Episode 1: Total Reward = 30.0
Episode 2: Total Reward = 41.0
Episode 3: Total Reward = 12.0
Episode 4: Total Reward = 18.0
Episode 5: Total Reward = 33.0
Episode 6: Total Reward = 10.0
Episode 7: Total Reward = 14.0
Episode 8: Total Reward = 10.0
Episode 9: Total Reward = 9.0
Episode 10: Total Reward = 11.0
Episode 11: Total Reward = 10.0
Episode 12: Total Reward = 10.0
Episode 13: Total Reward = 10.0
Episode 14: Total Reward = 13.0
Episode 15: Total Reward = 12.0
Episode 16: Total Reward = 12.0
Episode 17: Total Reward = 10.0
Episode 18: Total Reward = 12.0
Episode 19: Total Reward = 11.0
Episode 20: Total Reward = 12.0
Episode 21: Total Reward = 11.0
Episode 22: Total Reward = 12.0
Episode 23: Total Reward = 10.0
Episode 24: Total Reward = 22.0
Episode 25: Total Reward = 10.0
Episode 26: Total Reward = 10.0
Episode 27: Total Reward = 20.0
Episode 28: Total Reward = 13.0
Episode 29: Total Reward = 27.0
Episode 30: Total Reward = 19.0
Episode 31: Total Reward = 26.0
Episode 32: Total 

* The training process demonstrates that the policy gradient algorithm successfully learned to solve the environment in 146 episodes
* The progression is shown around Episode 74, a significant improvement in performance begins, with rewards exceeding 100. This shows the agent is starting to identify strategies that result in better outcomes.