<a href="https://colab.research.google.com/github/rahul-727/Reinforcement-Learning-/blob/main/2348544_Lab4_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Markov Decision Process (MDP)

solving a Markov Decision Process (MDP) using policy iteration, which is a fundamental algorithm in reinforcement learning (RL). The purpose is to find the optimal policy and state value function for a given MDP, which describes a decision-making problem in a stochastic environment.

In [None]:
import numpy as np

In [None]:
class MDP:
  def __init__(self, states, actions, rewards, transition_probs, discount_factor=0.9):
    """we input States, actions, rewards, transition probabilities, and discount factor
       Random Policy Initialization: Each state (s) is assigned a random action from the actions list"""
    self.states = states # List of states
    self.actions = actions # list of actions
    self.rewards = rewards # dictionary where rewards [s][a] is the reward for taking action 'a' in state 's'
    self.transition_probs = transition_probs # dictionary where transition_probs [s][a][s'] is the prob. of moving from state s to state s' with action a
    self.discount_factor = discount_factor # discount factor(gamma) for future rewards
    self.policy = {s: np.random.choice(actions) for s in states} # random initial policy

  def policy_evaluation(self, theta=1e-6):
    # the goal is to Compute the state-value function V(s) for a given policy.
    V = {s: 0 for s in self.states} #initialuze state values with 0

    while True:
      delta = 0
      for s in self.states:
        v = V[s]
        a = self.policy[s]

        # Bellman expectation equation
        V[s] = sum(self.transition_probs[s][a][s_prime] * (self.rewards[s][a] + self.discount_factor * V[s_prime])
        for s_prime in self.states)

        delta = max(delta, abs(v - V[s])) # Check for convergence
      if delta < theta:
        break
    return V

  def policy_improvement(self, V):
    #Goal: Improve the current policy by choosing actions that maximize the expected value for each state.

    policy_stable = True
    for s in self.states:
      old_action = self.policy[s]

      # Find the action that maximizes expected value
      self.policy[s] = max(self.actions, key=lambda a:sum(
          self.transition_probs[s][a][s_prime] * (self.rewards[s][a] + self.discount_factor * V[s_prime])
          for s_prime in self.states))

      if old_action != self.policy[s]:
        policy_stable = False
    return policy_stable

  def policy_iteration(self):
    # Perform policy iteration : alternate between policy evaluation and improvement until the policy is stable(optimal)
    while True:
      V = self.policy_evaluation()
      policy_stable = self.policy_improvement(V)
      if policy_stable:
        return self.policy, V


MDP Components

* States (states): A list of possible states in the environment (s1, s2,s3)
* Actions (actions): A list of actions available in each state (a1, a2)
* Rewards (rewards): A dictionary where rewards[s][a] gives the reward for taking action a in state s
* Transition Probabilities (transition_probs):A dictionary where transition_probs[s][a][s'] is the probability of moving to state s' from state s after taking action a
* Discount Factor (discount_factor): Denoted as gamma (γ), it balances immediate and future rewards.

In [None]:
states = ['s1', 's2', 's3']
actions = ['a1', 'a2']
rewards = {
    's1': {'a1': 5, 'a2': 10},
    's2': {'a1': -1, 'a2': 2},
    's3': {'a1': 0, 'a2': 0}
}
transition_probs = {
    's1': {'a1': {'s1': 0.7, 's2': 0.3, 's3': 0.0}, 'a2': {'s1': 0.4, 's2': 0.6, 's3': 0.0}},
    's2': {'a1': {'s1': 0.1, 's2': 0.6, 's3': 0.3}, 'a2': {'s1': 0.0, 's2': 0.9, 's3': 0.1}},
    's3': {'a1': {'s1': 0.0, 's2': 0.0, 's3': 1.0}, 'a2': {'s1': 0.0, 's2': 0.0, 's3': 1.0}}
}

#create MDP instances and perform policy iteration
mdp = MDP(states, actions, rewards, transition_probs, discount_factor=0.9)
optimal_policy, optimal_value = mdp.policy_iteration()

print("Optimal Policy:", optimal_policy)
print("Optimal State Values:", optimal_value)


Optimal Policy: {'s1': 'a2', 's2': 'a2', 's3': 'a1'}
Optimal State Values: {'s1': 24.506574930441033, 's2': 10.526312442034197, 's3': 0.0}


Optimal action for the agent in each state is as follows:
* In state s1, the best action is a2.
* In state s2, the best action is a2.
* In state s3, the best action is a1.
* The agent's goal is to follow these actions in each state to maximize its long-term rewards.

Optimal State Values

These are the state-value functions for each state, which represent the expected total reward the agent can achieve starting from each state, following the optimal policy.
* For state s1, the expected total reward is 24.51