<a href="https://colab.research.google.com/github/rahul-727/Reinforcement-Learning-/blob/main/2348544_Lab8_RL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Environment and TD(0) Implementation
*  Implementation of the TD(0) algorithm for a simple environment where an agent moves through a 1D grid with five states.
* Agent starts at a random state, and its actions result in moving left or right. The terminal states are state 1 and state 5.

In [2]:
import numpy as np
import random

* States 1 and 5 are terminal states.
* A reward of +1 is given for reaching state 5, and -1 for reaching state 1.
* Rewards for non-terminal transitions are 0.

In [3]:
class GridEnvironment:
    def __init__(self, num_states=5):
        self.num_states = num_states
        self.state = random.randint(2, num_states - 1)  # Start in a random non-terminal state
        self.terminal_states = [1, num_states]

    def reset(self):
        self.state = random.randint(2, self.num_states - 1)  # Reset to a random non-terminal state
        return self.state

    def step(self, action):
        """
        Move the agent based on the action.
        Action -1: Move left, Action +1: Move right.
        Returns (next_state, reward, done)
        """
        self.state += action
        if self.state in self.terminal_states:
            reward = 1 if self.state == self.num_states else -1
            done = True
        else:
            reward = 0
            done = False
        return self.state, reward, done

The value of a state is updated incrementally after each step using the rule:
V(s)←V(s)+α[r+γV(s′)−V(s)]

In [5]:
def td_zero(env, num_episodes, alpha, gamma):
    V = np.zeros(env.num_states + 1)

    for episode in range(num_episodes):
        state = env.reset()
        done = False

        while not done:
            action = random.choice([-1, 1])  # Randomly choose to move left or right
            next_state, reward, done = env.step(action)
            V[state] += alpha * (reward + gamma * V[next_state] - V[state])

            state = next_state  # Transition to next state

    return V

In [6]:
if __name__ == "__main__":
    env = GridEnvironment()
    num_episodes = 1000
    alpha = 0.1  # Learning rate
    gamma = 0.9  # Discount factor

    value_function = td_zero(env, num_episodes, alpha, gamma)
    print("Learned Value Function:")
    for state in range(1, env.num_states + 1):
        print(f"State {state}: {value_function[state]:.2f}")

Learned Value Function:
State 1: 0.00
State 2: -0.37
State 3: 0.01
State 4: 0.55
State 5: 0.00


* State 1 and State 5 are terminal states, so their values are fixed at 0 in this case.
* State 2 has a negative value (-0.37), indicating that it is from the goal (state 5)
* State 3 has a small positive value (0.01), reflecting a state that is somewhat neutral in its expected return—it's closer to both terminal states and can be seen as a "transitional" state.
* State 4 has a relatively high positive value (0.55), which is closer to state 5