# Re-Inforcement Learning

## Summary:
- This script demonstrates a simple Q-learning algorithm applied to a 5x5 grid world.
- The agent starts in the top-left corner of the grid and aims to reach the bottom-right corner (goal).
- Rewards are assigned for reaching the goal (+1) and penalties are given for every other move (-0.1).
- The Q-table is updated over 1000 episodes to learn the optimal path from start to goal.
- After training, the agent follows the learned policy to navigate the grid efficiently.

## Import Libraries

In [6]:
import numpy as np
import random
import time

## Define the environment
### (a simple 5x5 grid)

In [12]:
class GridWorld:
    def __init__(self):
        self.grid_size = 5
        self.state = (0, 0)  # Start in the top-left corner
        self.goal = (4, 4)   # Goal is in the bottom-right corner
        
    def reset(self):
        # Reset the environment to the starting state (top-left corner)
        self.state = (0, 0)
        return self.state
    
    def step(self, action):
        # Take an action and move to the next state
        # Action can be: 0 (up), 1 (down), 2 (left), 3 (right)
        row, col = self.state
        
        if action == 0:   # Move up
            row = max(0, row - 1)
        elif action == 1: # Move down
            row = min(self.grid_size - 1, row + 1)
        elif action == 2: # Move left
            col = max(0, col - 1)
        elif action == 3: # Move right
            col = min(self.grid_size - 1, col + 1)
        
        # Update the current state
        self.state = (row, col)
        
        # Define reward structure: +1 for reaching the goal, -0.1 for every other step
        reward = 1 if self.state == self.goal else -0.1
        done = self.state == self.goal
        
        return self.state, reward, done

    def get_num_states(self):
        # Return the total number of possible states in the grid
        return self.grid_size * self.grid_size
    
    def get_num_actions(self):
        # Return the number of possible actions (up, down, left, right)
        return 4

    def render(self):
        # Draw the grid to show the agent's current position
        grid = [['_' for _ in range(self.grid_size)] for _ in range(self.grid_size)]
        row, col = self.state
        grid[row][col] = 'A'  # Mark the agent's position
        grid[self.goal[0]][self.goal[1]] = 'G'  # Mark the goal position
        
        for line in grid:
            print(' '.join(line))
        print('\n')

        


## Initialize the environment

In [13]:
env = GridWorld()

## Re-Inforcement Learning

In [14]:
# Parameters for Q-learning
alpha = 0.1          # Learning rate - how much we update the Q-values on each step
gamma = 0.9          # Discount factor - how much we consider future rewards
epsilon = 0.1        # Exploration rate - probability of choosing a random action (exploration)
num_episodes = 1000  # Number of episodes - how many times we run through the environment

# Initialize Q-table
num_states = env.get_num_states()
num_actions = env.get_num_actions()
Q_table = np.zeros((num_states, num_actions))  # Q-table with states as rows and actions as columns

def state_to_index(state, grid_size):
    # Convert a (row, col) state into a single index for the Q-table
    return state[0] * grid_size + state[1]

# Q-learning loop - training the agent
for episode in range(num_episodes):
    state = env.reset()  # Reset the environment for each episode
    done = False
    total_reward = 0  # Track total reward for the episode
    
    if episode % 100 == 0:
        print(f"Episode: {episode}")
        env.render()  # Render the grid to show the agent's current position
    
    while not done:
        state_index = state_to_index(state, env.grid_size)
        
        # Choose action using epsilon-greedy strategy
        # With probability epsilon, choose a random action (exploration)
        # Otherwise, choose the best known action (exploitation)
        if random.uniform(0, 1) < epsilon:
            action = random.randint(0, num_actions - 1)  # Explore
        else:
            action = np.argmax(Q_table[state_index])     # Exploit
        
        # Take action and observe the result
        next_state, reward, done = env.step(action)
        next_state_index = state_to_index(next_state, env.grid_size)
        total_reward += reward  # Accumulate reward for the episode
        
        # Update Q-value using the Q-learning update rule
        # Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a))
        best_next_action = np.max(Q_table[next_state_index])
        Q_table[state_index, action] = Q_table[state_index, action] + alpha * (reward + gamma * best_next_action - Q_table[state_index, action])
        
        # Move to the next state
        state = next_state
    
    # Print the total reward for every 100 episodes
    if episode % 100 == 0:
        print(f"Total Reward after Episode {episode}: {total_reward}\n")

# # Demonstration of the trained agent's performance
# state = env.reset()  # Start from the initial state
# done = False
# print("\nTrained agent's path:")
# while not done:
#     env.render()  # Render the grid to show the agent's current position
#     time.sleep(0.1)  # Add a small delay to visualize the path
#     state_index = state_to_index(state, env.grid_size)
#     # Choose the best action based on the trained Q-table
#     action = np.argmax(Q_table[state_index])
#     state, _, done = env.step(action)



Episode: 0
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 0: -9.999999999999977

Episode: 100
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 100: 0.10000000000000009

Episode: 200
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 200: 0.20000000000000007

Episode: 300
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 300: 0.30000000000000004

Episode: 400
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 400: 0.30000000000000004

Episode: 500
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 500: 0.10000000000000009

Episode: 600
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 600: 0.10000000000000009

Episode: 700
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


Total Reward after Episode 700: 0.10000000000000009

Episode: 800
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ 

## Testing Data

In [16]:
# To test the performance of the trained agent, we will run a few episodes and check if the agent consistently reaches the goal
num_test_episodes = 2
successful_episodes = 0

for test_episode in range(num_test_episodes):
    state = env.reset()
    done = False
    steps = 0
    print(f"\nTest Episode {test_episode + 1}:")
    
    while not done and steps < 50:  # Limit the number of steps to avoid infinite loops
        env.render()  # Render the grid to show the agent's current position
        time.sleep(0.1)  # Add a small delay to visualize the path
        state_index = state_to_index(state, env.grid_size)
        action = np.argmax(Q_table[state_index])  # Use the trained policy
        state, _, done = env.step(action)
        steps += 1
    
    if done:
        successful_episodes += 1

print(f"\nThe trained agent successfully reached the goal in {successful_episodes}/{num_test_episodes} test episodes.")


Test Episode 1:
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


_ A _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ A _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ A _ _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ _ A _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ A _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ A _ G


_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ A G



Test Episode 2:
A _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


_ A _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ A _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ A _ _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ _ A _ _
_ _ _ _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ A _ _
_ _ _ _ G


_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ A _ G


_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ A G



The trained agent successfully reached the goal in 2/2 test episodes.
