Q-learning is all about learning from interactions with an environment to achieve a specific goal. It's like teaching an agent (a robot, a software program, etc.) how to act optimally by trying different actions and seeing the outcomes.

Q-learning is a powerful tool that allows agents to learn optimal actions through trial and error, updating their knowledge based on feedback from the environment. It's widely used in robotics, game playing, and various AI applications.

Q-values: Metrics used to evaluate actions at specific states.

Key Concepts:

Agent: The entity that makes decisions (e.g., a robot).

Environment: The world in which the agent operates (e.g., a grid, a maze).

State (s): A specific situation or configuration of the environment (e.g., the agent's position in the grid).

Action (a): Any possible move the agent can make (e.g., move left, right, up, down).

Reward (r): The feedback from the environment after an action (e.g., +10 for reaching the goal, -1 for a step taken). Or, Positive or negative responses provided to the agent based on its actions.

Step 1: Define the Environment

In [1]:
import numpy as np

In [2]:
# define the environment
n_states = 16  # Number of states in the grid world
n_actions = 4  # Number of possible actions (up, down, left, right)
goal_state = 15  # Goal state

# Initialize Q-table with zeros
Q_table = np.zeros((n_states, n_actions))

Step 2: Set Hyperparameters

In [3]:
# define parameters
learning_rate = 0.85
discount_factor = 0.96
exploration_prob = 0.2
epochs = 1000

Step 3: Implement the Q-Learning Algorithm

Methods for Determining Q-Values-

There are two methods for determining Q-values:

Temporal Difference: Calculated by comparing the current state and action values with the previous ones.

Bellman’s Equation: A recursive formula invented by Richard Bellman in 1957, used to calculate the value of a given state and determine its optimal position. It provides a recursive formula for calculating the value of a given state in a Markov Decision Process (MDP) and is particularly influential in the context of Q-learning and optimal decision-making.

In [4]:
# Q-learning algorithm
for epoch in range(epochs):
    current_state = np.random.randint(0, n_states)  # Start from a random state

    while current_state != goal_state:
        # Choose action with epsilon-greedy strategy
        if np.random.rand() < exploration_prob:
            action = np.random.randint(0, n_actions)  # Explore
        else:
            action = np.argmax(Q_table[current_state])  # Exploit

        # Simulate the environment (move to the next state)
        # For simplicity, move to the next state
        next_state = (current_state + 1) % n_states

        # Define a simple reward function (1 if the goal state is reached, 0 otherwise)
        reward = 1 if next_state == goal_state else 0

        # Update Q-value using the Q-learning update rule
        Q_table[current_state, action] += learning_rate * \
            (reward + discount_factor *
             np.max(Q_table[next_state]) - Q_table[current_state, action])

        current_state = next_state  # Move to the next state

Step 4: Output the Learned Q-Table

A Q-table is a lookup table used in Q-learning, a type of reinforcement learning. It helps an agent decide the best action to take in a given state to maximize future rewards. Think of it as a big table of scores that guides the agent on which moves are the most beneficial.

In [5]:
# After training, the Q-table represents the learned Q-values
print("Learned Q-table:")
print(Q_table)

Learned Q-table:
[[0.53789767 0.56467331 0.55196816 0.        ]
 [0.58820137 0.58820122 0.58820136 0.58818795]
 [0.61270976 0.61270973 0.6127096  0.6127096 ]
 [0.63823933 0.63823933 0.63823931 0.63823931]
 [0.66483264 0.66483264 0.66483264 0.66483264]
 [0.692534   0.692534   0.692534   0.692534  ]
 [0.72138956 0.72138958 0.72138958 0.72138958]
 [0.75144748 0.75144748 0.75144748 0.75144748]
 [0.78275779 0.78275779 0.78275779 0.78275779]
 [0.8153727  0.8153727  0.8153727  0.8153727 ]
 [0.84934656 0.84934656 0.84934656 0.84934656]
 [0.884736   0.884736   0.884736   0.884736  ]
 [0.9216     0.9216     0.9216     0.9216    ]
 [0.96       0.96       0.96       0.96      ]
 [1.         1.         1.         1.        ]
 [0.         0.         0.         0.        ]]
