## Step 1: Initialize Parameters and Grid
We define the parameters for Q-learning (e.g., learning rate, discount factor) and set up the grid environment. Each cell has a reward, and the grid actions are defined as `up`, `down`, `left`, and `right`.
- install dependencies:
pip install pandas
- packages version: numpy-1.26.4 pandas-2.2.3

In [1]:
import random

In [2]:
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.5
episodes = 200
record_frequency = 50  # Set recording frequency here
grid_size = 3  # 3x3 grid

# Define rewards for each cell ID
cell_rewards = {1: 0, 2: -1, 3: 0, 4: 0, 5: -1, 6: 0, 7: 0, 8: 0, 9: 1}
actions = ['up', 'down', 'left', 'right']

## Step 2: Define Dynamic Transition Logic
Here, we calculate the valid transitions dynamically based on the current cell ID and the action taken.

In [3]:
def get_transition(cell, action):
    if action == 'up':
        return cell + grid_size if cell <= grid_size * (grid_size - 1) else None
    elif action == 'down':
        return cell - grid_size if cell > grid_size else None
    elif action == 'left':
        return cell - 1 if (cell - 1) % grid_size != 0 else None
    elif action == 'right':
        return cell + 1 if cell % grid_size != 0 else None

## Step 3: Initialize Q-table
Create a dictionary to store Q-values for all cell-action pairs and prepare storage for intermediate results.

In [4]:
q_table = {cell: {action: 0 for action in actions} for cell in range(1, grid_size * grid_size + 1)}
intermediate_results = []

Display the initial Q-table.

In [5]:
print(f"Initiate Q-Table:")
print(f"{'Cell':<10}{'Action':<10}{'Q-Value':<10}")
print('-' * 30)
for cell, actions_dict in q_table.items():
    for action, q_value in actions_dict.items():
        print(f"{cell:<10}{action:<10}{q_value:<10.4f}")
print("\n")

Initiate Q-Table:
Cell      Action    Q-Value   
------------------------------
1         up        0.0000    
1         down      0.0000    
1         left      0.0000    
1         right     0.0000    
2         up        0.0000    
2         down      0.0000    
2         left      0.0000    
2         right     0.0000    
3         up        0.0000    
3         down      0.0000    
3         left      0.0000    
3         right     0.0000    
4         up        0.0000    
4         down      0.0000    
4         left      0.0000    
4         right     0.0000    
5         up        0.0000    
5         down      0.0000    
5         left      0.0000    
5         right     0.0000    
6         up        0.0000    
6         down      0.0000    
6         left      0.0000    
6         right     0.0000    
7         up        0.0000    
7         down      0.0000    
7         left      0.0000    
7         right     0.0000    
8         up        0.0000    
8         down      0

## Step 4: Q-learning Simulation
Train the agent over multiple episodes and record intermediate Q-tables at specified intervals.

In [6]:
intermediate_results = {}

for episode in range(episodes):
    state = 1  # Start at cell 1
    while state not in cell_rewards or cell_rewards[state] == 0:
        valid_actions = {a: get_transition(state, a) for a in actions if get_transition(state, a) is not None}
        action = (random.choice(list(valid_actions.keys()))
                  if random.uniform(0, 1) < epsilon
                  else max(valid_actions, key=lambda a: q_table[state][a]))
        new_state = valid_actions[action]
        reward = cell_rewards[new_state]
        max_future_q = max(q_table[new_state].values(), default=0)
        q_table[state][action] += learning_rate * (reward + discount_factor * max_future_q - q_table[state][action])
        state = new_state

    # Store intermediate Q-tables at specified frequency
    if (episode + 1) % record_frequency == 0:
            for cell, actions_dict in q_table.items():
                for action, q_value in actions_dict.items():
                    # Initialize a row for (Cell, Action) if not exists
                    if (cell, action) not in intermediate_results:
                        intermediate_results[(cell, action)] = {}

                    # Store the Q-Value for the current episode
                    intermediate_results[(cell, action)][f"Q-Value_{episode + 1}"] = round(q_value, 4)

## Step 5: Display Intermediate Q-Tables
Use pandas to display the consolidated intermediate Q-tables in a tabular format.

In [7]:
import pandas as pd

# Flatten the dictionary into a DataFrame
df = pd.DataFrame.from_dict(intermediate_results, orient="index").reset_index()
df.columns = ["Cell", "Action"] + [col for col in df.columns[2:]]
df = df.fillna(0)  # Fill missing values with 0 or NaN if preferred

# Display the DataFrame
print(df.to_string(index=False))

 Cell Action  Q-Value_50  Q-Value_100  Q-Value_150  Q-Value_200
    1     up      0.2758       0.5760       0.7072       0.7264
    1   down      0.0000       0.0000       0.0000       0.0000
    1   left      0.0000       0.0000       0.0000       0.0000
    1  right     -0.8649      -0.9836      -0.9966      -0.9993
    2     up      0.0000       0.0000       0.0000       0.0000
    2   down      0.0000       0.0000       0.0000       0.0000
    2   left      0.0000       0.0000       0.0000       0.0000
    2  right      0.0000       0.0000       0.0000       0.0000
    3     up      0.0000       0.0000       0.0000       0.0000
    3   down      0.0000       0.0000       0.0000       0.0000
    3   left      0.0000       0.0000       0.0000       0.0000
    3  right      0.0000       0.0000       0.0000       0.0000
    4     up      0.4207       0.7055       0.7966       0.8084
    4   down      0.0756       0.1969       0.4455       0.5412
    4   left      0.0000       0.0000   