#Pseudocode
Q-learning is a popular model-free reinforcement learning algorithm that learns the optimal Q-value function for an environment without requiring a model of the environment's dynamics. The Q-value function estimates the expected cumulative reward that can be obtained by taking a particular action in a particular state and following the optimal policy thereafter.

Here is the Q-learning algorithm:



```
1. Initialize Q(s, a) arbitrarily for all state-action pairs
2. Repeat the following for each episode:
   a. Observe the initial state, s
   b. Repeat the following until the episode terminates:
      i. Choose an action, a, using an exploration strategy (e.g., epsilon-greedy)
      ii. Take action a and observe the resulting reward, r, and the new state, s'
      iii. Update the Q-value function using the formula:
           Q(s, a) = Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))
           where alpha is the learning rate and gamma is the discount factor
      iv. Set the current state to the new state, s = s'
   c. Until convergence (i.e., until the Q-value function no longer changes)
   ```

The Q-learning algorithm gradually learns the optimal Q-value function by repeatedly updating the Q-value estimates based on the observed rewards and the estimated Q-values of the next state-action pairs. The exploration strategy is used to balance exploration and exploitation, and the learning rate determines the degree to which the Q-value function is updated based on new information. The discount factor is used to discount the future rewards, making immediate rewards more valuable than delayed rewards.

In [11]:
# Example environment with 6 states and 2 actions
class ExampleEnvironment:
    def __init__(self):
        self.num_states = 6
        self.num_actions = 2
        self.state = 0

    def reset(self):
        self.state = 0
        return self.state

    def step(self, action):
        if self.state == 0:
            if action == 0:
                self.state = 1
                reward = 0
                done = False
            else:
                self.state = 0
                reward = -1
                done = False
        elif self.state == 1:
            if action == 0:
                self.state = 0
                reward = -1
                done = False
            else:
                self.state = 2
                reward = 0
                done = False
        elif self.state == 2:
            if action == 0:
                self.state = 3
                reward = 0
                done = False
            else:
                self.state = 1
                reward = 0
                done = False
        elif self.state == 3:
            if action == 0:
                self.state = 2
                reward = 0
                done = False
            else:
                self.state = 4
                reward = 0
                done = False
        elif self.state == 4:
            if action == 0:
                self.state = 5
                reward = 1
                done = True
            else:
                self.state = 3
                reward = 0
                done = False
        elif self.state == 5:
            self.state = 5
            reward = 0
            done = True
        return self.state, reward, done, {}


In [12]:
import numpy as np

class QLearningAgent:
    def __init__(self, alpha, discount, epsilon, num_states, num_actions):
        self.alpha = alpha  # learning rate
        self.discount = discount  # discount factor
        self.epsilon = epsilon  # exploration rate
        self.num_states = num_states
        self.num_actions = num_actions
        self.q_table = np.zeros((num_states, num_actions))

    def choose_action(self, state):
        if np.random.uniform(0, 1) < self.epsilon:
            # Choose a random action
            action = np.random.choice(self.num_actions)
        else:
            # Choose the action with the highest Q-value
            action = np.argmax(self.q_table[state])
        return action

    def learn(self, state, action, reward, next_state):
        # Update Q-value for the state-action pair
        td_error = reward + self.discount * np.max(self.q_table[next_state]) - self.q_table[state][action]
        self.q_table[state][action] += self.alpha * td_error


In [13]:
if __name__ == "__main__":
    env = ExampleEnvironment()
    agent = QLearningAgent(alpha=0.5, discount=0.99, epsilon=0.1, num_states=env.num_states,
        num_actions=env.num_actions)
    num_episodes = 1000
    for episode in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            action = agent.choose_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.learn(state, action, reward, next_state)
            state = next_state
    
    #print(agent.q_table)


In [14]:
import pandas as pd
import numpy as np
column_names = ['action_0', 'action_1']
row_names    = ['State_0', 'State_1','State_2','State_3','State_4','State_5']
arr_1D = agent.q_table.reshape(-1)
arr_2D = arr_1D.reshape(6, 2)
df = pd.DataFrame(arr_2D, columns=column_names, index=row_names)
df

Unnamed: 0,action_0,action_1
State_0,0.960596,-0.04901
State_1,-0.04901,0.970299
State_2,0.9801,0.960596
State_3,0.970299,0.99
State_4,1.0,0.9801
State_5,0.0,0.0
