## Where deep learning comes to place

<img src="QDL.png" alt="image" width="300"> <img src="Improved-QDL-arch.png" alt="image" width="300">


we want to have a *supervised learning model* that takes state and action and outputs the values of these pair
by calculates are the values for all actions, we can then choose the best action for this state
 but how will the model train ? using training set from the bellman equation

* agent will take different actions according the q_value or random action (according to epsilon) (replay buffer)
* we will calculate bellman equation
* we will use this q-value as our training set to calculate q_new (replay method)
* repeat

## Simple flow-chart


<img src="QDL.png" alt="image" width="300"> <img src="Improved-QDL-arch.png" alt="image" width="300">


we want to have a *supervised learning model* that takes state and action and outputs the values of these pair
by calculates are the values for all actions, we can then choose the best action for this state
 but how will the model train ? using training set from the bellman equation

* agent will take different actions according the q_value or random action (according to epsilon) (replay buffer)
* we will calculate bellman equation
* we will use this q-value as our training set to calculate q_new (replay method)
* repeat

## Calculating the target q_value

does the batch_size = one episode ?
does the target q_value reset each episode ?


In [186]:
#assuming we already have the replay buffer, here is how to calculate the target q_value using the bellman equation
#creating the dataset for the model to train on
class QLearningAgent:
    def __init__(self, num_states, num_actions, gamma=0.99):
        self.num_states = num_states
        self.num_actions = num_actions
        self.gamma = gamma
        # Initialize Q-values as a dictionary with zeros
        self.Q_values = {state: [0] * num_actions for state in range(num_states)}

    def update_q_value(self, state, action, reward, next_state):
        max_next_q_value = max(self.Q_values[next_state])
        self.Q_values[state][action] = reward + self.gamma * max_next_q_value

    def display_q_values(self):
        print("Q-values:")
        for state, q_values in self.Q_values.items():
            print(f"State {state}: {q_values}")

# Define the environment parameters
num_states = 3
num_actions = 2

# Initialize the Q-learning agent
agent = QLearningAgent(num_states, num_actions)

# Define a list of experiences (state, action, reward, next_state)
#this is the "replay buffer"
#as if this is 1 episode
experiences = [
    (0, 0, 1, 1),  # Experience 1
    (1, 1, 2, 0),  # Experience 2
    (2, 0, 3, 1),  # Experience 3
    (1, 0, 4, 2),  # Experience 4
    (0, 1, 5, 2),  # Experience 5 (episode termination)
]

# Update Q-values for each experience
for i , experience in enumerate(experiences):
    state, action, reward, next_state = experience
    print(f"step number {i}, {agent.Q_values} ")
    agent.update_q_value(state, action, reward, next_state)

# Display Q-values after all experiences
agent.display_q_values()


step number 0, {0: [0, 0], 1: [0, 0], 2: [0, 0]} 
step number 1, {0: [1.0, 0], 1: [0, 0], 2: [0, 0]} 
step number 2, {0: [1.0, 0], 1: [0, 2.99], 2: [0, 0]} 
step number 3, {0: [1.0, 0], 1: [0, 2.99], 2: [5.960100000000001, 0]} 
step number 4, {0: [1.0, 0], 1: [9.900499, 2.99], 2: [5.960100000000001, 0]} 
Q-values:
State 0: [1.0, 10.900499]
State 1: [9.900499, 2.99]
State 2: [5.960100000000001, 0]


Storing Experiences

## DQL Model

In [198]:
import numpy as np
import gym

EPSILON = 0.9

class GridWorldEnvironment:
    def __init__(self):
        self.env = gym.make('FrozenLake-v1')
        self.num_states = self.env.observation_space.n
        self.state_size = self.env.observation_space.n
        self.num_actions = self.env.action_space.n
        print(f"Game contains {self.num_states} states and {self.num_actions} actions")
        self.Q_table = np.zeros((self.num_states, self.num_actions))
        print(f"shape of the qtest {self.Q_table.shape}")
        self.epsilon = EPSILON

    def train(self, num_episodes, learning_rate, discount_factor):
        if self.epsilon > 0.1:
            self.epsilon -= 0.1
        for episode in range(num_episodes):
            state = self.env.reset()[0]
            done = False

            while not done:
                if np.random.uniform(0, 1) < self.epsilon:
                    action = self.env.action_space.sample()  # Explore
                else:
                    action = np.argmax(self.Q_table[state])  # Exploit

                next_state, reward, done, _ , _ = self.env.step(action)

                # print(f"next state {self.env.step(action)}")
                self.Q_table[state][action] =  (
                        reward + discount_factor * np.max(self.Q_table[next_state]))

                state = next_state

    def test(self, max_steps=100):
        state = self.env.reset()[0]
        total_reward = 0
        for _ in range(max_steps):
            action = np.argmax(self.Q_table[state])
            # print(f"action taken {action}")
            state, reward, done, _ , _ = self.env.step(action)
            total_reward += reward
            # self.env.render()
            if done:
                break
        return total_reward

def main():
    # Hyperparameters
    NUM_EPISODES = 1000
    LEARNING_RATE = 0.1
    DISCOUNT_FACTOR = 0.9

    # Create environment
    env = GridWorldEnvironment()

    # Train the agent
    env.train(NUM_EPISODES, LEARNING_RATE, DISCOUNT_FACTOR)

    #print final state-value table
    # print(env.Q_table)
    
    # Test the agent
    total_reward = env.test()
    print("Total reward:", total_reward)

if __name__ == "__main__":
    main()


Game contains 16 states and 4 actions
shape of the qtest (16, 4)
Total reward: 0.0
