COPY from https://keon.io/deep-q-learning/ blog, thank you


![代替テキスト](https://keon.io/images/deep-q-learning/deep-q-learning.png)

target = reward + gamma * np.amax(model.predict(next_state))


## **Remember**




> One of the challenges for DQN is that neural network used in the algorithm tends to forget the previous experiences as it overwrites them with new experiences. So we need a list of previous experiences and observations to re-train the model with the previous experiences. We will call this array of experiences memory and use remember() function to append state, action, reward, and next state to the memory.



In our example, the memory list will have a form of:


> memory = [(state, action, reward, next_state, done)...]




And remember function will simply store states, actions and resulting rewards to the memory like below:



> def remember(self, state, action, reward, next_state, done):



> self.memory.append((state, action, reward, next_state, done))



# Replay

A method that trains the neural net with experiences in the memory is called replay(). First, we sample some experiences from the memory and call them minibath.


> minibatch = random.sample(self.memory, batch_size)




## **How The Agent Decides to Act**

Our agent will randomly select its action at first by a certain percentage, called ‘exploration rate’ or ‘epsilon’. This is because at first, it is better for the agent to try all kinds of things before it starts to see the patterns. When it is not deciding the action randomly, the agent will predict the reward value based on the current state and pick the action that will give the highest reward. np.argmax() is the function that picks the highest value between two elements in the act_values[0].

In [0]:
import random
import gym
import numpy as np
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

In [0]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse',
                      optimizer=Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])  # returns action

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + self.gamma *
                          np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

In [0]:
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n


In [0]:
agent = DQNAgent(state_size, action_size)
# agent.load("./save/cartpole-dqn.h5")
done = False
batch_size = 32







In [0]:
EPISODES = 1000
for e in range(EPISODES):
        state = env.reset()
        state = np.reshape(state, [1, state_size])
        for time in range(500):
            # env.render()
            action = agent.act(state)
            next_state, reward, done, _ = env.step(action)
            reward = reward if not done else -10
            next_state = np.reshape(next_state, [1, state_size])
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                print("episode: {}/{}, score: {}, e: {:.2}"
                      .format(e, EPISODES, time, agent.epsilon))
                break
            if len(agent.memory) > batch_size:
                agent.replay(batch_size)
        # if e % 10 == 0:
        #     agent.save("./save/cartpole-dqn.h5")

episode: 0/1000, score: 13, e: 1.0
episode: 1/1000, score: 10, e: 1.0








episode: 2/1000, score: 21, e: 0.93
episode: 3/1000, score: 21, e: 0.84
episode: 4/1000, score: 19, e: 0.76
episode: 5/1000, score: 21, e: 0.69
episode: 6/1000, score: 16, e: 0.63
episode: 7/1000, score: 14, e: 0.59
episode: 8/1000, score: 18, e: 0.54
episode: 9/1000, score: 21, e: 0.49
episode: 10/1000, score: 11, e: 0.46
episode: 11/1000, score: 15, e: 0.43
episode: 12/1000, score: 14, e: 0.4
episode: 13/1000, score: 10, e: 0.38
episode: 14/1000, score: 12, e: 0.36
episode: 15/1000, score: 9, e: 0.34
episode: 16/1000, score: 13, e: 0.32
episode: 17/1000, score: 10, e: 0.3
episode: 18/1000, score: 9, e: 0.29
episode: 19/1000, score: 7, e: 0.28
episode: 20/1000, score: 17, e: 0.26
episode: 21/1000, score: 9, e: 0.25
episode: 22/1000, score: 9, e: 0.23
episode: 23/1000, score: 9, e: 0.22
episode: 24/1000, score: 10, e: 0.21
episode: 25/1000, score: 8, e: 0.21
episode: 26/1000, score: 9, e: 0.2
episode: 27/1000

KeyboardInterrupt: ignored