# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [3]:
# original code
# Solved in 414 runs, 514 total runs.

import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()



In [4]:
 cartpole()

Run: 1, exploration: 1.0, score: 13
Scores: (min: 13, avg: 13, max: 13)

Run: 2, exploration: 0.9416228069143757, score: 19
Scores: (min: 13, avg: 16, max: 19)

Run: 3, exploration: 0.8475428503023453, score: 22
Scores: (min: 13, avg: 18, max: 22)

Run: 4, exploration: 0.778312557068642, score: 18
Scores: (min: 13, avg: 18, max: 22)

Run: 5, exploration: 0.736559652908221, score: 12
Scores: (min: 12, avg: 16.8, max: 22)

Run: 6, exploration: 0.6662995813682115, score: 21
Scores: (min: 12, avg: 17.5, max: 22)

Run: 7, exploration: 0.5732736268885887, score: 31
Scores: (min: 12, avg: 19.428571428571427, max: 31)

Run: 8, exploration: 0.5398075216808175, score: 13
Scores: (min: 12, avg: 18.625, max: 31)

Run: 9, exploration: 0.500708706245853, score: 16
Scores: (min: 12, avg: 18.333333333333332, max: 31)

Run: 10, exploration: 0.47622912292284103, score: 11
Scores: (min: 11, avg: 17.6, max: 31)

Run: 11, exploration: 0.4484282034609769, score: 13
Scores: (min: 11, avg: 17.181818181818183,

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.



In [None]:
# modifying exploration 
# exploration factor = comb EXPLORATION_MAX, EXPLORATION_MIN, and #EXPLORATION_DECAY
# Solved in 38 runs, 138 total runs.
# EXPLORATION_MAX = .5 
# EXPLORATION_MIN = .01 # increcreased from 0.01
# EXPLORATION_DECAY = .999  # decreased from 0.995

import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  

ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = .5 
EXPLORATION_MIN = .01 # increased from 0.01
EXPLORATION_DECAY = .999  # decreased from 0.995

  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()


In [None]:
 cartpole()

End of output copied from a separate file

Run: 201, exploration: 0.01, score: 217
Scores: (min: 29, avg: 193.42, max: 500)

Run: 202, exploration: 0.01, score: 89
Scores: (min: 29, avg: 191.77, max: 500)

Run: 203, exploration: 0.01, score: 346
Scores: (min: 29, avg: 193.13, max: 500)

Run: 204, exploration: 0.01, score: 157
Scores: (min: 29, avg: 193.17, max: 500)

Run: 205, exploration: 0.01, score: 402
Scores: (min: 29, avg: 195.64, max: 500)

Solved in 105 runs, 205 total runs.

In [None]:
# Solved in 105 runs, 205 total runs.
# modifying discount factor (GAMMA) from .95 to 1.2
# discount factor GAMMA =.92 # decreased from .95 I tried .99 it was too high

import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger 

# modifying discount factor (GAMMA) from .95 to 1.2
ENV_NAME = "CartPole-v1"  
  
GAMMA =.92 # decreased from .95 I tried .99 it was too high
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()


In [None]:
 cartpole()

End of output copied from a separate file

Run: 134, exploration: 0.01, score: 166
Scores: (min: 10, avg: 190.08, max: 465)

Run: 135, exploration: 0.01, score: 158
Scores: (min: 10, avg: 191.03, max: 465)

Run: 136, exploration: 0.01, score: 223
Scores: (min: 10, avg: 192.3, max: 465)

Run: 137, exploration: 0.01, score: 251
Scores: (min: 10, avg: 194.15, max: 465)

Run: 138, exploration: 0.01, score: 155
Scores: (min: 10, avg: 195.27, max: 465)

Solved in 38 runs, 138 total runs.

In [3]:
# Solved in 914 runs, 1014 total runs.
# modifying learning rate from .001 to .004
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers.legacy import Adam  
  
from scores.score_logger import ScoreLogger 

# modifying learning rate from .001 to .004
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.004 # increased from .001
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()


In [4]:
 cartpole()

Run: 1, exploration: 1.0, score: 12
Scores: (min: 12, avg: 12, max: 12)

Run: 2, exploration: 1.0, score: 8
Scores: (min: 8, avg: 10, max: 12)



  super().__init__(name, **kwargs)


Run: 3, exploration: 0.8955869907338783, score: 23
Scores: (min: 8, avg: 14.333333333333334, max: 23)

Run: 4, exploration: 0.6696478204705644, score: 59
Scores: (min: 8, avg: 25.5, max: 59)

Run: 5, exploration: 0.6337242817644086, score: 12
Scores: (min: 8, avg: 22.8, max: 59)

Run: 6, exploration: 0.5848838636585911, score: 17
Scores: (min: 8, avg: 21.833333333333332, max: 59)

Run: 7, exploration: 0.5425201222922789, score: 16
Scores: (min: 8, avg: 21, max: 59)

Run: 8, exploration: 0.5134164023722473, score: 12
Scores: (min: 8, avg: 19.875, max: 59)

Run: 9, exploration: 0.4858739637363176, score: 12
Scores: (min: 8, avg: 19, max: 59)

Run: 10, exploration: 0.45522245551230495, score: 14
Scores: (min: 8, avg: 18.5, max: 59)

Run: 11, exploration: 0.42437208406280985, score: 15
Scores: (min: 8, avg: 18.181818181818183, max: 59)

Run: 12, exploration: 0.40565285250151817, score: 10
Scores: (min: 8, avg: 17.5, max: 59)

Run: 13, exploration: 0.3858205374665315, score: 11
Scores: (min

KeyboardInterrupt: 

End of output copied from a separate file

Run: 1010, exploration: 0.01, score: 85
Scores: (min: 11, avg: 189.3, max: 500)

Run: 1011, exploration: 0.01, score: 250
Scores: (min: 11, avg: 190.83, max: 500)

Run: 1012, exploration: 0.01, score: 237
Scores: (min: 11, avg: 193.01, max: 500)

Run: 1013, exploration: 0.01, score: 205
Scores: (min: 11, avg: 194.18, max: 500)

Run: 1014, exploration: 0.01, score: 367
Scores: (min: 11, avg: 197.68, max: 500)

**SUMMARY INFO AND VALUES USED**

Values for the original code
**Solved in 414 runs, 514 total runs.**

GAMMA = 0.95  
LEARNING_RATE = 0.001  
MEMORY_SIZE = 1000000
BATCH_SIZE = 20  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  

Values for modifying exploration 
Exploration factor = comb EXPLORATION_MAX, EXPLORATION_MIN, and #EXPLORATION_DECAY
**Solved in 38 runs, 138 total runs.**

GAMMA = 0.95  
LEARNING_RATE = 0.001  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
**EXPLORATION_MAX = .5 # decreased from 0.01**
EXPLORATION_MIN = .01 
**EXPLORATION_DECAY = .999  # decreased from 0.995**

Values for modifying discount factor 
Discount factor GAMMA =.92 # decreased from .95 I tried .99 it was too high
**Solved in 105 runs, 205 total runs.**

**GAMMA =.92 # decreased from .95 I tried .99 it was too high**
LEARNING_RATE = 0.001  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  

Values for modifying learning rate
LEARNING_RATE = 0.004 # increased from .001
**Solved in 914 runs, 1014 total runs.**

GAMMA = 0.95  
**LEARNING_RATE = 0.004  # increased from .001**
MEMORY_SIZE = 1000000
BATCH_SIZE = 20  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  


The cartpole problem, also known as the inverted pendulum problem uses the deep Q-learning algorithm to maintain balance. The Cartpole problem essentially is solved by positioning the pivot point under the center of mass (Surma, 2021). Using reinforcement learning, the application improves performance over time and stores information related to various states to determine the best course of action.

Explain how reinforcement learning concepts apply to the cartpole problem.
•	What is the goal of the agent in this case?
•	What are the various state values?
•	What are the possible actions that can be performed?
•	What reinforcement algorithm is used for this problem?

In this code, the goal of the agent is to balance a pole on top of a cart by moving right or left based on the current state. The states in this problem include the cart position, the cart velocity, the angle, and the pole angular velocity. Valid moves allowed in the code include moving right and left. The reinforcement algorithm involved in this code is the deep-Q network or DQN, which is a Q-learning algorithm.

Analyze how experience replay is applied to the cartpole problem.
•	How does experience replay work in this algorithm?
•	What is the effect of introducing a discount factor for calculating the future rewards?

The remember() method is responsible for adding experiences to update the network. Introducing a discount factor causes the machine to consider long-term rewards as well as short-term rewards. The discount factor can be used to balance the importance of short-term versus long-term rewards and can encourage strategizing. A high discount factor, for example, places more emphasis on long-term rewards versus short-term rewards. 

Analyze how neural networks are used in deep Q-learning.
•	Explain the neural network architecture that is used in the cartpole problem.
•	How does the neural network make the Q-learning algorithm more efficient?
•	What difference do you see in the algorithm performance when you increase or decrease the learning rate?

 The neural network architecture consists of an input layer, hidden layers that process, and an output layer. The neural network makes the Q-learning network our effective by learning from training over time and storing past experiences. When we modify the learning rate by increasing it, the machine makes larger adjustments toward the optimal position. While I expected that moderately increasing the learning rate from .001 to .004 would allow the algorithm to solve the problem quicker, I found that it drastically increased the time for the model to solve the problem. 
 
References

Surma, G. (2021, October 13). Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning). Medium. https://gsurma.medium.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288
