# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [3]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.90  
LEARNING_RATE = 0.0025 
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.02  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [2]:
# Gamma: 0.99 - Default
# Learning Rate: 0.001 - Default
# Exploration: 0.01 - Default
cartpole()

Run: 1, exploration: 0.995, score: 21
Scores: (min: 21, avg: 21, max: 21)

Run: 2, exploration: 0.9000874278732445, score: 21
Scores: (min: 21, avg: 21, max: 21)

Run: 3, exploration: 0.7901049725470279, score: 27
Scores: (min: 21, avg: 23, max: 27)

Run: 4, exploration: 0.7514768435208588, score: 11
Scores: (min: 11, avg: 20, max: 27)

Run: 5, exploration: 0.6465587967553006, score: 31
Scores: (min: 11, avg: 22.2, max: 31)

Run: 6, exploration: 0.567555222460375, score: 27
Scores: (min: 11, avg: 23, max: 31)

Run: 7, exploration: 0.531750826943791, score: 14
Scores: (min: 11, avg: 21.714285714285715, max: 31)

Run: 8, exploration: 0.4907693883854626, score: 17
Scores: (min: 11, avg: 21.125, max: 31)

Run: 9, exploration: 0.46677573701590436, score: 11
Scores: (min: 11, avg: 20, max: 31)

Run: 10, exploration: 0.4439551321314536, score: 11
Scores: (min: 11, avg: 19.1, max: 31)

Run: 11, exploration: 0.40974000909221303, score: 17
Scores: (min: 11, avg: 18.90909090909091, max: 31)

Run:

NameError: name 'exit' is not defined

In [3]:
# Gamma: 0.99 - Increase from 0.95
# Learning Rate: 0.01 - Increase from 0.001
cartpole()

Run: 1, exploration: 1.0, score: 20
Scores: (min: 20, avg: 20, max: 20)

Run: 2, exploration: 0.918316468354365, score: 18
Scores: (min: 18, avg: 19, max: 20)

Run: 3, exploration: 0.8307187014821328, score: 21
Scores: (min: 18, avg: 19.666666666666668, max: 21)

Run: 4, exploration: 0.7666961448653229, score: 17
Scores: (min: 17, avg: 19, max: 21)

Run: 5, exploration: 0.7328768546436799, score: 10
Scores: (min: 10, avg: 17.2, max: 21)

Run: 6, exploration: 0.6900935609921609, score: 13
Scores: (min: 10, avg: 16.5, max: 21)

Run: 7, exploration: 0.6465587967553006, score: 14
Scores: (min: 10, avg: 16.142857142857142, max: 21)

Run: 8, exploration: 0.567555222460375, score: 27
Scores: (min: 10, avg: 17.5, max: 27)

Run: 9, exploration: 0.531750826943791, score: 14
Scores: (min: 10, avg: 17.11111111111111, max: 27)

Run: 10, exploration: 0.5134164023722473, score: 8
Scores: (min: 8, avg: 16.2, max: 27)

Run: 11, exploration: 0.4883155414435353, score: 11
Scores: (min: 8, avg: 15.7272727

NameError: name 'exit' is not defined

In [4]:
# Gamma: 0.90 - Decrease from 0.99
# Learning Rate: 0.001 - Return to default
# Min Exploration: 0.02 - Increase from 0.01
cartpole()

Run: 1, exploration: 1.0, score: 20
Scores: (min: 20, avg: 20, max: 20)

Run: 2, exploration: 0.7822236754458713, score: 50
Scores: (min: 20, avg: 35, max: 50)

Run: 3, exploration: 0.6935613678313175, score: 25
Scores: (min: 20, avg: 31.666666666666668, max: 50)

Run: 4, exploration: 0.6369088258938781, score: 18
Scores: (min: 18, avg: 28.25, max: 50)

Run: 5, exploration: 0.6088145090359074, score: 10
Scores: (min: 10, avg: 24.6, max: 50)

Run: 6, exploration: 0.5761543988830038, score: 12
Scores: (min: 10, avg: 22.5, max: 50)

Run: 7, exploration: 0.5507399854171277, score: 10
Scores: (min: 10, avg: 20.714285714285715, max: 50)

Run: 8, exploration: 0.5185893309484582, score: 13
Scores: (min: 10, avg: 19.75, max: 50)

Run: 9, exploration: 0.49571413690105054, score: 10
Scores: (min: 10, avg: 18.666666666666668, max: 50)

Run: 10, exploration: 0.4738479773082268, score: 10
Scores: (min: 10, avg: 17.8, max: 50)

Run: 11, exploration: 0.4506816115185697, score: 11
Scores: (min: 10, avg

NameError: name 'exit' is not defined

In [5]:
# Gamma: 0.90 - No change from previous test
# Learning Rate: 0.0025 - Increase from 0.001
# Min Exploration: 0.02 - No change from previous test
cartpole()

Run: 1, exploration: 1.0, score: 17
Scores: (min: 17, avg: 17, max: 17)

Run: 2, exploration: 0.8822202429488013, score: 28
Scores: (min: 17, avg: 22.5, max: 28)

Run: 3, exploration: 0.8307187014821328, score: 13
Scores: (min: 13, avg: 19.333333333333332, max: 28)

Run: 4, exploration: 0.7666961448653229, score: 17
Scores: (min: 13, avg: 18.75, max: 28)

Run: 5, exploration: 0.6935613678313175, score: 21
Scores: (min: 13, avg: 19.2, max: 28)

Run: 6, exploration: 0.6629680834613705, score: 10
Scores: (min: 10, avg: 17.666666666666668, max: 28)

Run: 7, exploration: 0.6274028820538087, score: 12
Scores: (min: 10, avg: 16.857142857142858, max: 28)

Run: 8, exploration: 0.5937455908197752, score: 12
Scores: (min: 10, avg: 16.25, max: 28)

Run: 9, exploration: 0.5507399854171277, score: 16
Scores: (min: 10, avg: 16.22222222222222, max: 28)

Run: 10, exploration: 0.5238143793828016, score: 11
Scores: (min: 10, avg: 15.7, max: 28)

Run: 11, exploration: 0.4907693883854626, score: 14
Scores:

KeyboardInterrupt: 

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

### **Explain how reinforcement learning concepts apply to the cartpole problem.**
  * **What is the goal of the agent in this case?**  
    The goal of the agent in the cartpole is to balance the pole that sits atop a cart.  The technical goal is to choose the optimal policy to maximize the reward.  A reward could be essentially performing an action that prolongs the pole moving toward being balanced.
  * **What are the various state values?**  
    There are 4 pieces of state:
    1.  Cart velocity
    2.  Cart position
    3.  Pole angle
    4.  Pole velocity (the velocity of the tip)
  * **What are the possible actions that can be performed?**  
    The actions that can be performed by the agent are pushing the cart left or right which is a 0 for left and 1 for right.
  * **What reinforcement algorithm is used for this problem?**  
    The reinforcement algorithm used in this problem is the DQN algorithm or Deep Q Network.
    
### **Analyze how experience replay is applied to the cartpole problem.**
  * **How does experience replay work in this algorithm?**  
    Experience replay stores the agent's experiences in memory.  Batches of experiences are then randomly sampled from memory and used to train the model.
  * **What is the effect of introducing a discount factor for calculating the future rewards?**  
    The effect of the discount factor is to make future rewards worth less than the current reward. The ultimate goal is to decrease the gap between the prediction and the target value.  By discounting future rewards this will bring those two values closer together.
    
### **Analyze how neural networks are used in deep Q-learning.**
  * **Explain the neural network architecture that is used in the cartpole problem.**  
    The architecture used in the cartpole problem is we use 3 dense layers and the Adam optimizer.  It approximates the output based on the input it is fed.  It will predict the reward value from a particular state.  The NN in the cartpole problem learns very differently from that of a standard neural network.  It uses state and past experiences along with rewards to calculate
  * **How does the neural network make the Q-learning algorithm more efficient?**  
    What makes Q-learning algorithms efficient is through experience replay.  It uses small batches of past experiences to make future actions.  The use of a reward system also increases performance by rewarding the agent for good decisions which in our case is increasing the amount of time the pole is still in play.
  * **What difference do you see in the algorithm performance when you increase or decrease the learning rate?**  
    The most notable difference I noticed is that whenever I changed the learning rate from 0.001 to 0.0025 it did not find a solution after 300+ runs, another thing that was make clear was whenever the score for a run was lower than the previous score the impact to the avg was note has evident as it was in previous tests.  It also took considerably longer to reach a solution when the learning rate was increased.
