# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [2]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [3]:
cartpole()

Run: 1, exploration: 1.0, score: 14
Scores: (min: 14, avg: 14, max: 14)

Run: 2, exploration: 0.9046104802746175, score: 26
Scores: (min: 14, avg: 20, max: 26)

Run: 3, exploration: 0.810157377815473, score: 23
Scores: (min: 14, avg: 21, max: 26)

Run: 4, exploration: 0.6401093727576664, score: 48
Scores: (min: 14, avg: 27.75, max: 48)

Run: 5, exploration: 0.547986285490042, score: 32
Scores: (min: 14, avg: 28.6, max: 48)

Run: 6, exploration: 0.3063705780533402, score: 117
Scores: (min: 14, avg: 43.333333333333336, max: 117)

Run: 7, exploration: 0.2730095965279488, score: 24
Scores: (min: 14, avg: 40.57142857142857, max: 117)

Run: 8, exploration: 0.2494556624678441, score: 19
Scores: (min: 14, avg: 37.875, max: 117)

Run: 9, exploration: 0.22793384675362674, score: 19
Scores: (min: 14, avg: 35.77777777777778, max: 117)

Run: 10, exploration: 0.192217783647157, score: 35
Scores: (min: 14, avg: 35.7, max: 117)

Run: 11, exploration: 0.17388222158237718, score: 21
Scores: (min: 14, av

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

# The first runs yielded between a 34 iteration being solved to this latest run which comes to a solution in 333 runs. There is  quite a difference in the results of the tests being run numerous times.

# The goal of the agent is to learn and solve based on reward which means that the pole is kept in the upright position for the  maximum amount of time.

# The various state values are: Cart position (x)
# Cart velocity (x_dot)
# Pole angle (theta)
# Pole angular velocity (theta_dot)

# The possible actions that can be taken by the agent include moving the cart to the right or left.

# The algorithm that is used is Q learning. Q-learning is a machine learning approach that uses reinforcement learning to enable a model to learn and improve over time. (Dazeley, et al., 2022)

# In this Q-learning example, experience replay involves storing the agent's experiences (state, action, reward, next state) in a replay buffer. During training, random batches of experiences are sampled from the buffer to update the Q-network, reducing the issues of correlated updates and increasing data efficiency. (Zhang, et al., 2021)

# In the CartPole testing  The discount factor (gamma) determines the importance of future rewards in the agent's decision-making. A higher gamma values make the agent prioritize long-term rewards, while a lower value makes it focus more on short-term rewards. (Castaneda, et al., 2022) 

# Neural Network Architecture is the CartPole application would consist of Input layer with four neurons (corresponding to the four state values)
# One or more hidden layers (with a suitable number of neurons and activation functions)
# Output layer with two neurons (for the two possible actions)
# Traditional distributed deep reinforcement learning (RL) commonly relies on exchanging the experience replay memory (RM) of each agent.

# (Cha, et al., 2020)
#
# How does the neural network make the Q-learning algorithm more efficient?
# Neural networks make Q-learning more efficient by enabling function 
# approximation. They can generalize Q-values across similar states which equates to greater efficiency.

# A higher learning rate can lead to faster convergence but may result in instability or possibly overshooting.
#  A lower learning rate can make learning more stable but slower.


# References
# Kurniawan, B., Vamplew, P., Papasimeon, M., Dazeley, R., & Foale, C. (2022). Discrete-to-deep reinforcement learning methods. Neural Computing & Applications, 34(3), 1713–1733. https://doi-org.ezproxy.snhu.edu/10.1007/s00521-021-06270-6

# Zhang, C., Song, Q., & Meng, Z. (2021). Minibatch Recursive Least Squares Q-Learning. Computational Intelligence & Neuroscience, 1–9. https://doi-org.ezproxy.snhu.edu/10.1155/2021/5370281

# Westenbroek, T., Castaneda, F., Agrawal, A., Sastry, S., & Sreenath, K. (2022). Lyapunov Design for Robust and Efficient Robotic Reinforcement Learning.

# Cha, H., Park, J., Kim, H., Bennis, M., & Kim, S. (2020). Proxy Experience Replay: Federated Distillation for Distributed Reinforcement Learning. IEEE Intelligent Systems, Intelligent Systems, IEEE, IEEE Intell. Syst, 35(4), 94–101. https://doi-org.ezproxy.snhu.edu/10.1109/MIS.2020.2994942
