# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [8]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
  
from scores.score_logger import ScoreLogger  
  
ENV_NAME = "CartPole-v1"  
  
GAMMA = 0.95  
LEARNING_RATE = 0.001  
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



In [9]:
cartpole()

Run: 1, exploration: 1.0, score: 15
Scores: (min: 15, avg: 15, max: 15)

Run: 2, exploration: 0.9655206468094844, score: 12
Scores: (min: 12, avg: 13.5, max: 15)

Run: 3, exploration: 0.9091562615825302, score: 13
Scores: (min: 12, avg: 13.333333333333334, max: 15)

Run: 4, exploration: 0.8348931673187264, score: 18
Scores: (min: 12, avg: 14.5, max: 18)

Run: 5, exploration: 0.7628626641409962, score: 19
Scores: (min: 12, avg: 15.4, max: 19)

Run: 6, exploration: 0.7219385759785162, score: 12
Scores: (min: 12, avg: 14.833333333333334, max: 19)

Run: 7, exploration: 0.6596532430440636, score: 19
Scores: (min: 12, avg: 15.428571428571429, max: 19)

Run: 8, exploration: 0.6211445383053219, score: 13
Scores: (min: 12, avg: 15.125, max: 19)

Run: 9, exploration: 0.5238143793828016, score: 35
Scores: (min: 12, avg: 17.333333333333332, max: 35)

Run: 10, exploration: 0.5032248303978422, score: 9
Scores: (min: 9, avg: 16.5, max: 35)

Run: 11, exploration: 0.47622912292284103, score: 12
Scores:

NameError: name 'exit' is not defined

Note: If the code is running properly, you should begin to see output appearing above this code block. It will take several minutes, so it is recommended that you let this code run in the background while completing other work. When the code has finished, it will print output saying, "Solved in _ runs, _ total runs."

You may see an error about not having an exit command. This error does not affect the program's functionality and results from the steps taken to convert the code from Python 2.x to Python 3. Please disregard this error.

1. Explain how reinforcement learning concepts apply to the cartpole problem.
Goal of the Agent:
The primary goal of the agent in the CartPole problem is to balance a pole on a cart by applying forces to the left or right. The agent must learn to keep the pole upright for as long as possible, maximizing the duration of each episode.

State Values:
The state of the CartPole environment is represented by a vector consisting of four values:Cart position, Cart velocity, Pole angle, and Pole velocity at the tip; these values provide the agent with information about the current configuration of the cart and pole, allowing it to decide the next action.

Possible Actions:
-Apply force to the left
-Apply force to the right

These actions influence the cart's movement and consequently affect the pole's stability.

Reinforcement Algorithm:
The algorithm used for this problem is Deep Q-Learning. DQN combines Q-Learning with deep neural networks to handle environments with high-dimensional state spaces.

2. Experience Replay in CartPole
Function of Experience Replay:
Experience replay is a technique used to store the agent's experiences (state, action, reward, next state, and done flag) in a replay memory. During training, random samples from it are used to update the Q-values.

Mechanism in the Algorithm:
In the DQN algorithm, experience replay helps in breaking the correlation between consecutive experiences. By sampling randomly from the replay memory, the agent can learn from a diverse set of past experiences, leading to more stable and efficient learning.

Effect of Discount Factor:
The discount factor (sybolized by gamma) in reinforcement learning is used to balance the importance of immediate rewards versus future rewards. A higher discount factor places more emphasis on future rewards, encouraging the agent to consider long-term benefits. Conversely, a lower discount factor focuses more on immediate rewards. In this algorithm, the discount factor influences the calculation of the Q-values during training.

3. Neural Networks in Deep Q-Learning
Neural Network Architecture:
The neural network used in the CartPole problem consists of an input layer with 24 neurons (corresponding to the input state), two hidden layers with 24 neurons each, and an output layer with neurons equal to the number of possible actions; in this case there are only two possible actions. The hidden layers use ReLU activation functions and the output layer uses a linear activation function.

Efficiency in Q-Learning:
The neural network approximates the Q-values for each action given a state, allowing the DQN algorithm to handle large and continuous state spaces. This makes Q-Learning more efficient and applicable to complex problems where traditional Q-Learning would be impractical due to the high-dimensional state space.

Impact of Learning Rate:
The learning rate , symbolized by the alpha symbol, determines the step size during the gradient descent optimization.

-Increasing the learning rate can speed up the learning process but may lead to instability and divergence if too high.
-Decreasing the learning rate results in more stable learning but can slow down the convergence.
-Empirical tuning is needed to find a balance where the learning rate is so it is not too high nor too low.

References
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.

Mnih, V., et al.(2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.