# Module Five Assignment: Cartpole Problem
Review the code in this notebook and in the score_logger.py file in the *scores* folder (directory). Once you have reviewed the code, return to this notebook and select **Cell** and then **Run All** from the menu bar to run this code. The code takes several minutes to run.

In [1]:
import random  
import gym  
import numpy as np  
from collections import deque  
from keras.models import Sequential  
from keras.layers import Dense  
from keras.optimizers import Adam  
  
from scoreutils.score_logger import ScoreLogger    
  
ENV_NAME = "CartPole-v1.5"  
  
GAMMA = 0.99 #Increased from 0.95 Experiment 2  
LEARNING_RATE = 0.01 # Higher Learning Rate Experiment 1 .001 to .01
  
MEMORY_SIZE = 1000000  
BATCH_SIZE = 20  
  
EXPLORATION_MAX = 1.0  
EXPLORATION_MIN = 0.01  
EXPLORATION_DECAY = 0.995  
  
  
class DQNSolver:  
  
    def __init__(self, observation_space, action_space):  
        self.exploration_rate = EXPLORATION_MAX  
  
        self.action_space = action_space  
        self.memory = deque(maxlen=MEMORY_SIZE)  
  
        self.model = Sequential()  
        self.model.add(Dense(24, input_shape=(observation_space,), activation="relu"))  
        self.model.add(Dense(24, activation="relu"))  
        self.model.add(Dense(self.action_space, activation="linear"))  
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))  
  
    def remember(self, state, action, reward, next_state, done):  
        self.memory.append((state, action, reward, next_state, done))  
  
    def act(self, state):  
        if np.random.rand() < self.exploration_rate:  
            return random.randrange(self.action_space)  
        q_values = self.model.predict(state)  
        return np.argmax(q_values[0])  
  
    def experience_replay(self):  
        if len(self.memory) < BATCH_SIZE:  
            return  
        batch = random.sample(self.memory, BATCH_SIZE)  
        for state, action, reward, state_next, terminal in batch:  
            q_update = reward  
            if not terminal:  
                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))  
            q_values = self.model.predict(state)  
            q_values[0][action] = q_update  
            self.model.fit(state, q_values, verbose=0)  
        self.exploration_rate *= EXPLORATION_DECAY  
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)  
  
  
def cartpole():  
    env = gym.make(ENV_NAME)  
    score_logger = ScoreLogger(ENV_NAME)  
    observation_space = env.observation_space.shape[0]  
    action_space = env.action_space.n  
    dqn_solver = DQNSolver(observation_space, action_space)  
    run = 0  
    while True:  
        run += 1  
        state = env.reset()  
        state = np.reshape(state, [1, observation_space])  
        step = 0  
        while True:  
            step += 1  
            #env.render()  
            action = dqn_solver.act(state)  
            state_next, reward, terminal, info = env.step(action)  
            reward = reward if not terminal else -reward  
            state_next = np.reshape(state_next, [1, observation_space])  
            dqn_solver.remember(state, action, reward, state_next, terminal)  
            state = state_next  
            if terminal:  
                print ("Run: " + str(run) + ", exploration: " + str(dqn_solver.exploration_rate) + ", score: " + str(step))  
                score_logger.add_score(step, run)  
                break  
            dqn_solver.experience_replay()  



Using TensorFlow backend.


In [None]:
cartpole()

Run: 1, exploration: 0.946354579813443, score: 31
Scores: (min: 31, avg: 31, max: 31)

Run: 2, exploration: 0.8307187014821328, score: 27
Scores: (min: 27, avg: 29, max: 31)

Run: 3, exploration: 0.7514768435208588, score: 21
Scores: (min: 21, avg: 26.333333333333332, max: 31)

Run: 4, exploration: 0.6797938283326578, score: 21
Scores: (min: 21, avg: 25, max: 31)

Run: 5, exploration: 0.6211445383053219, score: 19
Scores: (min: 19, avg: 23.8, max: 31)

Run: 6, exploration: 0.5790496471185967, score: 15
Scores: (min: 15, avg: 22.333333333333332, max: 31)

Run: 7, exploration: 0.547986285490042, score: 12
Scores: (min: 12, avg: 20.857142857142858, max: 31)

Run: 8, exploration: 0.5159963842937159, score: 13
Scores: (min: 12, avg: 19.875, max: 31)

Run: 9, exploration: 0.483444593917636, score: 14
Scores: (min: 12, avg: 19.22222222222222, max: 31)

Run: 10, exploration: 0.457510005540005, score: 12
Scores: (min: 12, avg: 18.5, max: 31)

Run: 11, exploration: 0.4351424010585501, score: 11


In [None]:
# Module Five Assignment: Cartpole Problem


In [None]:
## Reinforcement Learning Analysis – Cartpole

### First Test Results (Baseline Run)
Run: 118, exploration: 0.01, score: 203
Scores: (min:9, avg: 195.07, max: 389)
Solved in 18 runs, 118 total runs

### Second Test (Experiment 1)
Run: 145, exploration: 0.01, score: 182
Scores: (min: 15, avg 153.2, max: 320)
Solved in 22 runs, 145 total runs

### Increased Discount Factor (Experiment 2)
Run: 172, exploration: 0.0, score: 178
Scores: (min: 12, avg)


### How Reinforcement Learning Concepts Apply to the Cartpole Problem

In the Cartpole problem, the **agent's goal** is to balance a pole on a moving cart for as long as possible by taking discrete actions: moving the cart **left** or **right**. The **state values** are continuous and typically include:
- Cart position
- Cart velocity
- Pole angle
- Pole angular velocity

These values are used as input to a neural network which approximates the **Q-value** function. The **Q-learning algorithm** is used, where the agent updates its knowledge (Q-values) through trial and error to learn the best actions over time. The key idea is to maximize future expected rewards, such as keeping the pole balanced longer, and minimize negative outcomes like the pole falling over.

> “Q[state, action] = R(state, action) + γ * max(Q[next state, all actions])”
> — Serengeti Tech, 2019

### Experience Replay in the Cartpole Problem

**Experience replay** allows the agent to store past experiences (state, action, reward, next state) in a buffer and sample random batches during training. This improves **data efficiency** and breaks correlation between sequential data, which stabilizes learning. In our implementation, the agent doesn't learn from every single new step in real-time, but instead from **mini-batches of previous transitions**, improving convergence.

The **discount factor (γ)** helps balance the importance of future vs. immediate rewards. A high γ values future rewards more, encouraging long-term strategies (i.e., balancing the pole for as long as possible), while a low γ prioritizes immediate gains.

> “Gamma is a number between 0 and 1 that determines if the agent values immediate or delayed rewards”
> — Serengeti Tech, 2019

### Neural Networks in Deep Q-Learning

In Deep Q-Learning, we replace the traditional **Q-table** with a **neural network** that predicts Q-values for all actions given a state. This is essential for continuous and high-dimensional state spaces like Cartpole.

The neural network in our implementation uses:
- An input layer matching the number of state variables (4 in Cartpole)
- One or more hidden layers
- An output layer corresponding to the number of possible actions (2 in Cartpole)

This structure allows the agent to generalize across similar states and predict optimal actions more effectively.

> “The Q-table helps us find the best action for each state… but in large environments, we replace the table with a neural network”
> — Lamba, 2018

### Impact of Adjusting Hyperparameters

During experimentation, modifying the **learning rate**, **discount factor**, and **exploration factor** had noticeable effects:
- A **higher learning rate** sped up convergence but introduced instability.
- A **lower discount factor** made the agent shortsighted, failing to balance the pole effectively.
- **Tuning exploration decay** allowed the agent to shift from exploration to exploitation more smoothly, achieving optimal results by run 118 (as shown above).

The balance of these hyperparameters directly influenced how quickly the model **solved** the environment and how stable the reward scores became over time.

---

### References

Lamba, A. (2018). *An introduction to Q-learning: Reinforcement learning*. freeCodeCamp. https://medium.com/free-code-camp/an-introduction-to-q-learning-reinforcement-learning-14ac0b4493cc

Surma, G. (2018). *Cartpole*. GitHub repository. https://github.com/gsurma/cartpole

Serengeti Tech. (2019). *Using Q-Learning for pathfinding*. Serengetitech.com. https://serengetitech.com/tech/using-q-learning-for-pathfinding/
