# Mountain Car

Q Learning to play Mountain Car, using q table. Based on the tutorial by  sentdex [here](https://www.youtube.com/watch?v=yMk_XtIEzH8&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7). Code updated from the tutorial to comply with changes in the latest version of OpenAI gym.

## Q Learning
Q-Learning is a type of reinforcement learning algorithm that aims to learn the quality of actions, denoted as Q-values, which indicate the potential long-term rewards of taking an action in a given state. It is an off-policy algorithm, meaning it learns the value of the optimal policy independently of the agent’s actions.

Key Concepts

1.	__State (S)__: Represents the current situation or configuration in which the agent finds itself.
2.	__Action (A)__: Represents the set of all possible moves or decisions the agent can make.
3.	__Reward (R)__: The immediate feedback received after taking an action in a state.
4.	__Q-Value (Q)__: Represents the expected future rewards for an action taken in a given state, considering all possible future states.

#### Q-Learning Formula

The Q-value update rule is given by:

 $$Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a{\prime}} Q(s{\prime}, a{\prime}) - Q(s, a) \right)$$ 

Where:

- $Q(s, a)$ : Current Q-value for state  s  and action  a .
- $\alpha$ : Learning rate $(0 < \alpha ≤ 1)$, determines how much new information overrides the old information.
- $r$ : Immediate reward after taking action  a  in state  s .
- $\gamma$ : Discount factor $(0 ≤ \gamma < 1)$, determines the importance of future rewards.
- $\max_{a{\prime}} Q(s{\prime}, a{\prime})$ : Maximum Q-value for the next state  s{\prime}  across all possible actions  $a{\prime}$ .

#### Steps in Q-Learning Algorithm

1.	__Initialize__: Initialize the Q-table with arbitrary values (commonly zeros). The Q-table has dimensions |S| x |A|, where |S| is the number of states and |A| is the number of actions.
2.	__Loop__: For each episode:
  - Initialize the state  s .
  - For each step in the episode:
  - Choose an action  a  using an epsilon-greedy strategy.
  - Take the action  a , observe the reward  r  and the next state  s{\prime} .
  - Update the Q-value using the Q-learning formula.
  - Transition to the next state  s{\prime} .
3. __Repeat__: Continue the loop until the policy converges or a stopping criterion is met.


## Note on the MountainCar Environment in OpenAI Gym

The MountainCar environment is a classic reinforcement learning problem provided by OpenAI Gym. It is a simple environment that is often used for testing and demonstrating reinforcement learning algorithms.

#### Environment Description

- __Goal__: The goal is to drive an underpowered car up a steep hill. The car is positioned in a valley and lacks sufficient power to simply accelerate up the hill. Therefore, the agent must learn to build momentum by driving back and forth.
- __State Space__: The state space is continuous and consists of two variables:
    - __Position__: The position of the car on the x-axis. It ranges from -1.2 to 0.6.
	- __Velocity__: The velocity of the car, ranging from -0.07 to 0.07.
- __Action Space__: The action space is discrete with three possible actions:
	1.	Accelerate to the left.
	2.	Do nothing.
	3.	Accelerate to the right.
- __Rewards__: The agent receives a reward of -1 for each time step until it reaches the goal. The objective is to reach the goal state as quickly as possible to minimize the cumulative negative reward.
- __Goal State__: The goal is reached when the car’s position is equal to or greater than 0.5. Upon reaching the goal, the episode terminates.
- __Episode Termination__: An episode terminates when:
	- The car reaches the goal position (0.5).
	- The episode length reaches 200 time steps.

In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
import os
import gym
import pickle
import numpy as np


In [3]:
env = gym.make("MountainCar-v0",render_mode='human')

In [4]:
# How far apart to render the enviroment(optional)
SHOW_EVRY = 500

# Discretize the observation space to 20 values
DISCRETE_OS_SIZE = [20] * len((env.observation_space.high))

# Just like any DL
LEARNING_RATE = 0.1
# A measure of how future rewards are valued 
DISCOUNT = 0.95
# How long to train for
EPISODES = 5000 



## Epsilon 
epsilon ($\epsilon$) is a parameter used in the epsilon-greedy strategy, which balances exploration and exploitation during the learning process. The epsilon-greedy strategy defines the behavior of the agent using the following rule:
    
- With probability ε, the agent chooses a random action (exploration).
- With probability 1 - ε, the agent chooses the action with the highest Q-value for the current state (exploitation).

Balancing these two is known as the exploration-exploitation trade-off. If the agent only exploits, it might get stuck in suboptimal policies because it never tries unexplored actions that could lead to higher rewards. If it only explores, it might not leverage its knowledge effectively to achieve high rewards.

$\epsilon$ is not constant and is decayed over time to shift the balance from exploration to exploitation as the agent learns more about the environment. This can be done using a decay function, such as:

- Linear Decay: Decrease ε by a fixed amount after each episode. (Impelmented here)
- Exponential Decay: Decrease ε exponentially after each episode.

In this particular example, i.e., mountain car, this might not be obvious but it will help our again learn new approaches even if it learnt an approach that 'works' and that might lead to a more efficent solution

In [5]:
# When to start epsilon decaying 
START_EPSILON_DECAYING = 1

# When to stop epsilon decaying 
END_EPSILON_DECAYING = EPISODES // 2

# Now let's calculate the actual epsilon decay value
epsilon = 1
epsilon_decay_value = epsilon / (END_EPSILON_DECAYING - START_EPSILON_DECAYING)

## Convert continuous state to discrete state

This function takes a continuous state from the environment and maps it to a discrete state. This is useful in Q-Learning, where states need to be represented in a finite and manageable way. Otherwise, the state values would be vast, making the Q-table extremely large.

In [6]:
discrete_obs_win_size = (env.observation_space.high - env.observation_space.low) / DISCRETE_OS_SIZE

def get_discrete_state(state):
    discrete_state = (state-env.observation_space.low)/discrete_obs_win_size
    return tuple(discrete_state.astype(np.int32))

## Load saved Q-table if it exists

In [7]:
# File name for storing the Q-table
q_table_file = "mountain_car_q_table.pkl"

# Check if the Q-table file exists
if os.path.isfile(q_table_file):
    # If the file exists, load the Q-table from the file
    print("Found a stored Q-table, reading that")
    with open(q_table_file, "rb") as fp:  # Pickling
        q_table = pickle.load(fp)
else:
    # If the file does not exist, initialize a new Q-table with random values
    q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n]))

Found a stored Q-table, reading that


## Reset the environment to initial state

In [8]:
# Reset the environment to an initial state
obs, _ = env.reset()

# Convert the continuous observation to a discrete state
ds = get_discrete_state(obs)

2024-10-27 12:16:25.127 python[53731:23950452] +[IMKClient subclass]: chose IMKClient_Legacy
2024-10-27 12:16:25.127 python[53731:23950452] +[IMKInputSession subclass]: chose IMKInputSession_Legacy


In [9]:
for episode in range(EPISODES):
    # Determine render mode for visualization
    if episode % SHOW_EVRY == 0:
        render_mode = "human"
    else:
        render_mode = None
    
    # Create environment instance with the specified render mode
    env = gym.make("MountainCar-v0", render_mode=render_mode)
    
    # Initialize state
    init_state, _ = env.reset()
    # Convert continuous state to discrete state
    discrete_state = get_discrete_state(init_state)
    done = False
    
    while not done:
        # Choose action based on epsilon-greedy policy
        if np.random.random() > epsilon:
            # Get action from Q table (exploitation)
            action = np.argmax(q_table[discrete_state])
        else:
            # Get random action (exploration)
            action = np.random.randint(0, env.action_space.n)

        # Take the chosen action and observe the result
        obs, reward, terminated, truncated, info = env.step(action)
        # Convert new continuous state to discrete state
        new_discrete_state = get_discrete_state(obs)
        done = (terminated or truncated)
        
        if not done:
            # Get the maximum Q value for the new discrete state
            max_future_q = np.max(q_table[new_discrete_state])
            # Get the current Q value for the current state-action pair
            current_q = q_table[discrete_state + (action, )]
            # Calculate new Q value
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward * DISCOUNT + max_future_q)
            # Update Q table with new Q value
            q_table[discrete_state + (action,)] = new_q
        elif obs[0] >= env.goal_position:
            # If goal is reached, set Q value to 0
            q_table[discrete_state + (action,)] = 0
            print(f"Made it on episode {episode}")
        
        # Update discrete state
        discrete_state = new_discrete_state
    
    # Decay epsilon if within the decaying range
    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_decay_value

# Close the environment
env.close()

  if not isinstance(terminated, (bool, np.bool8)):


Made it on episode 1428
Made it on episode 1441
Made it on episode 1464
Made it on episode 1465
Made it on episode 1467
Made it on episode 1472
Made it on episode 1505
Made it on episode 1513
Made it on episode 1514
Made it on episode 1517
Made it on episode 1518
Made it on episode 1520
Made it on episode 1528
Made it on episode 1543
Made it on episode 1549
Made it on episode 1583
Made it on episode 1584
Made it on episode 1591
Made it on episode 1593
Made it on episode 1596
Made it on episode 1603
Made it on episode 1606
Made it on episode 1609
Made it on episode 1611
Made it on episode 1612
Made it on episode 1614
Made it on episode 1615
Made it on episode 1621
Made it on episode 1622
Made it on episode 1623
Made it on episode 1625
Made it on episode 1627
Made it on episode 1633
Made it on episode 1634
Made it on episode 1635
Made it on episode 1642
Made it on episode 1649
Made it on episode 1650
Made it on episode 1651
Made it on episode 1653
Made it on episode 1656
Made it on episo

### Save the Q-Table 

In [10]:
with open(q_table_file, "wb") as fp:   #Pickling
    pickle.dump(q_table, fp)

## Let’s run the game a few times using the learned Q-Table

In [11]:
print("Test runs with learned Q-Table .... ")
for i in range(10):
    env = gym.make("MountainCar-v0", render_mode='human')
    init_state, _ = env.reset()
    discrete_state = get_discrete_state(init_state)

    done = False
    while not done:
        action = np.argmax(q_table[discrete_state])
        obs, reward, terminated, truncated, info = env.step(action)
        new_discrete_state = get_discrete_state(obs)
        done = (terminated or truncated)
        if not done:
            max_future_q = np.max(q_table[new_discrete_state])
            current_q = q_table[discrete_state + (action, )]
        elif obs[0] >= env.goal_position:
            print(f"Made it on iteration {i}")
        discrete_state = new_discrete_state
    env.close()

Test runs with learned Q-Table .... 
Made it on iteration 0
Made it on iteration 1
Made it on iteration 2
Made it on iteration 3
Made it on iteration 4
Made it on iteration 5
Made it on iteration 6
Made it on iteration 7
Made it on iteration 8
Made it on iteration 9


: 