<a href="https://colab.research.google.com/github/wolfsinem/GYM-reinforcementLearning/blob/main/notebooks/week1/Reinforcement_Learning_project_Q_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement Learning project - Q-Learning




### Preparation

Some dependencies need to be installed for the code to work. Furthermore, we will define some methods which allow us to show the OpenAI Gym renderings in this (headless) Google Colab environment.

You only have to run these and don't need to change any of the code.

In [1]:
# Install dependencies
"""Note: if you are running this code on your own machine, you probably don't need all of these.
   Start with 'pip install gym' and install more packages if you run into errors."""
!apt-get update > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg cmake > /dev/null 2>&1

!pip install gym pyvirtualdisplay > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install colabgymrender

Requirement already up-to-date: setuptools in /usr/local/lib/python3.7/dist-packages (56.2.0)


### Imports for helper functions

In [2]:
import base64
import io
import math
from pathlib import Path

import gym
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
from colabgymrender.recorder import Recorder
from google.colab import drive
from gym.wrappers import Monitor
from IPython import display as ipythondisplay
from IPython.display import HTML
from pyvirtualdisplay import Display

In [3]:
# Mount your Google Drive. By doing so, you can store any output, models, videos, and images persistently.
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [4]:
# Create a directory to store the data for this lab. Feel free to change this.
data_path = Path('/content/gdrive/My Drive/Colab Notebooks/HU_RL/part1')
data_path.mkdir(parents=True, exist_ok=True)
video_path = data_path / 'video'

In [5]:
# Define helper functions to visually show what the models are doing.
%matplotlib inline

gym.logger.set_level(gym.logger.ERROR)

display = Display(visible=0, size=(1400, 900))
display.start()

def show_video():
    # Display the stored video file
    # Credits: https://star-ai.github.io/Rendering-OpenAi-Gym-in-Colaboratory/
    mp4list = list(data_path.glob('video/*.mp4'))
    if len(mp4list) > 0:
        mp4 = mp4list[-1]
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
            </video>'''.format(encoded.decode('ascii'))))
    else: 
        print('Could not find video')


def record_episode(idx):
    # This determines which episodes to record.
    # Since the video rendering in the OpenAI Gym is a bit buggy, we simply override it and decide
    # whether or not to render inside of our training loop.
    return True

    
def video_env(env):
    # Wraps the environment to write its output to a video file
    env = Monitor(env, video_path, video_callable=record_episode, force=True)
    return env


<pyvirtualdisplay.display.Display at 0x7ff65c442b10>

### Test the environment

In [65]:
"""We will use a basic OpenAI Gym examle: CartPole-v0.
In this example, we will try to balance a pole on a cart.
This is similar to kids (and.. grown-ups) trying to balance sticks on their hands.

Check out the OpenAI Gym documentation to learn more: https://gym.openai.com/docs/"""

# Create the desired environment
env = gym.make("CartPole-v0")

# Wrap the environment, to make sure we get to see a fancy video
env = video_env(env)

# Before you can use a Gym environment, it needs to be reset.
state = env.reset()

# Perform random actions untill we drop the stick. Just as an example.
done = False
while not done:
    env.render()
    # The action_space contains all possible actions we can take.
    random_action = env.action_space.sample() 

    # After each action, we end up in a new state and receive a reward.
    # When we drop the pole (more than 12 degrees), or balance it long enough (200 steps),
    # or drive off the screen, done is set to True.
    state, reward, done, info = env.step(random_action)

# Show the results!
env.close()
# show_video()

'We will use a basic OpenAI Gym examle: CartPole-v0.\nIn this example, we will try to balance a pole on a cart.\nThis is similar to kids (and.. grown-ups) trying to balance sticks on their hands.\n\nCheck out the OpenAI Gym documentation to learn more: https://gym.openai.com/docs/'

True

True

True

True

True

True

True

True

True

True

True

True

True

True

True

In [66]:
# Neat, it did something (randomly)! 

# In order to train the system, we will try to predict the reward a certain actions yields given the state of the system.
# But what is the state anyway?

# In this environment, the state represents the cart's position and velocity, and the pole's angle and velocity.

# Let's check out the current state
print(f'State array: {state}')
print(f'Cart position: {state[0]} (range: [-4.8, 4.8])')
print(f'Cart velocity: {state[1]} (range: [-inf, inf])')
print(f'Pole angle: {state[2]} (range: [-0.418, 0.418])')
print(f'Pole velocity: {state[3]} (range [-inf, inf])')

# You can find out the minimum and maximum possible observation values using:
print(f'Low observation space:', env.observation_space.low)
print(f'High observation space:', env.observation_space.low)

State array: [-0.06595657 -1.0341383   0.24015033  1.9491046 ]
Cart position: -0.06595657322407936 (range: [-4.8, 4.8])
Cart velocity: -1.034138304951036 (range: [-inf, inf])
Pole angle: 0.24015032993331345 (range: [-0.418, 0.418])
Pole velocity: 1.9491046034023012 (range [-inf, inf])
Low observation space: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
High observation space: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


### Implement Q-Learning

Implement Q-Learning and find suitable parameters to reach a 200 reward.

### Define parameters - Fill in the dots

In [67]:
num_episodes = 1000
num_steps = 200

num_episodes_between_status_update = 500
num_episodes_between_videos = 5000

n_actions = env.action_space.n
n_states = env.observation_space.shape[0]

min_alpha = 0.1       #learning rate
min_epsilon = 0.1     #exploration rate
gamma = 1          #discount factor

### Q-Table creation

In [68]:
# Define the initial Q table as a random uniform distribution
buckets = (1,1,6,12)

upper_bounds = [env.observation_space.high[0], 0.5, env.observation_space.high[2], math.radians(50)]
lower_bounds = [env.observation_space.low[0], -0.5, env.observation_space.low[2], -math.radians(50)]

Q = np.zeros(buckets + (n_actions,)) 
print('Initial Q table:', Q.shape)

Initial Q table: (1, 1, 6, 12, 2)


### Train


In [69]:
# functions
def discretize_state(state):
    """
    A Q-table cannot practically handle infinite states, so limit the state space by
    discretizing the state into buckets.
    """
    ratios = [(state[i] + abs(lower_bounds[i])) / (upper_bounds[i] - lower_bounds[i]) for i in range(len(state))]
    discrete_state = [int(round((buckets[i] - 1) * ratios[i])) for i in range(len(state))]
    discrete_state = [min(buckets[i] - 1, max(0, discrete_state[i])) for i in range(len(state))]
    return tuple(discrete_state)

def take_action(state, epsilon):
    """
    Take an action to either explore or exploit based on epsilon
    """
    if (np.random.random() < epsilon):
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state])
    return action

def estimated_max_for_next_state(new_state):
    """
    What's the best expected Q-value for the next state?
    """
    return np.argmax(Q[new_state])

def new_q_value(state, action, reward, new_state, alpha):
    """
    Calculate the new Q-value
    """
    Q[state][action] += min_alpha * (reward + gamma * np.max(Q[new_state]) - Q[state][action])
  
def get_epsilon(episode):
    """
    Decrease the exploration rate at each episode
    """
    return max(min_epsilon, min(1, 1.0 - math.log10((episode + 1) / 25)))

def get_alpha(episode):
    """
    Decrease the learning rate at each episode
    """
    return max(min_alpha, min(1.0, 1.0 - math.log10((episode + 1) / 25)))

In [70]:
# Time to train the system
rewards = [] 

for episode in range(num_episodes):
    current_state = env.reset() # Don't forget to reset the environment between episodes
    current_state = discretize_state(current_state)

    alpha = get_alpha(episode)
    epsilon = get_epsilon(episode)

    reward_sum = 0

    for t in range(num_steps):
        action = take_action(current_state, epsilon)
        new_state, reward, done, _ = env.step(action)
        new_state = discretize_state(new_state)

        new_q_value(current_state, action, reward, new_state, alpha)
        current_state = new_state

        reward_sum += reward

        # at the end of the episode
        if done:
            print(f'Total reward at episode {episode + 1}: {reward_sum}')
            break

    rewards.append(reward_sum)
print(f'Average reward over {episode + 1} episodes: {round(sum(rewards)/len(rewards))}')
print(f'Reward of 200 appeared {rewards.count(200)} times in {episode + 1} episodes')

Total reward at episode 1: 52.0
Total reward at episode 2: 28.0
Total reward at episode 3: 30.0
Total reward at episode 4: 28.0
Total reward at episode 5: 12.0
Total reward at episode 6: 9.0
Total reward at episode 7: 24.0
Total reward at episode 8: 31.0
Total reward at episode 9: 36.0
Total reward at episode 10: 21.0
Total reward at episode 11: 9.0
Total reward at episode 12: 48.0
Total reward at episode 13: 24.0
Total reward at episode 14: 14.0
Total reward at episode 15: 67.0
Total reward at episode 16: 47.0
Total reward at episode 17: 27.0
Total reward at episode 18: 39.0
Total reward at episode 19: 11.0
Total reward at episode 20: 14.0
Total reward at episode 21: 34.0
Total reward at episode 22: 22.0
Total reward at episode 23: 13.0
Total reward at episode 24: 29.0
Total reward at episode 25: 28.0
Total reward at episode 26: 20.0
Total reward at episode 27: 16.0
Total reward at episode 28: 15.0
Total reward at episode 29: 70.0
Total reward at episode 30: 12.0
Total reward at episo

### MountainCar

Now apply the things you've learned to the MountainCar problem. Please note that the observable space differs from the previous problem. Thus, before you start training, you need to learn more about thethis new environment.

Here is some code to help you get started..

In [71]:
# # Create the desired environment
# env = gym.make("MountainCar-v0")

# # Wrap the environment, to make sure we get to see a fancy video
# env = video_env(env)

# # Before you can use a Gym environment, it needs to be reset.
# state = env.reset()

# # Perform random actions untill we drop the stick. Just as an example.
# done = False
# while not done:
   
#     # Explore and take actions
#     pass

#     # Remove the line below when you have created an implementation you want to test.
#     done = True

# # Show the results!
# env.close()
# show_video()