Reinforcement learning is an area of machine learning with an agent, a set of states, and a set of actions. When the agent takes some action in some state, it transitions to another state and receives a positive or negative reward based on the quality of its action. Therefore, the goal of reinforcement learning is to maximize total reward. It is modeled on Markov decision processes.

Maximization of total rewards:
- The agent takes some action, transitioning to a new state and receiving a reward for that action
- The total expected reward (expected return) is summed using the received reward and maximum attainable rewards from future states 
    - A discount factor, gamma, is used to dictate how much the future rewards are valued
- At this point, different algorithms use different steps.
    - Policy based methods learn the best actions to take given a state directly
    - Value based methods learn to predict the expected return given a state or state-action pair

General summary of Q-learning:
Q-learning is a type of value-based method. It assigns values (known as q-values) to state-action pairs. Actions with higher q-values lead to better long term rewards than those with low q-values.
- The agent updates the q-values of pairs using the bellman equation:
    - Q(s, a) = Q(s, a) + learning_rate[reward + gamma * maxQ(s', a') - Q(s, a)], where s' and a' denote the next state and action pair

In [33]:
import gymnasium as gym
import numpy as np  

In this environment, I have not disabled the slippery of effect of ice. So when the agent takes an action, there is a chance it will end up in a different state than intended

In [34]:
env = gym.make('FrozenLake-v1')

In [35]:
env.observation_space

Discrete(16)

In [36]:
env.observation_space.sample()

np.int64(11)

Note, in this environment there are only 16 states. Thus, there is no need to use a neural network. We can use an array to store values.

In [37]:
env.action_space

Discrete(4)

In [38]:
env.action_space.sample()

np.int64(1)

In [39]:
env.action_space.sample()

np.int64(3)

In [40]:
obs, _ = env.reset()
obs

0

In [41]:
next_state, reward, done, terminated, _ = env.step(env.action_space.sample())

In [42]:
print(next_state)
print(reward)
print(done)
print(terminated)

0
0.0
False
False


Q-learning algorithm

Now is a good time to discuss exploration vs exploitation: <br>

- In reinforcement learning, exploration is the idea of the agent exploring states and taking different actions. By exploring, it learns which actions lead to good rewards in which states.
- Exploitation is the idea of exploiting what the agent has learned from its previous exploration. Now instead of taking actions just to explore what happens and what reward we get, we take the best action the agent has learned. Trained properly, exploitation aims to solve the problem.
- Thus, there has to be enough exploration before exploitation can occur. If exploration is too short, the agent has not had enough time to learn which actions are good in which states, resulting in poor performance.  

In [43]:
class Agent:
    def __init__(self, env: gym.Env, lr: float, gamma: float):
        self.env = env
        self.values = np.zeros((16, 4)) # 16 total states, 4 actions. Thus 16 x 4 array. 1 q-value per action.
        self.epsilon = 1
        self.lr = lr
        self.gamma = gamma

    def select_action(self, state: np.ndarray): # Greedy epsilon implementation. Start with exploration and slowly decrease to exploitation.
        if np.random.rand() < self.epsilon:
            return np.random.randint(0, self.env.action_space.n) # Return random action (explore)
        else:
            return np.argmax(self.values[state]) # Return best action learned (exploit)

    def learn(self, num_episodes: int):
        for i in range(num_episodes):
            total_reward = 0
            state, _ = self.env.reset()
            done, terminated = False, False

            while not (done or terminated):
                action = self.select_action(state)
                next_state, reward, terminated, done, _ = self.env.step(action)

                self.values[state][action] = \
                    self.values[state][action] + self.lr * (reward + self.gamma * np.max(self.values[next_state]) - self.values[state][action])

                state = next_state
                total_reward += reward
            
            print(f'finished episode {i + 1}, received total reward {total_reward}')
            # decay epsilon
            self.epsilon = max(0.1, self.epsilon * 0.995) 
            # never want epsilon to go to 0, that would stop exploration completely

    def evaluate(self, env, num_episodes: int):
        rewards = []
        self.epsilon = 0
        self.env = env
        
        for i in range(num_episodes):
            total_reward = 0
            state, _ = self.env.reset()
            done, terminated = False, False

            while not (done or terminated):
                action = self.select_action(state)
                next_state, reward, terminated, done, _ = env.step(action)
                state = next_state
                total_reward += reward
            
            print(f'finished episode {i + 1}, received total reward {total_reward}')
            rewards.append(total_reward)
        
        print('finished evaluation with average reward:', sum(rewards) / len(rewards))


In [44]:
agent = Agent(env, 0.1, 0.99)

In [45]:
agent.learn(1000)

finished episode 1, received total reward 0.0
finished episode 2, received total reward 0.0
finished episode 3, received total reward 0.0
finished episode 4, received total reward 0.0
finished episode 5, received total reward 0.0
finished episode 6, received total reward 0.0
finished episode 7, received total reward 0.0
finished episode 8, received total reward 0.0
finished episode 9, received total reward 0.0
finished episode 10, received total reward 0.0
finished episode 11, received total reward 0.0
finished episode 12, received total reward 0.0
finished episode 13, received total reward 0.0
finished episode 14, received total reward 0.0
finished episode 15, received total reward 0.0
finished episode 16, received total reward 0.0
finished episode 17, received total reward 0.0
finished episode 18, received total reward 0.0
finished episode 19, received total reward 0.0
finished episode 20, received total reward 0.0
finished episode 21, received total reward 0.0
finished episode 22, r

In [46]:
eval_env = gym.make('FrozenLake-v1', render_mode = 'human')
agent.evaluate(eval_env, 5)    

finished episode 1, received total reward 0.0
finished episode 2, received total reward 1.0
finished episode 3, received total reward 1.0
finished episode 4, received total reward 0.0
finished episode 5, received total reward 0.0
finished evaluation with average reward: 0.4


In [47]:
eval_env.close()

In [48]:
eval_env = gym.make('FrozenLake-v1')
agent.evaluate(eval_env, 100)   

finished episode 1, received total reward 0.0
finished episode 2, received total reward 0.0
finished episode 3, received total reward 1.0
finished episode 4, received total reward 0.0
finished episode 5, received total reward 1.0
finished episode 6, received total reward 1.0
finished episode 7, received total reward 0.0
finished episode 8, received total reward 0.0
finished episode 9, received total reward 1.0
finished episode 10, received total reward 1.0
finished episode 11, received total reward 1.0
finished episode 12, received total reward 0.0
finished episode 13, received total reward 1.0
finished episode 14, received total reward 1.0
finished episode 15, received total reward 0.0
finished episode 16, received total reward 0.0
finished episode 17, received total reward 0.0
finished episode 18, received total reward 0.0
finished episode 19, received total reward 0.0
finished episode 20, received total reward 1.0
finished episode 21, received total reward 1.0
finished episode 22, r