# Theory



## Reinforcement Learning

- First thing, there is an agent that explores some space.
- As it goes, the agent learns by receiving feedback for each action.
- An action that leads to the correct end result receives a reward, and a bad action is either ignored or punished.
- Examples: Pac-Man, Cat & Mouse game

### Q-Learning
An implementation of reinforcement learning.
The objects are:
- A set of environmental states \( s \)
- A set of possible actions in those states \( a \)
- A value of each state/action \( Q \)

Start off with \( Q \) values of 0, then explore the space. If a bad thing happens after a given state/action, reduce its \( Q \), but if a good thing happens, increase its \( Q \).

### The exploration problem
How do we efficiently explore all of the possible states?
#### The simple approach:
Always choose the action for a given state with the highest \( Q \). If there is a tie, choose at random.
#### Better way: introduce an $( \epsilon )$ term
- If a random number is less than $( \epsilon )$, don’t follow the highest \( Q \), but choose at random.
- That way, exploration never totally stops.
- Choosing an $( \epsilon )$ can be tricky.
#### Markov decision processes
- A mathematical framework for modeling decision-making in situations where outcomes are partly random.
- Basically the same thing above but more formal.
- States still are s and s', transition between states are $P_a$(s, s') and Q are a reward function $R_a$(s,s')

#### Dinamic Programming
- From my backgorund as a competitive programmer, this is basicaly an method that computes large problems (like Fibonacci numbers) using smaller ones, that are alredy computed and stored in memory.


# Code

In [1]:
import gym
import random

# Seta a seed para termos resultados aleatorios "controlados"
random.seed(1234)

# Utilizando gym 0.25.2 não tive o problema relatado no card 
# Mas essa versão ta com os dias contados
# Como mudava algumas coisas do codigo do curso, preferi não alterar
streets = gym.make('Taxi-v3').env
streets.reset()

  deprecation(
  deprecation(


468

In [2]:
# Codifica um estado inicial específico (linha de táxi, coluna de táxi, posição do passageiro, destino)
initial_state = streets.encode(2, 3, 2, 0)

# Define o estado inicial do ambiente para o estado codificado
streets.s = initial_state
test = streets.render(mode='ansi')
print(f'{test}')

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[43mB[0m: |
+---------+




See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


In [3]:
# Tabela de recompensas de cada decisão no estado inicial
streets.P[initial_state]

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

In [6]:
import numpy as np

# Tabela de recompensa para cada estado
q_table = np.zeros([streets.observation_space.n, streets.action_space.n])

# Parametros de aprendizado, utilizados na formula do algoritmo Q-learning
learning_rate = 0.1
discount_factor = 0.6
exploration = 0.1

# Epocas de cada aprendizado
epochs = 10000

# Loop q passa por cada epoca
for taxi_run in range(epochs):
    state = streets.reset()
    done = False

    # Loop de aprendizado, para apenas quando concluir o desafio (pegar e levar o passageiro)
    while not done:
      # condicional aleatorio que dita se vai pegar o maior Q ou uma açao aleatoria
        random_value = random.uniform(0, 1)
        if (random_value < exploration):
            action = streets.action_space.sample() # Explore a random action
        else:
            action = np.argmax(q_table[state]) # Use the action with the highest q-value

        # Aplica a ação escolhida
        next_state, reward, done, info = streets.step(action)

        # essa é a parte onde o algoritmo realmente aprende, adaptando a q_table com uma equação
        prev_q = q_table[state, action]
        next_max_q = np.max(q_table[next_state])
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
        q_table[state, action] = new_q

        state = next_state



In [7]:
# Importação das bibliotecas
from IPython.display import clear_output
from time import sleep

# Loop de 11 viagens
for tripnum in range(1, 11):
  # Seta uma nova viagem
    state = streets.reset()

  #  variaveis de controle
    done = False
    trip_length = 0

    # Loop de uma viagem
    while not done and trip_length < 25:
      # Escolhe o maior q da tabela já treinada, para essse estado
        action = np.argmax(q_table[state])
        # Aplica a ação
        next_state, reward, done, info = streets.step(action)
        # Limpa o display
        clear_output(wait=True)
        # Printa a viagem
        print("Trip number " + str(tripnum) + " Step " + str(trip_length))
        print(streets.render(mode='ansi'))
        # Sleep para conseguirmos ver o que acontece
        sleep(.5)
        # Avança para o proximo estado
        state = next_state
        trip_length += 1

    sleep(2)


Trip number 10 Step 12
+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)



# Activity

In [95]:
import time
def training(q_table, learning_rate, discount_factor, exploration, epochs):
  # Loop q passa por cada epoca
  start = time.time()
  for taxi_run in range(epochs):
      state = streets.reset()
      done = False

      # Loop de aprendizado, para apenas quando concluir o desafio (pegar e levar o passageiro)
      while not done:
        # condicional aleatorio que dita se vai pegar o maior Q ou uma açao aleatoria
          random_value = random.uniform(0, 1)
          if (random_value < exploration):
              action = streets.action_space.sample() # Explore a random action
          else:
              action = np.argmax(q_table[state]) # Use the action with the highest q-value

          # Aplica a ação escolhida
          next_state, reward, done, info = streets.step(action)

          # essa é a parte onde o algoritmo realmente aprende, adaptando a q_table com uma equação
          prev_q = q_table[state, action]
          next_max_q = np.max(q_table[next_state])
          new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
          q_table[state, action] = new_q

          state = next_state
  end = time.time()
  print(f'Tempo de treino: {end-start}')

  and should_run_async(code)


In [94]:
import time
# Função para testar o aprendizado (meio q depende mto de como o reset acontece mas eh isso ai), medindo o tempo médio de N testes
def test(q_table, n):
  t = []
  # Loop de 11 viagens
  for tripnum in range(1, n):
    # Seta uma nova viagem
      state = streets.reset()

    #  variaveis de controle
      done = False
      trip_length = 0

      # Loop de uma viagem
      start = time.time()
      while not done:
        # Acho que 100 ainda ta generoso
          if (trip_length >= 100):
            print('Modelo falho')
            return 0
          # Escolhe o maior q da tabela já treinada, para essse estado
          action = np.argmax(q_table[state])
          # Aplica a ação
          next_state, reward, done, info = streets.step(action)
          state = next_state
          trip_length += 1

      end = time.time()
      t.append(end-start)
  print(f'tempo médio: {np.mean(t)}')

In [69]:
# Controle
test(q_table, 100)

tempo médio: 0.0005383780508330374


In [96]:
# Teste com apenas 10 epochs
q_table1 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.1
df = 0.6
ex = 0.1
ep = 50

training(q_table1, lr, df, ex, ep)
test(q_table1, 100)

Tempo de treino: 1.2981035709381104
Modelo falho


0

In [97]:
# Aumentando o learning rate
q_t2 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.3
df = 0.6
ex = 0.1
ep = 10000

training(q_t2, lr, df, ex, ep)
test(q_t2, 100)

Tempo de treino: 9.899224758148193
tempo médio: 0.00027807071955517087


In [98]:
# Aumentando o learning rate e diminuindo epochs
q_t3 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.3
df = 0.6
ex = 0.1
ep = 20

training(q_t3, lr, df, ex, ep)
test(q_t2, 100)

Tempo de treino: 0.3913130760192871
tempo médio: 0.0004013596159039122


In [99]:
# Diminuindo Discout Factor
q_t4 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.1
df = 0.2
ex = 0.1
ep = 10000

training(q_t4, lr, df, ex, ep)
test(q_t4, 100)

Tempo de treino: 16.718200206756592
Modelo falho


0

In [102]:
# Diminuindo Discout Factor e aumentando learning rate
q_t5 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.3
df = 0.4
ex = 0.1
ep = 10000

training(q_t5, lr, df, ex, ep)
test(q_t5, 100)

Tempo de treino: 8.982004165649414
tempo médio: 0.000612923593232126


In [104]:
# Aumentando Discout Factor
q_t6 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.1
df = 0.8
ex = 0.1
ep = 10000

training(q_t6, lr, df, ex, ep)
test(q_t6, 100)

Tempo de treino: 11.211408138275146
tempo médio: 0.00046297516485657356


In [106]:
# Aumentando Discout Factor e learning rate
q_t7 = np.zeros([streets.observation_space.n, streets.action_space.n])

lr = 0.3
df = 0.8
ex = 0.1
ep = 10000

training(q_t7, lr, df, ex, ep)
test(q_t7, 100)

Tempo de treino: 10.036751747131348
Modelo falho


0