### Aprendizado por Reforço

O aprendizado por reforço (Reinforcement Learning - RL) é uma área do aprendizado de máquina onde um agente aprende a tomar decisões através de interações com um ambiente. O objetivo do agente é maximizar uma recompensa cumulativa ao longo do tempo, tomando ações que levem a estados mais recompensadores.

#### Conceitos Principais

- **Agente**: O tomador de decisões.
- **Ambiente**: Tudo com o que o agente interage.
- **Ação (A)**: As escolhas disponíveis para o agente.
- **Estado (S)**: A situação atual do ambiente.
- **Recompensa (R)**: O feedback recebido pelo agente após tomar uma ação.

#### Processo de Aprendizado

O agente observa o estado atual do ambiente, toma uma ação com base em uma política (que pode ser uma função ou uma tabela), e então recebe uma recompensa e observa o novo estado do ambiente. Este ciclo se repete, permitindo que o agente aprenda quais ações são mais benéficas ao longo do tempo.

#### Valores-Q (Q-values)

Os valores-Q são fundamentais no aprendizado por reforço. Eles representam a "qualidade" de uma ação em um determinado estado, ou seja, a recompensa esperada que o agente pode obter ao tomar aquela ação a partir daquele estado.

A fórmula de atualização do valor-Q é dada por:

$$ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right] $$

Onde:
- $ Q(s, a) $: Valor-Q atual para o estado s  e ação  a.
- $\alpha$: Taxa de aprendizado.
- $r$: Recompensa recebida após tomar a ação \( a \).
- $\gamma$: Fator de desconto, que equilibra recompensas imediatas e futuras.
- $\max_{a'} Q(s', a')$: Valor-Q máximo para o próximo estado $s'$.

A atualização dos valores-Q é feita iterativamente até que o agente aprenda uma política ótima, ou seja, uma maneira de escolher ações que maximiza a recompensa acumulada.


# Primeiro Exemplo

Referência: https://www.pymc.io/projects/examples/en/latest/case_studies/reinforcement_learning.html


In [None]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm
import pytensor
import pytensor.tensor as pt
import scipy

from matplotlib.lines import Line2D

In [None]:
def generate_data(rng, alpha, beta, n=100, p_r=None):
    if p_r is None:
        p_r = [0.4, 0.6]
    actions = np.zeros(n, dtype="int")
    rewards = np.zeros(n, dtype="int")
    Qs = np.zeros((n, 2))

    # Initialize Q table
    Q = np.array([0.5, 0.5])
    for i in range(n):
        # Apply the Softmax transformation
        exp_Q = np.exp(beta * Q)
        prob_a = exp_Q / np.sum(exp_Q)

        # Simulate choice and reward
        a = rng.choice([0, 1], p=prob_a)
        r = rng.random() < p_r[a]

        # Update Q table
        Q[a] = Q[a] + alpha * (r - Q[a])

        # Store values
        actions[i] = a
        rewards[i] = r
        Qs[i] = Q.copy()

    return actions, rewards, Qs

In [None]:
true_alpha = 0.5
true_beta = 5
n = 150
rng = np.random.default_rng(42)
actions, rewards, Qs = generate_data(rng, true_alpha, true_beta, n)
actions, rewards, Qs

(array([1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
        1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]),
 array([1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1,
        0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
        1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1

In [None]:
def update_Q(action, reward, Qs, alpha):
    Qs = pt.set_subtensor(Qs[action], Qs[action] + alpha * (reward - Qs[action]))
    return Qs

In [None]:
def rprobs(alpha, beta, actions, rewards):
    rewards = pt.as_tensor_variable(rewards, dtype="int32")
    actions = pt.as_tensor_variable(actions, dtype="int32")

    # Compute the Qs values
    Qs = 0.5 * pt.ones((2,), dtype="float64")
    Qs, updates = pytensor.scan(
        fn=update_Q, sequences=[actions, rewards], outputs_info=[Qs], non_sequences=[alpha]
    )

    # Apply the sotfmax transformation
    Qs = Qs[:-1] * beta
    logp_actions = Qs - pt.logsumexp(Qs, axis=1, keepdims=True)

    # Return the probabilities for the right action, in the original scale
    return pt.exp(logp_actions[:, 1])

In [None]:
with pm.Model() as m:
    alpha = pm.Beta("alpha", alpha=1, beta=1)
    beta = pm.HalfNormal("beta", sigma=10)

    action_probs = rprobs(alpha, beta, actions, rewards)
    like = pm.Bernoulli("likehood", p=action_probs, observed=actions[1:])

    trace = pm.sample()

In [None]:
pm.summary(trace)

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
alpha,0.549,0.081,0.399,0.705,0.003,0.002,1030.0,766.0,1.0
beta,4.942,0.893,3.132,6.52,0.026,0.018,1182.0,1242.0,1.0


In [None]:
with m:
  trace_post = pm.sample_posterior_predictive(trace)

In [None]:
trace_post

# Segundo Exemplo

Otimização de portfólio.

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np
import pytensor.tensor as pt
import pytensor
import pymc as pm

# Fetch historical data for three tickers
tickers = ['AAPL', 'SBUX', 'MSFT']
data = yf.download(tickers, start='2020-01-01', end='2023-01-01')['Adj Close']

# Calculate daily returns
returns = data.pct_change().dropna()


[*********************100%%**********************]  3 of 3 completed


In [2]:
returns

Ticker,AAPL,MSFT,SBUX
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-03,-0.009722,-0.012452,-0.005820
2020-01-06,0.007968,0.002585,-0.007880
2020-01-07,-0.004703,-0.009118,-0.003063
2020-01-08,0.016086,0.015928,0.011609
2020-01-09,0.021241,0.012493,0.018564
...,...,...,...
2022-12-23,-0.002798,0.002267,0.005217
2022-12-27,-0.013878,-0.007414,0.009464
2022-12-28,-0.030685,-0.010255,-0.006048
2022-12-29,0.028324,0.027630,0.011866


In [28]:
actions = np.random.choice(len(tickers), size=len(returns))
stocks = {0:"AAPL",1:"SBUX",2:"MSFT"}
rewards = [returns[stocks[0]][i] for i, action in enumerate(actions)]  # Observed rewards based on actions

len(rewards)

755

In [29]:
def update_Q(action, reward, Q, alpha):
    Q_updated = Q.copy()
    Q_updated = pt.set_subtensor(Q_updated[action], Q[action] + alpha * (reward - Q[action]))
    return Q_updated

def rprobs(alpha, beta, actions, rewards):
    rewards = pt.as_tensor_variable(rewards, dtype="float64")
    actions = pt.as_tensor_variable(actions, dtype="int32")

    Qs = 0.5 * pt.ones((len(tickers),), dtype="float64")
    Qs, updates = pytensor.scan(
        fn=update_Q, sequences=[actions, rewards], outputs_info=[Qs], non_sequences=[alpha]
    )

    Qs = Qs * beta
    logp_actions = Qs - pt.logsumexp(Qs, axis=1, keepdims=True)

    return pt.exp(logp_actions)


In [30]:
with pm.Model() as model_b:
    alpha = pm.Beta("alpha", alpha=1, beta=1)
    beta = pm.HalfNormal("beta", sigma=10)

    action_probs = rprobs(alpha, beta, actions, rewards)

    likelihood = pm.Categorical("likelihood", p=action_probs, observed=actions)

In [31]:
with model_b:
    trace_b = pm.sample()

In [32]:
pm.summary(trace_b)

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
alpha,0.504,0.282,0.0,0.925,0.011,0.008,620.0,534.0,1.0
beta,1.019,0.931,0.001,2.726,0.035,0.027,660.0,679.0,1.01


In [33]:
with model_b:
  trace_post_b = pm.sample_posterior_predictive(trace_b)

In [10]:
trace_post_b

In [11]:
trace_b

In [34]:
alpha_est = np.mean(trace_b.posterior['alpha'])
beta_est = np.mean(trace_b.posterior['beta'])

In [35]:
beta_est.values

array(1.01943096)

In [36]:
Qs = np.zeros(len(tickers))
weights = []

for i in range(len(returns)):
    exp_Q = np.exp(beta_est.values * Qs)
    prob_a = exp_Q / np.sum(exp_Q)
    action = np.random.choice(len(tickers), p=prob_a)

    reward = returns[stocks[action]][i]
    Qs[action] = Qs[action] + alpha_est * (reward - Qs[action])

    weights.append(prob_a)

weights = np.array(weights)
average_weights = weights.mean(axis=0)

portfolio = pd.DataFrame({
    'Ticker': tickers,
    'Weight': average_weights
})

print(portfolio)


  Ticker    Weight
0   AAPL  0.333291
1   SBUX  0.333281
2   MSFT  0.333428


In [37]:
portfolio_returns = (returns * average_weights).sum(axis=1)

cumulative_returns = (1 + portfolio_returns).cumprod() - 1

print(cumulative_returns)


Date
2020-01-03   -0.009331
2020-01-06   -0.008449
2020-01-07   -0.014029
2020-01-08    0.000308
2020-01-09    0.017746
                ...   
2022-12-23    0.535036
2022-12-27    0.528987
2022-12-28    0.505040
2022-12-29    0.539062
2022-12-30    0.534864
Length: 755, dtype: float64


In [38]:
import numpy as np
import pandas as pd

# Initialize Q-values for each ticker
Qs = np.zeros(len(tickers))
weights = []

# Set initial action (ticker) to the first one
current_action = 0

for i in range(len(returns)):
    exp_Q = np.exp(beta_est.values * Qs)
    prob_a = exp_Q / np.sum(exp_Q)

    reward = returns[stocks[current_action]][i]

    Qs[current_action] = Qs[current_action] + alpha_est * (reward - Qs[current_action])

    weights.append(prob_a)

    if reward < 0:
        current_action = np.argmax(Qs)

weights = np.array(weights)
average_weights = weights.mean(axis=0)

portfolio = pd.DataFrame({
    'Ticker': tickers,
    'Weight': average_weights
})

print(portfolio)


  Ticker    Weight
0   AAPL  0.345391
1   SBUX  0.324933
2   MSFT  0.329676


In [39]:
portfolio_returns = (returns * average_weights).sum(axis=1)

cumulative_returns = (1 + portfolio_returns).cumprod() - 1

print(cumulative_returns)


Date
2020-01-03   -0.009322
2020-01-06   -0.008338
2020-01-07   -0.013888
2020-01-08    0.000469
2020-01-09    0.017993
                ...   
2022-12-23    0.539490
2022-12-27    0.533205
2022-12-28    0.508789
2022-12-29    0.542998
2022-12-30    0.538932
Length: 755, dtype: float64


In [47]:
import numpy as np
import pandas as pd

epsilon = 0.2
Qs = np.zeros(len(tickers))
weights = []

current_action = 0

for i in range(len(returns)):
    exp_Q = np.exp(beta_est.values * Qs)
    prob_a = exp_Q / np.sum(exp_Q)

    reward = returns[stocks[current_action]][i]

    Qs[current_action] = Qs[current_action] + alpha_est * (reward - Qs[current_action])

    weights.append(prob_a)

    if reward < 0:
        if np.random.rand() < epsilon:
          current_action = 2
        else:
          current_action = np.argmax(Qs)

weights = np.array(weights)
average_weights = weights.mean(axis=0)

portfolio = pd.DataFrame({
    'Ticker': tickers,
    'Weight': average_weights
})

print(portfolio)


  Ticker    Weight
0   AAPL  0.330897
1   SBUX  0.328580
2   MSFT  0.340523


In [48]:
portfolio_returns = (returns * average_weights).sum(axis=1)

cumulative_returns = (1 + portfolio_returns).cumprod() - 1

print(cumulative_returns)

Date
2020-01-03   -0.009290
2020-01-06   -0.008495
2020-01-07   -0.014043
2020-01-08    0.000263
2020-01-09    0.017723
                ...   
2022-12-23    0.531800
2022-12-27    0.525971
2022-12-28    0.502192
2022-12-29    0.535979
2022-12-30    0.531753
Length: 755, dtype: float64
