# Meta-RL

Enables RL agents to quickly adapt to new tasks with minimal addtional experience. Traditional RL algorithms typically require extensive interactions with the environment to master each new task. In contrast, Meta-RL trains models to adapt rapidly by leveraging past experiences from multiple related tasks.
Meta-RL agents adapt to new scenarios or variations of tasks rapidly, requires fewer environment interactions when learning new tasks and have better performence across a dicerse set of tasks or environments.

## Background

Meta-RL introduces two levels of learning:
- **Inner loop**: The agent interacts with the environment to learn a task. This is the standard RL learning process.
- **Outer loop**: The agent learns how to learn new tasks quickly. This is the meta-learning process.

Important concepts in Meta-RL:
- **Task distribution**: The distribution of tasks that the agent will encounter during training and testing.
- **Adaptation steps**: The number of gradient updates applied for a new task.
- **Meta-policy**: The policy that generates the parameters of the inner-loop policy.

## Theory

Meta-RL involves training an agent on lupltiple tasks drawn from a task distrpibution $p(\mathcal{T})$. We define a set of tasks $\mathcal(T)_i$, each represented by an MDP $(S,A,P_i, r_i, \gamma)$, where $P_i$ is the transition dynamics, $r_i$ is the reward function, and $\gamma$ is the discount factor. The agent is trained to adapt to new tasks quickly by leveraging past experiences from related tasks.
Meta-RL aims to learn policy paramteres $\theta$ that can quickly adapt to new tasks $\mathcal{T}_{new}$, the objective is then generalized to:
$$\theta^* = arg \max_{\theta} \mathbb{E}_{\mathcal{T}_i \sim p(\mathcal{T})} [V_{\mathcal{T}_i}(\theta')]$$
where $\theta'$ is obtained by adapting $\theta$ on a small amount of task-specific data.

## Mathematical Formulation

MAML(Model-Agnostic Meta-Learning) is a popular Meta-RL approach providing a clear mathematical structture:
- Inner Loop adaptation (task-specific update):
Given initial paramters $\theta$, adapt using a small number of gradient steps K:
$$\theta' = \theta - \alpha \nabla_{\theta} \mathcal{L}_{\mathcal{T}_i}(\theta)$$
Here, $\mathcal{L}_{\mathcal{T}_i}(\theta)$ is the task-specific loss, and $\alpha$ is the step size.
- Outer Loop optimization (meta-learning):
Update original parameters $\theta$ using the adapted parameters $\theta'_i$:
$$\theta \leftarrow \theta - \beta \nabla_{\theta} \sum_{\mathcal{T}_i} \mathcal{L}_{\mathcal{T}_i}(\theta'_i)$$
Where $\beta$ is the meta-learning rate.

## Implementation

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
class Policy(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, act_dim),
            nn.Tanh()
        )

    def forward(self, x):
        return self.net(x)

In [11]:
class MAMLAgent:
    def __init__(self, obs_dim, act_dim, alpha=0.01, beta=0.001):
        self.policy = Policy(obs_dim, act_dim).to(device)
        self.alpha = alpha  # Inner loop learning rate
        self.beta = beta    # Outer loop learning rate
        self.meta_optimizer = optim.Adam(self.policy.parameters(), lr=self.beta)

    def adapt(self, support_set):
        adapted_policy = Policy(support_set['obs'].shape[1], support_set['act'].shape[1]).to(device)
        adapted_policy.load_state_dict(self.policy.state_dict())

        optimizer = optim.SGD(adapted_policy.parameters(), lr=self.alpha)

        obs, acts = support_set['obs'], support_set['act']
        preds = adapted_policy(obs)
        loss = nn.MSELoss()(preds, acts)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        return adapted_policy

    def meta_update(self, tasks):
        meta_loss = 0.0
        for task in tasks:
            adapted_policy = self.adapt(task['support'])
            query_obs, query_acts = task['query']['obs'], task['query']['act']
            query_preds = adapted_policy(query_obs)
            loss = nn.MSELoss()(query_preds, query_acts)
            meta_loss += loss

        meta_loss /= len(tasks)
        self.meta_optimizer.zero_grad()
        meta_loss.backward()
        self.meta_optimizer.step()
        return meta_loss.item()


In [12]:
# Assume tasks are variations of Pendulum environment
def sample_task():
    env = gym.make("Pendulum-v1")
    return env

def collect_data(env, policy, episodes=1):
    obs_list, act_list = [], []
    obs, _ = env.reset()
    for _ in range(episodes * 200):
        obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0).to(device)
        act = policy(obs_tensor).detach().cpu().numpy()[0]
        next_obs, _, terminated, truncated, _ = env.step(act)
        obs_list.append(obs)
        act_list.append(act)
        obs = next_obs
        if terminated or truncated:
            obs, _ = env.reset()
    return {
        'obs': torch.tensor(np.array(obs_list), dtype=torch.float32).to(device),
        'act': torch.tensor(np.array(act_list), dtype=torch.float32).to(device)
    }

agent = MAMLAgent(obs_dim=3, act_dim=1)

meta_epochs = 100
tasks_per_epoch = 5

for epoch in range(meta_epochs):
    tasks = []
    for _ in range(tasks_per_epoch):
        env = sample_task()
        support_set = collect_data(env, agent.policy, episodes=1)
        query_set = collect_data(env, agent.policy, episodes=1)
        tasks.append({'support': support_set, 'query': query_set})

    meta_loss = agent.meta_update(tasks)
    print(f"Epoch {epoch}: Meta Loss {meta_loss:.4f}")


Epoch 0: Meta Loss 0.0000
Epoch 1: Meta Loss 0.0000
Epoch 2: Meta Loss 0.0000
Epoch 3: Meta Loss 0.0000
Epoch 4: Meta Loss 0.0000
Epoch 5: Meta Loss 0.0000
Epoch 6: Meta Loss 0.0000
Epoch 7: Meta Loss 0.0000
Epoch 8: Meta Loss 0.0000
Epoch 9: Meta Loss 0.0000
Epoch 10: Meta Loss 0.0000
Epoch 11: Meta Loss 0.0000
Epoch 12: Meta Loss 0.0000
Epoch 13: Meta Loss 0.0000
Epoch 14: Meta Loss 0.0000
Epoch 15: Meta Loss 0.0000
Epoch 16: Meta Loss 0.0000
Epoch 17: Meta Loss 0.0000
Epoch 18: Meta Loss 0.0000
Epoch 19: Meta Loss 0.0000
Epoch 20: Meta Loss 0.0000
Epoch 21: Meta Loss 0.0000
Epoch 22: Meta Loss 0.0000
Epoch 23: Meta Loss 0.0000
Epoch 24: Meta Loss 0.0000
Epoch 25: Meta Loss 0.0000
Epoch 26: Meta Loss 0.0000
Epoch 27: Meta Loss 0.0000
Epoch 28: Meta Loss 0.0000
Epoch 29: Meta Loss 0.0000
Epoch 30: Meta Loss 0.0000
Epoch 31: Meta Loss 0.0000
Epoch 32: Meta Loss 0.0000
Epoch 33: Meta Loss 0.0000
Epoch 34: Meta Loss 0.0000
Epoch 35: Meta Loss 0.0000
Epoch 36: Meta Loss 0.0000
Epoch 37: M

## Next Steps

### $RL^2$ (Fast RL via Slow RL)

Utilizes a RNN to perform meta-learning by encoding experiences in the hidden state of the policy. This allows rapid adaptation based on recent experiences without explicitly performing gradient updates during test time.

In [21]:
class RecurrentPolicy(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden_size=128):
        super().__init__()
        self.lstm = nn.LSTM(obs_dim + act_dim + 1, hidden_size, batch_first=True)
        self.actor = nn.Linear(hidden_size, act_dim)
    
    def forward(self, obs, prev_act, prev_reward, hidden_state):
        x = torch.cat([obs, prev_act, prev_reward], dim=-1).unsqueeze(1)  # Shape: (batch, seq=1, features)
        output, hidden_state = self.lstm(x, hidden_state)
        action = torch.tanh(self.actor(output.squeeze(1)))
        return action, hidden_state

    def init_hidden(self, batch_size=1):
        return (torch.zeros(1, batch_size, 128).to(device),
                torch.zeros(1, batch_size, 128).to(device))

In [24]:
def run_episode(env, policy, optimizer=None, train=True):
    obs, _ = env.reset()
    hidden_state = policy.init_hidden()
    prev_action = torch.zeros(1, env.action_space.shape[0]).to(device)
    prev_reward = torch.zeros(1, 1).to(device)

    total_reward = 0
    log_probs = []  # To accumulate differentiable loss terms
    rewards = []

    for step in range(200):
        obs_tensor = torch.tensor(obs, dtype=torch.float32).unsqueeze(0).to(device)
        action, hidden_state = policy(obs_tensor, prev_action, prev_reward, hidden_state)

        # Assume continuous action; add exploration noise
        action_dist = torch.distributions.Normal(action, 0.1)
        sampled_action = action_dist.rsample()
        log_prob = action_dist.log_prob(sampled_action).sum(dim=-1)
        
        next_obs, reward, terminated, truncated, _ = env.step(sampled_action.detach().cpu().numpy()[0])
        done = terminated or truncated

        if train:
            log_probs.append(log_prob)
            rewards.append(torch.tensor([reward], dtype=torch.float32).to(device))

        prev_action = sampled_action.detach()
        prev_reward = torch.tensor([[reward]], dtype=torch.float32).to(device)
        obs = next_obs
        total_reward += reward

        if done:
            break

    # Compute loss at episode end (REINFORCE-like update)
    if train and optimizer:
        returns = []
        R = 0
        gamma = 0.99
        for r in reversed(rewards):
            R = r + gamma * R
            returns.insert(0, R)

        returns = torch.cat(returns).detach()
        log_probs = torch.stack(log_probs)

        loss = -(log_probs * returns).mean()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return total_reward


In [26]:
env_names = ["Pendulum-v1", "MountainCarContinuous-v0"]

env_dims = {
    "Pendulum-v1": {"obs_dim": 3, "act_dim": 1},
    "MountainCarContinuous-v0": {"obs_dim": 2, "act_dim": 1}
}

policies = {}
optimizers = {}

# Initialize policies for each environment
for env_name in env_names:
    dims = env_dims[env_name]
    policy = RecurrentPolicy(obs_dim=dims["obs_dim"], act_dim=dims["act_dim"]).to(device)
    optimizer = optim.Adam(policy.parameters(), lr=1e-3)
    policies[env_name] = policy
    optimizers[env_name] = optimizer

meta_epochs = 100

for epoch in range(meta_epochs):
    rewards = []
    for env_name in env_names:
        env = gym.make(env_name)
        policy = policies[env_name]
        optimizer = optimizers[env_name]

        ep_reward = run_episode(env, policy, optimizer=optimizer, train=True)
        rewards.append(ep_reward)

    avg_reward = np.mean(rewards)
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Average Reward: {avg_reward:.2f}")



Epoch 0, Average Reward: -758.08
Epoch 10, Average Reward: -774.72
Epoch 20, Average Reward: -766.77
Epoch 30, Average Reward: -776.45
Epoch 40, Average Reward: -543.27
Epoch 50, Average Reward: -672.09
Epoch 60, Average Reward: -776.11
Epoch 70, Average Reward: -912.63
Epoch 80, Average Reward: -719.13
Epoch 90, Average Reward: -933.36


In [28]:
env_test = gym.make("Pendulum-v1")
policy_test = policies["Pendulum-v1"]  # <-- clearly select the correct policy for this environment
test_rewards = []

for episode in range(10):
    ep_reward = run_episode(env_test, policy_test, train=False)
    test_rewards.append(ep_reward)
    print(f"Test Episode {episode}, Reward: {ep_reward:.2f}")

print(f"Average Test Reward: {np.mean(test_rewards):.2f}")

Test Episode 0, Reward: -1089.51
Test Episode 1, Reward: -1109.47
Test Episode 2, Reward: -1065.80
Test Episode 3, Reward: -1071.88
Test Episode 4, Reward: -1898.60
Test Episode 5, Reward: -1595.76
Test Episode 6, Reward: -1276.89
Test Episode 7, Reward: -1332.22
Test Episode 8, Reward: -1661.24
Test Episode 9, Reward: -1896.60
Average Test Reward: -1399.80
