# **Chapter 18: From Theory to Implementation**

## 1. Introduction to
### Pembelajaran Penguatan (Reinforcement Learning)

**Key Components**:
- Agent: Learns to make decisions
- Environment: Where the **Agen** (agent) adalah entitas yang mengambil tindakan dalam suatu lingkungan. operates
- State (s): Current situation
- Action (a): Decision taken by **Agen** (agent) adalah entitas yang mengambil tindakan dalam suatu lingkungan.
- Reward (r): Feedback from **Lingkungan** (environment) merespons aksi agen dan memberikan feedback berupa **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar..
- Policy (π): Strategy for acting in states

**The RL Loop**:
1. Agent observes state
2. Selects action based on **Policy** adalah strategi yang digunakan agen untuk memilih aksi berdasarkan keadaan saat ini.
3. Receives **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar. and new state
4. Updates **Policy** adalah strategi yang digunakan agen untuk memilih aksi berdasarkan keadaan saat ini.

## 2. Core RL Algorithms

### 2.1 Q-Learning
Off-**Policy** adalah strategi yang digunakan agen untuk memilih aksi berdasarkan keadaan saat ini. TD learning that estimates the optimal Q-values:
\[
Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]
\]

- $\alpha$: Learning rate
- $\gamma$: Discount factor
- $\epsilon$: Exploration rate

In [1]:
# Mengimpor pustaka yang diperlukan
import numpy as np
import gym

# Inisialisasi environment
env = gym.make('FrozenLake-v1', is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n

# Inisialisasi Q-table
Q = np.zeros((n_states, n_actions))

# Hyperparameter
alpha = 0.1
gamma = 0.99
epsilon = 0.1
n_episodes = 1000

# Fungsi bantu agar reset kompatibel dengan semua versi gym
def safe_reset(env):
    result = env.reset()
    return result[0] if isinstance(result, tuple) else result

# Fungsi bantu agar step kompatibel dengan semua versi gym
def safe_step(env, action):
    result = env.step(action)
    if len(result) == 5:
        next_state, reward, terminated, truncated, _ = result
        done = terminated or truncated
    elif len(result) == 4:
        next_state, reward, done, _ = result
    else:
        raise ValueError("Format hasil dari env.step() tidak dikenali.")
    return next_state, reward, done


# Implementasi Q-learning
for episode in range(n_episodes):
    state = safe_reset(env)
    done = False

    while not done:
        # Epsilon-greedy action selection
        if np.random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # eksplorasi
        else:
            action = np.argmax(Q[state])  # eksploitasi

        # Ambil aksi dan transisi
        next_state, reward, done = safe_step(env, action)

        # Update Q-table
        Q[state, action] = Q[state, action] + alpha * (
            reward + gamma * np.max(Q[next_state]) - Q[state, action]
        )

        state = next_state

# Output Q-table setelah pelatihan
print("Q-table setelah pelatihan:")
print(Q)


Q-table setelah pelatihan:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### 2.2 Deep Q-Networks (DQN)
Uses neural networks to approximate Q-values. Key innovations:
- Experience replay
- Target network
- Frame stacking

In [None]:
# Mengimpor pustaka yang diperlukan seperti gym dan numpy
import tensorflow as tf
# Mengimpor pustaka yang diperlukan seperti gym dan numpy
from collections import deque
# Mengimpor pustaka yang diperlukan seperti gym dan numpy
import random

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
# Menentukan parameter epsilon untuk eksplorasi
        self.epsilon = 1.0   # exploration rate
# Menentukan parameter epsilon untuk eksplorasi
        self.epsilon_min = 0.01
# Menentukan parameter epsilon untuk eksplorasi
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()
        self.target_model = self._build_model()

    def _build_model(self):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(24, activation='relu'),
            tf.keras.layers.Dense(self.action_size, activation='linear')
        ])
        model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
# Menentukan parameter epsilon untuk eksplorasi
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

    def replay(self, batch_size):
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = self.model.predict(state)
            if done:
                target[0][action] = reward
            else:
                t = self.target_model.predict(next_state)
                target[0][action] = reward + self.gamma * np.amax(t[0])
            self.model.fit(state, target, epochs=1, verbose=0)
        if self.epsilon > self.epsilon_min:
# Menentukan parameter epsilon untuk eksplorasi
            self.epsilon *= self.epsilon_decay

    def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())

## 3. Policy Gradient Methods

Directly optimize the **Policy** adalah strategi yang digunakan agen untuk memilih aksi berdasarkan keadaan saat ini. $\pi_θ(a|s)$. The REINFORCE algorithm:
\[
\nabla_θ J(θ) = \mathbb{E}[\nabla_θ \log \pi_θ(a|s) G_t]
\]

Where $G_t$ is the return from time step $t$

In [None]:
class PGAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.95
        self.learning_rate = 0.001
        self.states = []
        self.actions = []
        self.rewards = []
        self.model = self._build_model()

    def _build_model(self):
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(24, input_dim=self.state_size, activation='relu'),
            tf.keras.layers.Dense(24, activation='relu'),
            tf.keras.layers.Dense(self.action_size, activation='softmax')
        ])
        model.compile(loss='categorical_crossentropy',
                    optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate))
        return model

    def remember(self, state, action, reward):
        self.states.append(state)
        self.actions.append(action)
        self.rewards.append(reward)

    def act(self, state):
        state = np.reshape(state, [1, self.state_size])
        prob_weights = self.model.predict(state)
# Memilih aksi berdasarkan eksplorasi atau eksploitasi
        action = np.random.choice(self.action_size, p=prob_weights[0])
        return action

    def train(self):
        discounted_rewards = []
        Gt = 0
        for reward in reversed(self.rewards):
            Gt = reward + self.gamma * Gt
            discounted_rewards.insert(0, Gt)

        # Normalize rewards
        discounted_rewards = np.array(discounted_rewards)
        discounted_rewards = (discounted_rewards - np.mean(discounted_rewards)) / np.std(discounted_rewards)

        # Convert actions to one-hot vectors
        actions = tf.keras.utils.to_categorical(self.actions, num_classes=self.action_size)

        # Update policy
        self.model.train_on_batch(np.vstack(self.states), actions, sample_weight=discounted_rewards)

        # Clear memory
        self.states, self.actions, self.rewards = [], [], []

## 4. Proximal Policy Optimization (PPO)

Modern **Policy** adalah strategi yang digunakan agen untuk memilih aksi berdasarkan keadaan saat ini. gradient algorithm with:
- Clipped surrogate objective
- Multiple epochs per rollout
- Advantage normalization

In [None]:
class PPOMemory:
    def __init__(self, batch_size):
        self.states = []
        self.actions = []
        self.probs = []
        self.vals = []
        self.rewards = []
        self.dones = []
        self.batch_size = batch_size

    def store(self, state, action, prob, val, reward, done):
        self.states.append(state)
        self.actions.append(action)
        self.probs.append(prob)
        self.vals.append(val)
        self.rewards.append(reward)
        self.dones.append(done)

    def clear(self):
        self.states = []
        self.actions = []
        self.probs = []
        self.vals = []
        self.rewards = []
        self.dones = []

class PPOAgent:
    def __init__(self, state_dim, action_dim, action_std_init=0.6):
        self.policy = self._build_network(state_dim, action_dim)
        self.policy_old = self._build_network(state_dim, action_dim)
        self.memory = PPOMemory()

    def _build_network(self, state_dim, action_dim):
        # Simplified network architecture
        inputs = tf.keras.layers.Input(shape=(state_dim,))
        x = tf.keras.layers.Dense(64, activation='tanh')(inputs)
        x = tf.keras.layers.Dense(64, activation='tanh')(x)
        mean = tf.keras.layers.Dense(action_dim, activation='tanh')(x)
        std = tf.keras.layers.Dense(action_dim, activation='softplus')(x)
        return tf.keras.Model(inputs, [mean, std])

    def update(self):
        # PPO update logic would go here
        # Includes: advantages calculation, clipping, multiple epochs
        pass

## 5. Practical Considerations

### 5.1 Reward Shaping
- Design **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar.s to guide learning
- Balance sparse vs dense **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar.s
- Avoid **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar. hacking

### 5.2 Exploration Strategies
- ε-greedy
- Boltzmann exploration
- Noisy networks
- Intrinsic curiosity

## 6. Exercises

1. Implement Double DQN to reduce overestimation bias
2. Add prioritized experience replay to the DQN
3. Train a **Policy** adalah strategi yang digunakan agen untuk memilih aksi berdasarkan keadaan saat ini. gradient **Agen** (agent) adalah entitas yang mengambil tindakan dalam suatu lingkungan. on CartPole-v1
4. Compare performance of different exploration strategies
5. Visualize learned value functions and policies

## 6. Key Takeaways (Continued)

- **Value-based methods** (**Q-learning** adalah algoritma pembelajaran penguatan yang mencoba belajar nilai maksimum aksi dalam suatu keadaan., DQN) learn optimal value functions
- **Policy-based methods** (REINFORCE, PPO) directly optimize policies
- **Model-based RL** learns **Lingkungan** (environment) merespons aksi agen dan memberikan feedback berupa **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar.. dynamics for planning
- **Exploration vs Exploitation** must be carefully balanced
- **Reward design** critically impacts learning success

## 7. Implementing RL with TF-Agents

TensorFlow Agents provides production-grade RL implementations:

In [None]:
!pip uninstall -y keras keras-nightly keras-Preprocessing keras-vis
!pip install tensorflow==2.14
!pip install tf-agents==0.17.0
!pip install tensorflow-probability==0.23.0


In [None]:
import tensorflow as tf
from tf_agents.environments import suite_gym
from tf_agents.environments.tf_py_environment import TFPyEnvironment
from tf_agents.agents.dqn import dqn_agent
from tf_agents.networks import q_network
from tf_agents.replay_buffers import tf_uniform_replay_buffer


# ========================================================
# MEMBUAT ENVIRONMENT
# ========================================================
env_name = 'CartPole-v1'

train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

train_env = TFPyEnvironment(train_py_env)
eval_env = TFPyEnvironment(eval_py_env)

# ========================================================
# MEMBUAT Q-NETWORK
# ========================================================
fc_layer_params = (100,)  # bisa ditambah layer lebih banyak
q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=fc_layer_params
)

# ========================================================
# INISIALISASI AGENT
# ========================================================
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter
)

agent.initialize()

# ========================================================
# MEMBUAT REPLAY BUFFER
# ========================================================
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=100000
)

print("✅ TF-Agents DQN Agent berhasil diinisialisasi.")


## 8. Training Loop Example

Complete training procedure with TF-Agents:

In [None]:
def train_agent(agent, env, replay_buffer, num_iterations=20000):
    # Create dataset from replay buffer
    dataset = replay_buffer.as_dataset(
        num_parallel_calls=3,
        sample_batch_size=64,
        num_steps=2).prefetch(3)

    # Training loop
    for _ in range(num_iterations):
        # Collect experience
        time_step = env.current_time_step()
        action_step = agent.policy.action(time_step)
        next_time_step = env.step(action_step.action)
        traj = trajectory.from_transition(time_step, action_step, next_time_step)
        replay_buffer.add_batch(traj)

        # Train on sampled experience
        experience, _ = next(iter(dataset))
        train_loss = agent.train(experience).loss

        if _ % 1000 == 0:
            print(f'Iteration {_}: Loss = {train_loss}')

    return agent

## 9. Advanced RL Techniques

### 9.1 Imitation Learning
- Learn from expert demonstrations
- Behavioral cloning
- Dataset Aggregation (DAgger)

### 9.2 Multi-Agent RL
- Competitive/cooperative **Lingkungan** (environment) merespons aksi agen dan memberikan feedback berupa **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar..s
- Markov Games framework
- Centralized training with decentralized execution

### 9.3 Hierarchical RL
- Temporal abstraction with options
- Meta-controllers and sub-policies
- Feudal networks

## 10. Debugging RL Systems

Common issues and solutions:

| Problem | Possible Causes | Solutions |
|---------|-----------------|-----------|
| No learning | Low exploration, bad **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar.s | Adjust ε, reshape **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar.s |
| Unstable training | High learning rate, small buffer | Reduce LR, increase buffer |
| Poor final performance | Limited capacity, local optima | Larger network, better exploration |
| High variance | Small batches, no target network | Increase batch size, add target net |

## 11. Final Summary

- **Tabular methods** work well for small state spaces
- **Deep RL** scales to complex **Lingkungan** (environment) merespons aksi agen dan memberikan feedback berupa **Reward** adalah sinyal yang diterima agen berdasarkan aksi yang diambil, digunakan untuk belajar..s
- **Policy gradients** handle continuous actions naturally
- **TF-Agents** provides production-ready implementations
- **Careful experimentation** is key to successful RL applications