# Policy Gradient Reinforcement Learning for Marketing Campaign Optimization

In this notebook, you'll learn how to apply policy gradient reinforcement learning to optimize marketing budget allocation across multiple channels. You'll implement a simulated marketing environment, train an RL agent that maximizes campaign ROI, and analyze how the learned policy adapts to changing market conditions.

## 1. Reinforcement Learning Concept Refresher

### The RL Loop
- **Agent** interacts with an **Environment** through **Actions**
- Environment returns **State** and **Reward**
- Agent aims to maximize cumulative reward over time

### Policy Gradients
- **Policy**: Function π(a|s) that outputs probability distribution over actions
- **Objective**: Maximize expected return J(θ) = E[∑ rewards]
- **Update rule**: θ ← θ + α∇J(θ)

### Why Policy Gradients for Marketing?
- Naturally handles **continuous action spaces** (budget allocations)
- Works well with **stochastic environments** (uncertain market responses)
- Can learn **complex allocation strategies** without manual rules
- Adapts to **delayed rewards** (customer journey often spans days/weeks)

## 2. Environment Setup

Let's start by importing the necessary libraries and setting up our simulated marketing environment.

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import matplotlib.pyplot as plt
from collections import deque
import random

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

### Marketing Campaign Environment

We'll implement a marketing environment with the following characteristics:
- 3 marketing channels: Email, Social Media, and Search Ads
- Daily budget allocation decisions
- Stochastic returns with diminishing returns
- Delayed rewards (conversions happen over time)
- Market saturation effects (effectiveness changes over time)

In [2]:
class MarketingEnvironment:
    def __init__(self, num_channels=3, max_budget=1000, episode_length=30):
        self.num_channels = num_channels
        self.max_budget = max_budget
        self.episode_length = episode_length
        
        # Channel characteristics
        # [Email, Social, Search]
        self.base_effectiveness = np.array([0.3, 0.5, 0.7])  # Base return per dollar
        self.saturation_points = np.array([300, 500, 400])   # Saturation points (diminishing returns)
        self.volatility = np.array([0.1, 0.2, 0.15])         # Daily effectiveness volatility
        
        # Conversion delay model (days until conversion)
        self.delay_probs = np.array([0.5, 0.3, 0.15, 0.05])  # Probability of conversion after 0,1,2,3 days
        
        # State variables
        self.day = 0
        self.current_effectiveness = self.base_effectiveness.copy()
        self.pending_conversions = deque(maxlen=len(self.delay_probs))
        [self.pending_conversions.append(np.zeros(self.num_channels)) for _ in range(len(self.delay_probs))]
        
        # Tracking metrics
        self.total_reward = 0
        self.channel_spend = np.zeros(self.num_channels)
        self.channel_revenue = np.zeros(self.num_channels)
    
    def reset(self):
        """Reset the environment for a new episode"""
        self.day = 0
        self.current_effectiveness = self.base_effectiveness.copy()
        self.pending_conversions = deque(maxlen=len(self.delay_probs))
        [self.pending_conversions.append(np.zeros(self.num_channels)) for _ in range(len(self.delay_probs))]
        self.total_reward = 0
        self.channel_spend = np.zeros(self.num_channels)
        self.channel_revenue = np.zeros(self.num_channels)
        
        # Initial state: [day/episode_length, effectiveness_ch1, effectiveness_ch2, effectiveness_ch3,
        #                recent_spend_ch1, recent_spend_ch2, recent_spend_ch3,
        #                pending_conversions_0, pending_conversions_1, ...]
        state = self._get_state()
        return state
    
    def step(self, action):
        """Take a step in the environment given an action
        
        Args:
            action: numpy array of shape (num_channels,) with budget allocations
                    Values should be between 0 and 1, will be scaled to max_budget
        
        Returns:
            next_state: Current state representation
            reward: Reward from this step
            done: Whether episode is finished
            info: Additional information dictionary
        """
        # Scale action to actual budget
        # Ensure action is valid
        action = np.clip(action, 0, 1)
        # Normalize to sum to 1 (if not already)
        if np.sum(action) > 0:
            action = action / np.sum(action)
        
        # Scale to max budget
        budget_allocation = action * self.max_budget
        
        # Update spend tracking
        self.channel_spend += budget_allocation
        
        # Calculate immediate returns based on channel effectiveness and diminishing returns
        channel_returns = []
        for i in range(self.num_channels):
            # Apply diminishing returns formula: return = effectiveness * spend * exp(-spend/saturation)
            spend = budget_allocation[i]
            effectiveness = self.current_effectiveness[i]
            saturation = self.saturation_points[i]
            
            # Calculate expected return before random variation
            expected_return = effectiveness * spend * np.exp(-spend / saturation)
            
            # Add stochasticity
            actual_return = expected_return * (1 + np.random.normal(0, 0.2))
            channel_returns.append(max(0, actual_return))
        
        channel_returns = np.array(channel_returns)
        
        # Distribute returns across delayed conversion timeline
        for day_offset, prob in enumerate(self.delay_probs):
            self.pending_conversions[day_offset] += channel_returns * prob
        
        # Collect immediate reward (conversions from current and previous days)
        immediate_reward = np.sum(self.pending_conversions[0])
        
        # Update channel revenue tracking
        self.channel_revenue += self.pending_conversions[0]
        
        # Update total reward
        self.total_reward += immediate_reward
        
        # Rotate pending conversions queue
        self.pending_conversions.append(np.zeros(self.num_channels))
        
        # Update effectiveness (market dynamics)
        self._update_market_dynamics()
        
        # Advance to next day
        self.day += 1
        done = self.day >= self.episode_length
        
        # Get new state
        next_state = self._get_state()
        
        # Calculate ROI for info
        total_spend = np.sum(budget_allocation)
        roi = (immediate_reward - total_spend) / total_spend if total_spend > 0 else 0
        
        info = {
            'spend': budget_allocation,
            'returns': channel_returns,
            'immediate_reward': immediate_reward,
            'roi': roi,
            'effectiveness': self.current_effectiveness.copy()
        }
        
        return next_state, immediate_reward, done, info
    
    def _update_market_dynamics(self):
        """Update channel effectiveness based on market dynamics"""
        # Random walk with mean reversion
        for i in range(self.num_channels):
            # Mean reversion factor (pulls back toward base effectiveness)
            mean_reversion = 0.1 * (self.base_effectiveness[i] - self.current_effectiveness[i])
            
            # Random variation based on volatility
            random_change = np.random.normal(0, self.volatility[i])
            
            # Update effectiveness
            self.current_effectiveness[i] += mean_reversion + random_change
            
            # Ensure effectiveness stays positive
            self.current_effectiveness[i] = max(0.1, self.current_effectiveness[i])
    
    def _get_state(self):
        """Return current state representation"""
        # Normalized day
        normalized_day = self.day / self.episode_length
        
        # Channel effectiveness (normalized)
        normalized_effectiveness = self.current_effectiveness / np.max(self.base_effectiveness)
        
        # Recent spend (normalized to max budget)
        normalized_recent_spend = np.zeros(self.num_channels)
        if self.day > 0:
            recent_spend = self.channel_spend / (self.day * self.max_budget)
            normalized_recent_spend = recent_spend
        
        # Flatten pending conversions
        flattened_pending = np.array([np.sum(conv) for conv in list(self.pending_conversions)[1:]]) / self.max_budget
        
        # Combine all state components
        state = np.concatenate([
            [normalized_day],
            normalized_effectiveness,
            normalized_recent_spend,
            flattened_pending
        ])
        
        return state

Let's test our environment with a simple run:

In [4]:
env = MarketingEnvironment()
state = env.reset()

print(f"State shape: {state.shape}")
print(f"Initial state: {state}")

# Try a random action
action = np.random.rand(3)
next_state, reward, done, info = env.step(action)

print(f"\nAction taken: {action}")
print(f"Scaled budget allocation: {info['spend']}")
print(f"Channel returns: {info['returns']}")
print(f"Immediate reward: {reward}")
print(f"ROI: {info['roi']:.2f}")
print(f"Next state: {next_state}")

TypeError: sequence index must be integer, not 'slice'

## 3. Baseline Agents

Before implementing our policy gradient agent, let's establish some baselines:

In [None]:
def run_episode(env, policy_fn, render=False):
    """Run a full episode using the provided policy function"""
    state = env.reset()
    done = False
    total_reward = 0
    episode_data = []
    
    while not done:
        action = policy_fn(state)
        next_state, reward, done, info = env.step(action)
        total_reward += reward
        
        if render:
            print(f"Day {env.day}, Action: {action}, Reward: {reward:.2f}, Effectiveness: {info['effectiveness']}")
        
        episode_data.append({
            'day': env.day,
            'action': action.copy(),
            'reward': reward,
            'roi': info['roi'],
            'effectiveness': info['effectiveness'].copy()
        })
        
        state = next_state
    
    return total_reward, episode_data

### 3.1 Random Agent

A random agent that allocates budget randomly across channels.

In [None]:
def random_policy(state):
    """Random allocation policy"""
    action = np.random.rand(3)
    return action

# Run multiple episodes with random policy
n_episodes = 10
random_rewards = []
random_data = []

env = MarketingEnvironment()

for i in range(n_episodes):
    reward, data = run_episode(env, random_policy)
    random_rewards.append(reward)
    random_data.append(data)
    
print(f"Random Policy - Avg Total Reward: {np.mean(random_rewards):.2f}")

### 3.2 Heuristic Agent

A heuristic agent that allocates budget proportionally to the current effectiveness of each channel.

In [None]:
class HeuristicPolicy:
    def __init__(self, env):
        self.env = env
        self.state_size = 7  # Day, 3 effectiveness values, 3 recent spend values
    
    def __call__(self, state):
        # Extract effectiveness values from state
        effectiveness = state[1:4]  # Indices 1-3 contain channel effectiveness
        
        # Allocate budget proportionally to effectiveness
        if np.sum(effectiveness) > 0:
            allocation = effectiveness / np.sum(effectiveness)
        else:
            allocation = np.ones(3) / 3  # Equal allocation if all effectiveness is zero
        
        return allocation

# Run episodes with heuristic policy
env = MarketingEnvironment()
heuristic_policy = HeuristicPolicy(env)

heuristic_rewards = []
heuristic_data = []

for i in range(n_episodes):
    reward, data = run_episode(env, heuristic_policy)
    heuristic_rewards.append(reward)
    heuristic_data.append(data)
    
print(f"Heuristic Policy - Avg Total Reward: {np.mean(heuristic_rewards):.2f}")

Let's visualize the performance of our baseline agents:

In [None]:
def plot_rewards(rewards, title):
    """Plot rewards across episodes"""
    plt.figure(figsize=(10, 5))
    plt.plot(rewards)
    plt.title(title)
    plt.xlabel('Episode')
    plt.ylabel('Total Reward')
    plt.grid(True)
    plt.show()

def plot_cumulative_rewards(data_list, labels):
    """Plot cumulative rewards over time for multiple agents"""
    plt.figure(figsize=(12, 6))
    
    # Calculate mean cumulative rewards for each day across episodes
    for i, data in enumerate(data_list):
        # Average across episodes
        daily_rewards = np.zeros(30)  # Assuming 30-day episodes
        for episode_data in data:
            for entry in episode_data:
                daily_rewards[entry['day']-1] += entry['reward'] / len(data)
        
        # Calculate cumulative rewards
        cumulative_rewards = np.cumsum(daily_rewards)
        
        plt.plot(range(1, 31), cumulative_rewards, label=labels[i], linewidth=2)
    
    plt.title('Cumulative Reward by Day')
    plt.xlabel('Day')
    plt.ylabel('Cumulative Reward')
    plt.legend()
    plt.grid(True)
    plt.show()

# Plot baseline results
plot_rewards(random_rewards, 'Random Policy - Rewards per Episode')
plot_rewards(heuristic_rewards, 'Heuristic Policy - Rewards per Episode')
plot_cumulative_rewards([random_data, heuristic_data], ['Random', 'Heuristic'])

## 4. Policy Gradient Implementation

Now we'll implement a policy gradient algorithm to learn an optimal marketing budget allocation strategy.

### 4.1 Policy Network Architecture

We'll use a simple neural network to represent our policy. The network will take the state as input and output parameters for a multivariate normal distribution over actions (budget allocations).

For the marketing budget allocation problem:
- Input: State representation (day, channel effectiveness, recent spending, pending conversions)
- Output: Mean values for each channel's budget allocation (we'll use a fixed standard deviation)

```
             ┌───────────┐
             │  State s  │
             └─────┬─────┘
                   │
                   ▼
        ┌─────────────────────┐
        │ Dense Layer (64)    │
        │ ReLU Activation     │
        └─────────┬───────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Dense Layer (32)    │
        │ ReLU Activation     │
        └─────────┬───────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Output Layer        │
        │ 3 Means (Sigmoid)   │
        └─────────┬───────────┘
                  │
                  ▼
        ┌─────────────────────┐
        │ Normal Distribution │
        │ N(μ, σ²)            │
        └─────────┬───────────┘
                  │
                  ▼
             ┌─────────┐
             │ Action a │
             └─────────┘
```

### 4.2 Policy Gradient Loss Derivation

The core of policy gradient methods is the policy gradient theorem, which gives us a way to compute the gradient of expected return J(θ) with respect to policy parameters θ:

$$\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta}[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t]$$

Where:
- $\tau$ is a trajectory (sequence of states, actions, rewards)
- $\pi_\theta(a|s)$ is the policy (probability of taking action $a$ in state $s$)
- $R_t$ is the return (sum of rewards from time $t$ to the end of episode)

In practice, we use sample trajectories and compute the loss as:

$$L(\theta) = -\frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \log \pi_\theta(a_t^i|s_t^i) \cdot R_t^i$$

For our continuous action space (budget allocations), we'll use a multivariate normal distribution for our policy. For simplicity, we'll use diagonal covariance (independent dimensions) with fixed standard deviation.

In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_size=64):
        super(PolicyNetwork, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Linear(hidden_size // 2, action_dim),
            nn.Sigmoid()  # Output in [0,1] range for budget allocation
        )
        
        # Log standard deviations for each action dimension
        # We'll use fixed standard deviations for simplicity
        self.log_std = nn.Parameter(torch.ones(action_dim) * -0.5)
        
    def forward(self, state):
        # Convert state to tensor if it's not already
        if not isinstance(state, torch.Tensor):
            state = torch.FloatTensor(state)
        
        # Get mean actions
        action_means = self.network(state)
        
        # Create action distribution
        std = torch.exp(self.log_std)
        return action_means, std
    
    def sample_action(self, state):
        """Sample an action from the policy distribution"""
        mean, std = self.forward(state)
        
        # Create normal distribution
        normal = Normal(mean, std)
        
        # Sample action
        action = normal.sample()
        
        # Calculate log probability
        log_prob = normal.log_prob(action).sum(dim=-1)
        
        # Clip action to valid range [0, 1]
        action = torch.clamp(action, 0, 1)
        
        return action.detach().numpy(), log_prob.detach()
    
    def evaluate_action(self, state, action):
        """Calculate log probability of an action"""
        mean, std = self.forward(state)
        
        # Create normal distribution
        normal = Normal(mean, std)
        
        # Calculate log probability
        log_prob = normal.log_prob(action).sum(dim=-1)
        
        # Calculate entropy for exploration encouragement
        entropy = normal.entropy().sum(dim=-1)
        
        return log_prob, entropy

### 4.3 Training Loop

Now we'll implement the REINFORCE algorithm (vanilla policy gradient) to train our policy network.

In [None]:
def collect_trajectory(env, policy):
    """Collect one trajectory (episode) using the policy"""
    state = env.reset()
    states, actions, rewards, log_probs = [], [], [], []
    done = False
    episode_reward = 0
    
    while not done:
        # Convert state to tensor
        state_tensor = torch.FloatTensor(state)
        
        # Sample action from policy
        action, log_prob = policy.sample_action(state_tensor)
        
        # Take action in environment
        next_state, reward, done, _ = env.step(action)
        
        # Store experience
        states.append(state_tensor)
        actions.append(torch.FloatTensor(action))
        rewards.append(reward)
        log_probs.append(log_prob)
        
        # Update state and episode reward
        state = next_state
        episode_reward += reward
    
    return states, actions, rewards, log_probs, episode_reward

def compute_returns(rewards, gamma=0.99):
    """Compute discounted returns"""
    returns = []
    R = 0
    
    # Calculate returns from the end of the episode
    for r in reversed(rewards):
        R = r + gamma * R
        returns.insert(0, R)
    
    # Normalize returns for stability
    returns = torch.tensor(returns)
    if len(returns) > 1:
        returns = (returns - returns.mean()) / (returns.std() + 1e-5)
    
    return returns

def train_policy_gradient(env, policy, optimizer, num_episodes=200, gamma=0.99, print_every=10, entropy_weight=0.01):
    """Train policy using policy gradient"""
    episode_rewards = []
    average_rewards = []
    
    for episode in range(num_episodes):
        # Collect trajectory
        states, actions, rewards, log_probs, episode_reward = collect_trajectory(env, policy)
        episode_rewards.append(episode_reward)
        
        # Compute returns
        returns = compute_returns(rewards, gamma)
        
        # Convert to tensors
        states = torch.stack(states)
        actions = torch.stack(actions)
        
        # Compute loss
        # We need to recompute log probabilities for gradient calculation
        log_probs, entropy = policy.evaluate_action(states, actions)
        
        # Policy gradient loss
        policy_loss = -(log_probs * returns).mean()
        
        # Add entropy bonus to encourage exploration
        loss = policy_loss - entropy_weight * entropy.mean()
        
        # Update policy
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track progress
        if episode % print_every == 0:
            avg_reward = np.mean(episode_rewards[-print_every:])
            average_rewards.append(avg_reward)
            print(f"Episode {episode}, Average Reward: {avg_reward:.2f}")
    
    return episode_rewards, average_rewards

### 4.4 Train the Policy Gradient Agent

In [None]:
# Create environment
env = MarketingEnvironment()

# Initialize policy network
state_dim = len(env.reset())
action_dim = env.num_channels
policy = PolicyNetwork(state_dim, action_dim)
optimizer = optim.Adam(policy.parameters(), lr=0.01)

# Train policy
episode_rewards, average_rewards = train_policy_gradient(
    env, policy, optimizer, 
    num_episodes=100,  # Reduced for notebook runtime
    print_every=10
)

### 4.5 Evaluate and Compare Performance

In [None]:
# Create policy function that uses trained policy network
def pg_policy(state):
    action, _ = policy.sample_action(state)
    return action

# Run episodes with trained policy
pg_rewards = []
pg_data = []

for i in range(n_episodes):
    reward, data = run_episode(env, pg_policy)
    pg_rewards.append(reward)
    pg_data.append(data)
    
print(f"Policy Gradient - Avg Total Reward: {np.mean(pg_rewards):.2f}")

# Plot training progress
plt.figure(figsize=(12, 6))
plt.plot(episode_rewards)
plt.plot(range(0, len(episode_rewards), 10), average_rewards, 'r--')
plt.title('Policy Gradient Training Progress')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.legend(['Episode Reward', 'Average Reward (10 episodes)'])
plt.grid(True)
plt.show()

# Compare all agents
plot_cumulative_rewards([random_data, heuristic_data, pg_data], ['Random', 'Heuristic', 'Policy Gradient'])

Let's visualize how our trained policy allocates budget across channels compared to the baseline policies:

In [None]:
# Run a single episode with each policy and plot the budget allocations
def plot_budget_allocations(policies, policy_names):
    plt.figure(figsize=(15, 10))
    
    for i, (policy_fn, name) in enumerate(zip(policies, policy_names)):
        # Run episode
        env = MarketingEnvironment()
        _, data = run_episode(env, policy_fn)
        
        # Extract allocations
        days = [entry['day'] for entry in data]
        ch1_allocations = [entry['action'][0] for entry in data]
        ch2_allocations = [entry['action'][1] for entry in data]
        ch3_allocations = [entry['action'][2] for entry in data]
        
        # Extract effectiveness for reference
        ch1_effectiveness = [entry['effectiveness'][0] for entry in data]
        ch2_effectiveness = [entry['effectiveness'][1] for entry in data]
        ch3_effectiveness = [entry['effectiveness'][2] for entry in data]
        
        # Plot allocations
        plt.subplot(len(policies), 2, 2*i+1)
        plt.stackplot(days, ch1_allocations, ch2_allocations, ch3_allocations, 
                      labels=['Email', 'Social', 'Search'],
                      alpha=0.7)
        plt.title(f'{name} - Budget Allocation')
        plt.xlabel('Day')
        plt.ylabel('Allocation Proportion')
        plt.legend(loc='upper left')
        plt.grid(True)
        
        # Plot channel effectiveness
        plt.subplot(len(policies), 2, 2*i+2)
        plt.plot(days, ch1_effectiveness, 'g-', label='Email')
        plt.plot(days, ch2_effectiveness, 'b-', label='Social')
        plt.plot(days, ch3_effectiveness, 'r-', label='Search')
        plt.title(f'{name} - Channel Effectiveness')
        plt.xlabel('Day')
        plt.ylabel('Effectiveness')
        plt.legend(loc='upper left')
        plt.grid(True)
    
    plt.tight_layout()
    plt.show()

# Plot allocations
plot_budget_allocations(
    [random_policy, heuristic_policy, pg_policy], 
    ['Random Policy', 'Heuristic Policy', 'Policy Gradient']
)

## 5. Interpretation & Marketing Insights

Now that we've trained our policy gradient agent, let's analyze what marketing insights we can derive from its learned budget allocation strategy.

Our policy gradient agent has learned several key behaviors:

In [None]:
# Run a detailed analysis of a single episode with our trained policy
env = MarketingEnvironment()
_, detailed_data = run_episode(env, pg_policy, render=False)

# Extract data for analysis
days = [entry['day'] for entry in detailed_data]
allocations = np.array([entry['action'] for entry in detailed_data])
effectiveness = np.array([entry['effectiveness'] for entry in detailed_data])
rewards = [entry['reward'] for entry in detailed_data]
rois = [entry['roi'] for entry in detailed_data]

# Plot various metrics
plt.figure(figsize=(18, 12))

# Plot allocations vs. effectiveness
plt.subplot(2, 2, 1)
plt.stackplot(days, allocations[:, 0], allocations[:, 1], allocations[:, 2],
             labels=['Email', 'Social', 'Search'], alpha=0.7)
plt.title('Budget Allocation Strategy')
plt.xlabel('Day')
plt.ylabel('Allocation Proportion')
plt.legend(loc='upper left')
plt.grid(True)

plt.subplot(2, 2, 2)
plt.plot(days, effectiveness[:, 0], 'g-', label='Email')
plt.plot(days, effectiveness[:, 1], 'b-', label='Social')
plt.plot(days, effectiveness[:, 2], 'r-', label='Search')
plt.title('Channel Effectiveness')
plt.xlabel('Day')
plt.ylabel('Effectiveness')
plt.legend(loc='upper left')
plt.grid(True)

# Plot daily rewards and ROI
plt.subplot(2, 2, 3)
plt.plot(days, rewards, 'b-')
plt.title('Daily Reward')
plt.xlabel('Day')
plt.ylabel('Reward')
plt.grid(True)

plt.subplot(2, 2, 4)
plt.plot(days, rois, 'g-')
plt.title('Daily ROI')
plt.xlabel('Day')
plt.ylabel('ROI')
plt.grid(True)

plt.tight_layout()
plt.show()

# Calculate correlation between effectiveness and allocation
ch1_corr = np.corrcoef(effectiveness[:, 0], allocations[:, 0])[0, 1]
ch2_corr = np.corrcoef(effectiveness[:, 1], allocations[:, 1])[0, 1]
ch3_corr = np.corrcoef(effectiveness[:, 2], allocations[:, 2])[0, 1]

print(f"Correlation between effectiveness and allocation:")
print(f"Email: {ch1_corr:.2f}")
print(f"Social: {ch2_corr:.2f}")
print(f"Search: {ch3_corr:.2f}")

### Key Marketing Insights

From the trained policy behavior, we can derive several marketing insights:

1. **Dynamic Budget Reallocation**: The policy gradient agent has learned to dynamically adjust budget allocations based on changing channel effectiveness, unlike the static or simple heuristic strategies.

2. **Exploitation vs. Exploration**: The agent balances between exploiting channels known to be effective and exploring potentially underutilized channels, ensuring it doesn't miss opportunities due to the stochastic nature of returns.

3. **Response to Diminishing Returns**: The agent appears to recognize diminishing returns in channels, reducing allocation when oversaturation occurs rather than continuing to pour budget into a single high-performing channel.

4. **Anticipation of Delayed Conversions**: Unlike simple heuristics that might react only to immediate returns, the policy seems to account for the delayed conversion model, making decisions that optimize for long-term cumulative reward.

5. **Channel Synergies**: The policy may have learned implicit relationships between channels - for example, how email effectiveness might influence subsequent social media engagement.

For marketers, this suggests that:
- Budget allocation should be reviewed and adjusted more frequently than typical monthly or quarterly cycles
- Channel performance should be evaluated in the context of the full customer journey, not just immediate returns
- Testing low-allocation channels periodically can reveal changing effectiveness patterns
- Sophisticated adaptive strategies can significantly outperform static allocation plans

## 6. Exercises

### Exercise 1: Vary Reward Delay and Noise

Modify the environment to test how different reward delay patterns and noise levels affect the learning stability of the policy gradient agent.

In [None]:
# Modify the delay_probs and volatility parameters in the MarketingEnvironment class
# Train new policies with these modified environments, then compare performance

# Example:
# 1. Create an environment with longer delay
# 2. Create an environment with higher volatility
# 3. Train policies on each and compare learning curves and final performance



### Exercise 2: Add a Spend Cap Constraint

Modify the policy network to respect a total daily budget cap while still optimizing allocation across channels.

In [None]:
# Modify the PolicyNetwork class to ensure the sum of allocations respects a budget cap
# Hint: You can use a softmax output layer instead of sigmoid to ensure allocations sum to 1,
# or you can normalize the output in the action sampling method



### Exercise 3: Alternative Reward Metrics

Modify the environment to use customer lifetime value (LTV) or another marketing metric as the reward signal instead of immediate revenue.

In [None]:
# Modify the MarketingEnvironment class to include a customer LTV model
# For example, you could add a customer state that tracks repeated interactions
# or implement a simplified LTV calculation based on first purchase and retention probability



## 7. Further Reading & Citations

### Advanced Reinforcement Learning for Marketing

- **Actor-Critic Methods**: Combines value-based and policy-based methods for greater stability and sample efficiency
- **Offline RL**: Learn from historical marketing data without active experimentation
- **Multi-Objective RL**: Optimize for multiple marketing KPIs simultaneously (e.g., revenue, customer acquisition cost, retention)
- **Contextual Bandits**: Simpler special case of RL focused on immediate reward optimization

### Key Papers & Resources

1. Silver, D., et al. (2014). ["Deterministic Policy Gradient Algorithms"](http://proceedings.mlr.press/v32/silver14.pdf)
2. Schulman, J., et al. (2017). ["Proximal Policy Optimization Algorithms"](https://arxiv.org/abs/1707.06347)
3. Theocharous, G., et al. (2015). ["Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees"](https://www.ijcai.org/Proceedings/15/Papers/081.pdf)
4. Ie, E., et al. (2019). ["SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets"](https://arxiv.org/abs/1905.12767)
5. Rohde, D., et al. (2018). ["ReAgent: A Toolkit for Applied Reinforcement Learning"](https://ai.facebook.com/blog/reagent-applying-reinforcement-learning-to-real-world-problems/)

### Books
1. Sutton, R. S., & Barto, A. G. (2018). ["Reinforcement Learning: An Introduction"](http://incompleteideas.net/book/the-book-2nd.html)

### Online Courses
1. ["Practical Reinforcement Learning"](https://www.coursera.org/learn/practical-rl) (Coursera)
2. ["Deep Reinforcement Learning"](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) (Udacity)

## Next Steps

- **Actor-Critic Implementation**: Extend this notebook with an actor-critic approach for more stable learning
- **Offline RL with Historical Data**: Adapt the algorithms to learn from historical marketing campaign data
- **A/B Testing Integration**: Design a system that combines RL with traditional A/B testing for safer deployment
- **Multi-Channel Attribution**: Incorporate attribution modeling to better distribute credit for conversions
- **Bayesian RL**: Add uncertainty estimates to guide exploration and provide confidence intervals on budget decisions
- **User Segmentation**: Extend the environment to handle different user segments with varying response patterns