# Deep Q-Network (DQN) for Portfolio Management

This notebook demonstrates how to use Deep Reinforcement Learning for portfolio allocation decisions.

## Key Concepts
- **DQN (Deep Q-Network)**: Neural network that approximates Q-values for continuous state spaces
- **Experience Replay**: Breaks correlations in training data for stable learning
- **Target Network**: Stabilizes Q-learning with a slowly-updated target
- **Portfolio Environment**: State = market features, Action = allocation weights

## Requirements
```bash
pip install torch numpy pandas matplotlib gym
```

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import deque, namedtuple
import random
import warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. Portfolio Trading Environment

In [None]:
class PortfolioEnv:
    """Trading environment for portfolio allocation"""
    
    def __init__(self, returns, lookback=20, transaction_cost=0.001):
        """
        Args:
            returns: DataFrame of asset returns (n_days x n_assets)
            lookback: Number of days for state features
            transaction_cost: Cost per dollar traded
        """
        self.returns = returns.values
        self.n_assets = returns.shape[1]
        self.lookback = lookback
        self.transaction_cost = transaction_cost
        
        # Action space: discrete allocation choices
        # 0: Equal weight, 1: Momentum tilt, 2: Low-vol tilt, 3: Cash heavy
        self.n_actions = 4
        
        # State features per asset: returns, volatility, momentum
        self.state_dim = self.n_assets * 3 + 2  # + portfolio value, cash
        
        self.reset()
    
    def _get_allocation(self, action):
        """Convert discrete action to portfolio weights"""
        if action == 0:  # Equal weight
            weights = np.ones(self.n_assets) / self.n_assets
        elif action == 1:  # Momentum tilt (overweight positive momentum)
            momentum = self.returns[self.t-self.lookback:self.t].mean(axis=0)
            weights = np.maximum(momentum, 0)
            weights = weights / (weights.sum() + 1e-8)
        elif action == 2:  # Low-vol tilt
            vol = self.returns[self.t-self.lookback:self.t].std(axis=0)
            weights = 1 / (vol + 1e-8)
            weights = weights / weights.sum()
        else:  # Cash heavy (25% each asset)
            weights = np.ones(self.n_assets) * 0.25 / self.n_assets
        
        return weights
    
    def _get_state(self):
        """Compute state features"""
        recent_returns = self.returns[self.t-self.lookback:self.t]
        
        # Features for each asset
        avg_returns = recent_returns.mean(axis=0)
        volatilities = recent_returns.std(axis=0)
        momentum = recent_returns.sum(axis=0)  # Cumulative return
        
        # Normalize features
        avg_returns = avg_returns / (np.abs(avg_returns).max() + 1e-8)
        volatilities = volatilities / (volatilities.max() + 1e-8)
        momentum = momentum / (np.abs(momentum).max() + 1e-8)
        
        state = np.concatenate([
            avg_returns,
            volatilities,
            momentum,
            [self.portfolio_value / 100 - 1],  # Normalized portfolio value
            [self.current_weights.sum()]  # Investment level
        ])
        
        return state.astype(np.float32)
    
    def reset(self):
        """Reset environment"""
        self.t = self.lookback
        self.portfolio_value = 100.0
        self.current_weights = np.ones(self.n_assets) / self.n_assets
        return self._get_state()
    
    def step(self, action):
        """Take action and return next state, reward, done"""
        # Get new target weights
        new_weights = self._get_allocation(action)
        
        # Transaction cost
        turnover = np.abs(new_weights - self.current_weights).sum()
        cost = turnover * self.transaction_cost * self.portfolio_value
        
        # Apply returns
        daily_return = (new_weights * self.returns[self.t]).sum()
        self.portfolio_value *= (1 + daily_return)
        self.portfolio_value -= cost
        
        # Update state
        self.current_weights = new_weights
        self.t += 1
        
        # Reward: risk-adjusted return (simplified Sharpe)
        reward = daily_return * 100 - cost / self.portfolio_value  # Scale for learning
        
        # Done if end of data
        done = self.t >= len(self.returns) - 1
        
        return self._get_state(), reward, done, {'portfolio_value': self.portfolio_value}

# Generate synthetic returns data
def generate_market_data(n_days=1000, n_assets=4):
    """Generate synthetic multi-asset returns with realistic properties"""
    # Base volatilities
    vols = np.array([0.15, 0.20, 0.25, 0.12])[:n_assets] / np.sqrt(252)
    
    # Correlation matrix
    corr = np.array([
        [1.0, 0.6, 0.4, -0.2],
        [0.6, 1.0, 0.5, -0.1],
        [0.4, 0.5, 1.0, 0.0],
        [-0.2, -0.1, 0.0, 1.0]
    ])[:n_assets, :n_assets]
    
    # Cholesky decomposition for correlated returns
    L = np.linalg.cholesky(corr)
    
    # Generate returns with volatility clustering
    returns = []
    vol_state = vols.copy()
    
    for _ in range(n_days):
        # GARCH-like vol dynamics
        vol_state = 0.9 * vol_state + 0.1 * vols * (1 + np.random.randn(n_assets) * 0.5)
        vol_state = np.clip(vol_state, vols * 0.5, vols * 2)
        
        # Correlated returns
        z = np.random.randn(n_assets)
        daily_ret = L @ z * vol_state
        
        # Add small drift
        drift = np.array([0.08, 0.10, 0.06, 0.03])[:n_assets] / 252
        daily_ret += drift
        
        returns.append(daily_ret)
    
    df = pd.DataFrame(returns, columns=['Stock', 'Growth', 'SmallCap', 'Bonds'][:n_assets])
    return df

# Create environment
returns_df = generate_market_data(1500, 4)
env = PortfolioEnv(returns_df, lookback=20, transaction_cost=0.001)

print(f"Returns shape: {returns_df.shape}")
print(f"State dimension: {env.state_dim}")
print(f"Number of actions: {env.n_actions}")
print(f"\nReturn statistics:")
print((returns_df * 252).describe())

## 2. DQN Architecture

In [None]:
class DQN(nn.Module):
    """Deep Q-Network for portfolio allocation"""
    def __init__(self, state_dim, n_actions, hidden_dims=[128, 64]):
        super().__init__()
        
        layers = []
        prev_dim = state_dim
        for h_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, h_dim),
                nn.ReLU(),
                nn.LayerNorm(h_dim)
            ])
            prev_dim = h_dim
        
        # Dueling architecture: separate value and advantage streams
        self.feature_net = nn.Sequential(*layers)
        self.value_stream = nn.Linear(hidden_dims[-1], 1)
        self.advantage_stream = nn.Linear(hidden_dims[-1], n_actions)
    
    def forward(self, x):
        features = self.feature_net(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        
        # Combine: Q(s,a) = V(s) + (A(s,a) - mean(A))
        q_values = value + advantage - advantage.mean(dim=1, keepdim=True)
        return q_values

# Experience replay buffer
Transition = namedtuple('Transition', ['state', 'action', 'reward', 'next_state', 'done'])

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, *args):
        self.buffer.append(Transition(*args))
    
    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        batch = Transition(*zip(*transitions))
        return batch
    
    def __len__(self):
        return len(self.buffer)

# Initialize networks
policy_net = DQN(env.state_dim, env.n_actions).to(device)
target_net = DQN(env.state_dim, env.n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

print(f"DQN parameters: {sum(p.numel() for p in policy_net.parameters()):,}")

## 3. DQN Agent

In [None]:
class DQNAgent:
    def __init__(self, policy_net, target_net, n_actions, 
                 lr=1e-3, gamma=0.99, epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=500):
        self.policy_net = policy_net
        self.target_net = target_net
        self.n_actions = n_actions
        
        self.optimizer = torch.optim.Adam(policy_net.parameters(), lr=lr)
        self.gamma = gamma
        
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.steps_done = 0
        
        self.replay_buffer = ReplayBuffer(10000)
    
    def select_action(self, state, training=True):
        """Epsilon-greedy action selection"""
        epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
                  np.exp(-self.steps_done / self.epsilon_decay)
        self.steps_done += 1
        
        if training and random.random() < epsilon:
            return random.randrange(self.n_actions)
        else:
            with torch.no_grad():
                state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(device)
                q_values = self.policy_net(state_tensor)
                return q_values.argmax().item()
    
    def train_step(self, batch_size=64):
        """Single training step"""
        if len(self.replay_buffer) < batch_size:
            return None
        
        batch = self.replay_buffer.sample(batch_size)
        
        states = torch.tensor(np.array(batch.state), dtype=torch.float32).to(device)
        actions = torch.tensor(batch.action, dtype=torch.long).unsqueeze(1).to(device)
        rewards = torch.tensor(batch.reward, dtype=torch.float32).unsqueeze(1).to(device)
        next_states = torch.tensor(np.array(batch.next_state), dtype=torch.float32).to(device)
        dones = torch.tensor(batch.done, dtype=torch.float32).unsqueeze(1).to(device)
        
        # Current Q values
        current_q = self.policy_net(states).gather(1, actions)
        
        # Double DQN: use policy net for action selection, target net for evaluation
        with torch.no_grad():
            next_actions = self.policy_net(next_states).argmax(1, keepdim=True)
            next_q = self.target_net(next_states).gather(1, next_actions)
            target_q = rewards + (1 - dones) * self.gamma * next_q
        
        # Huber loss
        loss = F.smooth_l1_loss(current_q, target_q)
        
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), 1.0)
        self.optimizer.step()
        
        return loss.item()
    
    def update_target(self, tau=0.005):
        """Soft update target network"""
        for target_param, policy_param in zip(self.target_net.parameters(), 
                                              self.policy_net.parameters()):
            target_param.data.copy_(tau * policy_param.data + (1 - tau) * target_param.data)

# Initialize agent
agent = DQNAgent(policy_net, target_net, env.n_actions)
print("Agent initialized")

## 4. Training Loop

In [None]:
def train_dqn(env, agent, n_episodes=300, target_update_freq=10):
    """Train DQN agent"""
    episode_rewards = []
    episode_values = []
    losses = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action = agent.select_action(state)
            next_state, reward, done, info = env.step(action)
            
            agent.replay_buffer.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
            
            loss = agent.train_step()
            if loss is not None:
                losses.append(loss)
            
            agent.update_target()
        
        episode_rewards.append(total_reward)
        episode_values.append(info['portfolio_value'])
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:])
            avg_value = np.mean(episode_values[-50:])
            print(f"Episode {episode+1}: Avg Reward = {avg_reward:.2f}, "
                  f"Avg Portfolio Value = ${avg_value:.2f}")
    
    return episode_rewards, episode_values, losses

# Train
print("Training DQN agent...")
rewards, values, losses = train_dqn(env, agent, n_episodes=300)

# Plot training progress
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Smoothed rewards
window = 20
smoothed_rewards = pd.Series(rewards).rolling(window).mean()
axes[0].plot(rewards, alpha=0.3, label='Raw')
axes[0].plot(smoothed_rewards, label=f'{window}-episode MA')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Total Reward')
axes[0].set_title('Episode Rewards')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Portfolio values
smoothed_values = pd.Series(values).rolling(window).mean()
axes[1].plot(values, alpha=0.3)
axes[1].plot(smoothed_values)
axes[1].axhline(y=100, color='gray', linestyle='--', label='Initial')
axes[1].set_xlabel('Episode')
axes[1].set_ylabel('Final Portfolio Value ($)')
axes[1].set_title('Portfolio Performance')
axes[1].grid(True, alpha=0.3)

# Training loss
if losses:
    smoothed_losses = pd.Series(losses).rolling(100).mean()
    axes[2].plot(losses, alpha=0.1)
    axes[2].plot(smoothed_losses)
    axes[2].set_xlabel('Training Step')
    axes[2].set_ylabel('Loss')
    axes[2].set_title('Training Loss')
    axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Evaluate Trained Agent

In [None]:
def evaluate_agent(env, agent, n_episodes=10):
    """Evaluate trained agent without exploration"""
    results = []
    
    for episode in range(n_episodes):
        state = env.reset()
        done = False
        values = [100.0]
        actions_taken = []
        
        while not done:
            action = agent.select_action(state, training=False)
            actions_taken.append(action)
            state, _, done, info = env.step(action)
            values.append(info['portfolio_value'])
        
        results.append({
            'values': values,
            'actions': actions_taken,
            'final_value': values[-1],
            'total_return': (values[-1] / 100 - 1) * 100
        })
    
    return results

# Evaluate
eval_results = evaluate_agent(env, agent, n_episodes=10)

print("\nEvaluation Results:")
print(f"{'Episode':<10} {'Final Value':<15} {'Total Return':<15}")
print("-" * 40)
for i, res in enumerate(eval_results):
    print(f"{i+1:<10} ${res['final_value']:<14.2f} {res['total_return']:<14.2f}%")

avg_return = np.mean([r['total_return'] for r in eval_results])
print(f"\nAverage Return: {avg_return:.2f}%")

## 6. Compare with Baselines

In [None]:
def run_baseline(env, strategy='equal_weight'):
    """Run baseline strategy"""
    state = env.reset()
    done = False
    values = [100.0]
    
    while not done:
        if strategy == 'equal_weight':
            action = 0
        elif strategy == 'momentum':
            action = 1
        elif strategy == 'low_vol':
            action = 2
        else:
            action = 3  # Conservative
        
        state, _, done, info = env.step(action)
        values.append(info['portfolio_value'])
    
    return values

# Run baselines
equal_weight_values = run_baseline(env, 'equal_weight')
momentum_values = run_baseline(env, 'momentum')
low_vol_values = run_baseline(env, 'low_vol')

# Run DQN (best of eval episodes)
dqn_values = eval_results[np.argmax([r['final_value'] for r in eval_results])]['values']

# Plot comparison
plt.figure(figsize=(12, 6))
plt.plot(equal_weight_values, label=f'Equal Weight ({(equal_weight_values[-1]/100-1)*100:.1f}%)', alpha=0.8)
plt.plot(momentum_values, label=f'Momentum ({(momentum_values[-1]/100-1)*100:.1f}%)', alpha=0.8)
plt.plot(low_vol_values, label=f'Low Vol ({(low_vol_values[-1]/100-1)*100:.1f}%)', alpha=0.8)
plt.plot(dqn_values, label=f'DQN Agent ({(dqn_values[-1]/100-1)*100:.1f}%)', linewidth=2)

plt.xlabel('Trading Day')
plt.ylabel('Portfolio Value ($)')
plt.title('DQN vs Baseline Strategies')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Performance metrics
def calc_metrics(values):
    returns = np.diff(values) / values[:-1]
    total_return = (values[-1] / values[0] - 1) * 100
    sharpe = np.mean(returns) / np.std(returns) * np.sqrt(252)
    max_dd = np.max(1 - np.array(values) / np.maximum.accumulate(values)) * 100
    return total_return, sharpe, max_dd

print("\n" + "="*60)
print("Performance Metrics")
print("="*60)
print(f"{'Strategy':<15} {'Total Return':<15} {'Sharpe':<15} {'Max DD':<15}")
print("-"*60)

for name, vals in [('Equal Weight', equal_weight_values), 
                    ('Momentum', momentum_values),
                    ('Low Vol', low_vol_values),
                    ('DQN Agent', dqn_values)]:
    ret, sharpe, dd = calc_metrics(vals)
    print(f"{name:<15} {ret:<15.2f}% {sharpe:<15.2f} {dd:<15.2f}%")

## 7. Analyze Agent's Policy

In [None]:
# Analyze action distribution
action_names = ['Equal Weight', 'Momentum', 'Low Vol', 'Conservative']
all_actions = []
for res in eval_results:
    all_actions.extend(res['actions'])

action_counts = np.bincount(all_actions, minlength=4)
action_pcts = action_counts / len(all_actions) * 100

plt.figure(figsize=(10, 5))
plt.bar(action_names, action_pcts, color=['blue', 'green', 'orange', 'red'])
plt.ylabel('Frequency (%)')
plt.title('DQN Agent Action Distribution')
for i, (name, pct) in enumerate(zip(action_names, action_pcts)):
    plt.text(i, pct + 1, f'{pct:.1f}%', ha='center')
plt.grid(True, alpha=0.3, axis='y')
plt.show()

print("\nAction Distribution:")
for name, pct in zip(action_names, action_pcts):
    print(f"  {name}: {pct:.1f}%")

## 8. Q-Value Analysis

In [None]:
# Analyze Q-values across different market states
def analyze_q_values(env, agent, n_samples=100):
    """Collect Q-values across random states"""
    q_values_list = []
    states_list = []
    
    state = env.reset()
    for _ in range(n_samples):
        action = random.randrange(env.n_actions)
        next_state, _, done, _ = env.step(action)
        
        if done:
            state = env.reset()
            continue
        
        with torch.no_grad():
            state_tensor = torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(device)
            q_values = policy_net(state_tensor).cpu().numpy()[0]
        
        q_values_list.append(q_values)
        states_list.append(state)
        state = next_state
    
    return np.array(q_values_list), np.array(states_list)

q_values, states = analyze_q_values(env, agent, 500)

# Plot Q-value distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Q-value distribution by action
axes[0].boxplot([q_values[:, i] for i in range(4)], labels=action_names)
axes[0].set_ylabel('Q-Value')
axes[0].set_title('Q-Value Distribution by Action')
axes[0].grid(True, alpha=0.3)

# Q-values vs market volatility (using one of the state features)
vol_feature = states[:, env.n_assets:2*env.n_assets].mean(axis=1)  # Avg volatility
best_actions = q_values.argmax(axis=1)

for action in range(4):
    mask = best_actions == action
    if mask.sum() > 0:
        axes[1].scatter(vol_feature[mask], q_values[mask, action], 
                       alpha=0.5, label=action_names[action], s=20)

axes[1].set_xlabel('Average Volatility (normalized)')
axes[1].set_ylabel('Q-Value')
axes[1].set_title('Q-Values vs Market Volatility')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated:

1. **Portfolio Environment**: Custom gym-style environment for trading
2. **DQN with Dueling Architecture**: Separate value and advantage streams
3. **Double DQN**: Reduces overestimation bias in Q-learning
4. **Experience Replay**: Breaks correlations for stable training
5. **Policy Analysis**: Understanding what the agent learned

### Key Insights:
- DQN can learn adaptive allocation strategies
- Agent learns to switch strategies based on market conditions
- Transaction costs influence the frequency of rebalancing

### Extensions to Try:
- Continuous action spaces with DDPG/SAC
- Add more market features (sentiment, macro indicators)
- Use real historical data
- Implement PPO for more stable learning