## 5.3 交易策略
我们在前两节中，已经研究了最简化版本的强化学习环境MinuteBarEnv。正面我们来怎样使用DQN（Deep Q-Learning）来开发一个股票交易策略。

### 5.3.1. DQN算法实现
根据前面章节的分析，DQN由Worker NN和Target NN组成。我们先来看Worker NN的实现。

#### 5.3.1.1. $\epsilon$ Greedy策略
我们要在$1-\epsilon$的概率下使用策略选择的行动，在$\epsilon$的概率下使用随机策略，如下所示：
$$
a = \begin{cases} \arg \max_{a \in \mathcal{A}(s)} q_{\pi}(s, a), \quad p=1-\epsilon \\
random, \quad p=\epsilon
\end{cases}
$$
代码实现如下所示：
```python
class EpsilonGreedyActionSelector(ActionSelector):
    def __init__(self, epsilon=0.05, selector=None):
        self.epsilon = epsilon
        self.selector = selector if selector is not None else ArgmaxActionSelector()
        print('epsilon: {0}; selector: {1};'.format(self.epsilon, self.selector))

    def __call__(self, scores):
        assert isinstance(scores, np.ndarray)
        self.epsilon = 0.7
        batch_size, n_actions = scores.shape
        print('batch_size={0}; n_actions={1};'.format(batch_size, n_actions))
        actions = self.selector(scores)
        print('actions: {0};'.format(actions))
        mask = np.random.random(size=batch_size) < self.epsilon
        print('mask: {0};'.format(mask))
        rand_actions = np.random.choice(n_actions, sum(mask))
        print('rand_actions: {0};'.format(rand_actions))
        print('mask type: {0}; {1};'.format(type(mask), mask[2]))
        actions[mask] = rand_actions
        print('final actions: {0};'.format(actions))
        return actions
    
class TEpsilonGreedyActionSelector(unittest.TestCase):
    def test_usage(self):
        selector = EpsilonGreedyActionSelector(AppConfig.EPS_START)
        self.assertTrue(1>0)
        scores = np.array([
            [0.3, 0.2, 0.5, 0.1, 0.4],
            [0.11, 0.52, 0.33, 0.65, 0.27],
            [0.98, 0.32, 0.99, 0.15, 0.57]
        ])
        action = selector(scores)
        print('action: {0};'.format(action))    
```
运行结果如下所示：
```
ionSelector.test_usage
test_usage (uts.biz.drlt.rll.actions.t_epsilon_greedy_action_selector.TEpsilonGreedyActionSelector) ... epsilon: 1.0; selector: <biz.drlt.rll.actions.ArgmaxActionSelector object at 0x000001CFF8481C70>;
batch_size=3; n_actions=5;
actions: [2 3 2];
mask: [False  True  True];
rand_actions: [3 0];
mask type: <class 'numpy.ndarray'>; True;
final actions: [2 3 0];
action: [2 3 0];
```

#### 5.3.1.1. Worker NN
worker NN用于生成Q函数值，同时进行学习，在更新参数$N$次之后，将参数更新至Target NN。状态行动值函数（Q函数）可以表示为：
$$
q_{\pi}(s, a_{,1}) = r_{,1} + v_{\pi}(s_{,1}')|S_{t}=s_{,1}, A_{t}=a_{,1} \\
q_{\pi}(s, a_{,2}) = r_{,2} + v_{\pi}(s_{,2}')|S_{t}=s_{,2}, A_{t}=a_{,2} \\
...... \\
q_{\pi}(s, a_{,K}) = r_{,K} + v_{\pi}(s_{,K}')|S_{t}=s_{,K}, A_{t}=a_{,K}
$$