## 5.3 交易策略
我们在前两节中，已经研究了最简化版本的强化学习环境MinuteBarEnv。正面我们来怎样使用DQN（Deep Q-Learning）来开发一个股票交易策略。

### 5.3.1. DQN算法实现
根据前面章节的分析，DQN由Worker NN和Target NN组成。我们先来看Worker NN的实现。

#### 5.3.1.1. $\epsilon$ Greedy策略
我们要在$1-\epsilon$的概率下使用策略选择的行动，在$\epsilon$的概率下使用随机策略，如下所示：
$$
a = \begin{cases} \arg \max_{a \in \mathcal{A}(s)} q_{\pi}(s, a), \quad p=1-\epsilon \\
random, \quad p=\epsilon
\end{cases}
$$
代码实现如下所示：

In [None]:
# 表1
class EpsilonGreedyActionSelector(ActionSelector):
    def __init__(self, epsilon=0.05, selector=None):
        self.epsilon = epsilon
        self.selector = selector if selector is not None else ArgmaxActionSelector()
        print('epsilon: {0}; selector: {1};'.format(self.epsilon, self.selector))

    def __call__(self, scores):
        assert isinstance(scores, np.ndarray)
        self.epsilon = 0.7
        batch_size, n_actions = scores.shape
        print('batch_size={0}; n_actions={1};'.format(batch_size, n_actions))
        actions = self.selector(scores)
        print('actions: {0};'.format(actions))
        mask = np.random.random(size=batch_size) < self.epsilon
        print('mask: {0};'.format(mask))
        rand_actions = np.random.choice(n_actions, sum(mask))
        print('rand_actions: {0};'.format(rand_actions))
        print('mask type: {0}; {1};'.format(type(mask), mask[2]))
        actions[mask] = rand_actions
        print('final actions: {0};'.format(actions))
        return actions
    
class TEpsilonGreedyActionSelector(unittest.TestCase):
    def test_usage(self):
        selector = EpsilonGreedyActionSelector(AppConfig.EPS_START)
        self.assertTrue(1>0)
        scores = np.array([
            [0.3, 0.2, 0.5, 0.1, 0.4],
            [0.11, 0.52, 0.33, 0.65, 0.27],
            [0.98, 0.32, 0.99, 0.15, 0.57]
        ])
        action = selector(scores)
        print('action: {0};'.format(action))    

运行结果如下所示：

In [None]:
# 表2
```
ionSelector.test_usage
test_usage (uts.biz.drlt.rll.actions.t_epsilon_greedy_action_selector.TEpsilonGreedyActionSelector) 
... epsilon: 1.0; selector: <biz.drlt.rll.actions.ArgmaxActionSelector object at 0x000001CFF8481C70>;
batch_size=3; n_actions=5;
actions: [2 3 2];
mask: [False  True  True];
rand_actions: [3 0];
mask type: <class 'numpy.ndarray'>; True;
final actions: [2 3 0];
action: [2 3 0];
```

在表1第9行，我们设置70%的概率使用随机选择的行动，30%的概率使用策略选择的行动。使用策略选择的策略如表2第7行所示，我们使用70%的概率，生成表2第8行的mask，表明第2、3个样本使用随机选择的行动，表2第9行，是我们随机选择的行动，第11行为我们只保留了第1个样本是由策略选择的行动，第2、3个样本是随机选择的行动，最终我们返回这个结果。

#### 5.3.1.2. $\epsilon$ 衰减策略
刚开始训练时，策略的效果很差，同时我们对环境也一无所知，因此我们需要利用随机性选择的行动，来探索环境的特性。EpsilonTracker类就为了满足这一要求而设计的。

In [None]:
class EpsilonTracker:
    """
    Updates epsilon according to linear schedule
    """
    def __init__(self, selector: EpsilonGreedyActionSelector,
                 eps_start: Union[int, float],
                 eps_final: Union[int, float],
                 eps_frames: int):
        self.selector = selector
        self.eps_start = eps_start
        self.eps_final = eps_final
        self.eps_frames = eps_frames
        self.frame(0)

    def frame(self, frame: int):
        eps = self.eps_start - frame / self.eps_frames
        self.selector.epsilon = max(self.eps_final, eps)
        
class TEpsilonTracker(unittest.TestCase):
    def test_exp(self):
        selector = EpsilonGreedyActionSelector()
        et = EpsilonTracker(selector=selector, eps_start=1.0, eps_final=0.05, eps_frames=100)
        for i in range(100):
            et.frame(i)
            print('{0}: epsilon={1};'.format(i, et.selector.epsilon))
            
# 运行结果
0: epsilon=1.0;
1: epsilon=0.99;
2: epsilon=0.98;
3: epsilon=0.97;
4: epsilon=0.96;
5: epsilon=0.95;
6: epsilon=0.94;
7: epsilon=0.9299999999999999;
8: epsilon=0.92;
......
95: epsilon=0.050000000000000044;
96: epsilon=0.05;
97: epsilon=0.05;
98: epsilon=0.05;
99: epsilon=0.05;
