## 5.3 交易策略
我们在前两节中，已经研究了最简化版本的强化学习环境MinuteBarEnv。正面我们来怎样使用DQN（Deep Q-Learning）来开发一个股票交易策略。

### 5.3.1. DQN算法实现
根据前面章节的分析，DQN由Worker NN和Target NN组成。我们先来看Worker NN的实现。

#### 5.3.1.1. $\epsilon$ Greedy策略
我们要在$1-\epsilon$的概率下使用策略选择的行动，在$\epsilon$的概率下使用随机策略，如下所示：
$$
a = \begin{cases} \arg \max_{a \in \mathcal{A}(s)} q_{\pi}(s, a), \quad p=1-\epsilon \\
random, \quad p=\epsilon
\end{cases}
$$
代码实现如下所示：

In [None]:
# 表1
class EpsilonGreedyActionSelector(ActionSelector):
    def __init__(self, epsilon=0.05, selector=None):
        self.epsilon = epsilon
        self.selector = selector if selector is not None else ArgmaxActionSelector()
        print('epsilon: {0}; selector: {1};'.format(self.epsilon, self.selector))

    def __call__(self, scores):
        assert isinstance(scores, np.ndarray)
        self.epsilon = 0.7
        batch_size, n_actions = scores.shape
        print('batch_size={0}; n_actions={1};'.format(batch_size, n_actions))
        actions = self.selector(scores)
        print('actions: {0};'.format(actions))
        mask = np.random.random(size=batch_size) < self.epsilon
        print('mask: {0};'.format(mask))
        rand_actions = np.random.choice(n_actions, sum(mask))
        print('rand_actions: {0};'.format(rand_actions))
        print('mask type: {0}; {1};'.format(type(mask), mask[2]))
        actions[mask] = rand_actions
        print('final actions: {0};'.format(actions))
        return actions
    
class TEpsilonGreedyActionSelector(unittest.TestCase):
    def test_usage(self):
        selector = EpsilonGreedyActionSelector(AppConfig.EPS_START)
        self.assertTrue(1>0)
        scores = np.array([
            [0.3, 0.2, 0.5, 0.1, 0.4],
            [0.11, 0.52, 0.33, 0.65, 0.27],
            [0.98, 0.32, 0.99, 0.15, 0.57]
        ])
        action = selector(scores)
        print('action: {0};'.format(action))    

运行结果如下所示：

In [None]:
# 表2
```
ionSelector.test_usage
test_usage (uts.biz.drlt.rll.actions.t_epsilon_greedy_action_selector.TEpsilonGreedyActionSelector) 
... epsilon: 1.0; selector: <biz.drlt.rll.actions.ArgmaxActionSelector object at 0x000001CFF8481C70>;
batch_size=3; n_actions=5;
actions: [2 3 2];
mask: [False  True  True];
rand_actions: [3 0];
mask type: <class 'numpy.ndarray'>; True;
final actions: [2 3 0];
action: [2 3 0];
```

在表1第9行，我们设置70%的概率使用随机选择的行动，30%的概率使用策略选择的行动。使用策略选择的策略如表2第7行所示，我们使用70%的概率，生成表2第8行的mask，表明第2、3个样本使用随机选择的行动，表2第9行，是我们随机选择的行动，第11行为我们只保留了第1个样本是由策略选择的行动，第2、3个样本是随机选择的行动，最终我们返回这个结果。

#### 5.3.1.2. $\epsilon$ 衰减策略
刚开始训练时，策略的效果很差，同时我们对环境也一无所知，因此我们需要利用随机性选择的行动，来探索环境的特性。EpsilonTracker类就为了满足这一要求而设计的。

In [None]:
class EpsilonTracker:
    """
    Updates epsilon according to linear schedule
    """
    def __init__(self, selector: EpsilonGreedyActionSelector,
                 eps_start: Union[int, float],
                 eps_final: Union[int, float],
                 eps_frames: int):
        self.selector = selector
        self.eps_start = eps_start
        self.eps_final = eps_final
        self.eps_frames = eps_frames
        self.frame(0)

    def frame(self, frame: int):
        eps = self.eps_start - frame / self.eps_frames
        self.selector.epsilon = max(self.eps_final, eps)
        
class TEpsilonTracker(unittest.TestCase):
    def test_exp(self):
        selector = EpsilonGreedyActionSelector()
        et = EpsilonTracker(selector=selector, eps_start=1.0, eps_final=0.05, eps_frames=100)
        for i in range(100):
            et.frame(i)
            print('{0}: epsilon={1};'.format(i, et.selector.epsilon))
            
# 运行结果
0: epsilon=1.0;
1: epsilon=0.99;
2: epsilon=0.98;
3: epsilon=0.97;
4: epsilon=0.96;
5: epsilon=0.95;
6: epsilon=0.94;
7: epsilon=0.9299999999999999;
8: epsilon=0.92;
......
95: epsilon=0.050000000000000044;
96: epsilon=0.05;
97: epsilon=0.05;
98: epsilon=0.05;
99: epsilon=0.05;


#### 5.3.1.3. ExperienceSourceFirstLast
该类的基类为ExperienceSource，这个类实现只保存和处理每Trajectory第一个和最后一个状态。该类的构造函数为：

In [None]:
class ExperienceSource:
    """
    Simple n-step experience source using single or multiple environments

    Every experience contains n list of Experience entries
    """
    def __init__(self, env, agent, steps_count=2, steps_delta=1, vectorized=False):
        """
        Create simple experience source
        :param env: environment or list of environments to be used
        :param agent: callable to convert batch of states into actions to take
        :param steps_count: count of steps to track for every experience chain
        :param steps_delta: how many steps to do between experience items
        :param vectorized: support of vectorized envs from OpenAI universe
        """
        assert isinstance(env, (gym.Env, list, tuple))
        assert isinstance(agent, BaseAgent)
        assert isinstance(steps_count, int)
        assert steps_count >= 1
        assert isinstance(vectorized, bool)
        if isinstance(env, (list, tuple)):
            self.pool = env
        else:
            self.pool = [env]
        self.agent = agent
        self.steps_count = steps_count
        self.steps_delta = steps_delta
        self.total_rewards = []
        self.total_steps = []
        self.vectorized = vectorized
        
class ExperienceSourceFirstLast(ExperienceSource):
    """
    This is a wrapper around ExperienceSource to prevent storing full trajectory in replay buffer when we need
    only first and last states. For every trajectory piece it calculates discounted reward and emits only first
    and last states and action taken in the first state.

    If we have partial trajectory at the end of episode, last_state will be None
    """
    def __init__(self, env, agent, gamma, steps_count=1, steps_delta=1, vectorized=False):
        assert isinstance(gamma, float)
        super(ExperienceSourceFirstLast, self).__init__(env, agent, steps_count+1, steps_delta, vectorized=vectorized)
        self.gamma = gamma
        self.steps = steps_count
        
# 调用代码
class TExperienceSourceFirstLast(unittest.TestCase):
    def test_exp(self):
        
        #
        device = torch.device("cuda:0")
        year = 2016
        stock_data = BarData.load_year_data(year)
        env = MinuteBarEnv(
                stock_data, bars_count=AppConfig.BARS_COUNT)
        env = gym.wrappers.TimeLimit(env, max_episode_steps=1000)
        net = SimpleFFDQN(env.observation_space.shape[0],
                                env.action_space.n).to(device)
        selector = rll.actions.EpsilonGreedyActionSelector(AppConfig.EPS_START)
        agent = DQNAgent(net, selector, device=device)

        
        exp_source = rll.experience.ExperienceSourceFirstLast(
            env, agent, AppConfig.GAMMA, steps_count=AppConfig.REWARD_STEPS)
        src_itr = iter(exp_source)
        v1 = next(src_itr)
        #print('v1: {0}; {1};'.format(type(v1), v1))
        
# 运行结果
'''
self.pool: <class 'list'>; [<TimeLimit<MinuteBarEnv<StocksEnv-v0>>>];
self.agent: <class 'biz.drlt.rll.agent.DQNAgent'>; <biz.drlt.rll.agent.DQNAgent object at 0x00000138E3622E50>;
self.steps_count: <class 'int'>; 3;
self.steps_delta: <class 'int'>; 1;
self.total_rewards: <class 'list'>; [];
self.total_steps: <class 'list'>; [];
self.vectorized: <class 'bool'>; False;
self.gamma: <class 'float'>; 0.99;
self.steps: <class 'int'>; 2;
'''

我们先来看ExperienceSourceFirstLast类基类的__iter__方法：

In [None]:
        while True:
            actions = [None] * len(states)
            states_input = []
            states_indices = []
            for idx, state in enumerate(states):
                if state is None:
                    actions[idx] = self.pool[0].action_space.sample()  # assume that all envs are from the same family
                else:
                    states_input.append(state)
                    states_indices.append(idx)
            if states_input:
                states_actions, new_agent_states = self.agent(states_input, agent_states)
                for idx, action in enumerate(states_actions):
                    g_idx = states_indices[idx]
                    actions[g_idx] = action
                    agent_states[g_idx] = new_agent_states[idx]
            grouped_actions = _group_list(actions, env_lens)

代码解读如下所示：
* 第2行：actions的值为[None]；
* 第5行：states是以10天交易量加入持仓状态和收益率，共42个数据，所以本循环只会运行一次：
* 第6行：state不为空，会执行第9、10行；
* 第11行：states_input不为空，因此会执行第11$\sim$16行；

我们先来看对DQNAgent的调用，由对Agent的初始化代码agent = DQNAgent(net, selector, device=device)：
* dqn_model为SimpleFFDQN；
* selector为前面讲到的EpsilonGreedyActionSelector；
* device为cuda:0；
* preprocessor=None；

我们来看一下第12行对DQNAgent的__call__方法的调用：

In [None]:
    @torch.no_grad()
    def __call__(self, states, agent_states=None):
        if agent_states is None:
            agent_states = [None] * len(states)
        if self.preprocessor is not None:
            states = self.preprocessor(states)
            if torch.is_tensor(states):
                states = states.to(self.device)
        q_v = self.dqn_model(states)
        q = q_v.data.cpu().numpy()
        actions = self.action_selector(q)
        return actions, agent_states

代码解读如下：
* 第3、4行：由于agent_states=[None]所以不会执行；
* 第5$\sim$8行：由于self.preprocessor=None所以不会执行；
* 第9行：求出$q(s,a)$的值；（未理解SimpleFFDQN中的计算方法含义？？？？？）
* 第11行：按EpsilonGreedyActionSelector中的策略或按概率随机或取最大$q(s,a)$的行动；
注：agent_states好像没用到。