<table>
    <tr>
        <td>
            <img src='./text_images/nvidia.png' width="200" height="450">
        </td>
        <td> & </td>
        <td>
            <img src='./text_images/udacity.png' width="350" height="450">
        </td>
    </tr>
</table>

# Deep Reinforcement Learning for Optimal Execution of Portfolio Transactions 
用于执行最佳投资组合交易的深度强化学习


# Introduction
介绍

This notebook demonstrates how to use Deep Reinforcement Learning (DRL) for optimizing the execution of large portfolio transactions. We begin with a brief review of reinforcement learning and actor-critic methods.  Then, you will use an actor-critic method to generate optimal trading strategies that maximize profit when liquidating a block of shares.   
本笔记本演示了如何使用深度强化学习（DRL）来优化大型投资组合交易的执行。我们首先简要回顾一下强化学习和演员批评家方法。然后，您将使用演员批评家方法来生成最佳交易策略，以在清算股份时最大化利润。

# Actor-Critic Methods
演员批评家方法

In reinforcement learning, an agent makes observations and takes actions within an environment, and in return it receives rewards. Its objective is to learn to act in a way that will maximize its expected long-term rewards.   
在强化学习中，代理在环境中进行观察并采取行动，作为回报，它会获得奖励。 它的目标是学习以最大程度地获得其预期长期回报的方式行事。

<br>
<figure>
  <img src = "./text_images/RL.png" width = 80% style = "border: thin silver solid; padding: 10px">
      <figcaption style = "text-align: center; font-style: italic">Fig 1. - Reinforcement Learning.</figcaption>
</figure> 
<br>

There are several types of RL algorithms, and they can be divided into three groups:  
强化学习算法有几种类型，可以分为三类：

- **Critic-Only**: Critic-Only methods, also known as Value-Based methods, first find the optimal value function and then derive an optimal policy from it.   
仅限批评家：仅限批评家方法，也称为基于价值的方法，首先找到最优值函数然后从中得到最优策略  


- **Actor-Only**: Actor-Only methods, also known as Policy-Based methods, search directly for the optimal policy in policy space. This is typically done by using a parameterized family of policies over which optimization procedures can be used directly.   
仅限演员：仅限演员的方法，也称为基于策略的方法，直接在策略空间中搜索最佳策略。通常，这是通过使用参数化的策略系列来完成的，可以直接使用优化过程。  


- **Actor-Critic**: Actor-Critic methods combine the advantages of actor-only and critic-only methods. In this method, the critic learns the value function and uses it to determine how the actor's policy parameters should be changed. In this case, the actor brings the advantage of computing continuous actions without the need for optimization procedures on a value function, while the critic supplies the actor with knowledge of the performance. Actor-critic methods usually have good convergence properties, in contrast to critic-only methods.  The **Deep Deterministic Policy Gradients (DDPG)** algorithm is one example of an actor-critic method.  
演员-批评家：演员-批评家方法结合了仅限演员和仅限批评家方法的优点。在这个方法里面，评论家学习了价值函数，并用他来决定演员的策略参数如何调整。在这种情况下，演员带来了计算连续性动作的优势而无需对价值函数进行优化程序，而批评家为演员提供了性能方面的知识。演员-批评家方法通常具有很好的收敛性，这与仅限批评家方法正好相反。深度确定性策略梯度(DDPG)方法就是一个演员批评家的示例

<br>
<figure>
  <img src = "./text_images/Actor-Critic.png" width = 80% style = "border: thin silver solid; padding: 10px">
      <figcaption style = "text-align: center; font-style: italic">Fig 2. - Actor-Critic Reinforcement Learning.</figcaption>
</figure> 
<br>

In this notebook, we will use DDPG to determine the optimal execution of portfolio transactions. In other words, we will use the DDPG algorithm to solve the optimal liquidation problem. But before we can apply the DDPG algorithm we first need to formulate the optimal liquidation problem so that in can be solved using reinforcement learning. In the next section we will see how to do this.   
在本笔记本中，我们将使用深度确定性策略梯度DDPG去决定投资组合交易的最佳执行。换句话说，我们将使用DDPG算法去解决最佳清算问题。但是在我们应用DDPG算法之前，我们首先需要用公式表达最优清算问题，以便可以使用强化学习来解决。在下一节中，我们将看到如何执行此操作

# Modeling Optimal Execution as a Reinforcement Learning Problem
将最佳执行问题建模为强化学习问题

As we learned in the previous lessons, the optimal liquidation problem is a minimization problem, *i.e.* we need to find the trading list that minimizes the implementation shortfall. In order to solve this problem through reinforcement learning, we need to restate the optimal liquidation problem in terms of **States**, **Actions**, and **Rewards**. Let's start by defining our States.  
正如我们前面课程锁学习的那样，最优清算问题是一个最小化问题，即，我们需要找到一个能够最大程度减少实施缺口的交易清单。为了通过强化学习来解决这个问题，我们需要根据“状态”、“动作”和“奖励”来重新陈述最优清算问题。让我们从定义状态开始

### States
状态

The optimal liquidation problem entails that we sell all our shares within a given time frame. Therefore, our state vector must contain some information about the time remaining, or what is equivalent, the number trades remaning. We will use the latter and use the following features to define the state vector at time $t_k$:  
最优清算问题需要我们在一个给定的时间内出售所有股票。因此，我们的状态向量必须包含一些关于剩余时间或者剩余数量的信息。我们将用后者，并使用以下特征去定义一个在时间$t_k$的状态向量    


$$
[r_{k-5},\, r_{k-4},\, r_{k-3},\, r_{k-2},\, r_{k-1},\, r_{k},\, m_{k},\, i_{k}]
$$

where:    
所以：  

- $r_{k} = \log\left(\frac{\tilde{S}_k}{\tilde{S}_{k-1}}\right)$ is the log-return at time $t_k$  
- $r_{k} = \log\left(\frac{\tilde{S}_k}{\tilde{S}_{k-1}}\right)$  是在时间$t_k$的对数返回值    


- $m_{k} = \frac{N_k}{N}$ is the number of trades remaining at time $t_k$ normalized by the total number of trades.  
- $m_{k} = \frac{N_k}{N}$ 是在时间$t_k$通过交易总数归一化之后的剩余交易数量  


- $i_{k} = \frac{x_k}{X}$ is the remaining number of shares at time $t_k$ normalized by the total number of shares.  
- $i_{k} = \frac{x_k}{X}$ 是在时间$t_k$通过股份总数归一化之后的剩余股份数量


The log-returns capture information about stock prices before time $t_k$, which can be used to detect possible price trends. The number of trades and shares remaining allow the agent to learn to sell all the shares within a given time frame. It is important to note that in real world trading scenarios, this state vector can hold many more variables.   
对数返回捕捉时间$t_k$之前的股票价格信息，可以用来检测可能的价格趋势。剩余的交易数量和股票数量允许智能体去学习在给定的时间内出售所有股票。需要注意的是，在现实世界的交易场景中，状态向量可以容纳更多变量  

### Actions
动作  

Since the optimal liquidation problem only requires us to sell stocks, it is reasonable to define the action $a_k$ to be the number of shares to sell at time $t_{k}$. However, if we start with millions of stocks, intepreting the action directly as the number of shares to sell at each time step can lead to convergence problems, because, the agent will need to produce actions with very high values. Instead, we will interpret the action $a_k$ as a **percentage**. In this case, the actions produced by the agent will only need to be between 0 and 1. Using this interpretation, we can determine the number of shares to sell at each time step using:  
由于最优清算问题仅需要我们去出售股票，因此将动作$a_k$定义为在时间$t_{k}$时要出售的股票数量是合理的。但是，如果我们从数以百万计的股票开始，则将动作直接解释为在每个时间步骤出售的股票数量将会导致收敛的问题，因为智能体将需要相当高的代价去产生动作。相反，我们将动作$a_k$解释为百分比。在这种情况下，智能体产生动作只需要介于0和1之间。使用这个表示方法，我们可以使用以下方法确定每个时间步骤出售的股票数量：  


$$
n_k = a_k \times x_k
$$

where $x_k$ is the number of shares remaining at time $t_k$.    
其中$x_k$是在时间$t_k$的剩余的股票数量

### Rewards
奖励

Defining the rewards is trickier than defining states and actions, since the original problem is a minimization problem. One option is to use the difference between two consecutive utility functions. Remeber the utility function is given by:  
定义奖励比定义状态和动作要复杂得多，因为原始问题是最小化问题。 一种选择是利用两个连续的效用函数的差异。 记住效用函数如下：  

$$
U(x) = E(x) + λ V(x)
$$

After each time step, we compute the utility using the equations for $E(x)$ and $V(x)$ from the Almgren and Chriss model for the remaining time and inventory while holding parameter λ constant. Denoting the optimal trading trajectory computed at time $t$ as $x^*_t$, we define the reward as:   
在每一个时间段,我们利用等Almgren和Chriss的模型中的$E(x)$和$V(x)$的方程式计算剩余时间和库存量的效用，并同时保持参数λ不变。将在时间$t$最佳交易轨迹表示为$x^*_t$，我们将奖励定义为如下：  

$$
R_{t} = {{U_t(x^*_t) - U_{t+1}(x^*_{t+1})}\over{U_t(x^*_t)}}
$$

Where we have normalized the difference to train the actor-critic model easier.  
我们已将差异归一化，以便更轻松的训练演员-批判家模型

# Simulation Environment
模拟环境

In order to train our DDPG algorithm we will use a very simple simulated trading environment. This environment simulates stock prices that follow a discrete arithmetic random walk and that the permanent and temporary market impact functions are linear functions of the rate of trading, just like in the Almgren and Chriss model. This simple trading environment serves as a starting point to create more complex trading environments. You are encouraged to extend this simple trading environment by adding more complexity to simulte real world trading dynamics, such as book orders, network latencies, trading fees, etc...   
为了训练我们的DDPG算法，我们将使用一个非常简单的模拟交易环境。这种环境模拟了遵循离线算术随机游动的的股票价格，并且永久性和临时性市场影响函数是交易率的的线性函数，就像在Almgren和Chriss模型中的一样。这个简单的交易环境是创建更复杂的交易环境的起点。我们鼓励你通过增加更多复杂性来模拟现实世界中的交易动态来扩展这种简单的交易环境，例如订单，网络等待时间，交易费用等。

The simulated enviroment is contained in the **syntheticChrissAlmgren.py** module. You are encouraged to take a look it and modify its parameters as you wish. Let's take a look at the default parameters of our simulation environment. We have set the intial stock price to be $S_0 = 50$, and the total number of shares to sell to one million. This gives an initial portfolio value of $\$50$ Million dollars. We have also set the trader's risk aversion to $\lambda = 10^{-6}$.  
模拟的环境包含在**syntheticChrissAlmgren.py**模块中。建议你看一下并根据需要修改其参数。让我们看一下仿真环境的默认参数。我们将初始股票价格设置为$S_0 = 50$，要出售的股票总数是100万。这样得出初始投资组合值为$\$50$百万美元。我们还将交易者的风险规避设置为$\lambda = 10^{-6}$

The stock price will have 12\% annual volatility, a [bid-ask spread](https://www.investopedia.com/terms/b/bid-askspread.asp) of 1/8 and an average daily trading volume of 5 million shares. Assuming there are 250 trading days in a year, this gives a daily volatility in stock price of $0.12 / \sqrt{250} \approx 0.8\%$. We will use a liquiditation time of $T = 60$ days and we will set the number of trades $N = 60$. This means that $\tau=\frac{T}{N} = 1$ which means we will be making one trade per day.   
股票将具有12%的年度波动率，买卖差价为1/8，平均每日的交易量为500万股。假设一年中有250个交易日，则股票价格每天的波动幅度为$0.12 / \sqrt{250} \approx 0.8\%$。我们将使用$T = 60$天的清算时间，并设置交易数量为$N = 60$。这意味中$\tau=\frac{T}{N} = 1$，也就是说我们每天将进行一笔交易  

For the temporary cost function we will set the fixed cost of selling to be 1/2 of the bid-ask spread, $\epsilon = 1/16$. we will set $\eta$ such that for each one percent of the daily volume we trade, we incur a price impact equal to the bid-ask
spread. For example, trading at a rate of $5\%$ of the daily trading volume incurs a one-time cost on each trade of 5/8. Under this assumption we have $\eta =(1/8)/(0.01 \times 5 \times 10^6) = 2.5 \times 10^{-6}$.  
对于临时成本函数，我们将固定销售成本设置为买卖差价的1/2，$\epsilon = 1/16$。我们设置$\eta$，以使我们每天交易量的每1%产生的价格影响等于买卖价差。例如，以每日交易量的$5\%$的价格进行交易，每次交易的成本为5/8.在此假设下，我们有$\eta =(1/8)/(0.01 \times 5 \times 10^6) = 2.5 \times 10^{-6}$  

For the permanent costs, a common rule of thumb is that price effects become significant when we sell $10\%$ of the daily volume. If we suppose that significant means that the price depression is one bid-ask spread, and that the effect is linear for smaller and larger trading rates, then we have $\gamma = (1/8)/(0.1 \times 5 \times 10^6) = 2.5 \times 10^{-7}$.   
对于永久性成本，通常的经验法则是，当我们卖出每日交易量的$10\%$时，价格效应会变得很值得注意。如果我们假设值得注意的意思是价格下跌是一个买卖价差，并且对于越来越小的交易率，其影响是线性的，那么我们 $\gamma = (1/8)/(0.1 \times 5 \times 10^6) = 2.5 \times 10^{-7}$  

The tables below summarize the default parameters of the simulation environment  
下表总结了模拟环境的默认参数

In [1]:
import utils

# Get the default financial and AC Model parameters  
# 获取默认的金融和AC模型参数
financial_params, ac_params = utils.get_env_param()

In [2]:
financial_params
# Annual Volatility 年度波动率
# Bid-Ask Spread 买卖差价
# Daily Volatility 每日波动率
# Daily Trading Volume 每日交易量

0,1,2,3
Annual Volatility:,12%,Bid-Ask Spread:,0.125
Daily Volatility:,0.8%,Daily Trading Volume:,5000000.0


In [3]:
ac_params
# Total Number of Shares to Sell 出售股票总数
# Fixed Cost of Selling per Share 每股固定销售成本
# Starting Price per Share 每股起拍价
# Trader's Risk Aversion 交易者的风险规避
# Price Impact for Each 1% of Daily Volume Traded 每日交易量的1%的价格影响
# Permanent Impact Constant 永久影响常数
# Number of Days to Sell All the Shares 出售所有股份的天数
# Single Step Variance 单步方差
# Number of Trades 交易数
# Time Interval between trades 交易之间的时间间隔

0,1,2,3
Total Number of Shares to Sell:,1000000,Fixed Cost of Selling per Share:,$0.062
Starting Price per Share:,$50.00,Trader's Risk Aversion:,1e-06
Price Impact for Each 1% of Daily Volume Traded:,$2.5e-06,Permanent Impact Constant:,2.5e-07
Number of Days to Sell All the Shares:,60,Single Step Variance:,0.144
Number of Trades:,60,Time Interval between trades:,1.0


# Reinforcement Learning
强化学习

In the code below we use DDPG to find a policy that can generate optimal trading trajectories that minimize implementation shortfall, and can be benchmarked against the Almgren and Chriss model. We will implement a typical reinforcement learning workflow to train the actor and critic using the simulation environment. We feed the states observed from our simulator to an agent. The Agent first predicts an action using the actor model and performs the action in the environment. Then, environment returns the reward and new state. This process continues for the given number of episodes. To get accurate results, you should run the code at least 10,000 episodes.  
在下面的代码中，我们使用DDPG查找可以生成最佳交易轨迹的策略，以最大程度减少实施差额，并可以以Almgren和Chriss模型为基准。我们将实现一个典型的强化学习工作流程，以在模拟环境中训练演员和评论家。我们将从模拟器中观察到的状态反馈给智能体。代理首先用演员模型预测一个动作，然后在环境中执行这个动作。然后环境返回新的状态和奖励。对于给定的周期，这个动作将持续进行。为了获取一个精确的结果，你至少应该跑10000个周期

In [None]:
import numpy as np

import syntheticChrissAlmgren as sca
from ddpg_agent import Agent

from collections import deque

# Create simulation environment
# 创建一个模拟环境
env = sca.MarketEnvironment()

# Initialize Feed-forward DNNs for Actor and Critic models. 
# 初始化演员批判家模型的前向反馈DNN
agent = Agent(state_size=env.observation_space_dimension(), action_size=env.action_space_dimension(), random_seed=0)

# Set the liquidation time
# 设置清算周期
lqt = 60

# Set the number of trades
# 设置交易数量
n_trades = 60

# Set trader's risk aversion
# 设置交易者的风险规避
tr = 1e-6

# Set the number of episodes to run the simulation
# 设置模拟环境的运行剧集数
episodes = 10000

shortfall_hist = np.array([])
shortfall_deque = deque(maxlen=100)

for episode in range(episodes): 
    # Reset the enviroment
    # 重置环境
    cur_state = env.reset(seed = episode, liquid_time = lqt, num_trades = n_trades, lamb = tr)

    # set the environment to make transactions
    # 设置交易环境
    env.start_transactions()

    for i in range(n_trades + 1):
      
        # Predict the best action for the current state. 
        # 预测当前状态的最佳动作
        action = agent.act(cur_state, add_noise = True)
        
        # Action is performed and new state, reward, info are received. 
        # 执行操作并接收新状态，奖励和信息
        new_state, reward, done, info = env.step(action)
        
        # current state, action, reward, new state are stored in the experience replay
        # 当前状态，动作，奖励，新状态存储在经验回放中
        agent.step(cur_state, action, reward, new_state, done)
        
        # roll over new state
        # 刷新状态
        cur_state = new_state

        if info.done:
            shortfall_hist = np.append(shortfall_hist, info.implementation_shortfall)
            shortfall_deque.append(info.implementation_shortfall)
            break
        
    if (episode + 1) % 100 == 0: # print average shortfall over last 100 episodes
        print('\rEpisode [{}/{}]\tAverage Shortfall: ${:,.2f}'.format(episode + 1, episodes, np.mean(shortfall_deque)))        

print('\nAverage Implementation Shortfall: ${:,.2f} \n'.format(np.mean(shortfall_hist)))

Episode [100/10000]	Average Shortfall: $2,276,780.07
Episode [200/10000]	Average Shortfall: $2,562,254.63
Episode [300/10000]	Average Shortfall: $2,562,500.00
Episode [400/10000]	Average Shortfall: $2,562,500.00
Episode [500/10000]	Average Shortfall: $2,562,500.00
Episode [600/10000]	Average Shortfall: $2,562,500.00
Episode [700/10000]	Average Shortfall: $2,562,500.00
Episode [800/10000]	Average Shortfall: $2,562,500.00
Episode [900/10000]	Average Shortfall: $2,562,500.00
Episode [1000/10000]	Average Shortfall: $2,562,500.00
Episode [1100/10000]	Average Shortfall: $2,562,500.00
Episode [1200/10000]	Average Shortfall: $2,562,500.00
Episode [1300/10000]	Average Shortfall: $2,562,500.00
Episode [1400/10000]	Average Shortfall: $2,562,500.00
Episode [1500/10000]	Average Shortfall: $2,562,500.00
Episode [1600/10000]	Average Shortfall: $2,562,500.00
Episode [1700/10000]	Average Shortfall: $2,562,500.00
Episode [1800/10000]	Average Shortfall: $2,562,500.00
Episode [1900/10000]	Average Shortfal

# Todo

The above code should provide you with a starting framework for incorporating more complex dynamics into our model. Here are a few things you can try out:  
上面的代码应为您提供一个入门框架，以将更复杂的变化纳入我们的模型。 您可以尝试以下几件事：

- Incorporate your own reward function in the simulation environmet to see if you can achieve a expected shortfall that is better (lower) than that produced by the Almgren and Chriss model.  
将你自己的奖励方法并入模拟环境中，以便观察是否可以实现比Almgren和Chriss模型更好的(更低)的预期缺口


- Experiment rewarding the agent at every step and only giving a reward at the end.  
尝试给代理每一个奖励或者只在最后给一个奖励


- Use more realistic price dynamics, such as geometric brownian motion (GBM). The equations used to model GBM can be found in section 3b of this [paper](https://ro.uow.edu.au/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1705&context=aabfj)  
使用更真实的价格动态，例如几何布朗运动。用于建模GBM的方程式可以在这个paper的第3b节中找到


- Try different functions for the action. You can change the values of the actions produced by the agent by using different functions. You can choose your function depending on the interpretation you give to the action. For example, you could set the action to be a function of the trading rate.  
尝试不同的关于动作的方法。你可以使用不同的功能来更改代理产生的动作的值。你可以根据对动作的解释来选择功能。例如，你可以将动作设置为交易汇率的函数


- Add more complex dynamics to the environment. Try incorporate trading fees, for example. This can be done by adding and extra term to the fixed cost of selling, $\epsilon$.  
向环境添加更复杂的变化。例如，尝试合并交易费。这可以通过在固定的销售成本$\epsilon$上加上额外的项来实现