## Basic Market Microstructure

Trading in the cryptocurrency (and most financial) markets happens in what's called a continuous double auction with an open order book on an exchange. That's just a fancy way of saying that there are buyers and sellers that get matched so that they can trade with each other. The exchange is responsible for the matching. There are dozens of exchanges and each may carry slightly different products (such as Bitcoin or Ethereum versus U.S.Dollar). Interface-wise, and in terms of the data they provide, they all look pretty much same.



### Price chart


The current price is the price of the most recent trade. It varies depending on whether that trade was a buy or a sell. The price chart is typically displayed as candlestick chart that shows the Open/Start(O),High(H),Low(L) and Close/End(C) prices for a given time window. The bars below the price chart show the Volume(V), which is the total volume of all trades that happened in that period. The volume of all trades that happened in that period.The volume is important because it gives you a sense of the *liquidity* of the market. If you want to buy $\$100,000$ worth if Bitcoin, but there is nobody willing to sell, the market is _illiquid_.You simply can't buy. A high trade volume indicates that many people are willing transact, which means that you are likely to able to buy or sell when you want to do so. Generally speaking, the more money you want to invest, the more trade volume you want. Volume indicate the "quality" of price trend. High volume means you can rely on the price movement more than if there was low volume. High volume is often (but not always, as in the case of market manipulation) the consensus of a large number of market participants.



## Trade History (Right)

The right side shows a history of all recent trades. Each trade has a size,price,timestamp,and direction(buy or sell).**A trade is a match between two parties, a taker and a maker.**



## Order Book (Left)
* __Ask__ : People willing to sell.
* __Bids__: People willing to buy.
* Best ask: minimum selling price
* Best bid: the highest price someone is willing to buy.

The left side shows the order book, which contains information about who is willing to buy and sell at what price. The order book is made up of two sides: Asks(also called offers), and Bids. *Asks* are people willing to sell, and *bids* are people willing to buy. By definition, the **best ask**, the lowest price that someone is willing to sell at, is larger than the best bid, the **best bid**, the highest price that someone is willing buy at. If this was not the case, a trade between these two parties would've already happened. **The difference between best ask and best bid is called spread**.


Each level of the order book has a price and a volume. For example, a volume of 2.0 at a price level of \$10,000 means that you can buy 2BTC for \$10,000. If you want to buy more, you need to pay more price for the amount exceeds 2 BTC. The volume at level is cumulative, which means that you don't know how many people, or orders, that 2 BTC consists of. There could one person selling 2 BTC, or there could be 100 people selling 0.02 BTC each (some exchange provide this level information, but most don't).

<img src ="stock.png" >
#### Example
So what happens when you send an order to buy 3 BTC? You would be buting (round up) 0.08BTC at \$12,551.00, 0.01BTC at \$12,551.6 and 2.91 at \$12,552.00. On GDAX, you would also be paying 0.3\% taker fee, for a total of about $1.003 \times(0.08\times 12551 + 0.01 \times 12551.6 + 2.91\times12552) = \$37,768.88 $ and an average price per BTC  of $37768.88 / 3 = \$12,589.63 $. It's important to note that what you are actually paying is much higher than \$12,551.00, which was current price! The 0.3% is extremely high.

Also note that your buy order has consumed all the volume that was available at the \$12,551.00 and \$12,551.60 levels. Thus, the order book will "move up", and the best ask will become \$12,552.00. The current price will also become $12,552.00, because that is where the last trade happened. Selling work analogously, jsut that you are now operating on the bid side of the order book, and potentially moving the order book (and price) down. In other words by placing buy and sell orders, you are removing volume form the order book. If your orders are large enoughm you may shift the order book by several levels. In fact, if you placed a very large order for a few million dollars, you would shift order book and price significantly.



How do orders get into order book? That's the difference between market and limit orders. In the above example, you've issued a market order, which basically means "Buy/Sell X amount of BTC at the best price possible right now". If you are not careful about what's in the order book you could end up paying significantly more than the current price shows. For example, imagine that most of the lower levels in the order book only had volume at 0.001 BTC available. Most buy volume would then get matched at much higher, more expensive, price level. If you submit a *limit* order, also called a passive order, you specify quantity and price you're willing to buy or sell at. The order will be placed into the book, and you can cancel it as long as it has not been matched. For example, let's assume the Bitcoin price is at \$10,000, but you want to sell at \$10,010. You may place a limit order. First, nothing happens. If the price keeps moving down your order will just sit there, do nothing, and will never be matched. You can cancel it anytime. **However, if the price moves up, your order will at some point become the best price in the book, and the next person submitting a market order for a sufficient quantity will match.**



Market orders take liquidity from the market. By matching with orders from the order book, you are taking away the option to trade to from other people - thers's less volume left! That's also why market orders, or market takers, often need to pay higher fees than market makers, who put orders into the book. Limit orders providing liquidity because they are giving other the option to trade. At the same time, limit orders guarantee that you will not pay more than the price specified in the limit order. However, you don't know  when, or if, someone will match your order. You alos giving the market information about what you believe the price should be. This can also be used to manipulate the other participants in the market, who may act a certain way based on the orders you are executing or putting into the book. Because they provide the option to trade and give away information, market makers typically pay lower fees than market takers.Some exchange also provide stop orders, which allow you to set a maximum price for your market orders.

# Data

The main reason I am using cryptocurrencies in this post is because data is public, free, and easy to obtain. Most exchange have streaming APIs that allows you to receive market updates in real-time. We'll use GDAX as an example again, but the data for other exchanges look very similar. Let's go over the basic types of events you would use to build a Machine Learning model.


## Trade 

A new trade has happened. Each trade has a timestamp, a unique ID assigned by the exchange, a price, and size, as discussed above. If you wanted to plot the price graph of an asset, you would simply plot the price of all trades. If you wanted to plot the candelstick chart you would window the trade events for a certain period, such as five minutes, and then plot the windows. 

```python
{
    "time": "2014-11-07T22:19:28.578544Z",
    "trade_id": 74,
    "price": "10.00000000",
    "size": "0.01000000",
    "side": "buy"
}
```

## BookUpdate 

One or more levels in the order book were updated. Each level is made up of the side (Buy=Bid,Sell=Ask), the price/level, and the new quantity at that level. Note that these are changes, or deltas, and you must construct the full order book yourself by merging them.

```python
{
    "type": "l2update",
    "product_id": "BTC-USD",
    "changes": [
        ["buy", "10000.00", "3"],
        ["sell", "10000.03", "1"],
        ["sell", "10000.04", "2"],
        ["sell", "10000.07", "0"]
    ]
}
```


## BookSnapshot

Similar to BookUpdate, but a snapshot of the complete order book. Because the full order book can be very large, it is faster and more efficient to use the BookUpdate events instead. However, having an occasional snapshot can be useful.

```python
{
    "type": "snapshot",
    "product_id": "BTC-EUR",
    "bids": [["10000.00", "2"]],
    "asks": [["10000.02", "3"]]
}

```

# A few Trading Strategy Metrics

When developing trading algorithms, what do you optimize for? The obvious answer is profit, but that's not the whole story.You need to compare your trading strategy to baselines, and compare its risk and volatility to other investments. Here are a few of the most basic metrics that traders are using. I won't go into detail here, so feel free to follow the links for more information.

## Net PnL (Net Profit and Loss)

Simply how much money an algorithm makes (positive) or loses (negative) over same period of time, minus the trading fees.

## Alpha and Beta

**Alpha** defines how much better, in terms of profit, your strategy is when compared to an atlernative, relatively risk-free, investment, like a government bond. Even if your strategy is profitable, you could be better off investing in a risk-free alternative.

**Beta** is closely related, and tells you how volatile your strategy is compared to the market. For example, a beta of 0.5 means that your investment moves \$1 when market moves \$2.



## Sharpe Ratio

The Sharp Ratio measures the excess return per unit of risk you are taking. It's basically your return on capital over the standard deviation adjusted for risk. Thus, the higher the better. It takes into account both the volatility of your strategy, as well as an alternative risk free investment.


## Maximum Drawdown

The Maximum Drawdown is the maximum difference between a local maximum and the subsequent local minimum, another measure of risk. For example, a maximum drawdown of 50% means that you lose 50% of your capital at some point. You then need to make a 100% return to get back to your original amount of capital.Clearly, a lower maximum drawdown is better.


## Value at Risk (VaR)


Value at Risk is a risk metric that quantifies how much capital you may lose over a given time frame with some probability, assuming normal market conditions. For example, a 1-day 5% VaR of 10% means that there is a 5% chance that you may lose more than 10% of an investment within a day.

## Supervised Learning 

Before looking at the problem from Reinforcement Learning perspective, let's understand how we would go about creating a profitable trading strategy using a supervised learning approach. Then we will see what's problematic about this, adnd why we may want to use Reinforcement Learning techniques.

The most obvious approach we can take is price prediction.If we can predict that the market will move up can buy now, and sell once market has moved. Or, equivalently, if we predict the market goes down, we can go short (borrowing an asset we don't own) and then buy once the market has moved. However, there are a few problem with this.

First of all, what price do we actually predict? As we've seen above, there is not a "single" price we are buying at. The final price we pay depends on the volume available at different levels of the order book, and the fees we need to pay. A naive thing to do is to predict the mid price, which is the mid point between best bid and best ask. That's what most researchers do. However, this is just a theoretical price, not something we can actually execute orders at, and could differ significantly from the real price we're paying.


The next question is time scale. Do we predict the price of the next trade? The price at the next second? Minute? Hour? Day? Intutively, the further in the future we want to predict, the more uncertainty there is, and the more difficult the prediction becomes.

Let's look at an example. Let's assume the BTC price is \$10,000 and we can accurately predict that the "price" moves up from \$10,000 to \$10,500 in the next minute. So, does that mean you can make $50 of profit by buying and selling? Let's understand why it doesn't.

* We buy when the best ask is \$10,000. Most likely we will not be able to get all our 1.0 BTC filled at that price because the order book does not have the required volume. We may be forced to buy 0.5 BTC at \$10,000 and 0.5 BTC at \$10,010, for an average price of \$10,005. On GDAX, we also pay a 0.3% taker fee, which corresponds to roughly \$30.
* The price is now \$10,050, as predicted. We place the sell order. Because the market moves very fast,by the time the order is delivered over the network the price has slipped already. Let's say it's now at \$10,045. Similar to above, we most likely cannot sell all of our 1 BTC at that price. Perhaps we are forced to sell0.5 BTC at \$10,045 and 0.5 BTC at \$10,040, for an average price of \$10,042.5. Then we pay another 0.3% taker fee, which corresponds to roughly \$30.


So how much money have we made? -1005 - 30 -30 + 10,042.5 = - 22.5. Instead of making \$50, we have lost \$22.5, even though we have acurrately predicted a large price movement over the next minute! In the above example there were three reasons for this.: No liquidity in the best order book level, network latencies, and fees, none of which is supervised model could take into account.



What is the lesson here? In order to make money from simple price prediction strategy, we must predict relatively large price movements over longer periods of time, or be very smart about our fees and order management. And that's a difficult prediction problem. We could have saved on the fees by using limit instead of market orders being matched, and we would need to build a complex system for order management and cancellation.

But there's another problem with supervised learning: It does not imply a policy. In the above example we bought because we predicted that the price moves up, and it actually moved up. Everything went accordingly to plan. But what if the price would had moved down? Would you have sold? Kept the position and waited? What if the price had moved up just a little bit and then moved down again? What if we had been uncertain about the prediction, for example 65% up 35% down? Would you still have bought? How do you choose the threshold to place an order?

Thus, you need more than just a price prediction model (unless your model is extremely accurate and robust). We also need a *rule-based policy* that takes as input your price prediction and decides what actually do: Place an order, do nothing, cancel an order,and so on. How do we come up with such a policy? How dow we optimize the policy parameters and decision thersholds? The answer to this is not obvious, and many other people use simple heuritics or human intention.




# A typical Strategy Development Workflow

Luckily, there are solutions to many of the above problems. The bad news is, the solutions are not very effective. Let's look a typical workflow for trading strategy development.

1. **Data Analysis**: You performs exploratory data analysis to find trading opportunities. You may look at various charts, calculate data statistics, and so on. The output of this step is an "idea" for a trading strategy that should be validated/

2. **Supervised Model Training**: If necessary, you may train one or more supervised learning models to predict quantities of interest that are necessary for strategy to work.For example, price prediction, quantity predicition, etc.

3. **Policy Development**: You then come up with a rule-based policy that determines what actions to take based on the current state of the market and the outputs of supervised models. Note that this policy may also have parameters, such as decision thresholds, that need to be optimized This optimization is done later.

4. **Strategy Backtesting**: You use simulator to test an initial version of strategy against a set of historical data. The simulator can take things such as order book liquidity, network latencies, fees,etc into account. If the strategy performs reasonably well in backtesting, we can move on and do parameter optimization.

5. **Parameter Optimization**: You can now perform a search, for example a grid search, over possible values of strategy parameters like thresholds or coefficient, again using simulator and set of historical data. Here, overfitting to historical data is a big risk, and you must be care ful about using proper validation and test sets.

6. **Simulation & Paper Trading**: Before the strategy goes live, simulation is done on new marker data, in real time. That's called paper trading and helps preventing overfitting. Only f the strategy is successful in paper trading, it is deployed in a live environment.

7. **Live Trading**: The strategy is now running live on exchange.


1. Iteration cycles are slow. Steps 1-3 are largely based on intuition, and you don't know if your strategy works until the optimization in step 4-5 is done, possibly forcing you to start form scratch. In fact, every step comes with risk of failing and forcing you to start from scratch.
2. Simulation comes too late. You don't explicitly take into account environmental factors such as **latencies, fees, and liquidity** until step 4.Shouldn't these things directly iform your strategy development or the parametes of your model?
3. Policies are developed independently from supervised models even though they interacts closely. Supervised prediction are an input to the policy. Wouldn't it make sense to jointly optimize them?
4. Policies are simple. They are limited to what humans can come up with.
5. Parameter optimization is inefficient. For example lets assume you are optimizing for a combination of profit and risk, and you want to find parameters that give you high **sharpe Ratio**.Instead of using an efficient gradient-based approach you are doing an inefficient grid search and hope that you will find something good (while not overfitting).

# Deep Reinforcement Learning for Trading 

Remember that the traditional Reinforcement Learning problem can be formulated as a Markov Decision Process (MDP). We have an agent acting in an environment. Each time step $t$ the agent receives as the input the current state $S_t$, take an action $A_t$ and receives a reward $R_{t+1}$  and next state $S_{t+1}$. The agent chooses the action based on some policy $\pi:A_t = \pi(S_t)$. It's our goal to find a policy that maximizes the cumulative reward $\sum R_t$ iver some finite or infinite time horizon.