## Strategy Idea 2 : "Cointegration - pairs trading"

__Section 0: Setup__ Importing packages/reading in data etc.

__Section 1 : Idea__ 

- __1.1__ Strategy idea

- __1.2__ Origin of idea. Context/Reasoning for strategy to work e.g. use in financial markets?

__Section 2 : Exploration__

- __2.1__ Exploratory Data Analysis. e.g plots of price/volumes that could show strategy working, how much potential.

- __2.2__ Define some 'strategy metrics'. Metrics that can can you use to gauge if this strategy will work i.e no.price points above a certain threshold that is profitable. Metrics could show how often there is an opportunity to make a trade and how much 'value' is in an opportunity e.g. how much is there a price swing?


__Section 3 : Strategy testing__

- __3.1__ Testing strategy on previous data. 

- __3.2__ State any assumptions made by testing.

- __3.3__ Model refinements. How could strategy be optimised? Careful : is this backfitting/overfitting - what measures taken to negate this e.g. bootstrapping?

- __3.4__ Assessing strategy. P/L on data sample? ROI? variance in results? longest losing run?

__Section 4 : Practical requirements__

- __4.1__ Identify if this edge is ‘realisable’? What methods will you apply to extract this value? e.g. applying a hedge function


- __4.2__ Is it possible to quantify the potential profit from the strategy? Consideration : How long will it take to obtain this? How 'risky' is it? e.g. if something did go wrong, how much do we lose? 

- __4.3__ Strategy limitations. The factors that could prevent strategy working e.g. practical considerations e.g. reacting quick enough to market updates, volume behind a price, size of bankroll needed


__Section 5: Potential limitations__

- __5.1__ What is our 'competition' - if not quantifiable, do we suspect people are doing the same thing? 

- __5.2__ So what's our edge? Identify ways of finding this edge in future? e.g what features are there? Are they predictive? Is there a certain 'market/runner' profile?





Notes (to do)
* Use lay prices as well (currently only using back prices, but two lay bets are made per pairs trade

### Section 0 : Setup

In [54]:
# importing packages
from pathlib import Path, PurePath 

import pandas as pd
import numpy as np
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

import utils

In [55]:
# reading in data
project_dir = Path.cwd().parents[2]
data_dir = project_dir / 'data' / 'processed' / 'api' / 'advanced' / 'adv_data.csv'
df = pd.read_csv(data_dir, index_col = 0)
print(df.shape)
df.head()

(13073, 307)


Unnamed: 0,SelectionId,MarketId,Venue,Distance,RaceType,BSP,NoRunners,BS:T-60,BS:T-59,BS:T-58,...,LS:T+5,LS:T+6,LS:T+7,LS:T+8,LS:T+9,LS:T+10,LS:T+11,LS:T+12,LS:T+13,LS:T+14
0,11986132,1.169028,Huntingdon,20.0,Chase,8.33,9,16.43,24.51,26.57,...,10.08,11.15,5.44,7.09,14.16,19.53,3.12,3.31,0.68,0.68
1,16800725,1.169028,Huntingdon,20.0,Chase,3.68,9,15.43,25.74,57.82,...,29.87,221.22,43.23,43.1,13.53,26.15,13.6,74.3,419.52,23082.1
2,20968322,1.169028,Huntingdon,20.0,Chase,14.96,9,9.87,9.25,9.15,...,37.32,6.83,4.85,11.23,16.0,5.68,40.25,12.51,10.42,13.17
3,22023486,1.169028,Huntingdon,20.0,Chase,4.25,9,84.38,64.49,58.01,...,11.67,2.02,2.02,2.02,2.02,2.02,2.02,2.02,2.02,2.02
4,24496216,1.169028,Huntingdon,20.0,Chase,6.6,9,10.64,10.11,7.91,...,34.27,54.72,11.85,17.99,48.21,17.28,38.29,6.96,4.37,4.37


### Section 1 : Idea

__1.1 Idea__

Prices in a market adjust such that the sum of the implied odds of all horses is approximately equal to one. If one horse's price drifts, another's or several other's should be backed in. 

One traditional strategy considering this price behaviour is 'pairs trading': *"A pairs trade or pair trading is a market neutral trading strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement. This strategy is categorized as a statistical arbitrage and convergence trading strategy."*

__1.2  Reasoning__

Why is there an edge here?
- .

### Section 2 : Pairs trading

__2.1__ **- Idea**

[Bebbington, PA (2017)](https://discovery.ucl.ac.uk/id/eprint/1563501/) looks at pairs trading in horse racing markets. The following outlines their method for analysing this strategy. In 2.2, each step will be attempted.

* The 'signals' are the best match or lay price available at a given timestamp.
* Statsitical methods are used to analyse horses' pricing data for comparison, in particular to calcuate a hedge ratio and for stake weighting. In the paper, non-overlapping windows of data, for example, price observations 1-5, 6-10, 11-15, make up the time series, and then trades are made at the end of the window. This is used to simulate a method where the algorithm is reacting to live data. This example study movement throughout 30 price points and make bets in the remaining 30 periods.
* A z-score transformation of the log of the decimal odds is used to standardise prices. This makes the relative directional movement in different prices comparable by accounting for their respective variances. 
* Pairs are discovered by analysing the sum of squared distances between two horses' prices throughout time. Those that move the least relative to each other are the best candidates for pairs.
* Once pairs are identified, the 'spread' between their prices (on average, or at the end of each window) is compared to a minimum size requirement for a bet to be made, $\phi$.
* The 'hedging ratio' is found using an OLS regression of the price of one of the horses on the price of the other. Since the two prices are pairs but will have different variances and absolute values, their movement relative to eachother must be considered to make the strategy 'cost neutral'. It is also used to define the stake size. 
* The final observed spread indicates which on which horse a 'back-to-lay' hedge must be made (that which is expected to be backed in) and on which a 'lay-to-back' hedge must be made (that which is expected to drift). This spread is compared to an interval [?], likely a confidence interval of past spreads or simply the interval of observed spreads. If the spread is greater than usual or smaller than usual, the bets are placed. 
* In the paper it appears that both sides of the hedge bet are made at the same point in time.

__2.2__ **- Setup**

**Data**

The following example will be set up with a random race and will identify tradeable pairs (or that there are none). Three DataFrames are created: (1) the unchanged race sample DataFrame with one row per horse and data going along in columns, (2) a back prices DataFrame with one column per horse and prices going through time in rows, (3) the same for lay prices. This analysis looks at prices before the race begins.

There are 60 price data points for each horse, finishing at the begining of the race.

Variables:
* $BP_{t}^{i}$ is back price for horse i at time t.
* $LP_{t}^{i}$ is lay price for horse i at time t.

In [56]:
# defining variables
back_prices = [col for col in df.columns if 'BP' in col]
back_sizes = [col for col in df.columns if 'BS' in col]
lay_prices = [col for col in df.columns if 'LP' in col]
lay_sizes = [col for col in df.columns if 'LS' in col]

In [57]:
#runner_info = ['SelectionId', 'MarketId', 'Venue', 'Distance', 'RaceType', 'BSP', 'NoRunners']

sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()]

bp_df = sample_df[['SelectionId'] + back_prices].copy()
new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
bp_t_df = bp_df.T.copy()
bp_t_df.columns = ["h" + str(column) for column in bp_t_df.iloc[0]]
bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
bp_t_df.reset_index(drop=True, inplace=True)

lp_df = sample_df[['SelectionId'] + lay_prices].copy()
new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
lp_t_df = lp_df.T.copy()
lp_t_df.columns = ["h" + str(column) for column in lp_t_df.iloc[0]]
lp_t_df = lp_t_df.iloc[1:-15]
lp_t_df.reset_index(drop=True, inplace=True)

# bsp_df = plot_df[['BSP']].copy()
# bsp_df['min_bp'] = bsp_df['BSP'].apply(lambda x: round(utils.back_hedge_min_bp(x, 0.05), 2))
# bsp_df['max_lp'] = bsp_df['BSP'].apply(lambda x: round(utils.lay_hedge_max_lp(x, 0.05), 2))    

bp_t_df.head()

Unnamed: 0,h10417645.0,h10485530.0,h11042718.0,h13244753.0,h15455501.0,h20568186.0,h21064500.0,h2533268.0,h415344.0,h7211500.0
0,15.57,11.5,27.18,6.8,22.0,3.09,14.5,6.84,13.0,42.0
1,16.0,11.5,28.89,6.81,21.74,3.03,14.82,6.6,13.18,41.57
2,16.23,11.5,28.0,6.92,21.5,3.04,15.5,6.6,13.0,40.0
3,16.5,11.5,28.0,6.89,22.0,3.05,16.04,6.58,12.87,41.82
4,16.5,11.9,28.0,6.8,22.0,3.03,15.86,6.6,13.18,44.0


__2.3__ **- Z-score transformation**

Bebbington standardises prices by taking the natural logarithm of each price and then standardising it with a Z-score transformation. 

Taking $P_{t}^{i} = ln(BP_{t}^{i})$, this means finding 

#### $P_{t}^{'(i)} = \frac{P_{t}^{i}-\overline{P}_{t}^{i}}{\sigma^{(i)}}$

where $\sigma^{(i)}$ is the standard deviation of the horse's price throughout the time series.

The Z-score transformation gives the relationship between an individual data point in the sample relative to that of the population mean and standard deviation. This means that variations are comparable between horses.

The following will look at the first 30 observations.

In [58]:
z_bp_df = bp_t_df.copy()
z_bp_df = z_bp_df[:30] #first 30 observations
z_bp_df = np.log(z_bp_df)

for column in z_bp_df.columns:
    mean = z_bp_df[column].mean()
    sd = np.std(z_bp_df[column], ddof = 1)
    z_bp_df[column] = z_bp_df[column].apply(lambda x: (x - mean) / sd)
    
z_bp_df.head()

Unnamed: 0,h10417645.0,h10485530.0,h11042718.0,h13244753.0,h15455501.0,h20568186.0,h21064500.0,h2533268.0,h415344.0,h7211500.0
0,-1.773707,-1.17647,-1.671947,-0.256552,0.21214,1.957295,-2.816352,1.887244,-1.592685,-1.527041
1,-1.563813,-1.17647,-0.913185,-0.193632,-0.08214,0.993198,-2.358699,1.315435,-1.192795,-1.844057
2,-1.453849,-1.17647,-1.302315,0.492446,-0.356925,1.155199,-1.41814,1.315435,-1.592685,-3.030048
3,-1.326731,-1.17647,-1.302315,0.306421,0.21214,1.316668,-0.700167,1.26685,-1.884953,-1.659348
4,-1.326731,0.790337,-1.302315,-0.256552,0.21214,0.993198,-0.93677,1.315435,-1.192795,-0.093967


__2.4__ **- Sum of squared distances**

Following Bebbington, to select pairs, we create a matrix (DataFrame, in this case) of the sum of squared distances between pairs of horses throughout the time series.

$\Theta _{ij} = \left\{\begin{matrix}
\sum_{M}^{t=1}(P_{t}^{'(i)} - P_{t}^{'(j)})^{2}, &i\neq j\\ 
 0, i=j& 
\end{matrix}\right.$

In [59]:
ids = [column for column in z_bp_df.columns]

matrix = z_bp_df.iloc[0:0].copy()

matrix.insert(0, "horse", np.array(ids))
matrix = matrix.set_index("horse", drop = False)
del matrix["horse"]

for column in matrix.columns:
    for row in matrix.index:
        if column == row:
            matrix.loc[row, column] = np.nan
        else: 
            matrix.loc[row, column] = ((z_bp_df[row] - z_bp_df[column]) ** 2).sum() or np.nan
            
for x in range(len(ids)):
    for y in range(x, len(ids)):
        matrix.iloc[x, y] = np.nan
        
matrix

Unnamed: 0_level_0,h10417645.0,h10485530.0,h11042718.0,h13244753.0,h15455501.0,h20568186.0,h21064500.0,h2533268.0,h415344.0,h7211500.0
horse,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
h10417645.0,,,,,,,,,,
h10485530.0,49.338189,,,,,,,,,
h11042718.0,9.069529,50.104673,,,,,,,,
h13244753.0,43.574657,79.982354,45.312769,,,,,,,
h15455501.0,59.171365,80.34417,69.015291,22.438882,,,,,,
h20568186.0,99.284447,53.475556,98.741577,95.888779,79.762012,,,,,
h21064500.0,8.156533,42.855107,12.258066,44.414288,62.680348,100.147067,,,,
h2533268.0,112.279896,75.532826,103.858446,59.233086,47.272426,28.493127,105.629582,,,
h415344.0,15.490315,32.512443,22.337836,76.839332,81.922897,76.333392,17.734701,106.993038,,
h7211500.0,26.238732,27.836946,29.919123,80.930127,77.199871,72.924363,21.485852,94.440337,12.777213,


In [60]:
horse_x = matrix.min(axis=1).idxmin()
horse_y = matrix.min().idxmin()
sss = matrix.min().min()

print(f"Pair found: horse {horse_x} and horse {horse_y} with sum of squared spreads equal to {sss}.")

Pair found: horse h21064500.0 and horse h10417645.0 with sum of squared spreads equal to 8.156532863288882.


__2.5__ **- Regression of prices of horse Y on horse X**

With $x_{t} = \left \{P_{1}^{X} + P_{2}^{X} + , ... , P_{M}^{X} \right \}$ and  $y_{t} = \left \{P_{1}^{Y} + P_{2}^{Y} + , ... , P_{M}^{Y} \right \}$ where $M$ is the final time period in the window, we carry out the OLS regression of $y_{t}$ on $x_{t}$. The estimate of $\beta$ is the hedging ratio, giving the relative holding of $x_{t}$ for a cost-neutral hedge position.

$y_{t} = \beta x_{t} + \varepsilon_{t}$

Using this estimation and the final end of window observations at time $M$ we get the spread at the end of the window.

$\varepsilon_{M} = y_{M} - \hat{\beta} x_{M}$.

If $\varepsilon_{M}$ is outside of an interval of past spread values such that if the spread returns to the mean a hedge bet will be profitable, bets can be made.

In [61]:
#regression setup
reg_df = bp_t_df[[horse_y, horse_x]][:30].copy() #non-standardised prices
reg_df = np.log(reg_df) 
reg_df['const'] = 1

In [62]:
#regression fit and results
reg = sm.OLS(endog=reg_df[horse_y], exog=reg_df[['const', horse_x]], missing='drop')

results = reg.fit()

#print(results.summary())

constant = results.params[0]
beta = results.params[1]
print(f"\nHedge ratio beta = {beta}.")


Hedge ratio beta = 2.3384848866347667.


In [63]:
#estimated final period (T=30) spread
spread = reg_df[horse_y].iloc[29].item() - constant - beta * reg_df[horse_x].iloc[29].item()

print(f"Final period estimated spread (in log prices) epsilon = {spread}.")

if spread > 0:
    print("Positive spread: horse Y has drifted from the mean, horse X has been backed in. If mean reversion occurs horse Y will be backed in and horse X will drift. Back-to-lay hedge Y and lay-to-back hedge X.")
else:
    print("Negative spread: horse X has drifted from the mean, horse Y has been backed in. If mean reversion occurs horse X will be backed in and horse Y will drift. Back-to-lay hedge X and lay-to-back hedge Y.")

Final period estimated spread (in log prices) epsilon = -0.07856992079340497.
Negative spread: horse X has drifted from the mean, horse Y has been backed in. If mean reversion occurs horse X will be backed in and horse Y will drift. Back-to-lay hedge X and lay-to-back hedge Y.


__2.6__ **- Estimated residual vs. 'threshold'**

[Return to this later]

__2.7.1__ **- Pairs trade**

Bebbing describes the following steps for the pairs trade:

If the final estimated residual $\varepsilon_{M}$ is:
* Less than the lower bound of the threshold: back horse Y with £1, lay horse X with £$\beta$.
* Greater than the upper bound of the threshold: lay horse Y with £1 and back horse X with £$\beta$.

Bebbington describes using a self-imposed delay of 5 seconds to emulate the delays in placing a bet on the Betfair Exchange. Since each time step in this dataset before the beginning of the race is 2 minutes, 1 step will be used. The prices at time 31 (index = 30) will be used.

This is followed in the cell below. Put alone, this strategy doesnt make sense. The bettor is still exposed to the result of the race without closing out each position with an opposing bet on each horse after some period of time.

In [64]:
# #a threhold of zero is going to be used initially

# def payout(bp, bs, lp, ls, c = 0):
#     loss_side = - bs + ls * (1 - c) 
#     win_side = (bp - 1) * bs * (1 - c) - (lp - 1) * ls
#     return win_side, loss_side

# if spread < 0:
#     bp = bp_t_df[horse_y].iloc[30]
#     lp = bp_t_df[horse_x].iloc[30]
#     win_side, loss_side = payout(bp, 1, lp, beta)
#     print(f"Win side = {win_side}. Loss side = {loss_side}.")
# else:
#     bp = bp_t_df[horse_x].iloc[30]
#     lp = bp_t_df[horse_y].iloc[30]
#     win_side, loss_side = payout(bp, beta, lp, 1)
#     print(f"Win side = {win_side}. Loss side = {loss_side}.")

__2.7.2__ **- Pairs trade 2**

In a pairs trade with two financial assets (say, company shares) the objective is to go long on the share whose price is expected to increase, while shorting the share whose price is expected to decrease. In both cases the positions would be closed out: the asset is sold or bought back (in the case of the short-sale) once there is a gain or possibly after a set interval. In the prior example, the closing out of each position should be done via an opposite back/lay.

If the final estimated residual $\varepsilon_{M}$ is:
* Less than the lower bound of the threshold: back horse Y with £1, lay horse X with £$\beta$. **After k periods, lay horse Y and back horse X with the optimal stakes given the prevailing prices, as defined in utils.py.**
* Greater than the upper bound of the threshold: lay horse Y with £1 and back horse X with £$\beta$. **After k periods, back horse Y and lay horse X with the optimal stakes.**

In [65]:
k = 35

if spread < 0:
    #back to lay Y
    bp_y = bp_t_df[horse_y].iloc[30]
    lp_y = bp_t_df[horse_y].iloc[k]
    
    win_side_y, loss_side_y = utils.payout(bp_y, 1, lp_y, '?')
    print(f"Win side = {win_side_y}. Loss side = {loss_side_y}.")
    
    #lay to back X
    lp_x = bp_t_df[horse_x].iloc[30]
    bp_x = bp_t_df[horse_x].iloc[k]
    
    win_side_x, loss_side_x = utils.payout(bp_x, '?', lp_x, beta)
    print(f"Win side = {win_side_x}. Loss side = {loss_side_x}.")
        
else:
    #lay to back Y
    lp_y = bp_t_df[horse_y].iloc[30]
    bp_y = bp_t_df[horse_y].iloc[k]
    
    win_side_y, loss_side_y = utils.payout(bp_y, '?', lp_y, 1)
    print(f"Win side = {win_side_y}. Loss side = {loss_side_y}.")
    
    #back to lay X
    bp_x = bp_t_df[horse_x].iloc[30]
    lp_x = bp_t_df[horse_x].iloc[k]
    
    win_side_x, loss_side_x = utils.payout(bp_x, beta, lp_x, '?')
    print(f"Win side = {win_side_x}. Loss side = {loss_side_x}.")

Win side = 0.030000000000001137. Loss side = -0.010000000000000009.
Win side = 0.0412673803523802. Loss side = 0.041267380352378424.


__2.9__ **- Monte Carlo simulation**

Below is a Monte Carlo simulation of full process outlined above, repeated 1,000 times with profits aggregated.

In [66]:
# def payout(bp, bs, lp, ls, c = 0):
#     loss_side = - bs + ls * (1 - c) 
#     win_side = (bp - 1) * bs * (1 - c) - (lp - 1) * ls
#     return win_side, loss_side

# returns = 0

# for n in range(1000):
#     sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()].copy()

#     bp_df = sample_df[['SelectionId'] + back_prices].copy()
#     new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
#     bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
#     bp_t_df = bp_df.T.copy()
#     bp_t_df.columns = ["h" + str(column) for column in bp_t_df.iloc[0]]
#     bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
#     #bp_t_df.astype(int)
#     bp_t_df.reset_index(drop=True, inplace=True)

#     lp_df = sample_df[['SelectionId'] + lay_prices].copy()
#     new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
#     lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
#     lp_t_df = lp_df.T.copy()
#     lp_t_df.columns = ["h" + str(column) for column in lp_t_df.iloc[0]]
#     lp_t_df = lp_t_df.iloc[1:-15]
#     lp_t_df.reset_index(drop=True, inplace=True)
    
#     # bsp_df = plot_df[['BSP']].copy()
#     # bsp_df['min_bp'] = bsp_df['BSP'].apply(lambda x: round(utils.back_hedge_min_bp(x, 0.05), 2))
#     # bsp_df['max_lp'] = bsp_df['BSP'].apply(lambda x: round(utils.lay_hedge_max_lp(x, 0.05), 2)) 

#     z_bp_df = bp_t_df.copy()
#     z_bp_df = z_bp_df[:30] #first 30 observations
#     z_bp_df = np.log(z_bp_df)

#     #dodgy code to break loop if standard deviation is 0
#     oops = 0
#     for column in z_bp_df.columns:
#         mean = z_bp_df[column].mean()
#         standard_deviation = np.std(z_bp_df[column], ddof = 1)
#         if type(standard_deviation) == pd.core.series.Series:
#             standard_deviation = standard_deviation.min()
#             z_bp_df[column] = z_bp_df[column].apply(lambda x: (x - mean) / standard_deviation)  
#         elif standard_deviation == 0: 
#             oops += 1
#         else:
#             z_bp_df[column] = z_bp_df[column].apply(lambda x: (x - mean) / standard_deviation)          
#     if oops == 1:
#         break

#      # calculating sum of squared spreads

#     ids = [column for column in z_bp_df.columns]

#     matrix = z_bp_df.iloc[0:0].copy()

#     matrix.insert(0, "horse", np.array(ids))
#     matrix = matrix.set_index("horse", drop = False)
#     del matrix["horse"]

#     for column in matrix.columns:
#         for row in matrix.index:
#             if column == row:
#                 matrix.loc[row, column] = np.nan
#             else: 
#                 matrix.loc[row, column] = ((z_bp_df[row] - z_bp_df[column]) ** 2).sum() or np.nan

#     for x in range(len(ids)):
#         for y in range(x, len(ids)):
#             matrix.iloc[x, y] = np.nan

#     horse_x = matrix.min(axis=1).idxmin()
#     horse_y = matrix.min().idxmin()
#     sss = matrix.min().min()

#     #regression setup
#     reg_df = bp_t_df[[horse_y, horse_x]][:30].copy() #non-standardised prices
#     reg_df = np.log(reg_df) 
#     reg_df['const'] = 1
#     reg_df['spread'] = reg_df[horse_y] - reg_df[horse_x]

#     #regression fit and results
#     reg = sm.OLS(endog=reg_df[horse_y], exog=reg_df[['const', horse_x]], missing='drop')

#     results = reg.fit()

#     constant = results.params[0]
#     beta = results.params[1]

#     spread = reg_df[horse_y].iloc[29].item() - constant - beta * reg_df[horse_x].iloc[29].item()

#     k = 50

#     if spread < 0:
#         #back to lay Y
#         bp_y = bp_t_df[horse_y].iloc[30]
#         lp_y = bp_t_df[horse_y].iloc[k]

#         win_side_y, loss_side_y = utils.payout(bp_y, 1, lp_y, '?')

#         #lay to back X
#         lp_x = bp_t_df[horse_x].iloc[30]
#         bp_x = bp_t_df[horse_x].iloc[k]

#         win_side_x, loss_side_x = utils.payout(bp_x, '?', lp_x, beta)

#     else:
#         #lay to back Y
#         lp_y = bp_t_df[horse_y].iloc[30]
#         bp_y = bp_t_df[horse_y].iloc[k]

#         win_side_y, loss_side_y = utils.payout(bp_y, '?', lp_y, 1)

#         #back to lay X
#         bp_x = bp_t_df[horse_x].iloc[30]
#         lp_x = bp_t_df[horse_x].iloc[k]

#         win_side_x, loss_side_x = utils.payout(bp_x, beta, lp_x, '?')
        
#     returns += win_side_y + win_side_x + loss_side_y + loss_side_x
    
# print(returns)

### Further Research 

__3.0__ **- [Herlemont (2004)](http://docs.finance.free.fr/DOCS/Yats/cointegration-en%5B1%5D.pdf) paper**

Herlemont describes in detail the econometrics of pairs trading for financial market assets. The following partly follows his commentary with some additional clarifications and discussion relating to horse racing.

**3.1 - Testing for mean reversion**

The aim is to identify odds that move together and whose spread is mean reverting. For the purposes of horse racing pairs, mean reversion is essential. Our objective is to capture prices whose spread has (temporarily) deviated from its mean. If this can be found, bets can be made to take advantage of the possible reversion.

A stochastic process $y_{t}$ that is weakly stationary has the following properties for all $t$:

* $E[y_{t}] = \mu < \infty$
* $var(y_{t}) = \gamma_{0} < \infty$
* $cov(y_{t}, y_{t-j}) = \gamma_{j} < \infty, j = 1, 2, 3 ...$

(constant mean, constant variance, covariance between two observations depends only on the distance in time between them)

A weakly stationary $I(0)$ series:
* Fluctuates around its mean with a finite variance that does not depend upon time.
* Is mean-reverting: it has tendency to return to its mean.
* Has limited memory; the effect of a shock dies out. Autocorrelations die out (fairly) rapidly.

With two horse's odds, $A_{t}$ and $B_{t}$, we look at $y_{t} = \log \frac{A_{t}}{B_{t}} = \log A_{t} - \log B_{t}$. This is once again the spread between the prices of the two horses, defined slightly differently. We want to find a pair which has a weakly stationary spread. We are interested in the ($AR(1)$) process 

$y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$,

or the log odds ratio over time. If this is weakly stationary, it would suggest a mean reverting process. 

The three previous conditions, and a stability condition that $|\theta|<1$ (that the process $y_{t}$ is not a random walk or that it follows an eratic positive-to-negative pattern) must hold.
______

A Dickey-Fuller stationarity test can be carried out on the log ratio of the prices to test whether a process is weakly stationary. If we carry out the regression:

$\Delta y_{t} = \mu + \omega y_{t-1} + \varepsilon_{t}$

where the null hypothesis that $\omega = 0$ is that the 'true' relationship is $\Delta y_{t} = \mu + \varepsilon_{t} \Leftrightarrow y_{t} = \mu + y_{t-1} + \varepsilon_{t}$, or a random walk with starting point $y_{0} = \mu$.

If we can reject the null hypothesis, the price ratio is weakly stationary and thereby mean-reverting.

A Dickey-Fuller test is required for each possible pair of horses in a race, or $\frac{n(n-1)}{2}$ regressions, where $n$ is the number of horses.

While we are interested in the stochastic process $y_{t}$, we do not need to carry out the regression of $y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$ for the purpose of finding pairs. This relationship between a pair of odds itself is not important to quantify. We are only interested in the features of the process. 
____

*In the previous analysis, the test for whether two odds formed a pair was to find the pair with the smallest sum of absolute differences over time in the standardised prices. That method would allow maximum 1 pair to be found per race, and the validity of that pair would not be confirmed statisticallyather. Rather, the pair's feasibilty for a trade would be tested for afterwards based on profitability. I have more confidence in the approach in this section.*

**3.2 - Screening pairs**

Herlemont describes rules to ensure that market neutrality is more achievable in pairs trading. The idea is to pick stocks with very similar characteristics like same industry and similar market betas, with the intention of minimising asymmetric shocks to the price of one stock and not the other. For example in the case of two stocks, the share on which you are long is a business heavily dependent on oil, while the other share is not, a surge in oil prices which dampens profitability of your long share will likely see its price fall, ruining the pairs trade. In the case of shares, the simplest solution would be to pick shares in similar industries with similar market betas (or with similar idiosyncratic risks).

For horses, the external factors influencing prices (news about runners, changing weather conditions, etc.) will usually always have asymmetric effects. This may be avoidable through picking horses with similar fundamental characteristics. However, this is very complicated. My hope is that the pair finding mechanism picks horses where this is already the case, because the market reacts the same way to news for these horse pairs.

We cannot follow a beta-based approach because there are not 'market-wide fluctuations' of the same sort. However, there is the fact that the implied probability of all horses in the market book is equal to approximately 1. Therefore, you could say that for a given change in implied probability for one horse, the sum of the changes in the odds of all the remaining horses is the negative the change for the given horse:

$\Delta O_{i} = - \sum_{j = 1, j \neq i}^{N_{h}} \Delta O_{j} $

There is therefore interdependence between all prices across the market. It's possible that this will cause an endogeneity problem in regressions between separate horses, as the changes in the dependent variable necessarily impact the explanatory variable. However, the impact is likely to be very small, and will be smaller the greater the number of horses. 

*In Bebbington's analysis, he describes that betting £1 on one of the horses and £$\beta$ on the other creates a market neutral bet. This is incorrect, and it appears that he has misunderstood hedging in this context. In that analysis, $\beta = \frac{y_{t}}{x_{t}}$, and therefore he is simply considering the ratio of the prices of the horses, the same ratio considered when determining the optimal stake for two given prices in a hedge. It is correct that on a single horse this creates a market neutral bet, however neutrality in horse racing means neutral to the outcome of the race. Any bet neutral to the race outcome is definitively neutral to the market. When betting on separate horses, the bets on each horse must be made neutral separately. Additionally, the use of $\beta$ in staking is unneccesary. Consider the case where £$BS$ has been bet on horse A at price $BP$. Now, horse A is priced at $LP$. The optimal stake to bet on LP is £$LS = \frac{BS * BP}{LP}$. In the aforementioned regression, $BS = 1$, hence $\beta = \frac{y_{t} * 1}{x_{t}}$ is the optimal stake only for bets of £1, otherwise it would be $S*\beta$. More importantly, using the estimated $\beta$ to find the an approximation of the optimal stake makes no sense when you can simply find the optimal stake with the aforementioned equation.*

**3.3 - Trading rules**

Timing rules must be added. 

Herlemont's basic rule is "to open a position when the ratio of two share prices hits the 2 rolling standard deviation [difference from the 130-day rolling mean] and close it when the ratio returns to the mean."

To avoid opening a position on stocks that are deviating from the mean and are going to deviate further, Herlemont describes that "the position is not opened when the ratio breaks the two-standard-deviations limit for the first time, but rather when it crosses it to revert to the mean again."

This can be achieved with the horse odds, of course in far smaller time scales. The current dataset is in 5-minute intervals for the three hours before a race; this should likely be expanded.

Stop losses should be included and trade length should also be limited.

Rules:
1. Trade on pairs whose spread is reapproaching the mean from a deviated position
2. Stop loss at x% of the initial position
3. Don't hold open pairs trades for longer than x hours. 

It should be possible to quantify the average length of time required for a mean reversion and therefore the maximum logical time to hold open a position by looking at past data.

**3.4 - Other tests**

It should be ensured that the regression results of one price on another are not spurious (as with the regression in 2.5). $\beta$ could be statistically meaningless if it is, meaning that it makes no sense to use it.