## Strategy Idea 2 : "Comovement - pairs trading"

__Section 0: Setup__ Importing packages/reading in data etc.

__Section 1 : Idea__ 

- __1.1__ Strategy idea

- __1.2__ Origin of idea. Context/Reasoning for strategy to work e.g. use in financial markets?

__Section 2 : Exploration__

- __2.1__ Exploratory Data Analysis. e.g plots of price/volumes that could show strategy working, how much potential.

- __2.2__ Define some 'strategy metrics'. Metrics that can can you use to gauge if this strategy will work i.e no.price points above a certain threshold that is profitable. Metrics could show how often there is an opportunity to make a trade and how much 'value' is in an opportunity e.g. how much is there a price swing?


__Section 3 : Strategy testing__

- __3.1__ Testing strategy on previous data. 

- __3.2__ State any assumptions made by testing.

- __3.3__ Model refinements. How could strategy be optimised? Careful : is this backfitting/overfitting - what measures taken to negate this e.g. bootstrapping?

- __3.4__ Assessing strategy. P/L on data sample? ROI? variance in results? longest losing run?

__Section 4 : Practical requirements__

- __4.1__ Identify if this edge is ‘realisable’? What methods will you apply to extract this value? e.g. applying a hedge function


- __4.2__ Is it possible to quantify the potential profit from the strategy? Consideration : How long will it take to obtain this? How 'risky' is it? e.g. if something did go wrong, how much do we lose? 

- __4.3__ Strategy limitations. The factors that could prevent strategy working e.g. practical considerations e.g. reacting quick enough to market updates, volume behind a price, size of bankroll needed


__Section 5: Potential limitations__

- __5.1__ What is our 'competition' - if not quantifiable, do we suspect people are doing the same thing? 

- __5.2__ So what's our edge? Identify ways of finding this edge in future? e.g what features are there? Are they predictive? Is there a certain 'market/runner' profile?





### Section 0 : Setup

In [188]:
# importing packages
from pathlib import Path, PurePath 

import pandas as pd
import numpy as np
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

import utils

In [189]:
# reading in data
project_dir = Path.cwd().parents[2]
data_dir = project_dir / 'data' / 'processed' / 'api' / 'advanced' / 'adv_data.csv'
df = pd.read_csv(data_dir, index_col = 0)
print(df.shape)
df.head()

(13073, 307)


Unnamed: 0,SelectionId,MarketId,Venue,Distance,RaceType,BSP,NoRunners,BS:T-60,BS:T-59,BS:T-58,...,LS:T+5,LS:T+6,LS:T+7,LS:T+8,LS:T+9,LS:T+10,LS:T+11,LS:T+12,LS:T+13,LS:T+14
0,11986132,1.169028,Huntingdon,20.0,Chase,8.33,9,16.43,24.51,26.57,...,10.08,11.15,5.44,7.09,14.16,19.53,3.12,3.31,0.68,0.68
1,16800725,1.169028,Huntingdon,20.0,Chase,3.68,9,15.43,25.74,57.82,...,29.87,221.22,43.23,43.1,13.53,26.15,13.6,74.3,419.52,23082.1
2,20968322,1.169028,Huntingdon,20.0,Chase,14.96,9,9.87,9.25,9.15,...,37.32,6.83,4.85,11.23,16.0,5.68,40.25,12.51,10.42,13.17
3,22023486,1.169028,Huntingdon,20.0,Chase,4.25,9,84.38,64.49,58.01,...,11.67,2.02,2.02,2.02,2.02,2.02,2.02,2.02,2.02,2.02
4,24496216,1.169028,Huntingdon,20.0,Chase,6.6,9,10.64,10.11,7.91,...,34.27,54.72,11.85,17.99,48.21,17.28,38.29,6.96,4.37,4.37


### Section 1 : Idea

__1.1 Idea__

Prices in a market adjust such that the sum of the implied odds of all horses is approximately equal to one. If one horse's price drifts, another's or several other's should be backed in. 

One traditional strategy considering this price behaviour is 'pairs trading': *"A pairs trade or pair trading is a market neutral trading strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement. This strategy is categorized as a statistical arbitrage and convergence trading strategy."*

__1.2  Reasoning__

Why is there an edge here?
- .

### Section 2 : Pairs trading

__2.1__ **- Idea**

[Bebbington, PA (2017)](https://discovery.ucl.ac.uk/id/eprint/1563501/) looks at pairs trading in horse racing markets. The following outlines their method for analysing this strategy. In 2.2, each step will be attempted.

* The 'signals' are the best match or lay price available at a given timestamp.
* Statsitical methods are used to analyse horses' pricing data for comparison, in particular to calcuate a hedge ratio and for stake weighting. In the paper, non-overlapping windows of data, for example, price observations 1-5, 6-10, 11-15, make up the time series, and then trades are made at the end of the window. This is used to simulate a method where the algorithm is reacting to live data. This example study movement throughout 30 price points and make bets in the remaining 30 periods.
* A z-score transformation of the log of the decimal odds is used to standardise prices. This makes the relative directional movement in different prices comparable by accounting for their respective variances. 
* Pairs are discovered by analysing the sum of squared distances between two horses' prices throughout time. Those that move the least relative to each other are the best candidates for pairs.
* Once pairs are identified, the 'spread' between their prices (on average, or at the end of each window) is compared to a minimum size requirement for a bet to be made, $\phi$.
* The 'hedging ratio' is found using an OLS regression of the price of one of the horses on the price of the other. Since the two prices are pairs but will have different variances and absolute values, their movement relative to eachother must be considered to make the strategy 'cost neutral'. It is also used to define the stake size. 
* The final observed spread indicates which on which horse a 'back-to-lay' hedge must be made (that which is expected to be backed in) and on which a 'lay-to-back' hedge must be made (that which is expected to drift). This spread is compared to an interval [?], likely a confidence interval of past spreads or simply the interval of observed spreads. If the spread is greater than usual or smaller than usual, the bets are placed. 
* In the paper it appears that both sides of the hedge bet are made at the same point in time.

__2.2__ **- Setup**

**Data**

The following example will be set up with a random race and will identify tradeable pairs (or that there are none). Three DataFrames are created: (1) the unchanged race sample DataFrame with one row per horse and data going along in columns, (2) a back prices DataFrame with one column per horse and prices going through time in rows, (3) the same for lay prices. This analysis looks at prices before the race begins.

There are 60 price data points for each horse, finishing at the begining of the race.

Variables:
* $BP_{t}^{i}$ is back price for horse i at time t.
* $LP_{t}^{i}$ is lay price for horse i at time t.

In [207]:
# defining variables
back_prices = [col for col in df.columns if 'BP' in col]
back_sizes = [col for col in df.columns if 'BS' in col]
lay_prices = [col for col in df.columns if 'LP' in col]
lay_sizes = [col for col in df.columns if 'LS' in col]

#runner_info = ['SelectionId', 'MarketId', 'Venue', 'Distance', 'RaceType', 'BSP', 'NoRunners']

sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()] 

bp_df = sample_df[['SelectionId'] + back_prices].copy()
new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
bp_t_df = bp_df.T.copy()
bp_t_df.columns = ["h" + str(column) for column in bp_t_df.iloc[0]]
bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
bp_t_df.reset_index(drop=True, inplace=True)

lp_df = sample_df[['SelectionId'] + lay_prices].copy()
new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
lp_t_df = lp_df.T.copy()
lp_t_df.columns = ["h" + str(column) for column in lp_t_df.iloc[0]]
lp_t_df = lp_t_df.iloc[1:-15]
lp_t_df.reset_index(drop=True, inplace=True)

# bsp_df = plot_df[['BSP']].copy()
# bsp_df['min_bp'] = bsp_df['BSP'].apply(lambda x: round(utils.back_hedge_min_bp(x, 0.05), 2))
# bsp_df['max_lp'] = bsp_df['BSP'].apply(lambda x: round(utils.lay_hedge_max_lp(x, 0.05), 2))    

bp_t_df.head()

Unnamed: 0,h24258423.0,h24317226.0,h25105923.0,h27188367.0,h27431314.0,h27632374.0,h27632375.0,h27632376.0,h307936.0,h4873068.0,h891221.0
0,8.39,11.0,27.17,6.6,155.33,22.65,55.0,4.5,4.8,11.66,21.0
1,8.38,11.0,28.57,6.6,154.67,21.95,55.0,4.55,4.78,12.0,21.0
2,8.35,11.0,29.23,6.72,155.0,21.85,55.0,4.7,4.8,12.0,22.0
3,8.21,11.0,30.09,6.74,153.33,23.0,61.0,4.7,4.81,12.09,22.0
4,8.4,11.0,32.0,6.6,151.54,23.0,65.0,4.8,4.8,12.51,22.0


__2.3__ **- Z-score transformation**

Bebbington standardises prices by taking the natural logarithm of each price and then standardising it with a Z-score transformation. 

Taking $P_{t}^{i} = ln(BP_{t}^{i})$, this means finding 

#### $P_{t}^{'(i)} = \frac{P_{t}^{i}-\overline{P}_{t}^{i}}{\sigma^{(i)}}$

where $\sigma^{(i)}$ is the standard deviation of the horse's price throughout the time series.

The Z-score transformation gives the relationship between an individual data point in the sample relative to that of the population mean and standard deviation. This means that variations are comparable between horses.

The following will look at the first 30 observations.

In [208]:
z_bp_df = bp_t_df.copy()
z_bp_df = z_bp_df[:30] #first 30 observations
z_bp_df = np.log(z_bp_df)

for column in z_bp_df.columns:
    mean = z_bp_df[column].mean()
    sd = np.std(z_bp_df[column], ddof = 1)
    z_bp_df[column] = z_bp_df[column].apply(lambda x: (x - mean) / sd)
    
z_bp_df

Unnamed: 0,h24258423.0,h24317226.0,h25105923.0,h27188367.0,h27431314.0,h27632374.0,h27632375.0,h27632376.0,h307936.0,h4873068.0,h891221.0
0,0.259505,0.398284,-2.807325,-0.417476,-0.553476,1.42933,-1.253329,0.781824,-1.058465,-1.80642,0.312009
1,0.202046,0.398284,-1.507341,-0.417476,-0.576611,0.868995,-1.253329,0.902248,-1.15068,-1.584869,0.312009
2,0.029254,0.398284,-0.916431,-0.059515,-0.565031,0.787492,-1.253329,1.255735,-1.058465,-1.584869,1.252829
3,-0.785403,0.398284,-0.166169,-0.000477,-0.623888,1.703036,-0.840812,1.255735,-1.012502,-1.527274,1.252829
4,0.316897,0.398284,1.426166,-0.417476,-0.687691,1.703036,-0.587768,1.485181,-1.058465,-1.264044,1.252829
5,0.316897,0.398284,1.426166,-0.75137,-1.118049,1.703036,-0.744028,1.485181,-1.058465,-1.009507,-0.323864
6,0.316897,0.398284,1.426166,-1.028794,-1.118049,1.703036,-0.906667,1.485181,-1.058465,-0.676982,-0.674722
7,0.545781,0.398284,1.377608,-1.247278,-1.118049,1.492262,-0.844079,1.485181,-1.058465,-0.676982,-1.11428
8,0.716733,0.398284,1.04333,-2.310934,-0.964987,0.062255,-0.587768,1.068636,-0.92086,-0.563626,-1.031775
9,0.659816,0.398284,1.288347,-2.310934,-1.230193,0.079262,-0.587768,1.021355,-0.603078,-0.396656,-0.705081


__2.4__ **- Sum of squared distances**

Following Bebbington, to select pairs, we create a matrix (DataFrame, in this case) of the sum of squared distances between pairs of horses throughout the time series.

$\Theta _{ij} = \left\{\begin{matrix}
\sum_{M}^{t=1}(P_{t}^{'(i)} - P_{t}^{'(j)})^{2}, &i\neq j\\ 
 0, i=j& 
\end{matrix}\right.$

In [209]:
ids = [column for column in z_bp_df.columns]

matrix = z_bp_df.iloc[0:0].copy()

matrix.insert(0, "horse", np.array(ids))
matrix = matrix.set_index("horse", drop = False)
del matrix["horse"]

for column in matrix.columns:
    for row in matrix.index:
        if column == row:
            matrix.loc[row, column] = np.nan
        else: 
            matrix.loc[row, column] = ((z_bp_df[row] - z_bp_df[column]) ** 2).sum() or np.nan
            
for x in range(len(ids)):
    for y in range(x, len(ids)):
        matrix.iloc[x, y] = np.nan
        
matrix

Unnamed: 0_level_0,h24258423.0,h24317226.0,h25105923.0,h27188367.0,h27431314.0,h27632374.0,h27632375.0,h27632376.0,h307936.0,h4873068.0,h891221.0
horse,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
h24258423.0,,,,,,,,,,,
h24317226.0,56.263269,,,,,,,,,,
h25105923.0,36.650312,52.17693,,,,,,,,,
h27188367.0,90.581173,60.290753,90.076189,,,,,,,,
h27431314.0,82.898226,92.704608,81.929209,29.154914,,,,,,,
h27632374.0,41.294465,34.722962,44.203078,83.237158,102.83338,,,,,,
h27632375.0,63.887154,89.532533,63.857701,39.956689,9.490341,101.756136,,,,,
h27632376.0,45.33926,33.215646,35.049601,95.408522,104.354975,7.895305,104.779381,,,,
h307936.0,72.450253,83.284411,71.832632,33.463362,7.351403,107.163739,3.422524,108.224649,,,
h4873068.0,62.229249,83.788321,61.106487,33.505657,15.073342,106.150892,6.499771,110.486924,7.00384,,


In [210]:
horse_x = matrix.min(axis=1).idxmin()
horse_y = matrix.min().idxmin()
sss = matrix.min().min()

print(f"Pair found: horse {horse_x} and horse {horse_y} with sum of squared spreads equal to {sss}.")

Pair found: horse h307936.0 and horse h27632375.0 with sum of squared spreads equal to 3.4225237224683864.


__2.5__ **- Regression of prices of horse Y on horse X**

With $x_{t} = \left \{P_{1}^{X} + P_{2}^{X} + , ... , P_{M}^{X} \right \}$ and  $y_{t} = \left \{P_{1}^{Y} + P_{2}^{Y} + , ... , P_{M}^{Y} \right \}$ where $M$ is the final time period in the window, we carry out the OLS regression of $y_{t}$ on $x_{t}$. The estimate of $\beta$ is the hedging ratio, giving the relative holding of $x_{t}$ for a cost-neutral hedge position.

$y_{t} = \beta x_{t} + \varepsilon_{t}$

Using this estimation and the final end of window observations at time $M$ we get the spread at the end of the window.

$\varepsilon_{M} = y_{M} - \hat{\beta} x_{M}$.

If $\varepsilon_{M}$ is outside of an interval of past spread values such that if the spread returns to the mean a hedge bet will be profitable, bets can be made.

In [223]:
#regression setup
reg_df = bp_t_df[[horse_y, horse_x]][:30].copy() #non-standardised prices
reg_df_nonlog = reg_df.copy() #for the sake of visualisation (not used)
reg_df = np.log(reg_df) 
reg_df['const'] = 1
reg_df['spread'] = reg_df[horse_y] - reg_df[horse_x]
reg_df.tail(10)

Unnamed: 0,h27632375.0,h307936.0,const,spread
20,4.468778,1.656321,1,2.812456
21,4.553877,1.65058,1,2.903297
22,4.582413,1.660131,1,2.922282
23,4.60517,1.686399,1,2.918771
24,4.60517,1.690096,1,2.915074
25,4.639572,1.699279,1,2.940293
26,4.757891,1.669592,1,3.088299
27,4.787492,1.678964,1,3.108528
28,4.705016,1.665818,1,3.039197
29,4.678607,1.671473,1,3.007133


In [224]:
#regression fit and results
reg = sm.OLS(endog=reg_df[horse_y], exog=reg_df[['const', horse_x]], missing='drop')

results = reg.fit()

print(results.summary())

constant = results.params[0]
beta = results.params[1]
print(f"\nHedge ratio beta = {beta}.")

                            OLS Regression Results                            
Dep. Variable:            h27632375.0   R-squared:                       0.885
Model:                            OLS   Adj. R-squared:                  0.881
Method:                 Least Squares   F-statistic:                     216.5
Date:                Wed, 17 Jun 2020   Prob (F-statistic):           1.06e-14
Time:                        20:00:01   Log-Likelihood:                 31.913
No. Observations:                  30   AIC:                            -59.83
Df Residuals:                      28   BIC:                            -57.02
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -4.1104      0.573     -7.169      0.0

In [230]:
#estimated final period (T=30) spread
spread = reg_df[horse_y].iloc[29].item() - constant - beta * reg_df[horse_x].iloc[29].item()

print(f"Final period estimated spread (in log prices) epsilon = {spread}.")

if spread > 0:
    print("Positive spread: horse Y has drifted from the mean, horse X has been backed in. If mean reversion occurs horse Y will be backed in and horse X will drift. Back-to-lay hedge Y and lay-to-back hedge X.")
else:
    print("Negative spread: horse X has drifted from the mean, horse Y has been backed in. If mean reversion occurs horse X will be backed in and horse Y will drift. Back-to-lay hedge X and lay-to-back hedge Y.")

Final period estimated spread (in log prices) epsilon = 0.07015191008192367.
Positive spread: horse Y has drifted from the mean, horse X has been backed in. If mean reversion occurs horse Y will be backed in and horse X will drift. Back-to-lay hedge Y and lay-to-back hedge X.
