## Strategy Idea 2 : "Comovement - pairs trading"

__Section 0: Setup__ Importing packages/reading in data etc.

__Section 1 : Idea__ 

- __1.1__ Strategy idea

- __1.2__ Origin of idea. Context/Reasoning for strategy to work e.g. use in financial markets?

__Section 2 : Exploration__

- __2.1__ Exploratory Data Analysis. e.g plots of price/volumes that could show strategy working, how much potential.

- __2.2__ Define some 'strategy metrics'. Metrics that can can you use to gauge if this strategy will work i.e no.price points above a certain threshold that is profitable. Metrics could show how often there is an opportunity to make a trade and how much 'value' is in an opportunity e.g. how much is there a price swing?


__Section 3 : Strategy testing__

- __3.1__ Testing strategy on previous data. 

- __3.2__ State any assumptions made by testing.

- __3.3__ Model refinements. How could strategy be optimised? Careful : is this backfitting/overfitting - what measures taken to negate this e.g. bootstrapping?

- __3.4__ Assessing strategy. P/L on data sample? ROI? variance in results? longest losing run?

__Section 4 : Practical requirements__

- __4.1__ Identify if this edge is ‘realisable’? What methods will you apply to extract this value? e.g. applying a hedge function


- __4.2__ Is it possible to quantify the potential profit from the strategy? Consideration : How long will it take to obtain this? How 'risky' is it? e.g. if something did go wrong, how much do we lose? 

- __4.3__ Strategy limitations. The factors that could prevent strategy working e.g. practical considerations e.g. reacting quick enough to market updates, volume behind a price, size of bankroll needed


__Section 5: Potential limitations__

- __5.1__ What is our 'competition' - if not quantifiable, do we suspect people are doing the same thing? 

- __5.2__ So what's our edge? Identify ways of finding this edge in future? e.g what features are there? Are they predictive? Is there a certain 'market/runner' profile?





### Section 0 : Setup

In [5]:
# importing packages
from pathlib import Path, PurePath 

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import utils

In [6]:
# reading in data
project_dir = Path.cwd().parents[2]
data_dir = project_dir / 'data' / 'processed' / 'api' / 'advanced' / 'adv_data.csv'
df = pd.read_csv(data_dir, index_col = 0)
print(df.shape)
df.head()

(13073, 307)


Unnamed: 0,SelectionId,MarketId,Venue,Distance,RaceType,BSP,NoRunners,BS:T-60,BS:T-59,BS:T-58,...,LS:T+5,LS:T+6,LS:T+7,LS:T+8,LS:T+9,LS:T+10,LS:T+11,LS:T+12,LS:T+13,LS:T+14
0,11986132,1.169028,Huntingdon,20.0,Chase,8.33,9,16.43,24.51,26.57,...,10.08,11.15,5.44,7.09,14.16,19.53,3.12,3.31,0.68,0.68
1,16800725,1.169028,Huntingdon,20.0,Chase,3.68,9,15.43,25.74,57.82,...,29.87,221.22,43.23,43.1,13.53,26.15,13.6,74.3,419.52,23082.1
2,20968322,1.169028,Huntingdon,20.0,Chase,14.96,9,9.87,9.25,9.15,...,37.32,6.83,4.85,11.23,16.0,5.68,40.25,12.51,10.42,13.17
3,22023486,1.169028,Huntingdon,20.0,Chase,4.25,9,84.38,64.49,58.01,...,11.67,2.02,2.02,2.02,2.02,2.02,2.02,2.02,2.02,2.02
4,24496216,1.169028,Huntingdon,20.0,Chase,6.6,9,10.64,10.11,7.91,...,34.27,54.72,11.85,17.99,48.21,17.28,38.29,6.96,4.37,4.37


### Section 1 : Idea

__1.1 Idea__

Prices in a market adjust such that the sum of the implied odds of all horses is approximately equal to one. If one horse's price drifts, another's or several other's should be backed in. 

One traditional strategy considering this price behaviour is 'pairs trading': *"A pairs trade or pair trading is a market neutral trading strategy enabling traders to profit from virtually any market conditions: uptrend, downtrend, or sideways movement. This strategy is categorized as a statistical arbitrage and convergence trading strategy."*

__1.2  Reasoning__

Why is there an edge here?
- .

### Section 2 : Exploration

__2.1__ Sample covariance

To explore this phenomena, an example covariance matrix from a random sample race is taken below.

In [7]:
#transpose sample df
sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()] 

transpose_df = sample_df.T.copy()
transpose_df.columns = ["h" + str(column) for column in transpose_df.iloc[0]]
transpose_df = transpose_df.iloc[7:]
transpose_df.reset_index(drop=True, inplace=True)
transpose_df = transpose_df.astype(float)

transpose_df.head()

Unnamed: 0,h11244806,h12090523,h12192145,h12845992,h13306150,h13957361,h18705419,h19450775,h5542918,h8487860,h8971553
0,39.14,9.06,72.66,8.46,24.79,4.81,7.27,8.82,8.32,10.64,8.63
1,18.99,9.56,56.09,8.85,18.22,5.7,5.79,8.33,9.9,2.91,9.77
2,12.62,3.23,71.7,9.74,15.48,4.26,7.71,10.25,10.25,4.21,6.77
3,18.92,6.27,32.39,10.08,12.68,4.06,8.2,11.94,11.63,23.68,8.74
4,19.32,7.59,39.45,9.74,12.16,1.4,5.71,13.23,12.65,22.08,8.58


In [8]:
transpose_df.cov()

Unnamed: 0,h11244806,h12090523,h12192145,h12845992,h13306150,h13957361,h18705419,h19450775,h5542918,h8487860,h8971553
h11244806,5057.596068,3042.614034,5220.20667,3978.791071,2982.875886,2305.75529,4389.948645,3687.103195,1968.640336,3787.984341,2457.141164
h12090523,3042.614034,18317.61266,-4326.68698,5618.97872,13682.43209,11435.612581,6782.301107,8997.05838,16847.349969,6883.382656,17789.712905
h12192145,5220.20667,-4326.68698,739633.152741,-2208.945746,-2202.694772,-11429.279037,119.705006,-4098.499676,-8684.285836,-2009.19051,-7529.805195
h12845992,3978.791071,5618.97872,-2208.945746,6235.580012,4539.2054,6826.755501,6305.753799,6572.609879,4745.586164,5965.035214,5660.239183
h13306150,2982.875886,13682.43209,-2202.694772,4539.2054,11952.527178,8948.579922,5554.804698,7239.828497,12531.551305,5547.805481,13455.474538
h13957361,2305.75529,11435.612581,-11429.279037,6826.755501,8948.579922,18090.650796,7995.92933,11692.748348,10867.699546,8774.62351,13134.742557
h18705419,4389.948645,6782.301107,119.705006,6305.753799,5554.804698,7995.92933,7338.64229,7965.35882,5352.951808,7002.657954,6553.675429
h19450775,3687.103195,8997.05838,-4098.499676,6572.609879,7239.828497,11692.748348,7965.35882,10151.350927,8028.59561,8203.302551,9221.890312
h5542918,1968.640336,16847.349969,-8684.285836,4745.586164,12531.551305,10867.699546,5352.951808,8028.59561,27540.778324,5626.284885,16801.267716
h8487860,3787.984341,6883.382656,-2009.19051,5965.035214,5547.805481,8774.62351,7002.657954,8203.302551,5626.284885,7200.873796,6903.607076


In [9]:
transpose_df.corr()

Unnamed: 0,h11244806,h12090523,h12192145,h12845992,h13306150,h13957361,h18705419,h19450775,h5542918,h8487860,h8971553
h11244806,1.0,0.316112,0.085351,0.708501,0.383648,0.241054,0.720576,0.514578,0.166804,0.627688,0.255388
h12090523,0.316112,1.0,-0.037172,0.525756,0.924696,0.6282,0.584972,0.659788,0.750083,0.599342,0.971575
h12192145,0.085351,-0.037172,1.0,-0.032527,-0.023427,-0.098806,0.001625,-0.047299,-0.060847,-0.027531,-0.064717
h12845992,0.708501,0.525756,-0.032527,1.0,0.525789,0.64276,0.932161,0.826109,0.362129,0.890189,0.529832
h13306150,0.383648,0.924696,-0.023427,0.525789,1.0,0.608551,0.593104,0.657259,0.690696,0.597997,0.909727
h13957361,0.241054,0.6282,-0.098806,0.64276,0.608551,1.0,0.693959,0.862835,0.486881,0.768792,0.721833
h18705419,0.720576,0.584972,0.001625,0.932161,0.593104,0.693959,1.0,0.922859,0.376528,0.963302,0.565482
h19450775,0.514578,0.659788,-0.047299,0.826109,0.657259,0.862835,0.922859,1.0,0.480164,0.959476,0.67655
h5542918,0.166804,0.750083,-0.060847,0.362129,0.690696,0.486881,0.376528,0.480164,1.0,0.399522,0.748335
h8487860,0.627688,0.599342,-0.027531,0.890189,0.597997,0.768792,0.963302,0.959476,0.399522,1.0,0.601347


### Section 3 : Pairs trading

__3.1__ **- Idea**

[Bebbington, PA (2017)](https://discovery.ucl.ac.uk/id/eprint/1563501/) looks at pairs trading in horse racing markets. The following outlines their method for analysing this strategy. In 3.2, each step will be attempted.

* The 'signals' are the best match or lay price available at a given timestamp.
* Statsitical methods are used to analyse horses' pricing data for comparison, in particular to calcuate a hedge ratio and for stake weighting. In the paper, non-overlapping windows of data, for example, price observations 1-5, 6-10, 11-15, make up the time series, and then trades are made at the end of the window. This is used to simulate a method where the algorithm is reacting to live data. This example study movement throughout 30 price points and make bets in the remaining 30 periods.
* A z-score transformation of the log of the decimal odds is used to standardise prices. This makes the relative directional movement in different prices comparable by accounting for their respective variances. 
* Pairs are discovered by analysing the sum of squared distances between two horses' prices throughout time. Those that move the least relative to each other are the best candidates for pairs.
* Once pairs are identified, the 'spread' between their prices (on average, or at the end of each window) is compared to a minimum size requirement for a bet to be made, $\phi$.
* The 'hedging ratio' is found using an OLS regression of the price of one of the horses on the price of the other. Since the two prices are pairs but will have different variances and absolute values, their movement relative to eachother must be considered to make the strategy 'cost neutral'. It is also used to define the stake size. 
* The final observed spread indicates which on which horse a 'back-to-lay' hedge must be made (that which is expected to be backed in) and on which a 'lay-to-back' hedge must be made (that which is expected to drift). This spread is compared to an interval [?], likely a confidence interval of past spreads or simply the interval of observed spreads. If the spread is greater than usual or smaller than usual, the bets are placed. 
* In the paper it appears that both sides of the hedge bet are made at the same point in time.

__3.2__ **- Setup**

**Data**

The following example will be set up with a random race and will identify tradeable pairs (or that there are none). Three DataFrames are created: (1) the unchanged race sample DataFrame with one row per horse and data going along in columns, (2) a back prices DataFrame with one column per horse and prices going through time in rows, (3) the same for lay prices. This analysis looks at prices before the race begins.

There are 60 price data points for each horse, finishing at the begining of the race.

Variables:
* $BP_{t}^{i}$ is back price for horse i at time t.
* $LP_{t}^{i}$ is lay price for horse i at time t.

In [10]:
# defining variables
back_prices = [col for col in df.columns if 'BP' in col]
back_sizes = [col for col in df.columns if 'BS' in col]
lay_prices = [col for col in df.columns if 'LP' in col]
lay_sizes = [col for col in df.columns if 'LS' in col]

#runner_info = ['SelectionId', 'MarketId', 'Venue', 'Distance', 'RaceType', 'BSP', 'NoRunners']

sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()] 

bp_df = sample_df[['SelectionId'] + back_prices].copy()
new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
bp_t_df = bp_df.T.copy()
bp_t_df.columns = ["h" + str(column) for column in bp_t_df.iloc[0]]
bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
bp_t_df.reset_index(drop=True, inplace=True)

lp_df = sample_df[['SelectionId'] + lay_prices].copy()
new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
lp_t_df = lp_df.T.copy()
lp_t_df.columns = ["h" + str(column) for column in lp_t_df.iloc[0]]
lp_t_df = lp_t_df.iloc[1:-15]
lp_t_df.reset_index(drop=True, inplace=True)

# bsp_df = plot_df[['BSP']].copy()
# bsp_df['min_bp'] = bsp_df['BSP'].apply(lambda x: round(utils.back_hedge_min_bp(x, 0.05), 2))
# bsp_df['max_lp'] = bsp_df['BSP'].apply(lambda x: round(utils.lay_hedge_max_lp(x, 0.05), 2))    

bp_t_df.head()

Unnamed: 0,h13164965.0,h13384279.0,h16408223.0,h19492505.0,h20730419.0,h21007359.0,h21063957.0,h21178932.0,h23419840.0,h26548208.0,h26817714.0,h26830880.0,h27358554.0,h27377022.0,h6710054.0,h9036.0
0,4.5,1000.0,190.0,321.58,335.71,26.0,170.0,340.0,136.3,106.67,1000.0,6.32,14.5,2.64,11.22,30.19
1,4.5,1000.0,190.0,340.0,340.0,26.0,170.0,340.0,136.67,110.0,1000.0,6.4,14.5,2.64,11.09,29.8
2,4.59,1000.0,190.0,336.67,340.0,26.0,170.0,340.0,130.37,110.0,1000.0,6.4,14.5,2.64,11.0,30.0
3,4.6,1000.0,188.44,302.63,340.0,26.0,170.0,365.0,130.0,110.0,1000.0,6.4,14.5,2.64,11.0,30.0
4,4.59,1000.0,178.48,321.11,340.0,26.0,170.0,370.0,140.0,110.0,1000.0,6.4,14.5,2.61,11.4,31.51


__3.3__ **- Z-score transformation**

Bebbington standardises prices by taking the natural logarithm of each price and then standardising it with a Z-score transformation. 

Taking $P_{t}^{i} = BP_{t}^{i}$, this means finding 

#### $P_{t}^{'(i)} = \frac{P_{t}^{i}-\overline{P}_{t}^{i}}{\sigma^{(i)}}$

where $\sigma^{(i)}$ is the standard deviation of the horse's price throughout the time series.

The Z-score transformation gives the relationship between an individual data point in the sample relative to that of the population mean and standard deviation. This means that variations are comparable bwetween horses.



In [11]:
z_bp_df = bp_t_df.copy()
z_bp_df = np.log(z_bp_df)

for column in z_bp_df.columns:
    mean = z_bp_df[column].mean()
    sd = np.std(z_bp_df[column], ddof = 1)
    z_bp_df[column] = z_bp_df[column].apply(lambda x: (x - mean) / sd)
    
z_bp_df.head()

Unnamed: 0,h13164965.0,h13384279.0,h16408223.0,h19492505.0,h20730419.0,h21007359.0,h21063957.0,h21178932.0,h23419840.0,h26548208.0,h26817714.0,h26830880.0,h27358554.0,h27377022.0,h6710054.0,h9036.0
0,-0.206173,0.991632,0.793671,0.658674,-0.909595,-1.029158,-0.789898,-2.12327,0.637992,-0.383406,0.991632,-0.736271,-0.900389,0.953048,0.220815,-0.626978
1,-0.206173,0.991632,0.793671,0.804439,-0.79734,-1.029158,-0.789898,-2.12327,0.650291,-0.20195,0.991632,-0.647315,-0.900389,0.953048,0.122808,-0.827097
2,0.164583,0.991632,0.793671,0.778682,-0.79734,-1.029158,-0.789898,-2.12327,0.43618,-0.20195,0.991632,-0.647315,-0.900389,0.953048,0.054282,-0.724147
3,0.205328,0.991632,0.679596,0.499729,-0.79734,-1.029158,-0.789898,-1.767773,0.423285,-0.20195,0.991632,-0.647315,-0.900389,0.953048,0.054282,-0.724147
4,0.164583,0.991632,-0.071775,0.654846,-0.79734,-1.029158,-0.789898,-1.699603,0.75951,-0.20195,0.991632,-0.647315,-0.900389,0.646144,0.354658,0.031669


__3.4__ **- Sum of squared distances**

Following Bebbington, to select pairs, we create a matrix (DataFrame, in this case) of the sum of squared distances between pairs of horses throughout the time series.

$\Theta _{ij} = \left\{\begin{matrix}
\sum_{M}^{t=1}(P_{t}^{'(i)} - P_{t}^{'(j)})^{2}, &i\neq j\\ 
 0, i=j& 
\end{matrix}\right.$

In [49]:
ids = [column for column in z_bp_df.columns]

matrix = z_bp_df.iloc[0:0].copy()

matrix.insert(0, "horse", np.array(ids))
matrix = matrix.set_index("horse", drop = False)
del matrix["horse"]

for column in matrix.columns:
    for row in matrix.index:
        if column == row:
            matrix.loc[row, column] = np.nan
        else: 
            matrix.loc[row, column] = ((z_bp_df[row] - z_bp_df[column]) ** 2).sum()
            
matrix

Unnamed: 0_level_0,h13164965.0,h13384279.0,h16408223.0,h19492505.0,h20730419.0,h21007359.0,h21063957.0,h21178932.0,h23419840.0,h26548208.0,h26817714.0,h26830880.0,h27358554.0,h27377022.0,h6710054.0,h9036.0
horse,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
h13164965.0,,118.0,186.054717,225.966194,69.688701,105.61413,98.197439,65.680578,216.815496,210.350942,118.0,13.976972,89.681604,219.311252,207.548634,166.787182
h13384279.0,118.0,,118.0,118.0,118.0,118.0,118.0,118.0,118.0,118.0,0.0,118.0,118.0,118.0,118.0,118.0
h16408223.0,186.054717,118.0,,28.77983,155.831046,185.049484,104.869148,172.461334,44.790488,36.390599,118.0,209.827229,195.334682,53.028703,19.862208,88.724881
h19492505.0,225.966194,118.0,28.77983,,164.352338,156.871026,141.701169,191.638816,9.795516,21.798538,118.0,232.627253,172.357308,14.376697,16.129639,73.369829
h20730419.0,69.688701,118.0,155.831046,164.352338,,124.155087,86.683951,61.372575,158.910127,139.068443,118.0,74.941826,133.560082,176.070873,135.977224,86.443878
h21007359.0,105.61413,118.0,185.049484,156.871026,124.155087,,139.692028,74.483285,132.471815,158.777052,118.0,67.559398,17.073532,143.343345,190.213649,158.498435
h21063957.0,98.197439,118.0,104.869148,141.701169,86.683951,139.692028,,54.530783,155.964662,117.911017,118.0,91.773095,132.561161,171.285565,109.36316,120.544633
h21178932.0,65.680578,118.0,172.461334,191.638816,61.372575,74.483285,54.530783,,188.816475,151.288292,118.0,43.395196,80.386067,204.299164,163.401807,125.677425
h23419840.0,216.815496,118.0,44.790488,9.795516,158.910127,132.471815,155.964662,188.816475,,28.625268,118.0,219.481621,156.893333,18.166999,36.723759,92.337916
h26548208.0,210.350942,118.0,36.390599,21.798538,139.068443,158.777052,117.911017,151.288292,28.625268,,118.0,213.008607,179.626832,42.302701,19.078388,83.001373
