## Strategy Idea 2 : "Cointegration - pairs trading - version 2

Notes (to do)
* Use lay prices as well (currently only using back prices, but two lay bets are made per pairs trade

### Section 0 : Setup

In [7]:
# importing packages
from pathlib import Path, PurePath 

import pandas as pd
import numpy as np
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

import itertools

import utils

In [8]:
def payout(bp, bs, lp, ls, c = 0):
    if ls == '?':
        ls = lay_hedge_stake(bp, bs, lp, c)
    elif bs == '?':
        bs = bet_hedge_stake(lp, ls, bp, c)
    loss_side = - bs + ls * (1 - c) 
    win_side = (bp - 1) * bs * (1 - c) - (lp - 1) * ls
    return win_side, loss_side 

def lay_hedge_stake(bp, bs, lp, c):
    return (((bp - 1) * bs * (1 - c)) + bs) / (lp)

def bet_hedge_stake(lp, ls, bp, c):
    return ls * (lp - c) / (bp * (1 - c) + c)

In [19]:
# reading in data
project_dir = Path.cwd().parents[2]
data_dir = project_dir / 'data' / 'processed' / 'api' / 'advanced' / 'adv_data.csv'
df = pd.read_csv(data_dir, header = 1, low_memory = False, index_col = 0)
print(df.shape)

# defining variables
back_prices = [col for col in df.columns if 'BP' in col]
back_sizes = [col for col in df.columns if 'BS' in col]
lay_prices = [col for col in df.columns if 'LP' in col]
lay_sizes = [col for col in df.columns if 'LS' in col]

df.head(2)

(12906, 307)


Unnamed: 0,SelectionId,MarketId,Venue,Distance,RaceType,BSP,NoRunners,BS:T-60,BS:T-59,BS:T-58,...,LS:T+5,LS:T+6,LS:T+7,LS:T+8,LS:T+9,LS:T+10,LS:T+11,LS:T+12,LS:T+13,LS:T+14
0,11688029.0,1.166898,Southwell,8.0,Flat,9.2,7.0,4.15,5.98,6.86,...,4.76,7.7,3.07,41.07,8.05,3.74,1.85,7.05,3.89,0.41
1,13331255.0,1.166898,Southwell,8.0,Flat,4.3,7.0,41.5,64.89,38.54,...,16.44,7.38,18.12,5.44,4.09,15.5,3.82,66.43,192.93,136.06


### Alternative approach to pairs trading

__2.0__ **- [Herlemont (2004)](http://docs.finance.free.fr/DOCS/Yats/cointegration-en%5B1%5D.pdf) paper**

Herlemont describes in detail the econometrics of pairs trading for financial market assets. The following partly follows his commentary with some additional clarifications and discussion relating to horse racing.

**2.1 - Testing for mean reversion**

The aim is to identify odds that move together and whose spread is mean reverting. For the purposes of horse racing pairs, mean reversion is essential. Our objective is to capture prices whose spread has (temporarily) deviated from its mean. If this can be found, bets can be made to take advantage of the possible reversion.

A stochastic process $y_{t}$ that is weakly stationary has the following properties for all $t$:

* $E[y_{t}] = \mu < \infty$
* $var(y_{t}) = \gamma_{0} < \infty$
* $cov(y_{t}, y_{t-j}) = \gamma_{j} < \infty, j = 1, 2, 3 ...$

(constant mean, constant variance, covariance between two observations depends only on the distance in time between them)

A weakly stationary $I(0)$ series:
* Fluctuates around its mean with a finite variance that does not depend upon time.
* Is mean-reverting: it has tendency to return to its mean.
* Has limited memory; the effect of a shock dies out. Autocorrelations die out (fairly) rapidly.

With two horse's odds, $A_{t}$ and $B_{t}$, we look at $y_{t} = \log \frac{A_{t}}{B_{t}} = \log A_{t} - \log B_{t}$. This is once again the spread between the prices of the two horses, defined slightly differently. We want to find a pair which has a weakly stationary spread. We are interested in the ($AR(1)$) process 

$y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$,

or the log odds ratio over time. If this is weakly stationary, it would suggest a mean reverting process. 

The three previous conditions, and a stability condition that $|\theta|<1$ (that the process $y_{t}$ is not a random walk or that it follows an eratic positive-to-negative pattern) must hold.
______

A Dickey-Fuller stationarity test can be carried out on the log ratio of the prices to test whether a process is weakly stationary. If we carry out the regression:

$\Delta y_{t} = \mu + \omega y_{t-1} + \varepsilon_{t}$

where the null hypothesis that $\omega = 0$ is that the 'true' relationship is $\Delta y_{t} = \mu + \varepsilon_{t} \Leftrightarrow y_{t} = \mu + y_{t-1} + \varepsilon_{t}$, or a random walk with starting point $y_{0} = \mu$.

If we can reject the null hypothesis, the price ratio is weakly stationary and thereby mean-reverting.

A Dickey-Fuller test is required for each possible pair of horses in a race, or $\frac{n(n-1)}{2}$ regressions, where $n$ is the number of horses.

While we are interested in the stochastic process $y_{t}$, we do not need to carry out the regression of $y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$ for the purpose of finding pairs. This relationship between a pair of odds itself is not important to quantify. We are only interested in the features of the process. 
____

*In the previous analysis, the test for whether two odds formed a pair was to find the pair with the smallest sum of absolute differences over time in the standardised prices. That method would allow maximum 1 pair to be found per race, and the validity of that pair would not be confirmed statisticallyather. Rather, the pair's feasibilty for a trade would be tested for afterwards based on profitability. I have more confidence in the approach in this section.*

**2.2 - Screening pairs**

Herlemont describes rules to ensure that market neutrality is more achievable in pairs trading. The idea is to pick stocks with very similar characteristics like same industry and similar market betas, with the intention of minimising asymmetric shocks to the price of one stock and not the other. For example in the case of two stocks, the share on which you are long is a business heavily dependent on oil, while the other share is not, a surge in oil prices which dampens profitability of your long share will likely see its price fall, ruining the pairs trade. In the case of shares, the simplest solution would be to pick shares in similar industries with similar market betas (or with similar idiosyncratic risks).

For horses, the external factors influencing prices (news about runners, changing weather conditions, etc.) will usually always have asymmetric effects. This may be avoidable through picking horses with similar fundamental characteristics. However, this is very complicated. My hope is that the pair finding mechanism picks horses where this is already the case, because the market reacts the same way to news for these horse pairs.

We cannot follow a beta-based approach because there are not 'market-wide fluctuations' of the same sort. However, there is the fact that the implied probability of all horses in the market book is equal to approximately 1. Therefore, you could say that for a given change in implied probability for one horse, the sum of the changes in the odds of all the remaining horses is the negative the change for the given horse:

$\Delta O_{i} = - \sum_{j = 1, j \neq i}^{N_{h}} \Delta O_{j} $

There is therefore interdependence between all prices across the market. It's possible that this will cause an endogeneity problem in regressions between separate horses, as the changes in the dependent variable necessarily impact the explanatory variable. However, the impact is likely to be very small, and will be smaller the greater the number of horses. 

*In Bebbington's analysis, he describes that betting £1 on one of the horses and £$\beta$ on the other creates a market neutral bet. This is incorrect, and it appears that he has misunderstood hedging in this context. In that analysis, $\beta = \frac{y_{t}}{x_{t}}$, and therefore he is simply considering the ratio of the prices of the horses, the same ratio considered when determining the optimal stake for two given prices in a hedge. It is correct that on a single horse this creates a market neutral bet, however neutrality in horse racing means neutral to the outcome of the race. Any bet neutral to the race outcome is definitively neutral to the market. When betting on separate horses, the bets on each horse must be made neutral separately. Additionally, the use of $\beta$ in staking is unneccesary. Consider the case where £$BS$ has been bet on horse A at price $BP$. Now, horse A is priced at $LP$. The optimal stake to bet on LP is £$LS = \frac{BS * BP}{LP}$. In the aforementioned regression, $BS = 1$, hence $\beta = \frac{y_{t} * 1}{x_{t}}$ is the optimal stake only for bets of £1, otherwise it would be $S*\beta$. More importantly, using the estimated $\beta$ to find the an approximation of the optimal stake makes no sense when you can simply find the optimal stake with the aforementioned equation.*

**2.3 - Trading rules**

Timing rules must be added. 

Herlemont's basic rule is "to open a position when the ratio of two share prices hits the 2 rolling standard deviation [difference from the 130-day rolling mean] and close it when the ratio returns to the mean."

To avoid opening a position on stocks that are deviating from the mean and are going to deviate further, Herlemont describes that "the position is not opened when the ratio breaks the two-standard-deviations limit for the first time, but rather when it crosses it to revert to the mean again."

This can be achieved with the horse odds, of course in far smaller time scales. The current dataset is in 5-minute intervals for the three hours before a race; this should likely be expanded.

Stop losses should be included and trade length should also be limited.

Rules:
1. Trade on pairs whose spread is reapproaching the mean from a deviated position
2. Stop loss at x% of the initial position
3. Don't hold open pairs trades for longer than x hours. 

It should be possible to quantify the average length of time required for a mean reversion and therefore the maximum logical time to hold open a position by looking at past data.

**2.4 - Other tests and considerations**

1. It should be ensured that the regression results of one price on another are not spurious (as with the regression in 2.5). $\beta$ could be statistically meaningless if it is, meaning that it makes no sense to use it.
2. I will also test whether $y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$ is $I(1)$, or difference stationary. If we can rule this out, this gives more confidence in the 'weak-stationarity' of the spread over time.
3. I will look out for $\omega$ in the DF test that are close to 1 yet pass the DF test. They will have lots of features of a random walk, so the pairs exercise might be meaningless.
4. Structural breaks (in this case, large instantaneous jumps in the spread) may make series that are stationary on either side of the break appear non-stationary. This is hard to account for in testing. 

In [21]:
#new sample
sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()]
sample_df.drop_duplicates(inplace=True)

bp_df = sample_df[['SelectionId'] + back_prices].copy()
new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
bp_t_df = bp_df.T.copy()
bp_t_df.columns = ["h" + str(int(column)) for column in bp_t_df.iloc[0]]
bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
bp_t_df.reset_index(drop=True, inplace=True)

lp_df = sample_df[['SelectionId'] + lay_prices].copy()
new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
lp_t_df = lp_df.T.copy()
lp_t_df.columns = ["h" + str(int(column)) for column in lp_t_df.iloc[0]] #rename columns to horse ids
lp_t_df = lp_t_df.iloc[1:-15] #remove horse ids, remove inplay data
lp_t_df.reset_index(drop=True, inplace=True)

#using non-standardised log price data
log_bp = np.log(bp_t_df[:30]).copy()
log_bp.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df.drop_duplicates(inplace=True)


Unnamed: 0,h10715610,h12111722,h17574768,h19823,h21339426,h21822169,h23188356
0,2.090629,2.318458,2.944439,1.916923,2.089392,1.20896,1.745716
1,2.068128,2.313525,2.944439,1.896119,2.104134,1.238374,1.73871
2,2.071913,2.326302,2.944439,1.880991,2.104134,1.247032,1.729884
3,2.051556,2.309561,2.904713,1.88707,2.104134,1.255616,1.740466
4,2.028148,2.338917,2.862201,1.88707,2.104134,1.255616,1.73871


In [64]:
#create dataframe where each column is log(horse a's prices) - log(horse b's prices). one new column for all n(n-1)/2 possible pairs

#use itertools to find all possible comination pairs
combos = list(itertools.combinations(log_bp.columns, 2))

#creating dataframe
for pair in combos:
    if pair == combos[0]:
        new_series = log_bp[pair[0]] - log_bp[pair[1]]
        dickey_fuller_df = pd.DataFrame(new_series)
    else:
        new_series = log_bp[pair[0]] - log_bp[pair[1]]
        dickey_fuller_df = pd.concat([dickey_fuller_df, new_series], axis=1)
        
#naming columns
dickey_fuller_df.columns = [pair[0] + "_" + pair[1] for pair in combos]

dickey_fuller_df['const'] = 1

dickey_fuller_df.head()

Unnamed: 0,h11238576_h15836858,h11238576_h18227448,h11238576_h18889993,h11238576_h24635004,h11238576_h6561475,h15836858_h18227448,h15836858_h18889993,h15836858_h24635004,h15836858_h6561475,h18227448_h18889993,h18227448_h24635004,h18227448_h6561475,h18889993_h24635004,h18889993_h6561475,h24635004_h6561475,const
0,0.444868,-2.430418,-2.302585,0.228842,-0.818979,-2.875286,-2.747453,-0.216026,-1.263846,0.127833,2.65926,1.61144,2.531427,1.483607,-1.04782,1
1,0.426372,-2.441847,-2.314014,0.217413,-0.831409,-2.868219,-2.740386,-0.208959,-1.257781,0.127833,2.65926,1.610438,2.531427,1.482605,-1.048822,1
2,0.404295,-2.460409,-2.332576,0.195998,-0.850971,-2.864704,-2.736871,-0.208297,-1.255266,0.127833,2.656407,1.609438,2.528574,1.481605,-1.046969,1
3,0.377294,-2.476938,-2.357479,0.179468,-0.867501,-2.854233,-2.734773,-0.197826,-1.244795,0.119459,2.656407,1.609438,2.536948,1.489978,-1.046969,1
4,0.366931,-2.476938,-2.36377,0.185183,-0.851371,-2.84387,-2.730701,-0.181749,-1.218303,0.113169,2.662121,1.625567,2.548953,1.512399,-1.036554,1


In [65]:
#dickey fuller test on each column

#regression fit and results in vertical dataframe format (column for pair id, column for dickey fuller test result)
dickey_fuller_results = {'pair' : [], 'coef' : [], 'critical_value' : []}

for column in dickey_fuller_df:
    if column == 'const':
        break
    reg = sm.OLS(endog = dickey_fuller_df[column].diff(), exog = dickey_fuller_df[['const', column]].shift(1), missing = 'drop')
    results = reg.fit()
    dickey_fuller_results['pair'].append(column)
    dickey_fuller_results['coef'].append(results.params[1])
    dickey_fuller_results['critical_value'].append(results.tvalues[1])

dickey_fuller_results_df = pd.DataFrame(dickey_fuller_results)

dickey_fuller_results_df

#compare to -3.58 from the MacKinnon tables for 1% significance level, -2.93 for 5% significance level

Unnamed: 0,pair,coef,critical_value
0,h11238576_h15836858,-0.131551,-2.892398
1,h11238576_h18227448,-0.113261,-1.519496
2,h11238576_h18889993,-0.134122,-2.038246
3,h11238576_h24635004,-0.024029,-0.653254
4,h11238576_h6561475,-0.206256,-1.636178
5,h15836858_h18227448,-0.143266,-1.376951
6,h15836858_h18889993,-0.275742,-2.050797
7,h15836858_h24635004,-0.041601,-0.604386
8,h15836858_h6561475,-0.082666,-1.712969
9,h18227448_h18889993,-0.17377,-1.602845


In [66]:
#to work out what the direction of the trade should be
#the spreads are always defined as horse_a - horse_b, so if the spread > average, btl horse_a and ltb horse_b. if below average, vice versa.


def bet(idx_open, time_open):
    if (pair_df['spread'].iloc[open_trade_idx] > pair_spread_mean) and (pair_spread_mean > 0) or (pair_df['spread'].iloc[open_trade_idx] > pair_spread_mean) and (pair_spread_mean < 0):
        #back to lay A (short)
        bp_a = pair_df[horse_a + "_bp"].iloc[idx_open]
        lp_a = pair_df[horse_a + "_lp"].iloc[idx_open + time_open]

        win_side_a, loss_side_a = payout(bp_a, 1, lp_a, '?')
        
        #lay to back X (long)
        lp_b = pair_df[horse_b + "_lp"].iloc[idx_open]
        bp_b = pair_df[horse_b + "_bp"].iloc[idx_open + time_open]

        win_side_b, loss_side_b = payout(bp_b, '?', lp_b, 1)

        return win_side_a, win_side_b

    elif (pair_df['spread'].iloc[open_trade_idx] < pair_spread_mean) and (pair_spread_mean < 0) or (pair_df['spread'].iloc[open_trade_idx] < pair_spread_mean) and (pair_spread_mean > 0): 
        #lay to back A (long)
        lp_a = pair_df[horse_a + "_lp"].iloc[idx_open]
        bp_a = pair_df[horse_a + "_bp"].iloc[idx_open + time_open]

        win_side_a, loss_side_a = payout(bp_a, '?', lp_a, 1)

        #back to lay A (short)
        bp_b = pair_df[horse_b + "_bp"].iloc[idx_open]
        lp_b = pair_df[horse_b + "_lp"].iloc[idx_open + time_open]

        win_side_b, loss_side_b = payout(bp_b, 1, lp_b, '?')

        return win_side_a, win_side_b

The code below finds all possible pairs, finds where they have drifted 2sd from the mean and bets sequentially on all possible opportunities across all pairs.

In [67]:
#any pairs with a critical value less than have the null hypothesis rejected at the 1% significance level
if dickey_fuller_results_df['critical_value'].min() < - 3.58:
    
    pairs_df = dickey_fuller_results_df.loc[dickey_fuller_results_df['critical_value'] < - 3.58].copy() #all possible pairs
    print(f"{len(pairs_df.index)} pair(s) found\n")
    
    for id_id in pairs_df['pair']:
        pair_index = pairs_df.index[pairs_df['pair'] == id_id].item() #the most stationary pair 
        pair_cv = pairs_df['critical_value'].loc[pair_index]
        pair_ids = pairs_df['pair'].loc[pair_index]
        pair_coef = pairs_df['coef'].loc[pair_index]

        horse_a = pair_ids.split("_", 1)[0]
        horse_b = pair_ids.split("_", 1)[1]
        pair_df = bp_t_df[[horse_a, horse_b]]
        pair_df = pd.concat([pair_df, lp_t_df[[horse_a, horse_b]]], axis=1)
        pair_df.columns = [horse_a + "_bp", horse_b + "_bp", horse_a + "_lp", horse_b + "_lp"]

        #the following are all defined only in terms of BP
        pair_df['spread'] = pair_df[horse_a + "_bp"] - pair_df[horse_b + "_bp"]

        pair_spread_sd = np.std(pair_df['spread'][0:29], ddof = 1)
        pair_spread_mean = pair_df['spread'][0:29].mean()

        pair_df['deviation_2sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 2 * pair_spread_sd, True, False)
        pair_df['deviation_1sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > pair_spread_sd, True, False)

        print("Pair row index: " + str(pair_index), ", Pair ids: " + str(pair_ids), ", Pair DF test critical value: " + str(pair_cv), ", Pair theta: " + str(pair_coef))
        print("Pair average spread: " + str(pair_spread_mean), ", Pair spread standard deviation: " + str(pair_spread_sd) + "\n")

        open_trade_df = pair_df[pair_df['deviation_2sd'] == True].loc[30:] #so that we only consider data after the first 30 periods
        
        k = 5    
        
        while (len(open_trade_df.index) > 0) and ((open_trade_df.index[0] + k) < 59): #while len(open_trade_df.index) > k + 1:
            print(open_trade_df.index[0])
            open_trade_idx = open_trade_df.index[0] 
            win_side_a, win_side_b = bet(open_trade_idx, k)
            open_trade_df = open_trade_df.loc[open_trade_idx + k + 1:]
            
            print(f"Horse A payoff = {win_side_a}.")
            print(f"Horse B payoff = {win_side_b}.")
    
        else: print("No more trades.\n")
    
else: print("No pairs.")

No pairs.


**Monte Carlo simulation**

n repetitions of the above with profit summed over all trades.

Trading rules:
* Open trades when the price is between 2 and 4 standard deviations from the mean
* Close trades k 2-minute periods later
* Only consider races where 5 or less pairs are found

In [112]:
#setup
n = 500 #number of iterations
k = 10 #number of 2-minute periods trade is kept open (i.e. time expected for mean reversion to occur). this is use in the bet() function below

#results variables
profit = 0
num_pairs = 0
pairs_traded = 0
num_trades_total = 0

results_dict = {'pair' : [], 'mean_spread' : [], 'final_spread' : [], 'pair_cv' : [], 'pair_coef' : [], 'pairs_in_race' : [],
                'num_trades' : [], 'profitable_trades' : [], 'losing_trades' : [], 'pc_trades_prof' : [],
                'pair_profit' : [], 'pair_profits_list' : []}

for i in range(n):
            
    if (i + 1) % 100 == 0:
        print(f"{i+1} of {n} iterations completed.")
    
    #new sample
    sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()]
    sample_df.drop_duplicates(inplace=True)

    bp_df = sample_df[['SelectionId'] + back_prices].copy()
    new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
    bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
    bp_t_df = bp_df.T.copy()
    bp_t_df.columns = ["h" + str(int(column)) for column in bp_t_df.iloc[0]]
    bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
    bp_t_df.reset_index(drop=True, inplace=True)

    lp_df = sample_df[['SelectionId'] + lay_prices].copy()
    new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
    lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
    lp_t_df = lp_df.T.copy()
    lp_t_df.columns = ["h" + str(int(column)) for column in lp_t_df.iloc[0]] #rename columns to horse ids
    lp_t_df = lp_t_df.iloc[1:-15] #remove horse ids, remove inplay data
    lp_t_df.reset_index(drop=True, inplace=True)

    #using non-standardised log price data
    log_bp = np.log(bp_t_df[:30]).copy()
    log_bp.head()

    #create dataframe where each column is log(horse a's prices) - log(horse b's prices). one new column for all n(n-1)/2 possible pairs
    #use itertools to find all possible comination pairs
    combos = list(itertools.combinations(log_bp.columns, 2))

    for pair in combos:
        if pair == combos[0]:
            new_series = log_bp[pair[0]] - log_bp[pair[1]]
            dickey_fuller_df = pd.DataFrame(new_series)
        else:
            new_series = log_bp[pair[0]] - log_bp[pair[1]]
            dickey_fuller_df = pd.concat([dickey_fuller_df, new_series], axis=1)

    #naming columns
    dickey_fuller_df.columns = [pair[0] + "_" + pair[1] for pair in combos]
    dickey_fuller_df['const'] = 1

    #dickey fuller test on each column
    dickey_fuller_results = {'pair' : [], 'coef' : [], 'critical_value' : []}

    for column in dickey_fuller_df:
        if column == 'const':
            continue
        reg = sm.OLS(endog = dickey_fuller_df[column].diff(), exog = dickey_fuller_df[['const', column]].shift(1), missing = 'drop')
        results = reg.fit()
        dickey_fuller_results['pair'].append(column)
        dickey_fuller_results['coef'].append(results.params[1])
        dickey_fuller_results['critical_value'].append(results.tvalues[1])

    dickey_fuller_results_df = pd.DataFrame(dickey_fuller_results)


    #continue if there is at least one 'stationary' pair
    if dickey_fuller_results_df['critical_value'].min() < - 3.58:
    
        pairs_df = dickey_fuller_results_df.loc[dickey_fuller_results_df['critical_value'] < - 3.58].copy() #all possible pairs
        
        if len(pairs_df.index) < 6:  #TRADING RULE: REJECT RACES WHERE MORE THAN 5 PAIRS ARE FOUND
            for id_id in pairs_df['pair']:

                #pair analysis
                pair_index = pairs_df.index[pairs_df['pair'] == id_id].item() #the most stationary pair 
                pair_cv = pairs_df['critical_value'].loc[pair_index]
                pair_ids = pairs_df['pair'].loc[pair_index]
                pair_coef = pairs_df['coef'].loc[pair_index]

                horse_a = pair_ids.split("_", 1)[0]
                horse_b = pair_ids.split("_", 1)[1]
                pair_df = bp_t_df[[horse_a, horse_b]]
                pair_df = pd.concat([pair_df, lp_t_df[[horse_a, horse_b]]], axis=1)
                pair_df.columns = [horse_a + "_bp", horse_b + "_bp", horse_a + "_lp", horse_b + "_lp"]

                #the following are all defined only in terms of BP
                pair_df['spread'] = pair_df[horse_a + "_bp"] - pair_df[horse_b + "_bp"]

                pair_spread_sd = np.std(pair_df['spread'][0:29], ddof = 1)
                pair_spread_mean = pair_df['spread'][0:29].mean()

                pair_df['deviation_2sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 2 * pair_spread_sd, True, False)
                pair_df['deviation_4sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 4 * pair_spread_sd, True, False)            

                #ALL TRADES ARE MADE IN THE CODE BELOW
                open_trade_df = pair_df[pair_df['deviation_2sd'] == True].loc[30:] #only data after the first 30 periods
                open_trade_df.drop_duplicates(inplace=True)

                if len(open_trade_df.index) > 0:
                    pairs_traded += 1

                num_trades_pair = 0
                profitable_trades_pair = 0
                losing_trades_pair = 0
                pair_profit = 0
                pair_profits_list = []

                #if there are indexs at which to make trades and trades can be completed, cycle through them
                #TRADING RULE: IGNORE HORSES WHO ARE TRADING WITH SPREAD OF 3 SD OR GREATER THAN MEAN (to avoid horses who have deviated too much)    
                while (len(open_trade_df.index) > 0) and ((open_trade_df.index[0] + k) < 59) and (open_trade_df['deviation_4sd'].loc[open_trade_df.index[0]] == False):
                    open_trade_idx = open_trade_df.index[0] 

                    win_side_a, win_side_b = bet(open_trade_idx, k)
                    
                    #aggregate stats
                    profit += win_side_a + win_side_b
                    
                    num_trades_total += 1

                    #removes traded line from open_trade_df
                    #+ 1 period gap between trades on a given pair. at this point one could add in a block to trading if a loss occured in the previous trade
                    #edit the line below to alter the repetition of trades
                    open_trade_df = open_trade_df.loc[open_trade_idx + k + 1:] 

                    #pair stats
                    num_trades_pair += 1

                    if (win_side_a + win_side_b) > 0:
                        profitable_trades_pair += 1
                    else:
                        losing_trades_pair += 1
                        
                    pair_profit += win_side_a + win_side_b
                    pair_profits_list.append(round(win_side_a + win_side_b,2))
                    

                #stats
                num_pairs += 1
                results_dict['pair'].append(id_id)
                results_dict['mean_spread'].append(pair_spread_mean)
                results_dict['final_spread'].append(pair_df['spread'].loc[59])
                results_dict['pair_cv'].append(pair_cv)
                results_dict['pair_coef'].append(pair_coef)
                results_dict['pairs_in_race'].append(len(pairs_df.index))
                results_dict['num_trades'].append(num_trades_pair) 
                results_dict['profitable_trades'].append(profitable_trades_pair) 
                results_dict['losing_trades'].append(losing_trades_pair) 
                try: results_dict['pc_trades_prof'].append((profitable_trades_pair / num_trades_pair) * 100)
                except: results_dict['pc_trades_prof'].append(0)
                results_dict['pair_profit'].append(pair_profit) 
                results_dict['pair_profits_list'].append(pair_profits_list)
                #average profit
                #number of horses in race
                
    
    else: continue #move on to next iteration if there are no stationary series  

results_df = pd.DataFrame(results_dict)
profitable_trades_total = results_df['profitable_trades'].sum()
        
print(f"Profit over {n} random race markets = {profit}. {num_pairs} pairs found, {pairs_traded} pairs traded and {num_trades_total} pairs trades made. {profitable_trades_total} of {num_trades_total} were profitable.")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


100 of 500 iterations completed.
200 of 500 iterations completed.
300 of 500 iterations completed.
400 of 500 iterations completed.
500 of 500 iterations completed.
Profit over 500 random race markets = -23.411116867011128. 308 pairs found, 253 pairs traded and 185 pairs trades made. 54 of 185 were profitable.


In [113]:
results_2_df = results_df[results_df['num_trades'] > 0].copy()
results_2_df.to_csv(data_dir.parents[0] / 'pairs_trade_results.csv', index = False, header=True)