## Strategy Idea 2 : "Cointegration - pairs trading - version 2

Notes (to do)
* Think about whether using the hedge profit formula a minimum deviated price can be found, from which if the spread returns to the mean, it will be a profitable bet
* See if there are pairs which deviate away from and back to a mean spread, but the return to the spread isn't a convergence of prices and is instead a move of one horse or both horses in the same direction to different extents.
* To account for the above, edit the bet function so that it identifies the short and long positions by the horses deviation from its mean rather than by using the spread's deviation from its mean.
* Make another pc profit hedge column where the final price is the deviation from the mean spread multiplied by the weight of each horse

Long term:
* Use predictive method to decide when to bet

Observations that may be useful:
* The mean spread between two horses is the same as the spread of the mean prices of each horse

### Section 0 : Setup

In [45]:
# importing packages
from pathlib import Path, PurePath 

import pandas as pd
import numpy as np
import statsmodels.api as sm

import matplotlib.pyplot as plt
import seaborn as sns

import itertools

import utils

In [46]:
def payout(bp, bs, lp, ls, c = 0):
    if ls == '?':
        ls = lay_hedge_stake(bp, bs, lp, c)
        payoff = (bp - 1) * bs * (1 - c) - (lp - 1) * ls
        return payoff, ls
    elif bs == '?':
        bs = bet_hedge_stake(lp, ls, bp, c)
        payoff = (bp - 1) * bs * (1 - c) - (lp - 1) * ls
        return payoff, bs 

def lay_hedge_stake(bp, bs, lp, c):
    return (((bp - 1) * bs * (1 - c)) + bs) / (lp)

def bet_hedge_stake(lp, ls, bp, c):
    return ls * (lp - c) / (bp * (1 - c) + c)

In [47]:
# reading in data
project_dir = Path.cwd().parents[2]
data_dir = project_dir / 'data' / 'processed' / 'api' / 'advanced' / 'adv_data.csv'
df = pd.read_csv(data_dir, header = 1, low_memory = False, index_col = 0)
print(df.shape)

# defining variables
back_prices = [col for col in df.columns if 'BP' in col]
back_sizes = [col for col in df.columns if 'BS' in col]
lay_prices = [col for col in df.columns if 'LP' in col]
lay_sizes = [col for col in df.columns if 'LS' in col]

df.head(2)

(12906, 307)


Unnamed: 0,SelectionId,MarketId,Venue,Distance,RaceType,BSP,NoRunners,BS:T-60,BS:T-59,BS:T-58,...,LS:T+5,LS:T+6,LS:T+7,LS:T+8,LS:T+9,LS:T+10,LS:T+11,LS:T+12,LS:T+13,LS:T+14
0,11688029.0,1.166898,Southwell,8.0,Flat,9.2,7.0,4.15,5.98,6.86,...,4.76,7.7,3.07,41.07,8.05,3.74,1.85,7.05,3.89,0.41
1,13331255.0,1.166898,Southwell,8.0,Flat,4.3,7.0,41.5,64.89,38.54,...,16.44,7.38,18.12,5.44,4.09,15.5,3.82,66.43,192.93,136.06


### Alternative approach to pairs trading

__2.0__ **- [Herlemont (2004)](http://docs.finance.free.fr/DOCS/Yats/cointegration-en%5B1%5D.pdf) paper**

Herlemont describes in detail the econometrics of pairs trading for financial market assets. The following partly follows his commentary with some additional clarifications and discussion relating to horse racing.

**2.1 - Testing for mean reversion**

The aim is to identify odds that move together and whose spread is mean reverting. For the purposes of horse racing pairs, mean reversion is essential. Our objective is to capture prices whose spread has (temporarily) deviated from its mean. If this can be found, bets can be made to take advantage of the possible reversion.

A stochastic process $y_{t}$ that is weakly stationary has the following properties for all $t$:

* $E[y_{t}] = \mu < \infty$
* $var(y_{t}) = \gamma_{0} < \infty$
* $cov(y_{t}, y_{t-j}) = \gamma_{j} < \infty, j = 1, 2, 3 ...$

(constant mean, constant variance, covariance between two observations depends only on the distance in time between them)

A weakly stationary $I(0)$ series:
* Fluctuates around its mean with a finite variance that does not depend upon time.
* Is mean-reverting: it has tendency to return to its mean.
* Has limited memory; the effect of a shock dies out. Autocorrelations die out (fairly) rapidly.

With two horse's odds, $A_{t}$ and $B_{t}$, we look at $y_{t} = \log \frac{A_{t}}{B_{t}} = \log A_{t} - \log B_{t}$. This is once again the spread between the prices of the two horses, defined slightly differently. We want to find a pair which has a weakly stationary spread. We are interested in the ($AR(1)$) process 

$y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$,

or the log odds ratio over time. If this is weakly stationary, it would suggest a mean reverting process. 

The three previous conditions, and a stability condition that $|\theta|<1$ (that the process $y_{t}$ is not a random walk or that it follows an eratic positive-to-negative pattern) must hold.
______

A Dickey-Fuller stationarity test can be carried out on the log ratio of the prices to test whether a process is weakly stationary. If we carry out the regression:

$\Delta y_{t} = \mu + \omega y_{t-1} + \varepsilon_{t}$

where the null hypothesis that $\omega = 0$ is that the 'true' relationship is $\Delta y_{t} = \mu + \varepsilon_{t} \Leftrightarrow y_{t} = \mu + y_{t-1} + \varepsilon_{t}$, or a random walk with starting point $y_{0} = \mu$.

If we can reject the null hypothesis, the price ratio is weakly stationary and thereby mean-reverting.

A Dickey-Fuller test is required for each possible pair of horses in a race, or $\frac{n(n-1)}{2}$ regressions, where $n$ is the number of horses.

While we are interested in the stochastic process $y_{t}$, we do not need to carry out the regression of $y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$ for the purpose of finding pairs. This relationship between a pair of odds itself is not important to quantify. We are only interested in the features of the process. 
____

*In the previous analysis, the test for whether two odds formed a pair was to find the pair with the smallest sum of absolute differences over time in the standardised prices. That method would allow maximum 1 pair to be found per race, and the validity of that pair would not be confirmed statisticallyather. Rather, the pair's feasibilty for a trade would be tested for afterwards based on profitability. I have more confidence in the approach in this section.*

**2.2 - Screening pairs**

Herlemont describes rules to ensure that market neutrality is more achievable in pairs trading. The idea is to pick stocks with very similar characteristics like same industry and similar market betas, with the intention of minimising asymmetric shocks to the price of one stock and not the other. For example in the case of two stocks, the share on which you are long is a business heavily dependent on oil, while the other share is not, a surge in oil prices which dampens profitability of your long share will likely see its price fall, ruining the pairs trade. In the case of shares, the simplest solution would be to pick shares in similar industries with similar market betas (or with similar idiosyncratic risks).

For horses, the external factors influencing prices (news about runners, changing weather conditions, etc.) will usually always have asymmetric effects. This may be avoidable through picking horses with similar fundamental characteristics. However, this is very complicated. My hope is that the pair finding mechanism picks horses where this is already the case, because the market reacts the same way to news for these horse pairs.

We cannot follow a beta-based approach because there are not 'market-wide fluctuations' of the same sort. However, there is the fact that the implied probability of all horses in the market book is equal to approximately 1. Therefore, you could say that for a given change in implied probability for one horse, the sum of the changes in the odds of all the remaining horses is the negative the change for the given horse:

$\Delta O_{i} = - \sum_{j = 1, j \neq i}^{N_{h}} \Delta O_{j} $

There is therefore interdependence between all prices across the market. It's possible that this will cause an endogeneity problem in regressions between separate horses, as the changes in the dependent variable necessarily impact the explanatory variable. However, the impact is likely to be very small, and will be smaller the greater the number of horses. 

*In Bebbington's analysis, he describes that betting £1 on one of the horses and £$\beta$ on the other creates a market neutral bet. This is incorrect, and it appears that he has misunderstood hedging in this context. In that analysis, $\beta = \frac{y_{t}}{x_{t}}$, and therefore he is simply considering the ratio of the prices of the horses, the same ratio considered when determining the optimal stake for two given prices in a hedge. It is correct that on a single horse this creates a market neutral bet, however neutrality in horse racing means neutral to the outcome of the race. Any bet neutral to the race outcome is definitively neutral to the market. When betting on separate horses, the bets on each horse must be made neutral separately. Additionally, the use of $\beta$ in staking is unneccesary. Consider the case where £$BS$ has been bet on horse A at price $BP$. Now, horse A is priced at $LP$. The optimal stake to bet on LP is £$LS = \frac{BS * BP}{LP}$. In the aforementioned regression, $BS = 1$, hence $\beta = \frac{y_{t} * 1}{x_{t}}$ is the optimal stake only for bets of £1, otherwise it would be $S*\beta$. More importantly, using the estimated $\beta$ to find the an approximation of the optimal stake makes no sense when you can simply find the optimal stake with the aforementioned equation.*

**2.3 - Trading rules**

Timing rules must be added. 

Herlemont's basic rule is "to open a position when the ratio of two share prices hits the 2 rolling standard deviation [difference from the 130-day rolling mean] and close it when the ratio returns to the mean."

To avoid opening a position on stocks that are deviating from the mean and are going to deviate further, Herlemont describes that "the position is not opened when the ratio breaks the two-standard-deviations limit for the first time, but rather when it crosses it to revert to the mean again."

This can be achieved with the horse odds, of course in far smaller time scales. The current dataset is in 5-minute intervals for the three hours before a race; this should likely be expanded.

Stop losses should be included and trade length should also be limited.

Rules:
1. Trade on pairs whose spread is reapproaching the mean from a deviated position
2. Stop loss at x% of the initial position
3. Don't hold open pairs trades for longer than x hours. 

It should be possible to quantify the average length of time required for a mean reversion and therefore the maximum logical time to hold open a position by looking at past data.

**2.4 - Other tests and considerations**

1. It should be ensured that the regression results of one price on another are not spurious (as with the regression in 2.5). $\beta$ could be statistically meaningless if it is, meaning that it makes no sense to use it.
2. I will also test whether $y_{t} = c + \theta y_{t-1} + \varepsilon_{t}$ is $I(1)$, or difference stationary. If we can rule this out, this gives more confidence in the 'weak-stationarity' of the spread over time.
3. I will look out for $\omega$ in the DF test that are close to 1 yet pass the DF test. They will have lots of features of a random walk, so the pairs exercise might be meaningless.
4. Structural breaks (in this case, large instantaneous jumps in the spread) may make series that are stationary on either side of the break appear non-stationary. This is hard to account for in testing. 

In [48]:
# New Sample Function

def sample_dataframe():
    sample_df = df[df['MarketId'] == df['MarketId'].sample(1).item()]
    sample_df.drop_duplicates(inplace=True)

    bp_df = sample_df[['SelectionId'] + back_prices].copy()
    new_cols = bp_df.columns.str.replace("[BP:T]", "").str.replace("[+]", "")
    bp_df.rename(columns = dict(zip(bp_df.columns, new_cols)), inplace = True)
    bp_t_df = bp_df.T.copy()
    bp_t_df.columns = ["h" + str(int(column)) for column in bp_t_df.iloc[0]]
    bp_t_df = bp_t_df.iloc[1:-15] # using the 60 pre-off price data points
    bp_t_df.reset_index(drop=True, inplace=True)

    lp_df = sample_df[['SelectionId'] + lay_prices].copy()
    new_cols = lp_df.columns.str.replace("[LP:T]", "").str.replace("[+]", "")
    lp_df.rename(columns = dict(zip(lp_df.columns, new_cols)), inplace = True)
    lp_t_df = lp_df.T.copy()
    lp_t_df.columns = ["h" + str(int(column)) for column in lp_t_df.iloc[0]] #rename columns to horse ids
    lp_t_df = lp_t_df.iloc[1:-15] #remove horse ids, remove inplay data
    lp_t_df.reset_index(drop=True, inplace=True)

    #taking mid point df
    mid_df = bp_t_df.add(lp_t_df, fill_value=0) / 2
    
    #using log price data <-- This is where the decision to take only the first 30 time periods for analysis is made
    log_mid = np.log(mid_df[:30]).copy()
    
    return bp_t_df, lp_t_df, mid_df, log_mid

In [49]:
# Full Dickey Fuller setup and test; will work for any dataframe where the horses to pair up each have their own column and the heading is some horse identifier

def dickey_fuller_test(log_horse_prices): #log prices are required since stationarity relates to relative movements of two horses, not absolute movements
    
    # Create a dataframe where each column is log(horse a's prices) - log(horse b's prices). one new column for all n(n-1)/2 possible pairs
    combos = list(itertools.combinations(log_horse_prices.columns, 2))

    # Create a dataframe for the Dickey Fuller test where the data in each column is log(A/B), the prices of each horse in the possible pair
    for pair in combos:
        if pair == combos[0]:
            new_series = log_horse_prices[pair[0]] - log_horse_prices[pair[1]]
            dickey_fuller_df = pd.DataFrame(new_series)
        else:
            new_series = log_horse_prices[pair[0]] - log_horse_prices[pair[1]]
            dickey_fuller_df = pd.concat([dickey_fuller_df, new_series], axis=1)

    dickey_fuller_df.columns = [pair[0] + "_" + pair[1] for pair in combos] 
    dickey_fuller_df['const'] = 1

    # Performing the Dickey Fuller test on each column and returning the results in dickey_fuller_results_df. The results df gives the pair identifier and their test critical value
    dickey_fuller_results = {'pair' : [], 'coef' : [], 'critical_value' : []}

    for column in dickey_fuller_df:
        if column == 'const':
            continue
        reg = sm.OLS(endog = dickey_fuller_df[column].diff(), exog = dickey_fuller_df[['const', column]].shift(1), missing = 'drop')
        results = reg.fit()
        dickey_fuller_results['pair'].append(column)
        dickey_fuller_results['coef'].append(results.params[1])
        dickey_fuller_results['critical_value'].append(results.tvalues[1])

    dickey_fuller_results_df = pd.DataFrame(dickey_fuller_results)
    
    return dickey_fuller_results_df

In [50]:
# Grab the viable pairs from the Dickey Fuller test results table. Main objective to define 'pairs_df'

def race_pairs(results_df, significance_level = 0.01):

    if significance_level == 0.01: #note - this is for a T-dimension of 50. The greater T the lower the CV
        critical_value = - 3.58
        
    elif significance_level == 0.05:
        critical_value = - 2.93
        
    else: print("Please input signfiance level as 0.01 or 0.05")
        
    if results_df['critical_value'].min() < critical_value:
        pairs_df = results_df.loc[results_df['critical_value'] < critical_value].copy() #all possible pairs
        return pairs_df 

In [51]:
# # Betting function determines based on the spread and deviation which sides to back and lay

# def bet(idx_open, time_open):
#     if (pair_df['spread'].iloc[open_trade_idx] > pair_spread_mean) and (pair_spread_mean > 0) or (pair_df['spread'].iloc[open_trade_idx] > pair_spread_mean) and (pair_spread_mean < 0):
#         #back to lay A (short)
#         bp_a = pair_df[horse_a + "_bp"].iloc[idx_open]
#         lp_a = pair_df[horse_a + "_lp"].iloc[idx_open + time_open]

#         win_side_a, loss_side_a = payout(bp_a, 1, lp_a, '?')

#         #lay to back X (long)
#         lp_b = pair_df[horse_b + "_lp"].iloc[idx_open]
#         bp_b = pair_df[horse_b + "_bp"].iloc[idx_open + time_open]

#         win_side_b, loss_side_b = payout(bp_b, '?', lp_b, 1)

#         return win_side_a, win_side_b

#     elif (pair_df['spread'].iloc[open_trade_idx] < pair_spread_mean) and (pair_spread_mean < 0) or (pair_df['spread'].iloc[open_trade_idx] < pair_spread_mean) and (pair_spread_mean > 0): 
#         #lay to back A (long)
#         lp_a = pair_df[horse_a + "_lp"].iloc[idx_open]
#         bp_a = pair_df[horse_a + "_bp"].iloc[idx_open + time_open]

#         win_side_a, loss_side_a = payout(bp_a, '?', lp_a, 1)

#         #back to lay A (short)
#         bp_b = pair_df[horse_b + "_bp"].iloc[idx_open]
#         lp_b = pair_df[horse_b + "_lp"].iloc[idx_open + time_open]

#         win_side_b, loss_side_b = payout(bp_b, 1, lp_b, '?')

#         return win_side_a, win_side_b

In [52]:
# Betting function determines based on the spread and deviation which sides to back and lay
# This function is used as a filter for betting

def bet_prof_pc(bp_a, lp_a, bp_b, lp_b, spread, weight_a, weight_b):
    
    #weighted stakes
    stake_a_o = weight_a * stake
    stake_b_o = weight_b * stake     
    
    if (spread > pair_spread_mean) and (pair_spread_mean > 0) or (spread > pair_spread_mean) and (pair_spread_mean < 0):
        #back to lay A (short)
        payoff_a, stake_a_c = payout(bp_a, stake_a_o, horse_a_mean_lp, '?')

        #lay to back X (long)
        payoff_b, stake_b_c = payout(bp_b, '?', horse_b_mean_bp, stake_b_o)

        prof_pc = 100 * (payoff_a + payoff_b) / (stake_a_o + stake_b_o + stake_a_c + stake_b_c)
        return prof_pc

    elif (spread < pair_spread_mean) and (pair_spread_mean < 0) or (spread < pair_spread_mean) and (pair_spread_mean > 0): 
        #lay to back A (long)
        payoff_a, stake_a_c = payout(horse_a_mean_bp, '?', lp_a, stake_a_o)

        #back to lay A (short)
        payoff_b, stake_b_c = payout(bp_b, stake_b_o, horse_b_mean_lp, '?')

        prof_pc = 100 * (payoff_a + payoff_b) / (stake_a_o + stake_b_o + stake_a_c + stake_b_c)
        return prof_pc

In [53]:
# Betting function determines based on the spread and deviation which sides to back and lay
# This function is used for pairs trade payoffs

def bet(open_idx, close_idx, stake_a, stake_b):
    if (pair_df['spread'].iloc[open_idx] > pair_spread_mean) and (pair_spread_mean > 0) or (pair_df['spread'].iloc[open_idx] > pair_spread_mean) and (pair_spread_mean < 0):
        #back to lay A (short)
        bp_a = pair_df[horse_a + "_bp"].iloc[open_idx]
        lp_a = pair_df[horse_a + "_lp"].iloc[close_idx]

        payoff_a, stake_a_c = payout(bp_a, stake_a, lp_a, '?')

        #lay to back X (long)
        lp_b = pair_df[horse_b + "_lp"].iloc[open_idx]
        bp_b = pair_df[horse_b + "_bp"].iloc[close_idx]

        payoff_b, stake_b_c = payout(bp_b, '?', lp_b, stake_b)

        return payoff_a, payoff_b, stake_a_c, stake_b_c

    elif (pair_df['spread'].iloc[open_idx] < pair_spread_mean) and (pair_spread_mean < 0) or (pair_df['spread'].iloc[open_idx] < pair_spread_mean) and (pair_spread_mean > 0): 
        #lay to back A (long)
        lp_a = pair_df[horse_a + "_lp"].iloc[open_idx]
        bp_a = pair_df[horse_a + "_bp"].iloc[close_idx]

        payoff_a, stake_a_c = payout(bp_a, '?', lp_a, stake_a)

        #back to lay A (short)
        bp_b = pair_df[horse_b + "_bp"].iloc[open_idx]
        lp_b = pair_df[horse_b + "_lp"].iloc[close_idx]

        payoff_b, stake_b_c = payout(bp_b, stake_b, lp_b, '?')

        return payoff_a, payoff_b, stake_a_c, stake_b_c

**Monte Carlo simulation**

n repetitions of the above with profit summed over all trades.

What is going on below? (Ignoring ## stats code and text printouts)

1. The number of interations to simulate, n, is defined

2. The amount of time a trade is kept open, k, is defined

3. `sample_dataframe()` is used to grab a new random race
    
    Within this function, the lay prices, back prices, mid point prices and log prices (used in the Dickey Fuller [DF] tests) are defined
    This would be the place to alter for manipulations made in the price data. For example, changing which log prices are used in the DF test.
    
4. `dickey_fuller_test()` is used to perform a DF test and create a dataframe with pair_identifers and test results. It would probably make sense to move the pair identifiers code from this function

5. `race_pairs()` is used to filter the DF tests for only those where there looks to be cointegration at the 1% or 5% significance level

6. Before the trading strategy code, the iteration is aborted if the race pairs dataframe is empty (no pairs in that race)

The trading strategy code

7. Firstly, races where more than 5 pairs are found are rejected. This is an abitrary rule based on the suspicion that some races erroneously look like far too many horses are pairs.  ** Considering changing this**

8. Then, iterating through each potential pair in the given race, a dataframe of time series of each horse's prices and the pairs spread (in actual prices) is created.

9. The strategy part of the code creates a column in the pairs dataframe with a value equal to 1 when a trade should be opened, 0.5 when it is open and -1 when it should be closed, 0 otherwise. It then sets up a dictionary of bet open and close indices to be used to grab the prices from the pair dataframe when doing the bets

10. The bets part of the code creates a weighted stake given the horse's deviations from their mean prices at the opening index and uses the `bet()` function to calculate the profit from opening and closing bets at those indices.


Trading rules:
###

In [58]:
#CLEAN VERSION WITH LESS STATS

# Stats
profit = 0
num_pairs = 0
pairs_traded = 0
profitable_trades = 0
num_trades_total = 0
#

# Beginning of the Monte Carlo solution

n = 100 #number of iterations
stake = 1 #total stake on opening side of bets

for i in range(n):
    if (i + 1) % 100 == 0:
        print(f"{i+1} of {n} iterations completed.")
    
    bp_t_df, lp_t_df, mid_df, log_mid = sample_dataframe()

    dickey_fuller_results_df = dickey_fuller_test(log_mid)
    
    pairs_df = 0
    pairs_df = race_pairs(dickey_fuller_results_df, 0.01)
    if type(pairs_df) != pd.DataFrame: #i.e. if there are no pairs, reset to next iteration
        continue
    num_pairs += len(pairs_df.index)

    
    # The trading strategy code    
    if len(pairs_df.index) < 6:  #TRADING RULE: REJECT RACES WHERE MORE THAN 5 PAIRS ARE FOUND
        for id_id in pairs_df['pair']:

            # PAIR DATAFRAME SETUP
            
            # Grabbing identifying details for the given pair
            pair_index = pairs_df.index[pairs_df['pair'] == id_id].item()
            pair_ids = pairs_df['pair'].loc[pair_index]
            pair_coef = pairs_df['coef'].loc[pair_index]
            pair_cv = pairs_df['critical_value'].loc[pair_index]
            horse_a = pair_ids.split("_", 1)[0]
            horse_b = pair_ids.split("_", 1)[1]
            # Creating the prices dataframe for the given pair
            pair_df = bp_t_df[[horse_a, horse_b]] #prices dataframe
            pair_df = pd.concat([pair_df, lp_t_df[[horse_a, horse_b]]], axis=1) #with lay prices as well
            pair_df.columns = [horse_a + "_bp", horse_b + "_bp", horse_a + "_lp", horse_b + "_lp"]
            pair_df['spread'] = mid_df[horse_a] - mid_df[horse_b]

            # Creating price filters for trades and key variables
            pair_spread_sd = np.std(pair_df['spread'][0:29], ddof = 1)
            pair_spread_mean = pair_df['spread'][0:29].mean()
            
            horse_a_mean_bp = pair_df[horse_a + "_bp"][0:29].mean()
            horse_a_mean_lp = pair_df[horse_a + "_lp"][0:29].mean()
            horse_a_mean_mid = (horse_a_mean_bp + horse_a_mean_lp) / 2
            horse_b_mean_bp = pair_df[horse_b + "_bp"][0:29].mean()
            horse_b_mean_lp = pair_df[horse_b + "_lp"][0:29].mean()
            horse_b_mean_mid = (horse_b_mean_bp + horse_b_mean_lp) / 2
            
        
            # TRADING STRATEGY SETUP
            # Trade indicators     
            
            # Standard deviations indicator for opening and closing bets - size of deviations from mean. Bets are opened when this becomes True and close when it stops being True
            pair_df['deviation_1sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 1 * pair_spread_sd, 1, 0)

            # The 3 / 3.5 / 4 standard deviation threshold is a bit arbitrary, but I need some threshold for whether the deviations are a bit too big and it looks like the series has lost its pair characteristics
            pair_df['spread_too_big'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 3.5 * pair_spread_sd, 1, 0) 

            # Abort pair if price of one horse on average is over some threshold
            if (horse_a_mean_mid > 50) or (horse_b_mean_mid > 50):
                continue
                
            # Close a bet if the deviation gets too big = dont bet on a pair if too many values are too far from the mean spread
            # Can't think of a better way to do this without messing up my open close setup
            if sum(pair_df['spread_too_big']) > 20:
                continue
             
            # Trade open and close points
            
            
            # STRATEGY 2: BETTING BASED ON HEDGE PROFITS TO MEAN PRICES
            # Percentage deviation of each horse from its mean price for stake weighting
            pair_df['a_deviation_from_mean'] = 100 * (mid_df[horse_a] - (horse_a_mean_mid)) / horse_a_mean_mid
            pair_df['b_deviation_from_mean'] = 100 * (mid_df[horse_b] - (horse_b_mean_mid)) / horse_b_mean_mid
            pair_df['weight_a'] = abs(pair_df['a_deviation_from_mean']) / (abs(pair_df['a_deviation_from_mean']) + abs(pair_df['b_deviation_from_mean']))
            pair_df['weight_b'] = abs(pair_df['b_deviation_from_mean']) / (abs(pair_df['a_deviation_from_mean']) + abs(pair_df['b_deviation_from_mean']))   
            
            # Profit to mean indicator for opening bets. Then create a variable equal to 1 (else 0) if a hedge starting at the price and closing at the mean would give an X% margin
            pair_df['profit_pc_hedge_to_mean'] = pair_df.apply(lambda x: bet_prof_pc(x[horse_a + "_bp"], x[horse_a + "_lp"], x[horse_b + "_bp"], x[horse_b + "_lp"], x['spread'], x['weight_a'], x['weight_b']), axis=1)
            
            # Create open and close dictionary, opening bets where the prior column passes 1.5, and closing out the next time its less than 0.5, greater than 10 or the end of the df
            openidx = 0
            closeidx = 0
            open_close_dict_2 = {'open' : [], 'close' : []}
            idx0 = 30
            stoploss = 0
            breakiteration = 0
            while 30 <= idx0 < 60 and stoploss == 0 and breakiteration == 0:
                try: openidx = pair_df[idx0:][pair_df['profit_pc_hedge_to_mean'][idx0:] > 1.5].index[0] #find first value above 1.5
                except: 
                    breakiteration = 1
                try:
                    try: 
                        closeidx = pair_df[openidx + 1:][pair_df['profit_pc_hedge_to_mean'][openidx + 1:] > 10].index[0] #stop loss
                        stoploss = 1 # break while loop
                    except: closeidx = pair_df[openidx + 1:][pair_df['profit_pc_hedge_to_mean'][openidx + 1:] < 0.5].index[0] #find the index of the first value below 0.5 after the first above X
                except: closeidx = 59 #if there is now close index before 59, set it equal to 59. code wont get here unless there was an open index
                if openidx != 0:
                    open_close_dict_2['open'].append(openidx)
                    open_close_dict_2['close'].append(closeidx)
                    idx0 = closeidx + 1 #redo, starting from the index after the last close  
            if breakiteration == 1:
                continue #go on to next pair if no trading points are found
            
            # STRATEGY 1: BETTING BASED ON CURRENT SPREAD DEVIATIONS FROM MEAN
            # Gives = 1 to open a bet, and -1 to close a bet for the True/False setup - opening upon change to True, closing on change to False
            pair_df['open_close_bets'] = pair_df['deviation_1sd'].diff() 
            # Gives rows where bets are open value of 0.5
            pair_df['open_close_bets'] = np.where((pair_df['deviation_1sd'] == 1) & (pair_df['open_close_bets'] == 0), 0.5, pair_df['open_close_bets'])
            # Open at 30 if a bet should be ongoing
            pair_df['open_close_bets'] = np.where((pair_df.index == 30) & (pair_df['open_close_bets'] == 0.5), 1, pair_df['open_close_bets'])            
            # Close at 59 if the last bet doesnt close before
            pair_df['open_close_bets'] = np.where((pair_df.index == len(pair_df.index) - 1) & (pair_df['open_close_bets'] == 0.5), -1, pair_df['open_close_bets'])
            # Collect open and close indices. There is definitely a better way to do this if I can get the pair_df indices in the list rather than grabbing indices from a new np list
            open_bets_idx = list(np.where(pair_df['open_close_bets'][30:] == 1)[0] + 30)
            close_bets_idx = list(np.where(pair_df['open_close_bets'][30:] == -1)[0] + 30)
            # Create dictionary of pair open and close indices
            open_close_dict = {'open' : open_bets_idx, 'close' : close_bets_idx}
            # Dont do bets if there are no indices to open or close
            if (len(open_bets_idx) == 0) or (len(close_bets_idx) == 0):
                continue
            
            
            # BETS            

            # Change dictionary depending on strategy
            for o, c in zip(open_close_dict_2['open'], open_close_dict_2['close']):
                # Stakes for A and B weighted based on deviation to mean
                stake_a_o = pair_df['weight_a'].iloc[o] * stake
                stake_b_o = pair_df['weight_b'].iloc[o] * stake                
                
                # Add profits, return stakes on the other side of the bets (for rate of return)
                win_side_a, win_side_b, stake_a_c, stake_b_c = bet(o, c, stake_a_o, stake_b_o)
                
                # Stats
                profit += win_side_a + win_side_b
                print(win_side_a + win_side_b)
                if (win_side_a + win_side_b) > 0:
                    profitable_trades += 1
            
            # Stats
            num_trades_total += len(open_close_dict_2['open'])
            pairs_traded += 1
                

print(f"Profit over {n} random race markets = {profit}. {num_pairs} pairs found, {pairs_traded} pairs traded and {num_trades_total} pairs trades made. {profitable_trades} of {num_trades_total} were profitable.")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df.drop_duplicates(inplace=True)


-0.12529859119192
0.10627685888414629
-0.1490650398667005
-0.004405552277750913
-0.035032918126945145
0.00120656571265787
-0.014826485971612069
-0.0293356344032194
0.019775915001836175
0.054844942213502
-0.03494851860294479
0.05812046280337624
0.038120393261587004
0.031656740206758904
-0.20059377702114256
0.10608994758863854
0.0033658449825459957
0.16418419914421767
-0.1575551644461315
0.0031410158896454377
-0.19658619602555216
-0.1426373801272654
-0.12873743667502224
-0.08948686282143425
-0.19182155113518898
-0.15913595129048685
-0.1770047481033039
-0.09273918274051196
-0.2289582381493045
0.20888336306472732
0.19105778020173236
-0.09732573507969433
100 of 100 iterations completed.
Profit over 100 random race markets = -1.2687709351007594. 95 pairs found, 28 pairs traded and 32 pairs trades made. 13 of 32 were profitable.


In [39]:
pair_df[30:]

Unnamed: 0,h25377532_bp,h26578373_bp,h25377532_lp,h26578373_lp,spread,deviation_1sd,spread_too_big,a_deviation_from_mean,b_deviation_from_mean,weight_a,weight_b,profit_pc_hedge_to_mean,open_close_bets
30,5.3,30.0,5.4,32.0,-25.65,0,0,1.954986,-15.367127,0.112861,0.887139,-7.036252,0.0
31,5.25,30.0,5.35,32.0,-25.7,0,0,1.002136,-15.367127,0.061221,0.938779,-7.444372,0.0
32,5.2,30.0,5.3,32.0,-25.75,0,0,0.049285,-15.367127,0.003197,0.996803,-7.841257,0.0
33,5.2,30.0,5.3,32.0,-25.75,0,0,0.049285,-15.367127,0.003197,0.996803,-7.841257,0.0
34,5.2,30.8,5.3,33.05,-26.675,0,0,0.049285,-12.841791,0.003823,0.996177,-6.532584,0.0
35,5.1,30.0,5.2,32.1,-25.9,0,0,-1.856415,-15.230622,0.108645,0.891355,-7.27171,0.0
36,5.1,30.0,5.2,32.0,-25.85,0,0,-1.856415,-15.367127,0.107784,0.892216,-7.276433,0.0
37,5.04,30.0,5.14,32.0,-25.91,0,0,-2.999836,-15.367127,0.163328,0.836672,-7.062821,0.0
38,5.1,30.0,5.2,32.0,-25.85,0,0,-1.856415,-15.367127,0.107784,0.892216,-7.276433,0.0
39,5.1,30.0,5.2,32.0,-25.85,0,0,-1.856415,-15.367127,0.107784,0.892216,-7.276433,0.0


In [40]:
openidx = 0
closeidx = 0
open_close_dict_2 = {'open' : [], 'close' : []}

idx0 = 30
stoploss = 0
while 30 <= idx0 < 60 and stoploss == 0:
    
    try: openidx = pair_df[idx0:][pair_df['profit_pc_hedge_to_mean'][idx0:] > 1.5].index[0] #find first value above 1.5
    except: stoploss = 1 #continue
        
    try:
        try: 
            closeidx = pair_df[openidx + 1:][pair_df['profit_pc_hedge_to_mean'][openidx + 1:] > 10].index[0] #stop loss
            stoploss = 1 # break while loop
        except: closeidx = pair_df[openidx + 1:][pair_df['profit_pc_hedge_to_mean'][openidx + 1:] < 0.5].index[0] #find the index of the first value below 0.5 after the first above 1.5
    except: closeidx = 59 #if there is now close index before 59, set it equal to 59. code wont get here unless there was an open index
    
    if openidx != 0:
        open_close_dict_2['open'].append(openidx)
        open_close_dict_2['close'].append(closeidx)
        idx0 = closeidx + 1 #redo, starting from the index after the last close

In [41]:
print(open_close_dict_2)

{'open': [51], 'close': [53]}


In [42]:
pair_df

Unnamed: 0,h25377532_bp,h26578373_bp,h25377532_lp,h26578373_lp,spread,deviation_1sd,spread_too_big,a_deviation_from_mean,b_deviation_from_mean,weight_a,weight_b,profit_pc_hedge_to_mean,open_close_bets
0,5.2,56.75,5.3,70.5,-58.375,1,1,0.049285,73.702147,0.000668,0.999332,19.60317,
1,5.2,42.0,5.3,49.95,-40.725,1,0,0.049285,25.516011,0.001928,0.998072,4.808043,0.5
2,5.2,40.0,5.3,42.3,-35.9,0,0,0.049285,12.343314,0.003977,0.996023,2.369341,-1.0
3,5.2,39.85,5.3,43.3,-36.325,0,0,0.049285,13.503603,0.003637,0.996363,2.183374,0.0
4,5.2,38.2,5.3,43.7,-35.7,0,0,0.049285,11.797295,0.00416,0.99584,0.076323,0.0
5,5.2,38.0,5.3,43.35,-35.425,0,0,0.049285,11.04652,0.004442,0.995558,-0.185284,0.0
6,5.2,38.0,5.3,42.0,-34.75,0,0,0.049285,9.203707,0.005326,0.994674,-0.186016,0.0
7,5.2,38.0,5.3,40.05,-33.775,0,0,0.049285,6.541867,0.007478,0.992522,-0.187794,0.0
8,5.2,37.25,5.3,40.0,-33.375,0,0,0.049285,5.44983,0.008962,0.991038,-1.176625,0.0
9,5.2,36.85,5.3,39.75,-33.05,0,0,0.049285,4.56255,0.010687,0.989313,-1.710076,0.0


In [43]:
bp_t_df

Unnamed: 0,h10839473,h19774051,h21214401,h22867125,h7234999,h9277430
0,3.8,13.5,7.8,4.39,8.8,4.83
1,3.8,13.5,7.8,4.35,8.8,4.9
2,3.81,13.5,7.87,4.37,8.8,4.83
3,3.8,13.5,8.0,4.4,8.8,4.8
4,3.8,13.39,8.0,4.43,8.86,4.72
5,3.7,13.5,8.0,4.5,9.0,4.76
6,3.65,13.5,8.0,4.5,9.0,4.87
7,3.65,13.5,8.0,4.5,9.0,4.8
8,3.65,13.5,8.0,4.5,9.28,4.8
9,3.59,13.5,8.0,4.49,9.4,4.86


In [44]:
# #VERSION WITH A LOAD MORE STATS COLLECTED

# ########## stats
# profit = 0
# num_pairs = 0
# pairs_traded = 0
# num_trades_total = 0
# results_dict = {'pair' : [], 'mean_spread' : [], 'final_spread' : [], 'pair_cv' : [], 'pair_coef' : [], 'pairs_in_race' : [],
#                 'num_trades' : [], 'profitable_trades' : [], 'losing_trades' : [], 'pc_trades_prof' : [], 'pair_profit' : [], 'pair_profits_list' : []}
# ##########

# # Beginning of the Monte Carlo solution

# n = 200 #number of iterations
# k = 5 #number of 2-minute periods trade is kept open (i.e. time expected for mean reversion to occur). this is used in the bet() function below

# for i in range(n):
#     if (i + 1) % 100 == 0:
#         print(f"{i+1} of {n} iterations completed.")
    
#     bp_t_df, lp_t_df, mid_df, log_mid = sample_dataframe()

#     dickey_fuller_results_df = dickey_fuller_test(log_mid)
    
#     pairs_df = 0
#     pairs_df = race_pairs(dickey_fuller_results_df, 0.01)
#     if type(pairs_df) != pd.DataFrame: #i.e. if there are no pairs, reset to next iteration
#         continue
    
#     # The trading strategy code    
#     if len(pairs_df.index) < 6:  #TRADING RULE: REJECT RACES WHERE MORE THAN 5 PAIRS ARE FOUND
#         for id_id in pairs_df['pair']:

#             # Grabbing identifying details for the given pair
#             pair_index = pairs_df.index[pairs_df['pair'] == id_id].item()
#             pair_ids = pairs_df['pair'].loc[pair_index]
#             pair_coef = pairs_df['coef'].loc[pair_index]
#             pair_cv = pairs_df['critical_value'].loc[pair_index]
#             horse_a = pair_ids.split("_", 1)[0]
#             horse_b = pair_ids.split("_", 1)[1]
#             # Creating the prices dataframe for the given pair
#             pair_df = bp_t_df[[horse_a, horse_b]] #prices dataframe
#             pair_df = pd.concat([pair_df, lp_t_df[[horse_a, horse_b]]], axis=1) #with lay prices as well
#             pair_df.columns = [horse_a + "_bp", horse_b + "_bp", horse_a + "_lp", horse_b + "_lp"]
#             pair_df['spread'] = mid_df[horse_a] - mid_df[horse_b]

#             #Filtering criteria
#             pair_spread_sd = np.std(pair_df['spread'][0:29], ddof = 1)
#             pair_spread_mean = pair_df['spread'][0:29].mean()
#             pair_df['deviation_2sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 2 * pair_spread_sd, True, False)
#             pair_df['deviation_4sd'] = np.where(abs(pair_df['spread']) - abs(pair_spread_mean) > 4 * pair_spread_sd, True, False)            

#             #Create a dataframe where each row 
#             #If there is sufficient deviatiion anywhere, make the trades
#             open_trade_df = pair_df[pair_df['deviation_2sd'] == True].loc[30:] #only data after the first 30 periods
#             open_trade_df.drop_duplicates(inplace=True)

#             ########## stats
#             if len(open_trade_df.index) > 0:
#                 pairs_traded += 1
#             num_trades_pair = 0
#             profitable_trades_pair = 0
#             losing_trades_pair = 0
#             pair_profit = 0
#             pair_profits_list = []
#             ##########

#             #if there are indexs at which to make trades and trades can be completed, cycle through them
#             #TRADING RULE: IGNORE HORSES WHO ARE TRADING WITH SPREAD OF 3 SD OR GREATER THAN MEAN (to avoid horses who have deviated too much)    
#             while (len(open_trade_df.index) > 0) and ((open_trade_df.index[0] + k) < 59) and (open_trade_df['deviation_4sd'].loc[open_trade_df.index[0]] == False):
                
                
#                 open_trade_idx = open_trade_df.index[0]
#                 win_side_a, win_side_b = bet(open_trade_idx, k)

#                 #removes traded line from open_trade_df #+ 1 period gap between trades on a given pair. #edit this to alter the repetition of trades
#                 open_trade_df = open_trade_df.loc[open_trade_idx + k + 1:] 
                

#                 ########## stats
#                 profit += win_side_a + win_side_b
#                 num_trades_total += 1
#                 num_trades_pair += 1
#                 if (win_side_a + win_side_b) > 0:
#                     profitable_trades_pair += 1
#                 else:
#                     losing_trades_pair += 1
#                 pair_profit += win_side_a + win_side_b
#                 pair_profits_list.append(round(win_side_a + win_side_b,2))
#                 ##########    

#             ########## stats
#             num_pairs += 1
#             results_dict['pair'].append(id_id)
#             results_dict['mean_spread'].append(pair_spread_mean)
#             results_dict['final_spread'].append(pair_df['spread'].loc[59])
#             results_dict['pair_cv'].append(pair_cv)
#             results_dict['pair_coef'].append(pair_coef)
#             results_dict['pairs_in_race'].append(len(pairs_df.index))
#             results_dict['num_trades'].append(num_trades_pair) 
#             results_dict['profitable_trades'].append(profitable_trades_pair) 
#             results_dict['losing_trades'].append(losing_trades_pair) 
#             try: results_dict['pc_trades_prof'].append((profitable_trades_pair / num_trades_pair) * 100)
#             except: results_dict['pc_trades_prof'].append(0)
#             results_dict['pair_profit'].append(pair_profit) 
#             results_dict['pair_profits_list'].append(pair_profits_list)
#             #average profit
#             #number of horses in race
#             ##########
    
#     #else: continue #move on to next race day if there are no stationary series  

# results_df = pd.DataFrame(results_dict)
# results_df = results_df[results_df['num_trades'] > 0].copy()
# #results_df.to_csv(data_dir.parents[0] / 'pairs_trade_results.csv', index = False, header=True)

# profitable_trades_total = results_df['profitable_trades'].sum()       
# print(f"Profit over {n} random race markets = {profit}. {num_pairs} pairs found, {pairs_traded} pairs traded and {num_trades_total} pairs trades made. {profitable_trades_total} of {num_trades_total} were profitable.")