This notebook is a first step to exploring whether it's possible to predict horses races more accurately than the betting markets. It includes feature exploration, feature engineering, a basic XGBoost model, a betting strategy and calculates my profit or loss. 

This notebook doesn't rigorously test whether the strategy systematically makes money, but my sense is that it probably doesn't. The results change pretty dramatically when I change the test-train split. This could be because of a bug, or could be because the model is ~random because I have very few features (and I'm not using the most interesting features. 

Hopefully somebody can use this starting point and extend it. I'm happy to answer any questions if you don't understand anything I've done.

Ideas for improvements (ordered by priority):

 - Include additional features: most importantly form.
 - Create a betting strategy where you don't bet on every race but only those where there's a big discrepancy between your predictions and the odds 
 - Setup a cross validation framework. 
 - Look at feature importance and partial plots to make sure the model is behaving properly.
 - I'm treating this as a binary prediction problem (predicting the probability that each horse will win). This throws away information. There are probably better ways to setup the problem. 
 - Possibly include a model that also predicts place.


In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np

#useful for displaying wide data frames
pd.set_option('display.max_columns', 50)

In [2]:
df = pd.read_csv('input/modified_data.csv')

In [4]:
df.columns

Index(['id_x', 'collected_at', 'market_id', 'position', 'place_paid', 'margin',
       'horse_id', 'trainer_id', 'rider_id', 'handicap_weight', 'number',
       'barrier', 'blinkers', 'emergency', 'form_rating_one',
       'form_rating_two', 'form_rating_three', 'last_five_starts',
       'favourite_odds_win', 'favourite_odds_place', 'favourite_pool_win',
       'favourite_pool_place', 'tip_one_win', 'tip_one_place', 'tip_two_win',
       'tip_two_place', 'tip_three_win', 'tip_three_place', 'tip_four_win',
       'tip_four_place', 'tip_five_win', 'tip_five_place', 'tip_six_win',
       'tip_six_place', 'tip_seven_win', 'tip_seven_place', 'tip_eight_win',
       'tip_eight_place', 'tip_nine_win', 'tip_nine_place', 'id_y', 'timezone',
       'venue_id', 'race_number', 'distance', 'condition_id', 'weather_id',
       'total_pool_one_win', 'total_pool_one_place', 'total_pool_two_win',
       'total_pool_two_place', 'total_pool_three_win',
       'total_pool_three_place', 'overall_starts', 

In [6]:
numeric_features = ['position',
                    'market_id',
                    'barrier',
                    'handicap_weight', 
                    'place_pct', 
                    'win_pct',
                    'track_starts', 
                    'track_wins']
categorical_features = ['rider_id']
df_features = df[numeric_features]

In [13]:
df_features.head()

Unnamed: 0,position,market_id,barrier,handicap_weight,place_pct,win_pct,track_starts,track_wins
0,,1,10,58.5,0.0,0.0,0,0
1,,1,11,56.5,0.0,0.0,0,0
2,,1,5,56.5,0.25,0.0,1,0
3,,1,12,56.5,0.0,0.0,0,0
4,,1,8,58.5,1.0,0.0,0,0


#convert to factors
for feature in categorical_features:
    df[feature] = df[feature].astype(str)
    df[feature] = df[feature].replace('nan','0') #have to do this because of a weird random forest bug
encoded_features = pd.get_dummies(df_runners_and_market[categorical_features], columns=categorical_features)
df_features = pd.merge(df_features, encoded_features, left_index=True, right_index=True, how='inner')

In [14]:
#turn the target variable into a binary feature: did or did not win
df_features['win'] = 0
df_features.loc[df_features['position'] == 1,'win'] = 1

#del df_runners_and_market, encoded_features, df_features['position']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [15]:
bloom = pd.read_csv('features_bloom.csv')

In [16]:
df_features = bloom

# Split between training and test
Doing a random split

In [28]:
def runRaces(est=None, n_trials=5, xgboost=False):
    payouts = np.array([])
    bestModel = None
    bestPayout = -100000
    prediction = None
    for i in range(n_trials):
        print('trial {}'.format(i))
        df_train, df_test = train_test_split()
        if xgboost:
            df_test, model = fitXgboost(df_train, df_test)
        else:
            df_test, model, train_columns, predictions = fitModel(df_train, df_test, est)
        df_test = getOdds(df_test)
        df_test = getAdjustedProbs(df_test)
        payout = placeBets(df_test)
        payouts = np.append(payouts, payout)
        if payout > bestPayout:
            bestPayout = payout
            bestModel = model
            predictions = predictions
        
    print('mean payout: {}, std payout: {}'.format(np.mean(payouts), np.std(payouts)))
    return payouts, bestModel, train_columns, predictions      

def train_test_split():
    training_races = np.random.choice(df_features['market_id'].unique(),size=int(round(0.7*len(df_features['market_id'].unique()),0)),replace=False)
    df_train = df_features[df_features['market_id'].isin(training_races)]
    df_test = df_features[~df_features['market_id'].isin(training_races)]
    return (df_train, df_test)
    
#del df_features

In [19]:
def fitXgboost(df_train, df_test):
    param = {'objective':'binary:logistic' }
    dtrain = xgb.DMatrix(df_train.drop(['win', 'position', 'market_id'], axis=1).values, label=df_train['win'])
    dtest = xgb.DMatrix(df_test.drop(['win', 'position', 'market_id'], axis=1).values)
    model = xgb.train(param, dtrain)
    predictions = model.predict(dtest)
    df_test['predictions'] = predictions
    df_test = df_test[['predictions','win','market_id']]
    return df_test, model

In [23]:
def fitModel(df_train, df_test, est):
    est.fit(df_train.drop(['win','position','market_id'],axis=1), df_train['win'])
    
    if est.predict_proba:
        predictions = est.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,1]
    else:
        predictions = est.predict(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
    
    df_test['predictions'] = predictions
    df_test = df_test[['predictions','win','market_id']]
    return df_test, est, df_train.drop(['win','position','market_id'], axis=1).columns, predictions

gbm = xgb.XGBClassifier(objective='binary:logistic').fit(df_train.drop(df_train[['win','position','market_id']],axis=1)
, df_train['win'])
predictions = gbm.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
df_test['predictions'] = predictions
df_test = df_test[['predictions','win','market_id']]
#del df_train

#Compare with betting markets

In [24]:
def getOdds(df_test):
    df_odds = pd.read_csv("input/odds.csv")
    df_odds = df_odds[df_odds['runner_id'].isin(df_test.index)]

    #I take the mean odds for the horse rather than the odds 1 hour before or 10 mins before. You may want to revisit this.
    average_win_odds = df_odds.groupby(['runner_id'])['odds_one_win'].mean()

    #delete when odds are 0 because there is no market for this horse
    average_win_odds[average_win_odds == 0] = np.nan
    df_test['odds'] = average_win_odds
    df_test = df_test.dropna(subset=['odds'])
    #given that I predict multiple winners, there's leakage if I don't shuffle the test set (winning horse appears first and I put money on the first horse I predict to win)
    df_test = df_test.iloc[np.random.permutation(len(df_test))]
    return df_test

In [25]:
def getAdjustedProbs(df_test):
    marketIdTotalProb = pd.DataFrame(df_test.groupby('market_id').predictions.sum())
    marketIdTotalProb.columns = ['totalPredForMarket']
    df_test = pd.merge(df_test, marketIdTotalProb, left_on='market_id', right_index=True)
    df_test['adjProb'] = df_test.predictions / df_test.totalPredForMarket
    return df_test

In [48]:
def placeBets(df):
    investment = 0
    payout = 0
    bets = 0
    wins = 0
    
    for index, row in df.iterrows():
        # if we predicted the probability of a win that's greater than the odds, we bet
        if row['adjProb'] > (1/row['odds']):
            investment +=1
            bets +=1
            if (row['win']):
                payout += row['odds']
                wins +=1
                
    investment_return = round((payout - investment)/investment*100,2)
    print('Total bets: {}'.format(bets))
    print('Total wins: {}'.format(wins))
    return investment_return
            

In [44]:
#select the horse I picked as most likely to win
def getProfit(df_test):
    df_profit = df_test.loc[df_test.groupby("market_id")["predictions"].idxmax()]
    print(df_profit.head())
    bets = 0
    investment = 0
    payout = 0
    wins = 0
    for index, row in df_profit.iterrows():
        investment +=1
        bets +=1
        if (row['win']):
            payout += row['odds']
            wins +=1
    investment_return = round((payout - investment)/investment*100,2)
    return investment_return


In [45]:
#print("This algorithm and betting system will generate a " + str(investment_return) + "% return\n")
#print("Note: you can't read much from a single run. Best to setup a cross validation framework and look at the return over many runs")

In [46]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
payouts, bestModel, train_cols, predictions = runRaces(lr, n_trials=2)

trial 0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


       predictions  win  market_id   odds  totalPredForMarket   adjProb
22595     0.086406    0       1899   6.01            0.506965  0.170437
22590     0.068582    0       1899  16.75            0.506965  0.135279
22588     0.085067    1       1899   9.15            0.506965  0.167796
22593     0.083562    0       1899   8.64            0.506965  0.164828
22591     0.095595    0       1899  15.05            0.506965  0.188564
(19572, 6)
Total bets: 11466
Wins: 916, Bets: 11466
trial 1
       predictions  win  market_id    odds  totalPredForMarket   adjProb
57960     0.076035    0       5180  121.10             1.00686  0.075517
57956     0.101312    0       5180    9.65             1.00686  0.100622
57964     0.079798    0       5180    8.32             1.00686  0.079254
57965     0.091068    0       5180   73.10             1.00686  0.090447
57954     0.077138    1       5180   64.30             1.00686  0.076612
(19554, 6)
Total bets: 11585
Wins: 931, Bets: 11585
mean payout: 228.1