This notebook is a first step to exploring whether it's possible to predict horses races more accurately than the betting markets. It includes feature exploration, feature engineering, a basic XGBoost model, a betting strategy and calculates my profit or loss. 

This notebook doesn't rigorously test whether the strategy systematically makes money, but my sense is that it probably doesn't. The results change pretty dramatically when I change the test-train split. This could be because of a bug, or could be because the model is ~random because I have very few features (and I'm not using the most interesting features. 

Hopefully somebody can use this starting point and extend it. I'm happy to answer any questions if you don't understand anything I've done.

Ideas for improvements (ordered by priority):

 - Include additional features: most importantly form.
 - Create a betting strategy where you don't bet on every race but only those where there's a big discrepancy between your predictions and the odds 
 - Setup a cross validation framework. 
 - Look at feature importance and partial plots to make sure the model is behaving properly.
 - I'm treating this as a binary prediction problem (predicting the probability that each horse will win). This throws away information. There are probably better ways to setup the problem. 
 - Possibly include a model that also predicts place.


In [7]:
import pandas as pd
import xgboost as xgb
import numpy as np
%load_ext autoreload
%autoreload 2
#useful for displaying wide data frames
pd.set_option('display.max_columns', 50)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
#load the data into Pandas dataframes
df_market = pd.read_csv("input/markets.csv")
df_runners = pd.read_csv("input/runners.csv",dtype={'barrier': np.int16,'handicap_weight': np.float16})

#for my simple model, I'm ignoring other columns. I recommend starting with form if you're looking to add features
#df_odds = pd.read_csv("../input/odds.csv")
#df_form = pd.read_csv("../input/forms.csv")
#df_condition = pd.read_csv("../input/conditions.csv")
#df_weather = ("../input/weather.csv")
#df_rider = ("../input/riders.csv")
#df_horse = ("../input/horses.csv")
#df_horse_sex = ("../input/horse_sexes.csv")

In [10]:
##merge the runners and markets data frames
df_runners_and_market = pd.merge(df_runners,df_market,left_on='market_id',right_on='id',how='outer')
df_runners_and_market.index = df_runners_and_market['id_x'] 


In [11]:
numeric_features = ['position','market_id','barrier','handicap_weight']
categorical_features = ['rider_id']

#convert to factors
for feature in categorical_features:
    df_runners_and_market[feature] = df_runners_and_market[feature].astype(str)
    df_runners_and_market[feature] = df_runners_and_market[feature].replace('nan','0') #have to do this because of a weird random forest bug

    df_features = df_runners_and_market[numeric_features]

for feature in categorical_features:
    encoded_features = pd.get_dummies(df_runners_and_market[feature])
    encoded_features.columns = feature + encoded_features.columns
    df_features = pd.merge(df_features,encoded_features,left_index=True,right_index=True,how='inner') 

#turn the target variable into a binary feature: did or did not win
df_features['win'] = False
df_features.loc[df_features['position'] == 1,'win'] = True

#del df_runners_and_market, encoded_features, df_features['position']

#Split between training and test
Doing a random split

gbm = xgb.XGBClassifier(objective='binary:logistic').fit(df_train.drop(df_train[['win','position','market_id']],axis=1)
, df_train['win'])
predictions = gbm.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
df_test['predictions'] = predictions
df_test = df_test[['predictions','win','market_id']]
#del df_train

#Compare with betting markets

In [77]:
all_odds = pd.read_csv("input/odds.csv")


def getOdds(df):
    df_odds = all_odds[all_odds['runner_id'].isin(df_test.index)]

    #I take the mean odds for the horse rather than the odds 1 hour before or 10 mins before. You may want to revisit this.
    average_win_odds = df_odds.groupby(['runner_id'])['odds_one_win'].mean()

    #delete when odds are 0 because there is no market for this horse
    average_win_odds[average_win_odds == 0] = np.nan
    df['odds'] = average_win_odds
    df = df.dropna(subset=['odds'])
    #given that I predict multiple winners, there's leakage if I don't shuffle the test set (winning horse appears first and I put money on the first horse I predict to win)
    df = df.iloc[np.random.permutation(len(df))]
    return df


def train_test_split():
    training_races = np.random.choice(df_features['market_id'].unique(),size=int(round(0.7*len(df_features['market_id'].unique()),0)),replace=False)
    df_train = df_features[df_features['market_id'].isin(training_races)]
    df_test = df_features[~df_features['market_id'].isin(training_races)]
    return (df_train, df_test)
    

def getUniformProb(df):
    winProbs = df.groupby('market_id').apply(lambda x: 1/x.shape[0]).values
    print(winProbs.shape)
    print(winProbs[0:10])
    #w = pd.DataFrame(winProbs).rename(columns={0:'winProb'})    
    #return df.merge(w, left_on='market_id', right_index=True)


def placeBets(df):
    investment = 0
    payout = 0
    bets = 0
    wins = 0
    for index, row in df.iterrows():
        # if we predicted the probability of a win that's greater than the odds, we bet
        if row['winProb'] > (1/row['odds']):
            investment +=1
            bets +=1
            if (row['win']):
                payout += row['odds']
                wins +=1
                
    investment_return = round((payout - investment)/investment*100,2)
    print('Total bets: {}'.format(bets))
    print('Total wins: {}'.format(wins))
    return investment_return


def runRaces(n_trials=1):
    payouts = np.array([])
    for i in range(n_trials):
        print('trial {}'.format(i))
        _, df_test = train_test_split()
        df_test = getUniformProb(df_test)
        df_test = getOdds(df_test)
        payout = placeBets(df_test)
        payouts = np.append(payouts, payout)
        
    print('mean payout: {}, std payout: {}'.format(np.mean(payouts), np.std(payouts)))
    return df_test 

In [69]:
class UniformProbEst():
    
    def __init__(self):
        pass

    def fit(self, X, y):
        pass
  
    def predict_proba(self, df):
        #uniform prob for each race
        winProbs = df.groupby('market_id').apply(lambda x: 1/x.shape[0]) 
        w = pd.DataFrame(winProbs).rename(columns={0:'winProb'})    
        #merge to match the right prob with each horse
        d = df.merge(w, left_on='market_id', right_index=True)
        z = np.zeros(d.shape[0])
        #stack with a leading column of zeros to fit the return
        #signature of sklearn predict_proba
        return np.hstack((z[:, np.newaxis], d.winProb.values[:, np.newaxis]))
       

In [19]:
allOdds = pd.read_csv("input/odds.csv")


In [74]:
from runRaces import runRaces

uniformProbs = UniformProbEst()
p = runRaces(df_features, uniformProbs, allOdds)

trial 0
0
(26362, 1193)
--
(26362, 1194)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df_test['winProb'] = predictions


Total bets: 11127
Total wins: 340
mean payout: -28.4, std payout: 0.0


In [15]:
%reload_ext autoreload