This notebook is a first step to exploring whether it's possible to predict horses races more accurately than the betting markets. It includes feature exploration, feature engineering, a basic XGBoost model, a betting strategy and calculates my profit or loss. 

This notebook doesn't rigorously test whether the strategy systematically makes money, but my sense is that it probably doesn't. The results change pretty dramatically when I change the test-train split. This could be because of a bug, or could be because the model is ~random because I have very few features (and I'm not using the most interesting features. 

Hopefully somebody can use this starting point and extend it. I'm happy to answer any questions if you don't understand anything I've done.

Ideas for improvements (ordered by priority):

 - Include additional features: most importantly form.
 - Create a betting strategy where you don't bet on every race but only those where there's a big discrepancy between your predictions and the odds 
 - Setup a cross validation framework. 
 - Look at feature importance and partial plots to make sure the model is behaving properly.
 - I'm treating this as a binary prediction problem (predicting the probability that each horse will win). This throws away information. There are probably better ways to setup the problem. 
 - Possibly include a model that also predicts place.


In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np

#useful for displaying wide data frames
pd.set_option('display.max_columns', 50)

'''h = "ste"
def getLastTwenty(x):
    x = x.replace('x', "")
    x = x.replace('f', "9")
    arr = [int(ch) for ch in x]
    avg = np.sum(arr) / len(arr)
    return avg

forms.iloc[0:2, :]['last_twenty_starts'].apply(getLastTwenty)
''' 

In [2]:
#load the data into Pandas dataframes
df_market = pd.read_csv("input/markets.csv")
df_runners = pd.read_csv("input/runners.csv",dtype={'barrier': np.int16,'handicap_weight': np.float16})

#for my simple model, I'm ignoring other columns. I recommend starting with form if you're looking to add features
#df_odds = pd.read_csv("../input/odds.csv")
#df_form = pd.read_csv("../input/forms.csv")
#df_condition = pd.read_csv("../input/conditions.csv")
#df_weather = ("../input/weather.csv")
#df_rider = ("../input/riders.csv")
#df_horse = ("../input/horses.csv")
#df_horse_sex = ("../input/horse_sexes.csv")

# Inital Exploration
Looking at the data and some basic relationships

In [3]:
#look at the first fives rows of the market table
df_market[0:5]

Unnamed: 0,id,timezone,venue_id,race_number,distance,condition_id,weather_id,total_pool_one_win,total_pool_one_place,total_pool_two_win,total_pool_two_place,total_pool_three_win,total_pool_three_place
0,1,2016-06-26 19:00:00,1,3,1200,1.0,1.0,29718.08,,11564.5,5373.0,23464.8,5373.0
1,2,2016-06-26 19:05:00,2,2,1200,2.0,2.0,16169.99,,12624.0,3681.0,9001.42,3681.0
2,3,2016-06-26 19:30:00,1,4,1400,1.0,1.0,12282.57,,13233.0,5816.0,24191.1,5816.0
3,4,2016-06-26 19:40:00,2,3,1400,2.0,2.0,13000.05,,14416.5,6568.5,11542.2,6568.5
4,5,2016-06-26 20:00:00,1,5,1600,1.0,1.0,16194.21,,16076.5,6875.5,28934.6,6875.5


In [4]:
#look at the first fives rows of the runners table
df_runners[0:5]

Unnamed: 0,id,collected_at,market_id,position,place_paid,margin,horse_id,trainer_id,rider_id,handicap_weight,number,barrier,blinkers,emergency,form_rating_one,form_rating_two,form_rating_three,last_five_starts,favourite_odds_win,favourite_odds_place,favourite_pool_win,favourite_pool_place,tip_one_win,tip_one_place,tip_two_win,tip_two_place,tip_three_win,tip_three_place,tip_four_win,tip_four_place,tip_five_win,tip_five_place,tip_six_win,tip_six_place,tip_seven_win,tip_seven_place,tip_eight_win,tip_eight_place,tip_nine_win,tip_nine_place
0,4,2016-06-26 18:54:31.800293,1,,1,,4,4.0,4.0,58.5,4,10,f,f,82.0,82.0,14.0,x80x2,f,f,f,f,f,,f,t,f,t,t,t,t,t,f,t,f,t,f,t,t,t
1,10,2016-06-26 18:54:31.974395,1,,0,,10,4.0,10.0,56.5,10,11,f,f,100.0,100.0,18.0,22x35,f,f,f,f,f,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f
2,5,2016-06-26 18:54:31.835329,1,,0,,5,5.0,5.0,56.5,5,5,f,f,76.0,76.0,0.0,f7,f,f,f,f,f,,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,f
3,6,2016-06-26 18:54:31.873492,1,,0,,6,6.0,6.0,56.5,6,12,f,f,85.0,85.0,2.0,f6462,f,f,f,f,f,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f
4,1,2016-06-26 18:54:31.746854,1,,0,,1,1.0,1.0,58.5,1,8,f,f,89.0,89.0,0.0,34x0x,f,f,f,f,f,,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f


## Importance of Barrier
Horses that draw barriers 1-6 win more often. Horses that draw 16 or worse rarely win 

#explore the barriers feature: does it look like it impacts chances of victory?
winners_by_barrier = df_runners[df_runners['position'] == 1][['id','barrier']].groupby('barrier').agg(['count'])
barrier_count = df_runners[['id','barrier']].groupby('barrier').agg(['count'])
pct_winner_by_barrier = winners_by_barrier/barrier_count[barrier_count.index.isin(winners_by_barrier.index)]
ax = pct_winner_by_barrier.plot(kind='bar')
ax.set_ylabel("Win Percentage")

#this notebook pushes up against memory limits. So I'm aggressive with garbage collection.
del winners_by_barrier, barrier_count, pct_winner_by_barrier

# Handicap
Heavier horses win more often, suggesting that weights aren't a sufficient handicap for better horses

#explore weight: does it looks like it has an impact?
winners_by_weight = df_runners[df_runners['position'] == 1][['id','handicap_weight']].groupby('handicap_weight').agg(['count'])
winners_by_weight = winners_by_weight[winners_by_weight > 30].dropna()
weight_count = df_runners[['id','handicap_weight']].groupby('handicap_weight').agg(['count'])
pct_winners_by_weight = winners_by_weight/weight_count[weight_count.index.isin(winners_by_weight.index)]
ax = pct_winners_by_weight.plot(kind='bar')
ax.set_ylabel("Win Percentage")
del winners_by_weight, weight_count, pct_winners_by_weight

#Rider Quality
The best riders win ~three times as often as the worst 

#explore weight: does it looks like it has an impact?
winners_by_rider = df_runners[df_runners['position'] == 1][['id','rider_id']].groupby('rider_id').agg(['count'])
#only inclide riders who have more than 10 races
winners_by_rider =winners_by_rider[winners_by_rider > 10].dropna()
rider_count = df_runners[['id','rider_id']].groupby('rider_id').agg(['count'])
rider_count = rider_count[rider_count.index.isin(winners_by_rider.index)]
pct_winners_by_rider = winners_by_rider/rider_count
pct_winners_by_rider.columns = ['Win_Percentage']
pct_winners_by_rider = pct_winners_by_rider.sort_values(by='Win_Percentage',ascending=False)
ax = pct_winners_by_rider.plot(kind='bar')
ax.set_ylabel("Win Percentage")
del winners_by_rider, rider_count, pct_winners_by_rider

# Create Feature Matrix
The exploration above suggests that barrier, weight and rider are valuable features for predicting winners. I've included all those features.

In [5]:
##merge the runners and markets data frames
df_runners_and_market = pd.merge(df_runners, df_market,left_on='market_id',right_on='id',how='inner')
df_runners_and_market.index = df_runners_and_market['id_x'] 


In [6]:
forms = pd.read_csv('input/forms.csv')

In [7]:
df_runners_and_market = df_runners_and_market.join(forms[['overall_starts','overall_wins', 'overall_places', 'track_starts', 'track_wins']], on='horse_id')

In [8]:
#payouts, bestModel = runRaces(lr, n_trials=5)

NameError: name 'runRaces' is not defined

In [11]:
numeric_features = ['position',
                    'market_id',
                    'barrier',
                    'handicap_weight',  
                    'track_starts', 
                    'track_wins']
categorical_features = ['rider_id']
df_features = df_runners_and_market[numeric_features]

#convert to factors
for feature in categorical_features:
    df_runners_and_market[feature] = df_runners_and_market[feature].astype(str)
    df_runners_and_market[feature] = df_runners_and_market[feature].replace('nan','0') #have to do this because of a weird random forest bug


encoded_features = pd.get_dummies(df_runners_and_market[categorical_features], columns=categorical_features)

df_features = pd.merge(df_features, encoded_features, left_index=True, right_index=True, how='inner')

In [12]:
#turn the target variable into a binary feature: did or did not win
df_features['win'] = 0
df_features.loc[df_features['position'] == 1,'win'] = 1

#del df_runners_and_market, encoded_features, df_features['position']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [13]:
df_features.to_csv('features_bloom.csv',)

In [14]:
bloom = pd.read_csv('features_bloom.csv')

# Split between training and test
Doing a random split

In [16]:
bloom.shape

(82639, 8)

In [36]:
def runRaces(est=None, n_trials=5, xgboost=False):
    payouts = np.array([])
    bestModel = None
    bestPayout = -100000
    for i in range(n_trials):
        print('trial {}'.format(i))
        df_train, df_test = train_test_split()
        if xgboost:
            df_test, model = fitXgboost(df_train, df_test)
        else:
            df_test, model = fitModel(df_train, df_test, est)
        df_test = getOdds(df_test)
        df_test = getAdjustedProbs(df_test)
        payout = getProfit(df_test)
        payouts = np.append(payouts, payout)
        if payout > bestPayout:
            bestPayout = payout
            bestModel = model
    print('mean payout: {}, std payout: {}'.format(np.mean(payouts), np.std(payouts)))
    return payouts, bestModel       

In [27]:
def train_test_split():
    training_races = np.random.choice(df_features['market_id'].unique(),size=int(round(0.7*len(df_features['market_id'].unique()),0)),replace=False)
    df_train = df_features[df_features['market_id'].isin(training_races)]
    df_test = df_features[~df_features['market_id'].isin(training_races)]
    return (df_train, df_test)
    
#del df_features

In [42]:
def fitXgboost(df_train, df_test):
    param = {'objective':'binary:logistic' }
    dtrain = xgb.DMatrix(df_train.drop(['win', 'position', 'market_id'], axis=1).values, label=df_train['win'])
    dtest = xgb.DMatrix(df_test.drop(['win', 'position', 'market_id'], axis=1).values)
    model = xgb.train(param, dtrain)
    predictions = model.predict(dtest)
    df_test['predictions'] = predictions
    df_test = df_test[['predictions','win','market_id']]
    return df_test, model

In [28]:
def fitModel(df_train, df_test, est):
    est.fit(df_train.drop(df_train[['win','position','market_id']],axis=1), df_train['win'])
    
    if est.predict_proba:
        predictions = est.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
    else:
        predictions = est.predict(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
    
    df_test['predictions'] = predictions
    df_test = df_test[['predictions','win','market_id']]
    return df_test, est

gbm = xgb.XGBClassifier(objective='binary:logistic').fit(df_train.drop(df_train[['win','position','market_id']],axis=1)
, df_train['win'])
predictions = gbm.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
df_test['predictions'] = predictions
df_test = df_test[['predictions','win','market_id']]
#del df_train

#Compare with betting markets

In [29]:
def getOdds(df_test):
    df_odds = pd.read_csv("input/odds.csv")
    df_odds = df_odds[df_odds['runner_id'].isin(df_test.index)]

    #I take the mean odds for the horse rather than the odds 1 hour before or 10 mins before. You may want to revisit this.
    average_win_odds = df_odds.groupby(['runner_id'])['odds_one_win'].mean()

    #delete when odds are 0 because there is no market for this horse
    average_win_odds[average_win_odds == 0] = np.nan
    df_test['odds'] = average_win_odds
    df_test = df_test.dropna(subset=['odds'])
    #given that I predict multiple winners, there's leakage if I don't shuffle the test set (winning horse appears first and I put money on the first horse I predict to win)
    df_test = df_test.iloc[np.random.permutation(len(df_test))]
    return df_test

In [30]:
def getAdjustedProbs(df_test):
    marketIdTotalProb = pd.DataFrame(df_test.groupby('market_id').predictions.sum())
    marketIdTotalProb.columns = ['totalPredForMarket']
    df_test = pd.merge(df_test, marketIdTotalProb, left_on='market_id', right_index=True)
    df_test['adjProb'] = df_test.predictions / df_test.totalPredForMarket
    return df_test


In [31]:
#select the horse I picked as most likely to win
def getProfit(df_test):
    df_profit = df_test.loc[df_test.groupby("market_id")["predictions"].idxmax()]
    df_profit.head()
    investment = 0
    payout = 0
    for index, row in df_profit.iterrows():
        investment +=1
    
        if (row['win']):
            payout += row['odds']

    investment_return = round((payout - investment)/investment*100,2)
    return investment_return


In [33]:
#print("This algorithm and betting system will generate a " + str(investment_return) + "% return\n")
#print("Note: you can't read much from a single run. Best to setup a cross validation framework and look at the return over many runs")

In [34]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
payouts, bestModel = runRaces(lr, n_trials=5)

trial 0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


trial 1
trial 2
trial 3
trial 4
mean payout: -20.018, std payout: 12.142754876880286


In [41]:
payouts, bestModel = runRaces(est=None, n_trials=2, xgboost=True)

trial 0
[ 0.09469943  0.02599738  0.10701256  0.07757477  0.11427253  0.08143684
  0.07757477  0.1119375   0.09469943  0.09468859]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


NameError: name 'est' is not defined