This notebook is a first step to exploring whether it's possible to predict horses races more accurately than the betting markets. It includes feature exploration, feature engineering, a basic XGBoost model, a betting strategy and calculates my profit or loss. 

This notebook doesn't rigorously test whether the strategy systematically makes money, but my sense is that it probably doesn't. The results change pretty dramatically when I change the test-train split. This could be because of a bug, or could be because the model is ~random because I have very few features (and I'm not using the most interesting features. 

Hopefully somebody can use this starting point and extend it. I'm happy to answer any questions if you don't understand anything I've done.

Ideas for improvements (ordered by priority):

 - Include additional features: most importantly form.
 - Create a betting strategy where you don't bet on every race but only those where there's a big discrepancy between your predictions and the odds 
 - Setup a cross validation framework. 
 - Look at feature importance and partial plots to make sure the model is behaving properly.
 - I'm treating this as a binary prediction problem (predicting the probability that each horse will win). This throws away information. There are probably better ways to setup the problem. 
 - Possibly include a model that also predicts place.


In [40]:
import pandas as pd
import xgboost as xgb
import numpy as np

#useful for displaying wide data frames
pd.set_option('display.max_columns', 50)

In [41]:
df = pd.read_csv('input/modWithComp.csv')

In [42]:
numeric_features = ['position',
                    'market_id',
                    'barrier',
                    'handicap_weight', 
                    'place_pct', 
                    'win_pct',
                    'comp_win_pct',
                    'comp_place_pct',
                    'track_starts', 
                    'track_wins']
categorical_features = ['rider_id']
df_features = df[numeric_features]

#convert to factors
for feature in categorical_features:
    df[feature] = df[feature].astype(str)
    df[feature] = df[feature].replace('nan','0') #have to do this because of a weird random forest bug
encoded_features = pd.get_dummies(df_runners_and_market[categorical_features], columns=categorical_features)
df_features = pd.merge(df_features, encoded_features, left_index=True, right_index=True, how='inner')

In [43]:
#turn the target variable into a binary feature: did or did not win
df_features['win'] = 0
df_features.loc[df_features['position'] == 1,'win'] = 1

#del df_runners_and_market, encoded_features, df_features['position']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


# Split between training and test
Doing a random split

In [67]:
def runRaces(est=None, n_trials=5, xgboost=False):
    payouts = np.array([])
    bestModel = None
    bestPayout = -100000
    prediction = None
    for i in range(n_trials):
        print('trial {}'.format(i))
        df_train, df_test = train_test_split()
        if xgboost:
            df_test, model = fitXgboost(df_train, df_test)
        else:
            df_test, model, train_columns, predictions = fitModel(df_train, df_test, est)
        df_test = getOdds(df_test)
        df_test = getAdjustedProbs(df_test)
        payout = getProfit(df_test)
        payouts = np.append(payouts, payout)
        if payout > bestPayout:
            bestPayout = payout
            bestModel = model
            predictions = predictions
        
    print('mean payout: {}, std payout: {}'.format(np.mean(payouts), np.std(payouts)))
    return payouts, bestModel, train_columns, predictions      

In [45]:
def train_test_split():
    training_races = np.random.choice(df_features['market_id'].unique(),size=int(round(0.7*len(df_features['market_id'].unique()),0)),replace=False)
    df_train = df_features[df_features['market_id'].isin(training_races)]
    df_test = df_features[~df_features['market_id'].isin(training_races)]
    return (df_train, df_test)
    
#del df_features

In [46]:
def fitXgboost(df_train, df_test):
    param = {'objective':'binary:logistic' }
    dtrain = xgb.DMatrix(df_train.drop(['win', 'position', 'market_id'], axis=1).values, label=df_train['win'])
    dtest = xgb.DMatrix(df_test.drop(['win', 'position', 'market_id'], axis=1).values)
    model = xgb.train(param, dtrain)
    predictions = model.predict(dtest)
    df_test['predictions'] = predictions
    df_test = df_test[['predictions','win','market_id']]
    return df_test, model

In [66]:
def fitModel(df_train, df_test, est):
    est.fit(df_train.drop(df_train[['win','position','market_id']],axis=1), df_train['win'])
    
    if est.predict_proba:
        predictions = est.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,1]
        print(predictions)
    else:
        predictions = est.predict(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
    
    df_test['predictions'] = predictions
    df_test = df_test[['predictions','win','market_id']]
    return df_test, est, df_train.drop(['win','position','market_id'], axis=1).columns, predictions

gbm = xgb.XGBClassifier(objective='binary:logistic').fit(df_train.drop(df_train[['win','position','market_id']],axis=1)
, df_train['win'])
predictions = gbm.predict_proba(df_test.drop(df_test[['win','position','market_id']],axis=1))[:,0]
df_test['predictions'] = predictions
df_test = df_test[['predictions','win','market_id']]
#del df_train

#Compare with betting markets

In [48]:
def getOdds(df_test):
    df_odds = pd.read_csv("input/odds.csv")
    df_odds = df_odds[df_odds['runner_id'].isin(df_test.index)]

    #I take the mean odds for the horse rather than the odds 1 hour before or 10 mins before. You may want to revisit this.
    average_win_odds = df_odds.groupby(['runner_id'])['odds_one_win'].mean()

    #delete when odds are 0 because there is no market for this horse
    average_win_odds[average_win_odds == 0] = np.nan
    df_test['odds'] = average_win_odds
    df_test = df_test.dropna(subset=['odds'])
    #given that I predict multiple winners, there's leakage if I don't shuffle the test set (winning horse appears first and I put money on the first horse I predict to win)
    df_test = df_test.iloc[np.random.permutation(len(df_test))]
    return df_test

In [49]:
def getAdjustedProbs(df_test):
    marketIdTotalProb = pd.DataFrame(df_test.groupby('market_id').predictions.sum())
    marketIdTotalProb.columns = ['totalPredForMarket']
    df_test = pd.merge(df_test, marketIdTotalProb, left_on='market_id', right_index=True)
    df_test['adjProb'] = df_test.predictions / df_test.totalPredForMarket
    return df_test


In [50]:
#select the horse I picked as most likely to win
def getProfit(df_test):
    df_profit = df_test.loc[df_test.groupby("market_id")["predictions"].idxmax()]
    df_profit.head()
    investment = 0
    payout = 0
    for index, row in df_profit.iterrows():
        investment +=1
    
        if (row['win']):
            payout += row['odds']

    investment_return = round((payout - investment)/investment*100,2)
    return investment_return


In [51]:
#print("This algorithm and betting system will generate a " + str(investment_return) + "% return\n")
#print("Note: you can't read much from a single run. Best to setup a cross validation framework and look at the return over many runs")

In [69]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
payouts, bestModel, train_cols, predictions = runRaces(lr, n_trials=5)

trial 0
[ 0.055532    0.05491216  0.06267821 ...,  0.04804684  0.15398422
  0.07419078]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


trial 1
[ 0.05524596  0.05660352  0.06443668 ...,  0.0645494   0.05183351
  0.05783013]
trial 2
[ 0.06649384  0.08770885  0.07098709 ...,  0.05006389  0.14351333
  0.07710616]
trial 3
[ 0.06550047  0.05667057  0.10609112 ...,  0.06863343  0.0624719
  0.07459865]
trial 4
[ 0.0680906   0.08036962  0.07260524 ...,  0.06709356  0.07677473
  0.07006478]
mean payout: 191.88600000000002, std payout: 18.829539134030867


In [70]:
pd.Series(bestModel.coef_[0],index=train_cols )

barrier           -0.069054
handicap_weight    0.062837
place_pct         -0.050190
win_pct            0.011367
comp_win_pct      -0.014890
comp_place_pct     0.028023
track_starts       0.005798
track_wins         0.004792
dtype: float64

In [65]:
print(df_features[df_features.win == 1].win_pct.mean())
print(df_features[df_features.win == 0].win_pct.mean())

0.11064414060769343
0.1121880974178413
