# Australian Thoroughbred Horseracing Analysis

This is th

## The datasets

There are five datasets we are interested in.

All of these datasets were sourced from the following websites:

* racing.com (races, horses, jockeys and trainers)
* racenet.com.au (sectional times)

All coding to create these datasets was done by the author.

The datasets cover thoroughbred races at the following courses:

* Caulfield (Melbourne)
* Flemington (Melbourne)
* Rosehill Gardens (Sydney)
* Royal Randwick (Sydney)

These courses were chosen as they are the four biggest metropolitan courses in Australia, having the greatest number of Group and Listed races. Races at these courses have the greatest prizemoney and tend to attract a lot of public interest and have large waging pools.

We extracted all races from the period 1 January 2022 to 5 August 2023, a total of 1301 races. Due to some technical restrictions, we were only able to gather sectional data for the first 3 races at all of these meets.

## Load required libraries

In [3]:
# Assert minimum versions 
import sys 
assert sys.version_info >= (3, 5) 
import sklearn 
assert sklearn.__version__ >= "0.20" 
 
# Import packages and modules that will be used 
import numpy as np 
import pandas as pd
from sklearn import linear_model
from sklearn import metrics 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import SGDClassifier 
from sklearn.metrics import accuracy_score 

# Import and configure matplotlib 
%matplotlib inline  
import matplotlib as mpl  
import matplotlib.pyplot as plt 
mpl.rc('figure', dpi=120) # set good resolution

# Set a seed for reproducability
import random
random.seed(42)
# numpy needs a random seed, too
np.random.seed(42)

import datetime
from datetime import date

## Load datasets

### Races

In [4]:
df_races = pd.read_csv('/Users/phillipmonk/research_paper/horse_code/data/race_data.csv')
df_races.head()

Unnamed: 0,race_course,race_no,datetime,distance,class,track_cond,track_rail,race_time,prize_money,position,...,trainer,jockey,weight,prize,800m,400m,margin,sp,s_tab_win,s_tab_place
0,Flemington,4,2022-09-10 14:20:00,1400,Listed,7,True Entire Circuit,84.38,175000,1,...,S.O'Dea & M.Hoysted,D.M.Lane,58.5,106000,1.0,1.0,0.0,7.5,6.9,2.4
1,Flemington,4,2022-09-10 14:20:00,1400,Listed,7,True Entire Circuit,84.38,175000,2,...,E.Jusufovic,M.J.Dee,54.0,31500,10.0,9.0,2.25,4.0,0.0,1.9
2,Flemington,4,2022-09-10 14:20:00,1400,Listed,7,True Entire Circuit,84.38,175000,3,...,J.R.McArdle,J.Mott,54.0,15750,6.0,3.0,2.45,6.5,0.0,2.2
3,Flemington,4,2022-09-10 14:20:00,1400,Listed,7,True Entire Circuit,84.38,175000,4,...,M.Price & M.Kent (Jnr),B.J.Melham,58.5,7875,5.0,7.0,3.95,9.5,0.0,0.0
4,Flemington,4,2022-09-10 14:20:00,1400,Listed,7,True Entire Circuit,84.38,175000,5,...,S.R.Fliedner,H.T.Coffey,55.0,4375,11.0,11.0,4.25,26.0,0.0,0.0


In [5]:
df_races.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11881 entries, 0 to 11880
Data columns (total 22 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   race_course  11881 non-null  object 
 1   race_no      11881 non-null  int64  
 2   datetime     11881 non-null  object 
 3   distance     11881 non-null  int64  
 4   class        11881 non-null  object 
 5   track_cond   11881 non-null  int64  
 6   track_rail   11881 non-null  object 
 7   race_time    11881 non-null  float64
 8   prize_money  11881 non-null  int64  
 9   position     11881 non-null  int64  
 10  horse_name   11881 non-null  object 
 11  barrier      11881 non-null  int64  
 12  trainer      11881 non-null  object 
 13  jockey       11881 non-null  object 
 14  weight       11881 non-null  float64
 15  prize        11881 non-null  int64  
 16  800m         11880 non-null  float64
 17  400m         11880 non-null  float64
 18  margin       11880 non-null  float64
 19  sp  

This dataframe contains data of each runner in each race.

#### Fields

##### race_course
The racecourse (Caulfield, Flemington, Rosehill Gardens or Royal Randwick).

##### race_no
The individual race number for the meet.

##### datetime
The date and time of the race.

##### distance  
The distance of the race. Sprints are usually around a kilometre or a bit longer, long-distance races are 2 kilometres or longer (the Melbourne cup is 3200 metres).

##### class 
The class of the race. Group 1 races are the highest, down to Group 3 then Listed races.

##### track_cond  
The rating of the track. A rating of 1 means firm and fast, 10 means a heavy bog.

##### track_rail
The position of the rail on the course.

##### race_time
The time of the winning horse, in seconds.

##### prize_money
The total amount of prize money for the race.

##### position 
The finishing position of the horse in the race.

##### horse_name
The name of the horse. Horse names should be unique therefore can be used as an identifier.

##### barrier
The starting barrier for the race. Barrier 1 is closest to the rail. The closer the barrier, the shorter the distance to the finishing post, assuming that the race includes some bends. A wider barrier might carry an advantage if the track is heavy and has been churned up close to the rail.

##### trainer
The trainer of the horse. Usually the same trainer will train a horse for an extended length of time, though it's possible the owners will decide to change trainers.

##### jockey
The jockey riding the horse in this race. It is more likely the horse will be ridden by different jockeys at different times.

##### weight
Thoroghbred horseracing in Australia is handicapped by weight; i.e., different horses will carry different weights to make the races closer and more interesting. If the jockey weighs less than the required weight they will carry extra weight in the saddle.

##### prize
The individual prize money won by this horse in this race.

##### 800m
The position the horse in with 800 metres to go.

##### 400m
The position the horse in with 400 metres to go.

##### margin
The distance in lengths the horse finished behind the winner. A value of 0 means the horse won.

##### sp
The bookmaker's payout for the horse winning the race. E.g., a figure of 10 means that a $1 wager paid out $10 ($9 profit).

##### s_tab_win
The TAB payout for the horse winning the race.

##### s_tab_place
The TAB payout for the horse finishing in the top 3 places.

In [6]:
df_races.describe()

Unnamed: 0,race_no,distance,track_cond,race_time,prize_money,position,barrier,weight,prize,800m,400m,margin,sp,s_tab_win,s_tab_place
count,11881.0,11881.0,11881.0,11881.0,11881.0,11881.0,11881.0,11881.0,11881.0,11880.0,11880.0,11880.0,11881.0,11881.0,11881.0
mean,5.348035,1480.599276,5.040064,89.485173,330969.5,6.396347,6.41882,56.535603,26023.99,6.411785,6.407407,4.787513,22.42012,0.684538,0.697838
std,2.727153,392.864655,1.689115,26.426646,870906.0,3.75702,3.775117,2.287362,137707.9,3.767928,3.764314,6.07917,27.758774,3.595505,1.658653
min,1.0,1000.0,3.0,56.06,50000.0,1.0,1.0,49.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,3.0,1200.0,4.0,70.02,130000.0,3.0,3.0,54.5,2600.0,3.0,3.0,1.56,7.0,0.0,0.0
50%,5.0,1400.0,4.0,83.81,150000.0,6.0,6.0,56.5,4000.0,6.0,6.0,3.55,13.0,0.0,0.0
75%,8.0,1600.0,6.0,98.21,200000.0,9.0,9.0,58.0,13500.0,9.0,9.0,6.15,26.0,0.0,0.0
max,10.0,3200.0,10.0,206.91,15000000.0,21.0,22.0,65.0,6749000.0,21.0,21.0,144.49,301.0,95.9,25.0


#### Observations

There are a total of 11,881 data rows across 1301 races, 1 for each horse in each race.

There's a maximum of 10 races per meet. Race distances range from 1000 metres to 3200 metres. The track condition is never lower than 3 (Good) but can go as high as 10 (Very Heavy). Because these are high-profile tracks they tend to be well-maintained (good amounts of grass and moisture).

The prize money for an entire race ranges from $50,000 to $15,000,000, with a mean of $330,969.50.

The maximum number of finishers is 21 (note some races could have more runners that were scratched or failed to finish).

The minimum weight carried was 49kg, the maximum 65kg. The mean weight is about 56.5kg.

Individual prize money can range from zero (smaller races only tend to pay the first few finishers) to $6.749 million for winning the Melbourne Cup.

Lengths behind the leader at the 800m/400m marks is capped at 21 lengths (totally out of contention).

The mean margin behind the winner is about 4.79 lengths.

The SP (bookie) price is the price offered when the race started. I have zeroed the values when nothing was recorded. The mean win odds are $22.42. Therefore, in general, most horses are long-shots. The highest odds offered were $301.00.

The TAB win and place odds are the actual payouts recorded, therefore they are usually zero. The highest odds paid out were $95.90 for a win and $25 for a place.

### Horses

The historic data for horses is sourced from pages similar to this: https://www.racing.com/horses/stageman

Note, we excluded trial results from the study (these are coloured grey on the website, and notably offer no prize money).

In [7]:
df_horses = pd.read_csv('/Users/phillipmonk/research_paper/horse_code/data/horse_data.csv')
df_horses.head()

Unnamed: 0,horse_name,foaled,colour,sire,dam,sex,group_1_wins,starts,firsts,seconds,...,date,dist,cond,weight,800m,400m,margin,rating,odds,odds_source
0,A Good Yarn,18/09/2013,Brown or Black,O'lonhro,Little Thread,Gelding,0,46,6,8,...,31/07/23,1100,9.0,61.5,2.0,1.0,10.2,61,20.0,SP
1,A Good Yarn,18/09/2013,Brown or Black,O'lonhro,Little Thread,Gelding,0,46,6,8,...,02/01/23,1125,3.0,54.0,7.0,5.0,6.7,61,11.0,SP
2,A Good Yarn,18/09/2013,Brown or Black,O'lonhro,Little Thread,Gelding,0,46,6,8,...,16/12/22,1109,3.0,61.5,6.0,6.0,0.8,61,6.5,SP
3,A Good Yarn,18/09/2013,Brown or Black,O'lonhro,Little Thread,Gelding,0,46,6,8,...,08/12/22,1100,3.0,58.5,4.0,5.0,1.0,61,9.5,SP
4,A Good Yarn,18/09/2013,Brown or Black,O'lonhro,Little Thread,Gelding,0,46,6,8,...,27/11/22,1000,4.0,56.5,11.0,10.0,5.6,63,15.0,SP


In [8]:
df_horses.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50257 entries, 0 to 50256
Data columns (total 40 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   horse_name    50257 non-null  object 
 1   foaled        50257 non-null  object 
 2   colour        50257 non-null  object 
 3   sire          50257 non-null  object 
 4   dam           50257 non-null  object 
 5   sex           50257 non-null  object 
 6   group_1_wins  50257 non-null  int64  
 7   starts        50257 non-null  int64  
 8   firsts        50257 non-null  int64  
 9   seconds       50257 non-null  int64  
 10  thirds        50257 non-null  int64  
 11  prize_money   50242 non-null  float64
 12  season        50257 non-null  object 
 13  grp_listed    50257 non-null  object 
 14  first_up      50257 non-null  object 
 15  second_up     50257 non-null  object 
 16  third_up      50257 non-null  object 
 17  firm          50257 non-null  object 
 18  good          50257 non-nu

This dataframe contains the form (history) for each horse. Note that this includes races at other tracks than the four in scope.

#### Fields

##### horse_name
The name of the horse. Horse names should be unique therefore can be used as an identifier.

##### foaled
Date of birth.

##### colour
The colour of the horse.

##### sire
The horse's father.

##### dam
The horse's mother.

##### sex
The horse's sex.

##### group_1_wins
Total number of group 1 wins.

##### starts
Races started.

##### firsts
Number of races won.

##### seconds 
Number of races that the horse finished second.

##### thirds
Number of races that the horse finished third.

##### prize_money
The total amount of prize money the horse has won.

##### season
The performance of the horse over the season (starts, wins, seconds, thirds).

##### grp_listed
The performance of the horse over group and listed races (starts, wins, seconds, thirds).

##### first up
The performance of the horse in the first race after a spell (starts, wins, seconds, thirds).

##### second_up
The performance of the horse in the second race after a spell (starts, wins, seconds, thirds).

##### third_up
The performance of the horse in the third race after a spell (starts, wins, seconds, thirds).

##### firm
The performance of the horse on firm tracks (starts, wins, seconds, thirds), i.e., the track condition is 1 or 2.

##### good
The performance of the horse on good tracks (starts, wins, seconds, thirds), i.e., the track condition is 3 or 4.

##### soft
The performance of the horse on soft tracks (starts, wins, seconds, thirds), i.e., the track condition is 5, 6 or 7.

##### heavy
The performance of the horse on heavy tracks (starts, wins, seconds, thirds), i.e., the track condition is 8, 9 or 10.

##### jumps
The performance of the horse on steeplechases. None of these races are in scope.

##### synth
The performance of the horse on steeplechases. None of these races are in scope.

##### position
The finishing position of the horse in each individual race.

##### num_runners
The number of runners in each individual race.

##### trainer
The trainer for the horse for that race.

##### prize
The total amount of prize money for the race.

##### _class
The class of the race - Group 1, Group 2, et cetera.

##### jockey
The jockey for the horse for that race.

##### track
The track and race number for the race.

##### date
The date of the race meet.

##### dist
The distance of the race.

##### cond
The condition of the track for the race.

##### weight
The weight the horse carried for the race.

##### 800m
The position of the horse with 800 metres to go in the race.

##### 400m
The position of the horse with 400 metres to go in the race.

##### margin
The distance behind the winner in lengths the horse finished in the race.

##### rating
The rating of the horse during the race.

##### odds
The odds for a win for the horse at the beginning of the race.

##### odds_source
The source of the odds for the race, typically SP (bookmaker) or TAB.

In [9]:
df_horses.describe()

Unnamed: 0,group_1_wins,starts,firsts,seconds,thirds,prize_money,position,num_runners,prize,dist,cond,weight,800m,400m,margin,odds
count,50257.0,50257.0,50257.0,50257.0,50257.0,50242.0,50257.0,50257.0,50257.0,50257.0,49187.0,50257.0,47014.0,47318.0,49950.0,50250.0
mean,0.046203,23.484828,4.160992,3.255666,2.778081,355947.3,4.835903,10.661918,161530.7,1457.104025,5.173643,57.242846,5.658506,5.507925,3.962741,15.363363
std,0.343294,12.864327,2.343639,2.439559,2.251207,779619.3,3.388646,2.774304,541063.8,401.028369,1.850005,2.320796,3.422342,3.401808,5.103854,25.951027
min,0.0,1.0,0.0,0.0,0.0,1350.0,1.0,3.0,4664.0,800.0,2.0,49.0,1.0,1.0,0.0,1.08
25%,0.0,14.0,3.0,1.0,1.0,105475.0,2.0,9.0,35000.0,1200.0,4.0,56.0,3.0,3.0,1.2,4.6
50%,0.0,21.0,4.0,3.0,2.0,188178.0,4.0,10.0,55000.0,1400.0,4.0,57.0,5.0,5.0,2.8,8.5
75%,0.0,30.0,5.0,5.0,4.0,347050.0,7.0,12.0,146250.0,1600.0,6.0,59.0,8.0,8.0,5.1,17.0
max,9.0,100.0,22.0,18.0,18.0,20765010.0,26.0,30.0,15000000.0,4600.0,10.0,76.0,22.0,22.0,165.2,1313.0


In [10]:
df_horses['horse_name'].nunique()

3272

#### Observations

There are 50,257 rows of horse observations for 3272 horses. The reason this number is higher than the observations in the Races data frame is that the data is historic for the horse, and includes races before 2022 and at other racetracks. Usually a horse would start its career at smaller tracks and would only 'graduate' to the courses in scope after having proven itself.

There are only a handful of Group 1 races every year, so Group 1 wins are rare. The highest number of wins is 9. Almost all horses have zero.

Horses have a mean of about 23 starts. This reflects that most horses have already had some expererience racing, with on average 4 wins, 3 seconds and 3 thirds, and about $356,000 in winnings. The winningest horse has over $20 million in career winnings.

On average a race has about 10-11 runners, with a maximum of 30.

The 'prize' column is the total prize pool for that race. The median value is $55,000; the mean is quite a bit higher at $161,307.

Races range from an 800m sprint to a whopping 4600m long-distance event. Most races are around 1400 metres. Track conditions range from 2 to 10, with most being a 4 to 6. Weight carried ranges from a svelte 49kg to a husky 76.

800m/400m lengths behind data is a little lower than in the races table (5.5-5.6 here as opposed to 6.4 in the Races data frame). This is probably due to the fact that these horses tended to finish closer to the front in their early careers racing against lower-quality opposition. Similarly, the margin here is 3.96 as opposed to 4.79 in the Races data frame.

The odds quoted are the odds offered by Set-Price (SP) bookmakers at the start of the race.



### Jockeys

Jockey data is sourced from pages similar to this:

https://www.racing.com/jockeys/kerrin-mcevoy/stats

There is a wealth of data available on those pages, to keep things manageable we only look at the career summary and the class statistics.

In [27]:
df_jockeys = pd.read_csv('/Users/phillipmonk/research_paper/horse_code/data/jockey_data.csv')
df_jockeys.head()

Unnamed: 0,jockey_name,career_wins,group_1_wins,prize_money,win_pct,recent_win_pct,group_1_win_pct,group_1_place_pct,group_2_win_pct,group_2_place_pct,group_3_win_pct,group_3_place_pct,listed_win_pct,listed_place_pct,other_win_pct,other_place_pct
0,Aaron Bullock,836,0,18266124,19.0,24.0,0.0,0.0,,,0.0,0.0,0.0,15.0,19,45
1,Adam Hyeronimus,535,1,20898420,13.0,22.0,3.0,13.0,2.0,9.0,6.0,17.0,13.0,33.0,13,36
2,Adam McCabe,208,0,4751110,9.0,6.0,,,,,0.0,0.0,7.0,7.0,9,29
3,Alana Kelly,214,0,10565677,7.0,4.0,0.0,0.0,50.0,50.0,0.0,12.0,4.0,8.0,7,25
4,Alexandra Bryan,95,0,1985574,8.0,4.0,,,,,,,,,8,21


In [28]:
df_jockeys.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   jockey_name        250 non-null    object 
 1   career_wins        250 non-null    int64  
 2   group_1_wins       250 non-null    int64  
 3   prize_money        250 non-null    int64  
 4   win_pct            249 non-null    float64
 5   recent_win_pct     241 non-null    float64
 6   group_1_win_pct    167 non-null    float64
 7   group_1_place_pct  167 non-null    float64
 8   group_2_win_pct    172 non-null    float64
 9   group_2_place_pct  172 non-null    float64
 10  group_3_win_pct    198 non-null    float64
 11  group_3_place_pct  198 non-null    float64
 12  listed_win_pct     219 non-null    float64
 13  listed_place_pct   219 non-null    float64
 14  other_win_pct      250 non-null    int64  
 15  other_place_pct    250 non-null    int64  
dtypes: float64(10), int64(5), 

The jockeys data frame contains summary statistical information about jockeys. All of these jockeys were involved in at least one race in scope, but the summary data is across their entire career.

#### Fields

##### jockey_name
The name of the jockey.

##### career_wins
Total number of career wins for the jockey.

##### group_1_wins
The total number of group 1 wins for the jockey.

##### prize_money
The total amount of prize money won by the jockey.

##### win_pct
The total win percentage for the jockey.

##### recent_win_pct
The win percentage of the jockey over the last 50 races.

##### group_1_win_pct
The win percentage of the jockey in Group 1 races.

##### group_1_place_pct
The place percentage of the jockey in Group 1 races.

##### group_2_win_pct
The win percentage of the jockey in Group 2 races.

##### group_2_place_pct
The place percentage of the jockey in Group 2 races.

##### group_3_win_pct
The win percentage of the jockey in Group 3 races.

##### group_3_place_pct
The place percentage of the jockey in Group 3 races.

##### listed_win_pct
The win percentage of the jockey in Listed races.

##### listed_place_pct
The place percentage of the jockey in Listed races.

##### other_win_pct
The win percentage of the jockey in other races.

##### other_place_pct
The place percentage of the jockey in other races.

In [29]:
df_jockeys.describe()

Unnamed: 0,career_wins,group_1_wins,prize_money,win_pct,recent_win_pct,group_1_win_pct,group_1_place_pct,group_2_win_pct,group_2_place_pct,group_3_win_pct,group_3_place_pct,listed_win_pct,listed_place_pct,other_win_pct,other_place_pct
count,250.0,250.0,250.0,249.0,241.0,167.0,167.0,172.0,172.0,198.0,198.0,219.0,219.0,250.0,250.0
mean,660.464,6.432,28451880.0,11.052209,10.80083,5.586826,16.97006,7.127907,20.604651,6.924242,22.767677,7.442922,22.680365,11.184,32.856
std,676.108806,18.202492,41379010.0,4.620397,6.853965,10.128241,17.935435,11.13584,18.784577,7.80988,16.955797,8.659204,16.281867,5.03862,9.259493
min,0.0,0.0,800.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,161.25,0.0,4792883.0,8.0,6.0,0.0,0.0,0.0,3.0,0.0,13.0,0.0,13.0,8.0,28.0
50%,421.0,0.0,12336680.0,10.0,10.0,3.0,15.0,5.0,19.0,5.0,22.0,7.0,23.0,10.0,31.0
75%,1029.0,2.0,34198850.0,13.0,14.0,7.0,23.5,9.0,29.0,10.0,29.0,10.0,30.0,13.0,37.0
max,3152.0,125.0,247809300.0,37.0,40.0,100.0,100.0,100.0,100.0,50.0,100.0,60.0,100.0,42.0,76.0


In [35]:
df_jockeys.sort_values(by=['group_1_wins'],ascending=False).head(10)

Unnamed: 0,jockey_name,career_wins,group_1_wins,prize_money,win_pct,recent_win_pct,group_1_win_pct,group_1_place_pct,group_2_win_pct,group_2_place_pct,group_3_win_pct,group_3_place_pct,listed_win_pct,listed_place_pct,other_win_pct,other_place_pct
68,Damien Oliver,2788,125,247809336,16.0,8.0,11.0,34.0,14.0,38.0,13.0,34.0,15.0,39.0,17,43
85,Frankie Dettori,385,105,98599310,33.0,40.0,32.0,58.0,36.0,67.0,33.0,67.0,25.0,62.0,36,69
99,Hugh Bowman,2358,102,234158146,16.0,8.0,15.0,36.0,15.0,39.0,13.0,37.0,12.0,37.0,17,44
90,Glen Boss,1905,90,183107631,13.0,,11.0,29.0,9.0,29.0,9.0,30.0,11.0,31.0,14,36
212,Ryan Moore,467,76,110720971,32.0,38.0,21.0,57.0,32.0,66.0,32.0,70.0,39.0,73.0,36,72
190,Opie Bosson,674,76,39532409,25.0,24.0,25.0,42.0,21.0,50.0,19.0,50.0,22.0,50.0,27,58
107,James McDonald,1437,75,198956894,21.0,30.0,15.0,41.0,21.0,47.0,20.0,46.0,17.0,41.0,22,51
140,Kerrin McEvoy,1844,72,215875646,12.0,10.0,9.0,28.0,10.0,34.0,12.0,33.0,11.0,33.0,13,36
66,Craig Williams,2090,70,215348589,13.0,16.0,9.0,27.0,12.0,32.0,12.0,33.0,12.0,35.0,13,37
184,Nash Rawiller,2418,67,155762171,18.0,26.0,13.0,35.0,17.0,39.0,16.0,38.0,15.0,38.0,18,45


#### Observations

There are only 250 jockeys in scope. Each line in the data frame is a summary record. Most jockeys tend to ride in multiple races in a meet so the number of jockeys observed should be lower.

This is career data so it includes results stretching back decades, in some cases.

The prize_money column is the total money won by the jockey over their career. With a mean of $28 million it is clear that jockeys have much longer careers than horses, plus with many more opportunities to race. Even the median jockey in the study has over $12 million in prize money.

Win percentages tend to hover around 10%. Group 1 races are much harder to win or place in, group 2 and 3 and listed races somewhat less so. This is probably due to group 1 fields having more runners and also the 'plum' rides tending to go to the same jockeys. 'Other' races report higher success rates.

### Trainers

In [30]:
df_trainers = pd.read_csv('/Users/phillipmonk/research_paper/horse_code/data/trainer_data.csv')
df_trainers.head()

Unnamed: 0,trainer_name,career_wins,group_1_wins,prize_money,win_pct,place_pct,recent_win_pct,group_1_win_pct,group_1_place_pct,group_2_win_pct,group_2_place_pct,group_3_win_pct,group_3_place_pct,listed_win_pct,listed_place_pct,other_win_pct,other_place_pct
0,Aaron Clarke,122,0,2118767,8.0,27,6.0,,,,,0.0,0.0,,,8.0,27.0
1,Aaron Laing,14,0,553190,6.0,25,10.0,,,0.0,0.0,0.0,0.0,,,6.0,25.0
2,Aaron Purcell,456,3,17286146,9.0,31,14.0,15.0,30.0,20.0,30.0,5.0,26.0,7.0,26.0,9.0,31.0
3,Adam Campton,49,0,1976714,19.0,45,12.0,,,0.0,100.0,,,0.0,33.0,19.0,44.0
4,Adam Chambers,48,0,1091446,16.0,39,22.0,,,,,,,0.0,0.0,16.0,39.0


In [31]:
df_trainers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541 entries, 0 to 540
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trainer_name       541 non-null    object 
 1   career_wins        541 non-null    int64  
 2   group_1_wins       541 non-null    int64  
 3   prize_money        541 non-null    int64  
 4   win_pct            539 non-null    float64
 5   place_pct          541 non-null    int64  
 6   recent_win_pct     528 non-null    float64
 7   group_1_win_pct    296 non-null    float64
 8   group_1_place_pct  296 non-null    float64
 9   group_2_win_pct    331 non-null    float64
 10  group_2_place_pct  331 non-null    float64
 11  group_3_win_pct    390 non-null    float64
 12  group_3_place_pct  390 non-null    float64
 13  listed_win_pct     435 non-null    float64
 14  listed_place_pct   435 non-null    float64
 15  other_win_pct      539 non-null    float64
 16  other_place_pct    539 non

The trainers data frame contains summary statistical information about trainers. All of these trainers were involved in at least one race in scope, but the summary data is across their entire career.

#### Fields

##### trainer_name
The name of the trainer.

##### career_wins
Total number of career wins for the trainer.

##### group_1_wins
The total number of group 1 wins for the trainer.

##### prize_money
The total amount of prize money won by the trainer.

##### win_pct
The total win percentage for the trainer.

##### recent_win_pct
The win percentage of the trainer over the last 50 races.

##### group_1_win_pct
The win percentage of the trainer in Group 1 races.

##### group_1_place_pct
The place percentage of the trainer in Group 1 races.

##### group_2_win_pct
The win percentage of the trainer in Group 2 races.

##### group_2_place_pct
The place percentage of the trainer in Group 2 races.

##### group_3_win_pct
The win percentage of the trainer in Group 3 races.

##### group_3_place_pct
The place percentage of the trainer in Group 3 races.

##### listed_win_pct
The win percentage of the trainer in Listed races.

##### listed_place_pct
The place percentage of the trainer in Listed races.

##### other_win_pct
The win percentage of the trainer in other races.

##### other_place_pct
The place percentage of the trainer in other races.

In [32]:
df_trainers.describe()

Unnamed: 0,career_wins,group_1_wins,prize_money,win_pct,place_pct,recent_win_pct,group_1_win_pct,group_1_place_pct,group_2_win_pct,group_2_place_pct,group_3_win_pct,group_3_place_pct,listed_win_pct,listed_place_pct,other_win_pct,other_place_pct
count,541.0,541.0,541.0,539.0,541.0,528.0,296.0,296.0,331.0,331.0,390.0,390.0,435.0,435.0,539.0,539.0
mean,290.340111,2.417745,11770060.0,10.727273,32.07024,11.054924,4.402027,15.422297,7.350453,21.752266,5.315385,19.648718,6.422989,22.110345,11.096475,32.784787
std,507.563186,11.394897,32267930.0,4.173048,7.676483,6.093112,8.603736,17.732113,13.232575,21.477662,7.116765,19.349503,8.167905,19.439301,4.898028,8.94147
min,0.0,0.0,18445.0,1.0,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0
25%,34.0,0.0,1091446.0,8.0,28.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,8.0,28.0
50%,123.0,0.0,3017801.0,10.0,32.0,10.0,0.0,14.0,2.0,20.0,2.0,20.0,5.0,22.0,10.0,32.0
75%,322.0,0.0,9139064.0,13.0,36.0,14.0,7.0,25.0,11.0,33.0,9.0,30.0,10.0,31.0,13.0,37.0
max,4567.0,150.0,436460900.0,32.0,59.0,38.0,100.0,100.0,100.0,100.0,50.0,100.0,60.0,100.0,37.0,87.0


#### Observations

There are 541 trainers in scope. Each line in the data frame is a summary record. A trainer may have runners in multiple races in a meeting, sometimes they can have multiple runners in the same race.

This is career data so it includes results stretching back decades, in some cases.

The prize_money column is the total money won by the trainer over their career. The mean is almost $12 million and the median is $3 million. There are more trainers than jockeys in the study. Logistically, it may be harder for a trainer to work with as many horses as a jockey can, as each trainer needs to invest a certain amount of time overseeing each horse they train.

Win percentages tend to hover around 10%. Group 1 races are much harder to win or place in, group 2 and 3 and listed races somewhat less so. This is probably due to group 1 fields having more runners and also the 'plum' rides tending to go to the same jockeys. 'Other' races report higher success rates.

### Sectionals

In [17]:
df_sectionals = pd.read_csv('/Users/phillipmonk/research_paper/horse_code/data/sectional_data.csv')
df_sectionals.head()

Unnamed: 0,race_course,race_no,date,horse_name,r1_l8,r2_l8,r3_l8,r4_l8,r5_l8,r1_l6,...,r1_l4,r2_l4,r3_l4,r4_l4,r5_l4,r1_l2,r2_l2,r3_l2,r4_l2,r5_l2
0,caulfield,2,2022-01-08,Blandford Lad,-1.07,-,-2.1,-,-2.88,-1.27,...,-0.1,-,-0.92,-,-1.47,0.18,-,-0.36,-,-0.24
1,caulfield,2,2022-01-08,The Awesome Son,-,-0.92,-2.23,-0.61,-,-,...,-,-0.61,-1.62,0.38,-,-,0.04,-0.57,0.34,-
2,caulfield,2,2022-01-08,Brilliant Venture,-0.75,0.18,-1.99,-1.46,-,-0.95,...,0.22,1.13,-0.92,-0.15,-,0.34,0.9,-0.24,0.41,-
3,caulfield,2,2022-01-08,Coup De Tonnerre,-1.09,-0.23,-,-1.05,1.24,-0.58,...,0.28,0.05,-,0.52,0.44,0.56,0.21,-,0.45,0.4
4,caulfield,2,2022-01-08,Ruru,-1.19,-1.89,-1.3,-2.14,-0.18,-1.39,...,-0.18,-0.92,-0.18,-1.27,0.57,0.02,0.12,0.15,-0.21,0.69


In [18]:
df_sectionals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4008 entries, 0 to 4007
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   race_course  4008 non-null   object
 1   race_no      4008 non-null   int64 
 2   date         4008 non-null   object
 3   horse_name   4008 non-null   object
 4   r1_l8        4008 non-null   object
 5   r2_l8        4008 non-null   object
 6   r3_l8        4008 non-null   object
 7   r4_l8        4008 non-null   object
 8   r5_l8        4008 non-null   object
 9   r1_l6        4008 non-null   object
 10  r2_l6        4008 non-null   object
 11  r3_l6        4008 non-null   object
 12  r4_l6        4008 non-null   object
 13  r5_l6        4008 non-null   object
 14  r1_l4        4008 non-null   object
 15  r2_l4        4008 non-null   object
 16  r3_l4        4008 non-null   object
 17  r4_l4        4008 non-null   object
 18  r5_l4        4008 non-null   object
 19  r1_l2        4008 non-null 

The sectionals data frame contains summary historic sectional data for each horse in the nominated race. Up to 5 races worth of sectional data is recorded for each horse. Each of the fields from 'r1_l8' to 'r5_l2' gives a number indicating how much above, or below, the horse ran to a distance in a previous race.

The field format is 'rx_ly' where x is the historical race, and y is the distance.

According to racenet.com.au, sectionals metrics compare a runner’s final 800m, 600m and 200m splits to the standard benchmark in lengths. The standard benchmark compares races of similar length and track condition, irrespective of class and location. Negative result = faster than the benchmark. Positive result = slower than the benchmark.

#### Fields

##### race_course
The course the race was run.

##### race_no
The number of the race in the meet.

##### date
The date of the meet.

##### horse_name
The name of the horse.

##### r1_l8
The sectionals 800m value for the most recent race in the horse's history.

##### r2_l8
The sectionals 800m value for the second-most recent race in the horse's history.

##### r3_l8
The sectionals 800m value for the third-most recent race in the horse's history.

##### r4_l8
The sectionals 800m value for the fourth-most recent race in the horse's history.

##### r5_l8
The sectionals 800m value for the fifth-most recent race in the horse's history.

##### r1_l6
The sectionals 600m value for the most recent race in the horse's history.

##### r2_l6
The sectionals 600m value for the second-most recent race in the horse's history.

##### r3_l6
The sectionals 600m value for the third-most recent race in the horse's history.

##### r4_l6
The sectionals 600m value for the fourth-most recent race in the horse's history.

##### r5_l6
The sectionals 600m value for the fifth-most recent race in the horse's history.

##### r1_l4
The sectionals 400m value for the most recent race in the horse's history.

##### r2_l4
The sectionals 400m value for the second-most recent race in the horse's history.

##### r3_l4
The sectionals 400m value for the third-most recent race in the horse's history.

##### r4_l4
The sectionals 400m value for the fourth-most recent race in the horse's history.

##### r5_l4
The sectionals 400m value for the fifth-most recent race in the horse's history.

##### r1_l2
The sectionals 200m value for the most recent race in the horse's history.

##### r2_l2
The sectionals 200m value for the second-most recent race in the horse's history.

##### r3_l2
The sectionals 200m value for the third-most recent race in the horse's history.

##### r4_l2
The sectionals 200m value for the fourth-most recent race in the horse's history.

##### r5_l2
The sectionals 200m value for the fifth-most recent race in the horse's history.


In [19]:
df_sectionals.describe()

Unnamed: 0,race_no
count,4008.0
mean,2.067615
std,0.808409
min,1.0
25%,1.0
50%,2.0
75%,3.0
max,3.0


In [20]:
pd.DataFrame(df_jockey).to_csv('/Users/phillipmonk/research_paper/horse_code/data/jockeys.csv')

NameError: name 'df_jockey' is not defined

In [None]:
df_trainer = df[(df['date'] >= datetime.date(2022,1,1)) & (df['track'].isin(['CAUL','FLEM','RAND','RHIL']))]['trainer'].unique()

df_trainer

In [None]:
pd.DataFrame(df_trainer).to_csv('/Users/phillipmonk/research_paper/horse_code/data/trainers.csv')