When modelling or handicapping a horse race, how important is the weight carried by each horse?

Using data from horse races in Hong Kong, we aim to investigate this and see if any other factors might need to be taken into consideration.

## Firstly, load the data

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set some Pandas options
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 20)

# Read the race data
df_races = pd.read_csv('../input/races.csv', parse_dates=['date']).set_index('race_id')

# To save some time with feature calculations, just use a subset of 1000 handicap races
df_races = df_races[(df_races['race_class'] >= 1) & (df_races['race_class'] <= 5)].iloc[1000:2000]
df_races.head()

In [None]:
# We'll also need to get the horse run data for the above races
df_runs = pd.read_csv('../input/runs.csv')
df_runs = df_runs[df_runs['race_id'].isin(df_races.index.values)]
df_runs.head()

## Initial Analysis

Let's take a quick look at the weight carried and how this compares with the strike rate.

In [None]:
weight_vs_wins = df_runs.groupby('actual_weight')['won'].mean()
weight_vs_wins.plot();

Apart from the likely outlier at 105lbs, this would seem to confirm our belief that winning horses tend to carry more weight. However, if we just leave it at that, we will have neglected several important factors, such as:

 - the competition between horses in a race &mdash; _how does one horse's weight compare with others in the same race?_
 - the weight of the horse + jockey (declared_weight) &mdash; _if the horse is already huge, will he notice a few more pounds?_
 - the distance of the race &mdash; _carrying 120lbs over 1000 metres is usually a lot easier than carrying 120lbs over 2000 metres._

In most handicap races, the weight carried should be between 113 and 133lbs, according to the current handicapping policy described on the [HKJC website][1].

However, not all races are the same, and one of the features of handicap races is that horses are grouped into classes, depending on their performance history. In this context, the term "Handicap" means a race in which the weights to be carried by the horses are adjusted by the handicapper for the intended purpose of equalising their chances of winning. (As an aside, there is also another category of races in addition to handicaps, known as "Weight-for-Age", which means any race for which weights are allotted according to the age and sex of the horse; but we will only consider handicap races here).

[1]: http://www.hkjc.com/english/racinginfo/handicap_policy.asp

## Does race class make a difference?

In [None]:
df_runs['race_class'] = df_runs.apply(lambda run: df_races.loc[run['race_id'], 'race_class'], axis=1)
df_class_runs = df_runs.groupby('race_class')
df_class_runs['actual_weight'].mean()

From the table above, we can see that horses running in the higher race classes tend to carry slightly more weight. So let's do a little feature engineering and see what happens if we compare weight carried with the mean weight carried by all horses in the same race.

In [None]:
# Add an 'actual_weight_mean' field to each run
df_mean_actual_weight = df_runs.groupby('race_id')['actual_weight'].mean().to_frame()
df_runs = df_runs.join(other=df_mean_actual_weight, on='race_id', how='outer', rsuffix='_mean')

# Add an 'actual_weight_var' field so we can see the relative weight carried by this horse
df_runs['actual_weight_var'] = df_runs.apply(lambda run: run['actual_weight'] - run['actual_weight_mean'], axis=1)

In [None]:
# Plot actual weight variance against strike rate
weight_var_vs_wins = df_runs.groupby('actual_weight_var')['won'].mean()
weight_var_vs_wins.plot();

Not exactly clear, but ignoring the numerous outliers, it does look like there may be a _slightly_ higher chance of winning for horses carrying a heavier weight than others in the same race.

## What about the weight of the horse?

Can bigger horses pull more weight? Most thoroughbreds are equine athletes, so if they ever are on the heavy side, it's unlikely to be due to laziness or overeating.

Let's see if there is any relationship between the ratio of added weight to declared weight (weight of the horse & jockey) and the strike rate.

In [None]:
df_runs['weight_ratio'] = df_runs['actual_weight'] / df_runs['declared_weight']
weight_ratio_vs_wins = df_runs.groupby('weight_ratio')['won'].mean()
weight_ratio_vs_wins.plot();

So the answer here is no, there does not seem to be any. The reason for this is likely to be due to the way in which weights are assigned to horses, which is based at least in part on their estimated strength.

## Does the race distance matter?

As mentioned above, there may be some value in considering the weight carried along with the distance of the race. A weight carried over a shorter distance should, in theory, have a lower impact on the horse's performance than if it was carried over a longer distance.

So let's see.

In [None]:
df_runs['distance'] = df_runs.apply(lambda run: df_races.loc[run['race_id']]['distance'], axis=1)
df_runs['weight_distance'] = df_runs['actual_weight'] * df_runs['distance']
weight_distance_vs_wins = df_runs.groupby('weight_distance')['won'].mean()
weight_distance_vs_wins.plot();

Possibly, but only slightly.

## Conclusion

Weight carried is an important handicapping factor, but due to the way in which weights are assigned to each horse in a race, it's usefulness as a factor determining the likely outcome of a race is somewhat less than may be expected.

For this reason, it is necessary to examine weight carried along with other factors, such as the weight of the horse and the distance that the weight is being carried over. But even then, there are many other factors, such as past performance, which will likely contribute more insight to a computerised race betting system than this.