For this experiment, we will first explore the data and look for a possible correlation between various parameters and the average time a horse took around the track. By analyzing this correlation we will then make a hypothesis for e.g. a certain breed of horse from a certain country performs way better than the other. Finally, we will do a hypothesis test to check whether or not our hypothesis is correct or not.

P.S.
1. For the average time, we will only consider only the first 3 sections as times for them will be present for all the runs. 
2. We will accept any hypothesis to be true only if there is a 5% chance of that event. 
3. We will be using a one-tailed test as we are only interested in observing whether the average time is significantly lower or not.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
runs = pd.read_csv("../input/runs.csv")
runs.head(10)

Age might be an important factor in a horse's speed. So, Lets calculate average run times and number of horses in each age category and then choose an age for our experiment as age can be a lurking variable

In [None]:
grp_age = runs[['horse_age', 'time1', 'time2', 'time3']].groupby('horse_age').mean()
grp_age['mean_time'] = (grp_age['time1'] + grp_age['time2'] + grp_age['time3'])/3
grp_age[['mean_time']].plot.bar()

In [None]:
uni_horse_age = runs[['horse_age', 'horse_id']].groupby('horse_age').horse_id.nunique()
uni_horse_age.plot.bar()

Choosing horses of age 3 will be better because it will allow for more possible combinations and also allows us to take larger sample size for the hypothesis.

In [None]:
runs = runs[runs.horse_age == 3][['horse_country', 'horse_type', 'horse_gear', 'declared_weight', 'actual_weight',\
                                  'time1', 'time2', 'time3', 'horse_age']]

In [None]:
runs['mean_time'] = (runs['time1'] + runs['time2'] + runs['time3']) / 3
sns.heatmap(runs[['mean_time', 'declared_weight', 'actual_weight']].corr())

In [None]:
runs[['mean_time', 'declared_weight', 'actual_weight']].corr()

An interesting thing to observe here is that the declared_weight and mean_time are negatively correlated, this would mean that as the declared weight goes up there a good chance that average time would go down.

Anyway, the correlations are not strong enough that they can be used to suggest anything.

In [None]:
# horse_country', 'horse_type', 'horse_gear'
grp_country = runs[['horse_country','horse_type', 'mean_time']].dropna().groupby(['horse_country','horse_type']).mean()
grp_country[['mean_time']].sort_values('mean_time').head(15).plot.bar()

In [None]:
# Lets see how many observation points we have
runs[(runs.horse_country == 'SAF') & (runs.horse_type == 'Mare')]

It seems that Mares from SAF are the fastest ones, let see how statistically significant our observation is?
## Hypothesis: If all the horses are of type 'Mare' from 'SAF' then the average time around track will decrease

In [None]:
runs.mean_time.describe()

In [None]:
runs.mean_time.std()

In [None]:
runs[(runs.horse_country == 'SAF') & (runs.horse_type == 'Mare')].mean_time.describe()

The population parameters are 

 - AVG : 22.634  
 - SD: 3.745

The sample values are

 - Sample Size : 15
 - Sample mean : 21.078
 - Standard Error : 0.967
 - Z-value : -1.61 (approx)
 - p-value : 0.0537

Even though the hypothesis failed, we did end up pretty close to it (missed it by .37%). So we can still say there is a good chance that betting on a Mare from South Africa (I assume SAF is South Africa) will lead to good returns