# Sabrina del Rosal
### Capstone Project: Sprint 1
# Racehorses Risk of Injury Predictions 

#### Introduction #### 

Horse racing is a flourishing industry filled with fanatics who go to enjoy the race, bet, or be an inclusive part of it by owning racehorses. However, these horses undergo a lot of intense activity and can worsen their performance through time based on age, injury, and other factors. It is difficult to know exactly when and how a horse may begin to lower its winning chances throughout the races. A well performing predictive model could help provide valuable insights for trainers to prevent them from over pushing their horses causing performance decline and illness.

##### Big Idea #####
 
A machine learning model could use historical race data, biometrics records of individual racehorses, track intensity and/or conditions, and jockey statistics to help find patterns that may be associated with stress or risk of injury. By highlighting early indicators of fatigue, or poor performance, the model may be a crucial tool for racehorse trainers. This will be fairly similar to the injury-prevention models I have seen used for athletes playing basketball or football, for example. It can eventually allow trainers to have an idea as to when a horse might need to wean off racing for a bit in order to prevent massive injury that can lead to retirement. 
	Firstly, creating a model that can help predict winning positions will be in place to see if there are any trends in number of races a horse can feasible do well in. Then, biometric data will begin to play a part to see if height, weight, age, as well as pedigree plays significant roles in winning. We can look at conditions of track and weather as well to further investigate. If anything, I plan to begin with the most current year and create a model using that smaller dataset if needed to start small.
    

### Downloading Data ###

In [67]:
# import any necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [68]:
# show all dataframe columns
pd.set_option('display.max_columns', None)

# set matplotlib global settings eg. figsize
plt.rcParams['figure.figsize'] = (8.0, 6.0)

In [69]:
# import data
pre_racehorse_df = pd.read_csv("../data/UKRacehorse.csv")
horses_df = pd.read_csv("../data/horses_2000.csv")
races_df = pd.read_csv("../data/races_2000.csv")

In [70]:
# PRERACEHORSE DATA CHECK
pre_racehorse_df.head()

# this dataframe is not important to our model but it is good to keep referring to.

Unnamed: 0,course,countryCode,marketTime,title,runners,condition,prize,rclass,horseName,trainerName,jockeyName,RPRc,TRc,OR,weightSt,weightLb,age,decimalPrice
0,Limerick,0,2020-09-11 12:45:00+01:00,Irish Stallion Farms EBF Fillies Maiden (Plus ...,14,Yielding To Soft,13717.5,,All Down To Rosie,Conor O'Dwyer,Kevin Manning,,,,9,2,2,50.0
1,Limerick,0,2020-09-11 12:45:00+01:00,Irish Stallion Farms EBF Fillies Maiden (Plus ...,14,Yielding To Soft,13717.5,,Colfer Kay,K J Condon,W J Lee,79.0,70.0,,9,2,2,6.037778
2,Limerick,0,2020-09-11 12:45:00+01:00,Irish Stallion Farms EBF Fillies Maiden (Plus ...,14,Yielding To Soft,13717.5,,Dha Leath,Garvan Donnelly,J M Sheridan,,,,9,2,2,49.666667
3,Limerick,0,2020-09-11 12:45:00+01:00,Irish Stallion Farms EBF Fillies Maiden (Plus ...,14,Yielding To Soft,13717.5,,Ellabella,Andrew McNamara,Colin Keane,,,,9,2,2,17.944444
4,Limerick,0,2020-09-11 12:45:00+01:00,Irish Stallion Farms EBF Fillies Maiden (Plus ...,14,Yielding To Soft,13717.5,,Fermoy,Mrs John Harrington,Tom Madden,73.0,58.0,,9,2,2,17.594737


In [71]:
# HORSES DATASET

horses_df.head()

# this data set will need cleaning so that we can then merge it to our races dataframe -- merged df is what will be most important in our model

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,dist,weightSt,weightLb,overWeight,outHandicap,headGear,RPR,TR,OR,father,mother,gfather,runners,margin,weight,res_win,res_place
0,270318,Peggy Barry,7.0,14.0,0.090909,0,Paul Nolan,John Cullen,1,,,11,2,,,,105.0,,,Montelimar,Winterville,Rusticaro,15,1.617469,70,1.0,1
1,270318,Avondale Illusion,8.0,1.0,0.333333,1,W J Burke,Mr P Fenton,2,8.0,,12,0,,,,102.0,,,Satco,Tattered Illusion,Our Mirage,15,1.617469,76,0.0,1
2,270318,Chermesina,6.0,8.0,0.058824,0,Timothy Doyle,Tom Doyle,3,8.0,16.0,11,6,,,,89.0,,,Be My Native,Annabrook Lass,Laurence O,15,1.617469,72,0.0,1
3,270318,Banogue Lass,6.0,7.0,0.222222,0,Augustine Leahy,Mr R Flavin,4,0.75,16.75,11,2,,,,88.0,,,Good Thyne,Moorsville,Phardante,15,1.617469,70,0.0,0
4,270318,Marico,7.0,3.0,0.047619,0,W Power,Mr J G Sheehan,5,1.0,17.75,11,7,,,,92.0,,,Lord Americo,Gilt Course,Crash Course,15,1.617469,73,0.0,0


In [72]:
horses_df["outHandicap"].unique()

array([nan,  1.,  3.,  5.,  9.,  7.,  6., 10.,  4.,  2.,  8., 31., 20.,
       13., 11., 22., 16., 12., 15., 21., 18., 32., 19., 14., 17., 29.,
       25., 40., 60., 26., 44., 27., 23., 24., 34., 42., 28., 39., 30.,
       37., 72., 36., 47., 35., 46., 43., 41.])

In [73]:
# checking for nulls 

horses_df.isnull().sum()

# we will need to look  cleaning 'age' 'trainerName' 'jockeyName' 'positionL' 'overWeight' 'outHandicap' 'headgear' 'RPR' 'TR' 'OR' 'father' 'mother ' gfather'

rid                  0
horseName            0
age                  7
saddle              65
decimalPrice         0
isFav                0
trainerName         75
jockeyName           1
position             0
positionL        18028
dist             27154
weightSt             0
weightLb             0
overWeight      102300
outHandicap     101391
headGear         87966
RPR              16381
TR               47591
OR               36843
father              34
mother              77
gfather            235
runners              0
margin               0
weight               0
res_win              0
res_place            0
dtype: int64

In [74]:
# RACES DATASET

races_df.head()

# will be important as well for our model -- merging to horses dataset

Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
0,270318,Tramore (IRE),03:45,00/01/01,Radley Engineering I.N.H. Flat Race,,,6yo+,2m,Soft To Heavy,,[],281.3,,3218.0,IE,12,0
1,344106,Tramore (IRE),12:45,00/01/01,Mean Fiddler Handicap Chase,,0-102,5yo+,2m6f,Soft To Heavy,15 fences,[],364.6,,4424.0,IE,12,0
2,234338,Tramore (IRE),03:15,00/01/01,Kent Brothers Handicap Hurdle,,0-95,4yo+,2m,Soft To Heavy,10 hurdles,[],279.5,,3218.0,IE,12,0
3,262922,Tramore (IRE),01:15,00/01/01,T.J.Carroll Chase,,,5yo+,2m4f,Soft To Heavy,13 fences,[],332.7,,4022.0,IE,12,0
4,31042,Tramore (IRE),02:15,00/01/01,David Flynn Construction Maiden Hurdle,,,5yo,2m4f,Soft To Heavy,11 hurdles,[],296.5,,4022.0,IE,12,0


In [75]:
# checking for nulls

races_df.isnull().sum()

# look into cleaning 'rclass' 'band' 'condition' 'hurdles' 'prize'

rid               0
course            0
time              0
date              0
title             0
rclass         2167
band           5161
ages              0
distance          0
condition         1
hurdles        5856
prizes            0
winningTime       0
prize          1922
metric            0
countryCode       0
ncond             0
class             0
dtype: int64

## Cleaning Data ##

### Cleaning Data Action Plan:

horses_df cleaning:
1. drop unneccessary columns
2. normalize numerical values
3. fill missing values

races_df cleaning:
1. drop unneccessary columns
2. normalize numerical values
4. convert categorical variables

### horses_df cleaning: copy dataframe


In [76]:
# make a copy of my original data frame and start cleaning it up

horses_df.copy()

horses_df.head(2)

Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,dist,weightSt,weightLb,overWeight,outHandicap,headGear,RPR,TR,OR,father,mother,gfather,runners,margin,weight,res_win,res_place
0,270318,Peggy Barry,7.0,14.0,0.090909,0,Paul Nolan,John Cullen,1,,,11,2,,,,105.0,,,Montelimar,Winterville,Rusticaro,15,1.617469,70,1.0,1
1,270318,Avondale Illusion,8.0,1.0,0.333333,1,W J Burke,Mr P Fenton,2,8.0,,12,0,,,,102.0,,,Satco,Tattered Illusion,Our Mirage,15,1.617469,76,0.0,1


#### horses_df cleaning : dropping columns

columns to drop:
- position L (we will use dist as we are focusing on end result -- wins)
- weightSt & weight (Lb) (same as weight (kg) just different measurements all for Handicap measurement)
- overWeight (not necessary for our models)
- outHandicap (not necessary for our models)
- father , mother , gfather (for now we can exclude this)
- margin (betting related; prices)
- trainer name (not important)
- jockey name (not important)
- headgear code (not important)
- isFav (betting related)
- decimalPrice (betting related)
-** we actually need this to keep unique ID and manage race frequency** horseName (not needed for our prediction models)
- runners (knowing how many other horses are racing does not affect an individual horses risk of injury)
- RPR (RP rating is opinion based and so we will prioritize official rating)


In [77]:
# drop unneccessary columns first

horses_df = horses_df.drop(columns = ['positionL' , 'weightSt' , 'weightLb' , 'overWeight' , 'outHandicap' , 'father' , 'mother' , 'gfather' , 
                                      'margin' , 'trainerName' , 'jockeyName' , 'headGear' , 'isFav' , 'decimalPrice' , 'runners' , 'RPR'])

In [78]:
# sanity check

horses_df.head(2)

Unnamed: 0,rid,horseName,age,saddle,position,dist,TR,OR,weight,res_win,res_place
0,270318,Peggy Barry,7.0,14.0,1,,,,70,1.0,1
1,270318,Avondale Illusion,8.0,1.0,2,,,,76,0.0,1


#### horses_df cleaning: normalize numerical columns

1. change 'age' into an integer 
3. 'dist' should be numerical data
4. OR values should all be integers
5. fix position 40 to be 0 (did not finish)

In [79]:
# convert age into an integer and fill in nulls with median (this makes sense as racehorses tend to be of similar age)

horses_df['age'] = horses_df['age'].fillna(horses_df['age'].median()).astype(int)

In [80]:
# A horse earns its official rating when it has won a race or placed in the top six on three separate occasions
# so if OR is NaN we will replace with 0 as it has not won a race or placed in top six

horses_df['OR'] = horses_df['OR'].fillna(0).astype(int)

In [81]:
# 40 in position means they did not finish so we will be changing this format to see where potential injuries occurred 0 as it didnt finish 

horses_df['position'] = horses_df['position'].replace(40, 0) 

In [82]:
# check for missing values

horses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105500 entries, 0 to 105499
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   rid        105500 non-null  int64  
 1   horseName  105500 non-null  object 
 2   age        105500 non-null  int64  
 3   saddle     105435 non-null  float64
 4   position   105500 non-null  int64  
 5   dist       78346 non-null   object 
 6   TR         57909 non-null   float64
 7   OR         105500 non-null  int64  
 8   weight     105500 non-null  int64  
 9   res_win    105500 non-null  float64
 10  res_place  105500 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 8.9+ MB


#### horses_df cleaning: fill missing values

1. saddle
2. dist
3. TR (topspeed)

In [83]:
# saddle can either drop or place mode --not sure which will be better for predicting. I think I will drop null rows for now as changing starting position might skew data.

horses_df = horses_df.dropna(subset=['saddle'])

In [84]:
# idk what to do with distance yet....(how far a horse has finished from a winner ; horses corpses)

horses_df = horses_df.drop(columns=['dist'])

In [85]:
# TR topspeed fill values with median or mean as it is usually normally distributed **may change this later**

horses_df['TR'] = horses_df['TR'].fillna(horses_df['TR'].mean()).astype(int)

In [86]:
# sanity check 

horses_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 105435 entries, 0 to 105499
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   rid        105435 non-null  int64  
 1   horseName  105435 non-null  object 
 2   age        105435 non-null  int64  
 3   saddle     105435 non-null  float64
 4   position   105435 non-null  int64  
 5   TR         105435 non-null  int64  
 6   OR         105435 non-null  int64  
 7   weight     105435 non-null  int64  
 8   res_win    105435 non-null  float64
 9   res_place  105435 non-null  int64  
dtypes: float64(2), int64(7), object(1)
memory usage: 8.8+ MB


### races_df cleaning : copy dataframe

In [87]:
# make a copy of my original data frame and start cleaning it up

races_df.copy()

races_df.head(2)

Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
0,270318,Tramore (IRE),03:45,00/01/01,Radley Engineering I.N.H. Flat Race,,,6yo+,2m,Soft To Heavy,,[],281.3,,3218.0,IE,12,0
1,344106,Tramore (IRE),12:45,00/01/01,Mean Fiddler Handicap Chase,,0-102,5yo+,2m6f,Soft To Heavy,15 fences,[],364.6,,4424.0,IE,12,0


In [88]:
# creating a little DF to see conditions w condition code

# unique matched pairs of 'ncond' and 'condition'
unique_conditions = races_df[['ncond', 'condition']].drop_duplicates()

# sort it 
unique_conditions = unique_conditions.sort_values(by='ncond')

unique_conditions

Unnamed: 0,ncond,condition
9,0,Standard
4482,0,
107,1,Good
577,2,Good To Firm
2168,3,Very Soft
1655,4,Good To Yielding
7,5,Soft
495,6,Yielding
457,7,Fast
911,8,Firm


#### horses_df cleaning : dropping columns

columns to drop: 

- time (exact time of day isnt necessary)
- title (not needed)
- rclass (class is its numerical )
- band (class gives the overall interpretation)
- prizes (bet related)
- prize (bet related)
- countrycode (already in UK)
- course (not necessary because we have track conditions explanation, etc.)
- ages (we have the age of horses in horses_df)
- condition (for model we need numerical which is ncond)
- distance (because we have 'metric' which is distance in meters)

In [89]:
# drop unneccessary columns first

# ncond is the numerical version of condition for models
# class is numerical version of rclass for models

races_df = races_df.drop(columns = ['time' , 'title' , 'rclass' , 'band' , 'prizes' , 'prize' , 'countryCode' , 'course' , 'ages' , 'condition' , 'distance'])


#### horses_df cleaning: normalize numerical columns

1. fix date (datetime format)
2. change hurdles and fences into two seperate columns to make numerical


In [90]:
# fixing datetime column

races_df['date'] = pd.to_datetime(races_df['date'], format='%y/%m/%d')

In [91]:
# fix hurdles to numerical values: split it into hurdles column and fences column and only grab number -- unknown will be 0

# extract numbers for 'fences' and 'hurdles' separately
races_df['fences'] = races_df['hurdles'].str.extract(r'(\d+)\s*fences').fillna(0).astype(int)
races_df['hurdles'] = races_df['hurdles'].str.extract(r'(\d+)\s*hurdles').fillna(0).astype(int)

In [92]:
# sanity check

races_df.info()

# no more missing values!!! looks clean

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9585 entries, 0 to 9584
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   rid          9585 non-null   int64         
 1   date         9585 non-null   datetime64[ns]
 2   hurdles      9585 non-null   int64         
 3   winningTime  9585 non-null   float64       
 4   metric       9585 non-null   float64       
 5   ncond        9585 non-null   int64         
 6   class        9585 non-null   int64         
 7   fences       9585 non-null   int64         
dtypes: datetime64[ns](1), float64(2), int64(5)
memory usage: 599.2 KB


### merging datasets 

In [93]:
# merging the datasets

races_horses_df = pd.merge(horses_df, races_df, on='rid', how='left')

In [94]:
# sanity check

races_horses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105435 entries, 0 to 105434
Data columns (total 17 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   rid          105435 non-null  int64         
 1   horseName    105435 non-null  object        
 2   age          105435 non-null  int64         
 3   saddle       105435 non-null  float64       
 4   position     105435 non-null  int64         
 5   TR           105435 non-null  int64         
 6   OR           105435 non-null  int64         
 7   weight       105435 non-null  int64         
 8   res_win      105435 non-null  float64       
 9   res_place    105435 non-null  int64         
 10  date         105435 non-null  datetime64[ns]
 11  hurdles      105435 non-null  int64         
 12  winningTime  105435 non-null  float64       
 13  metric       105435 non-null  float64       
 14  ncond        105435 non-null  int64         
 15  class        105435 non-null  int6

In [95]:
# problems saving datetime in csv:

races_horses_df['date'] = races_horses_df['date'].dt.strftime('%Y-%m-%d %H:%M:%S')

In [96]:
# save data

races_horses_df.to_csv("races_horses_df.csv", index=False)


Now it's time for EDA! Let's go to a new notebook for this....