### Finding time intervals by distance for API data

- The duration of a race is unknown (yet somewhat predictable by distance). 
- For inplay strategies it may be important to act at some given time point during a race. 
- This strategy will be generalised to races with varying race times. 
- Therefore a sensible/consistent approach to dividing races has been applied to find divide races into 'equal thirds'.

In [1]:
# packages
from sqlalchemy import create_engine
import pymysql
import pandas as pd
import numpy as np
import re 
import json
from pathlib import Path, PurePath
import pprint as pp

# configs
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
%matplotlib inline

In [2]:
# loading in sql login credentials

project_dir = Path.cwd().parents[1]
logins_dir = project_dir / 'sql_logins.json'

with open(logins_dir) as f:
    login_dict =  json.load(f)

In [10]:
db_connection_str = f"mysql+pymysql://{login_dict['UID']}:{login_dict['PWD']}@localhost/{login_dict['DB']}"
db_connection = create_engine(db_connection_str)

data = pd.read_sql('''
                 SELECT
                  race_id,
                  distance_yards,
                  race_type,
                  winning_time_secs
                 FROM
                  historic_races
                 ORDER BY
                  race_id
                ''',
                con=db_connection)
# db_connection.close()

In [67]:
df = data.copy() # temp : to save having to run query (remove to save memory)
print("Races :", len(df.index))

Races : 204787


In [68]:
df.head()

Unnamed: 0,race_id,distance_yards,race_type,winning_time_secs
0,-1,1320.0,Flat,0.0
1,11426,3520.0,Hurdle,272.7
2,11427,4400.0,Hurdle,342.5
3,11428,3520.0,Hurdle,295.3
4,11429,4400.0,Chase,326.1


In [69]:
df = df.loc[(df['winning_time_secs'] > 1) & (df['winning_time_secs'] < 1200)]

In [72]:
df['distance_furlongs'] = round(df['distance_yards'] / 220).astype(int)

In [73]:
df = df.loc[df['race_type'] != 'Point to Point']

In [74]:
df['race_type'].value_counts()

Flat                  79065
Hurdle                44795
All Weather Flat      40191
Chase                 30978
National Hunt Flat     8055
Name: race_type, dtype: int64

In [75]:
race_type_dict = {'National Hunt Flat': 'NHF',
                  'Flat' : 'Flat',
                  'All Weather Flat': 'Flat',
                  'Chase' : 'Chase',
                  'Hurdle' : 'Hurdle'}
df['race_type'] = df['race_type'].map(race_type_dict)

In [76]:
df['race_type'].value_counts()

Flat      119256
Hurdle     44795
Chase      30978
NHF         8055
Name: race_type, dtype: int64

To find a consistent approach to splitting race times was done via the 5th quantile.

__Why do it this way?__

To ensure bins are races that are too short wont fill t_3, therfore the cut-offs were chosen by:

- (q5 or race_times by distance) = t_3

- ((q5 or race_times by distance) / 3) * 2 = t_1

- (q5 or race_times by distance) / 3 = t_1

The race times can then be categorised in future by looking up distance and applying the given number of seconds to create t_1, t_2 & t_3. 

__NOTE:__

- t_1 = times < t_1 secs
- t_2 = t_1 <= times <= t_2
- t_3 = time >= t_3


In [77]:
# df['distance_furlong']
df_times = df.groupby(['distance_furlongs','race_type'])['winning_time_secs'].quantile(0.05).reset_index()
df_times.rename(columns = {'winning_time_secs' : 't_3'}, inplace = True)
df_times['t_3'] = round(df_times['t_3'])
df_times['t_1'] = round(df_times['t_3'] / 3)
df_times['t_2'] = round(df_times['t_3'] / 3) * 2
df_times = df_times[['distance_furlongs', 'race_type', 't_1', 't_2', 't_3']] # rearrange colums

In [78]:
df_times

Unnamed: 0,distance_furlongs,race_type,t_1,t_2,t_3
0,5,Flat,19.0,38.0,58.0
1,6,Chase,98.0,196.0,293.0
2,6,Flat,23.0,46.0,70.0
3,6,Hurdle,93.0,186.0,279.0
4,7,Flat,28.0,56.0,83.0
5,7,Hurdle,98.0,196.0,295.0
6,8,Chase,96.0,192.0,287.0
7,8,Flat,32.0,64.0,96.0
8,8,Hurdle,75.0,150.0,224.0
9,8,NHF,73.0,146.0,219.0


In [98]:
df_times.to_csv('inplay_bins.csv')

In [93]:
# mydict = df_times.set_index(['distance_furlongs', 'race_type']).to_dict('index')

In [95]:
# pp.pprint(mydict)

In [96]:
# import json
# with open('result.json', 'w') as fp:
#     json.dump(d, fp)