### Finding time intervals by distance for API data

- The duration of a race is unknown (yet somewhat predictable by distance). 
- For inplay strategies it may be important to act at some given time point during a race. 
- This strategy will be generalised to races with varying race times. 
- Therefore a sensible/consistent approach to dividing races has been applied to find divide races into 'equal thirds'.

In [1]:
# packages
from sqlalchemy import create_engine
import pymysql
import pandas as pd
import numpy as np
import re 
import json
from pathlib import Path, PurePath
import pprint as pp

# configs
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
%matplotlib inline

In [2]:
# loading in sql login credentials

project_dir = Path.cwd().parents[1]
logins_dir = project_dir / 'sql_logins.json'

with open(logins_dir) as f:
    login_dict =  json.load(f)

In [8]:
db_connection_str = f"mysql+pymysql://{login_dict['UID']}:{login_dict['PWD']}@localhost/{login_dict['DB']}"
db_connection = create_engine(db_connection_str)

data = pd.read_sql('''
                 SELECT
                  race_id,
                  distance_yards,
                  winning_time_secs
                 FROM
                  historic_races
                 ORDER BY
                  race_id
                ''',
                con=db_connection)
# db_connection.close()

In [9]:
df = data.copy() # temp : to save having to run query (remove to save memory)
print("Races :", len(df.index))

Races : 204787


In [10]:
df.head()

Unnamed: 0,race_id,distance_yards,winning_time_secs
0,-1,1320.0,0.0
1,11426,3520.0,272.7
2,11427,4400.0,342.5
3,11428,3520.0,295.3
4,11429,4400.0,326.1


In [11]:
df = df.loc[(df['winning_time_secs'] > 1) & (df['winning_time_secs'] < 1200)]

In [12]:
df['distance_furlongs'] = round(df['distance_yards'] / 220)

To find a consistent approach to splitting race times was done via the 5th quantile.

__Why do it this way?__

To ensure bins are races that are too short wont fill t_3, therfore the cut-offs were chosen by:

- (q5 or race_times by distance) = t_3

- ((q5 or race_times by distance) / 3) * 2 = t_1

- (q5 or race_times by distance) / 3 = t_1

The race times can then be categorised in future by looking up distance and applying the given number of seconds to create t_1, t_2 & t_3. 

__NOTE:__

- t_1 = times < t_1 secs
- t_2 = t_1 <= times <= t_2
- t_3 = time >= t_3


In [13]:
# df['distance_furlong']
df_times = df.groupby(['distance_furlongs'])['winning_time_secs'].quantile(0.05).reset_index()
df_times.rename(columns = {'winning_time_secs' : 't_3'}, inplace = True)
df_times['t_3'] = round(df_times['t_3'])
df_times['t_1'] = round(df_times['t_3'] / 3)
df_times['t_2'] = round(df_times['t_3'] / 3) * 2
df_times = df_times[['distance_furlongs', 't_1', 't_2', 't_3']] # rearrange colums

In [14]:
d = df_times.set_index('distance_furlongs').to_dict('index')
pp.pprint(d)

{5.0: {'t_1': 19.0, 't_2': 38.0, 't_3': 58.0},
 6.0: {'t_1': 23.0, 't_2': 46.0, 't_3': 70.0},
 7.0: {'t_1': 28.0, 't_2': 56.0, 't_3': 83.0},
 8.0: {'t_1': 32.0, 't_2': 64.0, 't_3': 96.0},
 9.0: {'t_1': 36.0, 't_2': 72.0, 't_3': 108.0},
 10.0: {'t_1': 41.0, 't_2': 82.0, 't_3': 124.0},
 11.0: {'t_1': 45.0, 't_2': 90.0, 't_3': 134.0},
 12.0: {'t_1': 50.0, 't_2': 100.0, 't_3': 150.0},
 13.0: {'t_1': 54.0, 't_2': 108.0, 't_3': 163.0},
 14.0: {'t_1': 60.0, 't_2': 120.0, 't_3': 179.0},
 15.0: {'t_1': 62.0, 't_2': 124.0, 't_3': 185.0},
 16.0: {'t_1': 71.0, 't_2': 142.0, 't_3': 213.0},
 17.0: {'t_1': 76.0, 't_2': 152.0, 't_3': 227.0},
 18.0: {'t_1': 82.0, 't_2': 164.0, 't_3': 245.0},
 19.0: {'t_1': 88.0, 't_2': 176.0, 't_3': 264.0},
 20.0: {'t_1': 94.0, 't_2': 188.0, 't_3': 283.0},
 21.0: {'t_1': 100.0, 't_2': 200.0, 't_3': 300.0},
 22.0: {'t_1': 103.0, 't_2': 206.0, 't_3': 308.0},
 23.0: {'t_1': 111.0, 't_2': 222.0, 't_3': 332.0},
 24.0: {'t_1': 115.0, 't_2': 230.0, 't_3': 346.0},
 25.0: {'t_1

In [15]:
import json
with open('result.json', 'w') as fp:
    json.dump(d, fp)