__Setup :__ installing necessary packages and configuring notebook format

In [1]:
# packages
from sqlalchemy import create_engine
import pymysql
import pandas as pd
import numpy as np
import re 
import json
import git


# configs
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
%matplotlib inline

## Analysing how fundamental / technical features influence price changes

__Aim__

To gain a greater understanding if 'pre-race' factors (predominantly fundamental) seem to have an impact on price changes (BSP to inplay). 

__Sections__
- Create 'natural' & 'engineered' features (bins & one-hot)
- Create target (exclude winners -> convert to probabilities -> bsp - ip_min) 
- Measure the differences in price changes per variable (using ANOVA?)
- Identify those with the 'biggest influence'

#### 1. Reading in data
Connecting to MySQL database where Smartform tables are stored and converting query into pandas dataframe.

In [2]:
# loading in sql login credentials
repo = git.Repo('.', search_parent_directories=True) # finds root dir of git repo
logins_dir = str(repo.working_tree_dir) + "/sql_logins.txt" 

with open(logins_dir) as f:
    login_dict =  json.load(f)

In [None]:
db_connection_str = f"mysql+pymysql://{login_dict['UID']}:{login_dict['PWD']}@localhost/{login_dict['DB']}"
db_connection = create_engine(db_connection_str)

data = pd.read_sql('''
                 SELECT
                  race_id,
                  course,
                  race_type,
                  going,
                  handicap,
                  maiden,
                  num_runners,
                  distance_yards,
                  added_money,
                  runner_id,
                  distance_travelled,
                  form_figures,
                  gender,
                  age,
                  bred,
                  in_race_comment,
                  owner_id,
                  trainer_id,
                  jockey_id,
                  position_in_betting,
                  days_since_ran,
                  weight_pounds,
                  finish_position,
                  amended_position,
                  tack_hood,
                  tack_visor,
                  tack_blinkers,
                  tack_eye_shield,
                  tack_cheek_piece,
                  tack_pacifiers,
                  tack_tongue_strap,
                  bf_race_id,
                  bf_runner_id,
                  bsp,
                  inplay_min,
                  win
                 FROM
                  historic_races
                  JOIN historic_runners USING (race_id)
                  JOIN historic_betfair_win_prices ON race_id = sf_race_id
                  AND runner_id = sf_runner_id
                WHERE
                  (
                    CAST(historic_races.meeting_date AS Datetime) BETWEEN '2016-10-01'
                    AND '2020-01-01'
                  )
                ORDER BY
                  race_id,
                  runner_id
                ''',
                con=db_connection)
print('No. Rows : ', len(data.index))
# db_connection.close()

In [142]:
df = data.copy() # temp : to save having to run query (remove to save memory)

### 2. Data Processing

#### 2.0 Correct finish position 
Finding 'true' finishing positions due to error caused by stewards enquiries etc

In [143]:
df['final_position'] = np.where(df['amended_position'].notnull(), df['amended_position'], df['finish_position'])
df.drop(['finish_position', 'amended_position'], axis = 1, inplace = True)

#### 2.1 Dropping missing values

In [144]:
prev_rows = len(df.index)
df.dropna(inplace=True)
print('Rows Removed: ', prev_rows - len(df.index), '\nRows Remaining : ', len(df.index))

Rows Removed:  60963 
Rows Remaining :  333675


In [145]:
# come back to improve this with some assumptions / conditions?

#### 2.2 Remove winners 
Creating sample not affected by winners (causing fat right tail in price decreases). Small but acceptable fraction of sample removed. (Winners will/could be analysed seperately).

In [146]:
prev_rows = len(df.index)
df = df[df['win'] == 0]
print('Rows Removed: ', prev_rows - len(df.index), '\nRows Remaining : ', len(df.index))

Rows Removed:  38631 
Rows Remaining :  295044


#### 2.3 Form transformation

Capturing how form may influence price movements. Attempt to capture effects like 'fitness', 'momentum' & __'consistency'__.

- Pre-process form (0's & letters to 9's). Currently leaving out '-' & '/' but these count as a run.
- Taking previous 3 runs as single features
- Taking length of form as a feature
- Summing all form as a aggregate 
- Average form as form_sum / form_len
- Comparing averages to give some indication of improvement / decline of performance

In [148]:
df['form_figures'] = df['form_figures'].str.replace(r'[A-Z0]', '9') # convering letters/zeros to '9'

In [149]:
df['form_3'] = df['form_figures'].str[-3:] # pos in 3rd last race
df['form_2'] = df['form_figures'].str[-2:] # pos in 2nd last race
df['form_1'] = df['form_figures'].str[-1] # pos in last race
df['form_len'] = df['form_figures'].str.len().astype(int) # length of form figures

df['form_rec'] = df['form_figures'].str[-3:] # pos in last 3 races - 'recent form'
df['form_rec_len'] = df['form_rec'].str.len().astype(int) # length of recent

In [151]:
# aggregate on all available form
df['form_int_list'] = df['form_figures'].apply(lambda x: re.findall(r'\d+', x)) # list of non-sep ints 
df['form_ints_list'] = df['form_int_list'].apply(lambda x: [sum(int(c) for c in str(num)) for num in x]) # sep ints
df['form_sum_all'] = df['form_ints_list'].apply(lambda x: sum(x)) # sum of all ints
df['form_avg_all'] = round(df['form_sum_all'] / df['form_len'], 2) # mean of ints
# df.drop(['form_figures_num', 'form_int_list', 'form_ints_list'], axis = 1, inplace = True)

In [152]:
# aggreagate on last 3 runs ?
df['form_int_list'] = df['form_rec'].apply(lambda x: re.findall(r'\d+', x)) # list of non-sep ints 
df['form_ints_list'] = df['form_int_list'].apply(lambda x: [sum(int(c) for c in str(num)) for num in x]) # sep ints
df['form_sum_rec'] = df['form_ints_list'].apply(lambda x: sum(x)) # sum of all ints
df['form_avg_rec'] = round(df['form_sum_rec'] / df['form_rec_len'], 2) # mean of ints

In [158]:
# if 'average form' has improved
df['form_avg_imp'] = np.where(df['form_avg_rec'] < df['form_avg_all'], 1, 0) 

In [None]:
# perhaps add in some rule over form length here? i.e. not inlcude just 1 run?

In [161]:
# checks
# df[['form_figures', 'form_figures_num', 'form_ints_list', 'form_sum_all', 'form_len', 'form_avg']].tail(50)
# df[['form_figures', 'form_int_list', 'form_sum_all', 'form_avg_all', 'form_sum_rec','form_avg_rec', 'form_avg_imp']].tail(50)

In [87]:
# dropping temp form vars here
# df.drop(['form_figures', 'form_'], axis = 1, inplace = True) # removing temporary form variables
         # drop all form variables simultaneuosly?

#### 2.4 Headgear Transformation
All headgear types are different variables within the database (1/0 if worn or not). Adding all of these variables together as the type of headgear doesn't matter too much. Just if they wore headgear for the first time for example as this 'typically' brings about improvement and is often worn by 'front runners' who we know often experience a price decrease inplay.

In [None]:
df['headgear'] = df['tack_hood'] + df['tack_blinkers'] + df['tack_eye_shield'] + df['tack_eye_cover'] + df['tack_cheek_piece'] + df['tack_pacifiers'] + df['tack_tongue_strap']
df.drop(columns = ['tack_hood', 'tack_blinkers', 'tack_eye_shield', 'tack_eye_cover', 'tack_cheek_piece',
                   'tack_pacifiers', 'tack_tongue_strap'], inplace=True, axis=1)

In [None]:
#### Encoding 

In [None]:
going_dict = {'Standard': 0, 'Firm': 1, 'Good to Firm': 2, 'Good': 3, 'Good - Yielding': 4, 'Yielding' : 5,
              'Yielding - Soft': 6, 'Good to Soft': 7, 'Soft': 8, 'Soft - Heavy': 9, 'Heavy': 10}
df['going'] = df['going'].map(going_dict)

In [None]:
# further data processing : 
# headgear encoding
# going encoding
# previous race vars

# prev race missing val treatment (Ensure this doesn;t affect features)
# prev race feature creation


In [None]:
# prev vars
prev_vars = ['form', 'handicap', 'maiden', 'finish_position']

In [None]:
# difference vars
dif_vars = ['handicap', ]

In [None]:
# variable lists
one_hots = ['course', 'race_type', 'going', 'gender', 'age', 'bred', 'owner_id', 'trainer_id',
           'jockey_id', 'form_3', 'form_2', 'form_1'] # add in engineered vars
bins = ['num_runners', 'distance_yards', 'added_money', 'distance_travelled', 'days_since_ran', 'weight_pounds',
        'early_traded', 'total_traded'] # add in engineered vars

In [None]:
# combos
course x jockey
course x trainer
race_type x jockey
race_type x trainer
len_form x aggregate


In [None]:
- data processing 
- create features (prev races, nlp)
- create combinations
- create targets (av_price -> bsp, bsp -> ip_min (w/o winners?))
- before inspecting 'ststistical factors' reduce features that only have a small sample e.g. sire/jockey combo?
