# Fox River AI Racing
We will follow a methodical approach, starting with the questions we want answered, choose models best suited to answer those questions, and data preparation.

## Predictive Questions:

Based on the objective, here are a few specific questions that align with horse racing data and could be valuable for building predictive models:
1. ** Performance Prediction Based on Class Changes: How likely is a horse to perform well (win or place) when its trainer and jockey move it up or down a class?
> Prioritization: High. Trainer/Jockey combinations are typical inidicators for a horse's performance and a move up can indicate positive sentiment by the team and a step down in class might indicate a lack of confidence, or a feeling that the horse needs to be in a less competive field for a while.

3. ** Track-Specific Performance: Can certain trainer-jockey combinations predictably improve or diminish a horse’s performance on specific surfaces, such as dirt or turf?
> Prioritization: High. Coupled with the trainer/jockey stats on given tracks and surfaces can be a strong indicator.

4. ** Impact of Historical Performance Metrics: How do past performance metrics (speed, class rating, distance, and surface) influence a horse’s predicted outcome?
> Prioritization: Very High. Ultimately its about the horse and its capabilities.
 
5. Outcome Prediction Based on Race Attributes: How do variables like race type (e.g., stakes, allowance), purse, distance, and surface affect a horse’s probability of winning or placing?
> Prioritization: Medium. Possibly when combined with other factors these attributes could be strong indicators. Likely there will be other indicators, like high winning percentage at a give class that would prompt the horse being placed there to compete. 

6. ** Real-Time Speed and Position Impact: How does a horse’s speed and position at each sectional point relate to its final performance in a race?
> Prioritization: Very High. This can yield insights into how a horse’s sectional performance might predict its finishing position, which could enhance live betting models.

6.	** Positional Advantage at Specific Distances: Do horses that maintain a lead or favorable position at specific sectional points (e.g., 600 meters) tend to perform better overall?
> Prioritization: High. This is critical for understanding strategic points within a race, especially in conjunction with track and distance data.

7.	Consistency of Stride and Pace: Does a horse’s stride frequency and speed consistency through sectional points correlate with higher performance outcomes?
> Prioritization: Medium-High. Examining stride frequency consistency could provide a distinct performance metric for fitness or endurance, complementing speed and class information.

8.	Influence of Race Pace on Performance: How does the average race pace (derived from the sectional data) influence a horse’s likelihood of winning or placing?
> Prioritization: Medium-High. This could help determine the ideal pace for specific horses, which may reveal strategies for early or late acceleration.

9.	Impact of Track Conditions on GPS Performance Metrics: How do track conditions, like “wet” or “firm,” affect horses’ sectional performances and overall outcomes?
> Prioritization: Medium. This would help assess if certain horses are more adaptable to specific track conditions, complementing your track-specific analysis.

10.	Trainer-Jockey Performance in Real-Time Race Dynamics: Do specific trainer-jockey combinations show consistent strategies in GPS-related metrics, such as speed increases or deceleration at key sectional points?
> Prioritization: Medium. Analyzing trainer-jockey strategies could indicate tactical patterns, which can be useful for predicting outcomes based on track and horse attributes.

11.	Predicting Optimal Bet Types: Given a horse’s real-time GPS data and historical sectional data, which bet type (e.g., win, place, exacta) has the highest predicted success rate?
> Prioritization: Medium. This is valuable for bettors and adds a layer of predictive insight beyond simply the horse’s finishing place.

12.	Early Race Sectional Points as Predictors: Can early sectional points in a race (e.g., first 400 meters) reliably predict the final outcome?
> Prioritization: Medium-Low. While potentially useful, this may vary more depending on distance and horse-specific attributes and is worth exploring as a secondary analysis.


## Model Recommendations and Key Attributes

With the prioritized questions in mind, here’s an outline of models, data attributes, and custom metrics that could be beneficial:

1. Performance Prediction Based on Class Changes

>Model Recommendations:
	•	Logistic Regression: Good for binary outcomes (win/place vs. loss).
	•	Random Forest or Gradient Boosting: Capture non-linear relationships, which are useful for interactions between class changes and other variables.
	•	XGBoost: Often performs well with tabular data, especially for binary or multi-class classification.

> Key Data Attributes:

	•	class rating: Measure of the race’s competitive level.
	•	todays_cls: Indicates the current class level.
	•	historical class changes: For each horse, this is a record of up/down movements over time.
	•	trainer and jockey win rates: Historical data on trainer and jockey success rates for class transitions.
	•	horse performance metrics: Past results, such as speed, average speed, and distance performance.

> Custom Metrics:

    •	Class Change Rate: Ratio of class changes (up or down) for a specific horse.
	•	Success Rate by Class: Percentage of positive outcomes (e.g., win/place) for each class change type for a specific horse.
	•	Trainer-Jockey Class Shift Success: Win/place rate when trainer-jockey combinations have adjusted a horse’s class.

2. Track-Specific Performance on Specific Surfaces (Dirt vs. Turf)

> Model Recommendations:

    •	Random Forest or Decision Trees: Good for handling categorical variables like surfaces and tracks.
	•	XGBoost or LightGBM: Capture complex interactions and are effective with tabular data, especially in classifying win/place/loss probabilities.
	•	Survival Analysis (e.g., Cox Proportional Hazards): If you want to model the likelihood of finishing in a certain position based on surfaces.
	
> Key Data Attributes:

	•	surface: Dirt, turf, all-weather, etc.
	•	course_cd: Unique code representing the track location.
	•	track history: Historical records of horse performance on specific tracks and surfaces.
	•	jockey and trainer track-specific win rates: Success rate on different tracks and surfaces.

> Custom Metrics:

	•	Track-Surface Win Rate: Win/place rate for each horse across specific tracks and surfaces.
	•	Trainer-Jockey Surface Compatibility: Success rates for trainer-jockey combinations on each surface type.
	•	Surface Speed Rating: Average speed rating for each surface type for each horse.

3. Impact of Historical Performance Metrics

> Model Recommendations:

    •	Time Series Models (ARIMA, SARIMA): Useful if you have sequential data and want to forecast based on historical trends.
	•	Recurrent Neural Networks (LSTM/GRU): For sequential data analysis, to predict outcomes based on historical patterns.
	•	Gradient Boosting (XGBoost): Effective for handling complex, non-linear relationships and feature interactions in historical metrics.
	
> Key Data Attributes:

    •	speed figures: Speed metrics across distances and surfaces.
	•	class rating: Historical class ratings per race.
	•	distance and surface: Analyze how different surfaces and distances affect outcomes.
	•	jockey/trainer performance history: Jockey and trainer win/place rates over time.

> Custom Metrics:

    •	Weighted Speed Index: Weight recent performance more heavily to account for improvement or decline.
	•	Consistency Rating: Variance in performance metrics (e.g., speed) across recent races.
	•	Class Stability: A horse’s tendency to remain within the same class over time.

4. Outcome Prediction Based on Race Attributes

> Model Recommendations:

    •	Logistic Regression: Useful for simple, interpretable models to predict winning probability based on categorical race features.
	•	Gradient Boosting Machines (GBM): Effective for capturing complex interactions between race attributes and outcomes.
	•	Naive Bayes: Simplistic, but can work well with categorical data, like race type and bet options.
	
> Key Data Attributes:

    •	race type: Stakes, allowance, claiming, etc.
	•	purse: Monetary reward associated with the race, often a proxy for competition level.
	•	betting options: Information about betting options, such as exacta or trifecta, can sometimes indicate the competitive nature of the race.
	•	distance and surface: Provides context for how horses tend to perform in specific conditions.

> Custom Metrics:

    •	Race Competition Index: Weighted index based on purse, race type, and number of competitors.
	•	Surface-Distance Adaptability: Success rate of horses at specific distances and surfaces.
	•	Betting Volume Indicator: If available, can be used as a feature to capture how betting odds relate to outcomes.

5. Real-Time Speed and Position Impact

> Model Recommendations:

    •	Sequential or Recurrent Neural Networks (RNNs, LSTM/GRU): These models are ideal for processing time-dependent, sequential GPS data.
	•	Convolutional Neural Networks (CNNs): Useful for detecting spatial patterns, particularly if speed and position data are mapped into grids.
	•	Gradient Boosting Models (GBM, XGBoost): For simpler, non-sequential models to estimate probabilities based on instantaneous positions and speeds.
	
> Key Data Attributes:

    •	GPS data: Includes real-time speed, stride frequency, and distance to finish.
	•	sectionals: Position and speed at different sections of the race.
	•	final position: Finishing position, used as the outcome variable for training.

> Custom Metrics:

    •	Sectional Consistency Score: Measures the consistency of a horse’s speed across sections.
	•	Acceleration Rate: Rate of speed increase or decrease over time.
	•	Peak Speed Position: Section of the race where the horse reached its maximum speed.
	•	Position Advantage Index: Position advantage based on average length behind the leader.

Recommended Priority and Next Steps

	1.	Impact of Historical Performance Metrics: Start with basic machine learning models like logistic regression or XGBoost to predict performance based on historical metrics. The required data is already in a clean, structured format, making it suitable for testing multiple models quickly.

    2.	Real-Time Speed and Position Impact: Given the TPD data’s richness, this is ideal for exploring advanced techniques like LSTM or CNN models if you have sequential data and sufficient computational resources.

    3.	Track-Specific Performance on Dirt vs. Turf: Trainer and jockey performance on specific tracks can provide straightforward insights using XGBoost or Decision Trees.

    4.	Performance Prediction Based on Class Changes: Use logistic regression or gradient boosting to analyze the probability of a successful performance when moving up or down a class.

    5.	Outcome Prediction Based on Race Attributes: Useful as a supporting analysis, especially for understanding how general race conditions affect the likelihood of placing or winning.

This sequence provides a roadmap to test models incrementally. Each question offers potential insights, but starting with high-impact, feasible models allows you to build and refine an ensemble gradually.


## Data Analysis

Historical data has been obtained from Equibase (EQB) and Total Performance Data (TPD). Equibase has a subsidiary company called TrackMaster where their data actually comes from and they are the authoritative source for all horse racing data in the US, while TPD has proprietary ownership of GPD data that tracks horses via a GPS device in the horse's saddle cloth. 

EQB data is provided in XML format, and TPD uses JSON primarily and KML for their routes data. KML (Keyhole Markup Language) is an XML-based file format used to represent geographic data and is commonly associated with applications like Google Earth and Google Maps.

The data was modeled using a combination of XML Spy and Erwin. The idea is to integrate the data sets so the GPS data can be used to augment data from EQB. 

The first steps in analyzing the data is to follow the procedures outlined below:

Fox River AI Racing: Initial Data Analysis Workflow

1. Environment Setup

	•	Database Connection: Connect your Jupyter Notebook to your PostgreSQL database using SQLAlchemy or a similar library. Import essential libraries such as pandas, sqlalchemy, psycopg2, and geoalchemy2 (for handling PostGIS data).
	•	Load Extensions: Ensure that postgis and other necessary extensions are loaded in PostgreSQL. You’ll need spatial functions for handling GPS and spatial data.



In [50]:
import logging
import os
import pandas as pd
from sqlalchemy import create_engine
import geopandas as gpd
from datetime import datetime
import configparser
from src.data_ingestion.ingestion_utils import (
    get_db_connection, update_tracking, load_processed_files
)
from src.data_ingestion.eqb_ppData import process_pluspro_data
from src.data_ingestion.eqb_resultsCharts import process_resultscharts_data
from src.data_ingestion.tpd_datasets import (
    process_tpd_sectionals_data,
    process_tpd_gpsdata_data
)

# Load the configuration file
config = configparser.ConfigParser()
config.read('/home/exx/myCode/horse-racing/FoxRiverAIRacing/config.ini')

# Set up logging for consistent logging behavior in Notebook
logging.basicConfig(level=logging.INFO)

# Retrieve database credentials from config file
# Retrieve database credentials from config file
db_host = config['database']['host']
db_port = config['database']['port']
db_name = config['database']['dbname']  # Corrected from 'name' to 'dbname'
db_user = config['database']['user']

# Establish connection using get_db_connection
conn = get_db_connection(config)

# Create the SQLAlchemy engine
engine = create_engine(f'postgresql+psycopg2://{db_user}@{db_host}:{db_port}/{db_name}')

2. Data Quality and Integrity Checks

	•	Initial Exploration: Start by exploring the database schema and inspecting sample rows for each table. Here are some sample queries:

SELECT * FROM race_results LIMIT 10;
SELECT * FROM runners LIMIT 10;
SELECT * FROM gpsData LIMIT 10;


	•	Data Consistency: Check for missing or duplicate entries in primary fields, such as:
	•	course_cd, race_date, race_number, and post_time in the race_results table.
	•	course_cd, race_date, saddle_cloth_number, and post_time in the gpsData table.
	•	GPS Data Quality: Ensure GPS data coordinates (longitude, latitude) are within reasonable bounds. Look for any NULL values or extreme outliers.
	•	Foreign Key Validations: Confirm that course_cd, race_date, and race_number fields align between race_results and runners tables.



In [49]:
# Sample SQL query to fetch data
query = "SELECT todays_cls, distance, dist_unit, surface, stkorclm, purse, claimant, raceord, partim, dist_disp, stk_clm_md FROM racedata where course_cd = 'LRL' and race_date = '2022-08-07' and race_number = 1 LIMIT 10;"
df = pd.read_sql_query(query, engine)

# Display the DataFrame
df.head()

Unnamed: 0,todays_cls,distance,dist_unit,surface,stkorclm,purse,claimant,raceord,partim,dist_disp,stk_clm_md
0,88.0,600.0,F,D,CL,35000.0,25000.0,1.0,111.35,6F,CLM


In [35]:
# Sample SQL query to fetch data
query = "SELECT * FROM runners LIMIT 10;"
df = pd.read_sql_query(query, engine)

# Display the DataFrame
df.head()

Unnamed: 0,saddle_cloth_number,course_cd,race_date,race_number,country,axciskey,post_position,todays_cls,owner_name,turf_mud_mark,...,lst_salena,lst_salepr,lst_saleda,claimprice,avgspd,avgcls,apprweight,jock_key,train_key,post_time
0,3,TGP,2023-04-22,7,USA,049056050052056056061062,3,92,Daniel L Walters,M,...,,0.0,1970-01-01,0.0,85.0,0.0,0.0,85995,945912,15:43:00
1,4,TGP,2023-04-22,7,USA,049058050053052058055062,4,92,Echo Papa Racing Corp,,...,OBS YRG 2 YR & HRA SALE,19000.0,2020-10-13,0.0,74.0,0.0,0.0,86192,162152,15:43:00
2,5,TGP,2023-04-22,7,USA,049057050052055056060062,5,92,Steve Budhoo,M,...,,0.0,1970-01-01,0.0,84.0,0.0,0.0,156219,233911,15:43:00
3,6,TGP,2023-04-22,7,USA,049057050051060061059060,6,92,Miracles International Trading Inc,MT,...,OBS SPR 2YO 2020,17000.0,2020-06-09,0.0,81.0,0.0,0.0,129141,964704,15:43:00
4,7,TGP,2023-04-22,7,USA,049058050051052054054061,7,92,Clap Embroidery,M,...,,0.0,1970-01-01,16000.0,80.0,0.0,0.0,162748,969418,15:43:00


In [37]:
# Sample SQL query to fetch data
query = "SELECT * FROM gpspoint LIMIT 10"
df = pd.read_sql_query(query, engine)

# Display the DataFrame
df.head()

Unnamed: 0,time_stamp,saddle_cloth_number,longitude,latitude,speed,progress,stride_frequency,location,course_cd,race_date,post_time,race_number
0,2023-04-09 17:43:03.200,3,-80.137641,25.98153,16.22,388.1,2.0,0101000020E610000078C59D1CCF0854C0D18B248B45FB...,TGP,2023-04-09,13:38:00,3
1,2023-05-19 17:10:45.200,2,-79.605812,43.711793,0.08,1106.4,,0101000020E610000030C26F9EC5E653C08681F80A1CDB...,TWO,2023-05-19,13:10:00,1
2,2024-04-21 22:22:46.200,13,-96.988121,32.773169,0.28,1609.3,,0101000020E61000000AC09A5E3D3F58C03A652431F762...,TLS,2024-04-21,17:19:00,9
3,2024-04-21 22:22:47.200,10,-96.988168,32.773176,0.3,1609.3,,0101000020E610000069DC40263E3F58C08187B36CF762...,TLS,2024-04-21,17:19:00,9
4,2024-04-21 22:22:47.200,11,-96.98815,32.773169,0.62,1609.3,,0101000020E6100000B7627FD93D3F58C064E4D132F762...,TLS,2024-04-21,17:19:00,9


3. Establish Core Metrics

Start by defining key metrics for both racing performance and GPS data. These metrics will provide the foundation for understanding race performance.
	•	Performance Metrics (EQB Data):
	•	Average speed (avgspd in runners).
	•	Turf/Mud rating.
	•	Final position and pace at different stages.
	•	Split timings (if available in race_results).
	•	Jockey and trainer statistics.
	•	GPS Metrics (TPD Data):
	•	Speed over Time: Calculate speed differences at intervals using the gpsData table.
	•	Distance to Finish Line: Track distance changes over time to infer acceleration and deceleration points.
	•	Course Navigation: Map each horse’s GPS route using PostGIS, examining how they maneuver around the track.



4. Integrate EQB and TPD Data

	•	Route Alignment: Verify that GPS data aligns with EQB records based on course_cd, race_date, post_time, and race_number.
	•	Join GPS Data to Race Results: Create a view or query that merges gpsData with race_results for horses that match based on identifiers.
	•	Identify Unique Races: Extract distinct races from EQB and TPD data and identify missing matches.



5. Perform Basic Exploratory Data Analysis (EDA)

	•	Race Outcome Analysis: Explore correlations between starting position, jockey/trainer, turf/mud conditions, and final outcome.
	•	Speed Trends: Use GPS data to plot speed over the course of a race for different horses. Look for acceleration and deceleration patterns.
	•	Spatial Analysis: Using PostGIS, visualize and map the race paths to analyze spatial trends.



6. Data Enrichment and Feature Engineering

	•	Create Derived Features: Based on initial findings, create new features such as average speed per segment, acceleration rates, and distance traveled in different track sections.
	•	Additional Geographic Features: Calculate spatial features like curvature, straightaway performance, and positioning relative to competitors.
	•	Temporal Aggregation: Aggregate data to capture seasonality and form trends for horses, jockeys, and trainers.



7. Save and Document Metrics

	•	Database Views: Create database views for core metrics and common queries, simplifying future access and processing.
	•	Documentation: Document all queries, derived features, and assumptions. This will be essential for reproducibility and explaining model decisions.


8. Prepare Data for Machine Learning

	•	Feature Table Creation: Aggregate all relevant metrics into a single table for each race or horse, ensuring clean and well-structured data for model input.
	•	Normalization and Scaling: Apply scaling to GPS metrics and standardize EQB metrics where appropriate.

