# Performance Prediction Based on Class Changes

## Model Recommendations: 

1. Logistic Regression: Good for binary outcomes (win/place vs. loss).
2. Random Forest or Gradient Boosting: Capture non-linear relationships, which are useful for interactions between class changes and other variables.
3. XGBoost: Often performs well with tabular data, especially for binary or multi-class classification.

## Key Data Attributes:

1. class rating: Measure of the race’s competitive level.
2. todays_cls: Indicates the current class level.
3. historical class changes: For each horse, this is a record of up/down movements over time.
4. trainer and jockey win rates: Historical data on trainer and jockey success rates for class transitions.
5. horse performance metrics: Past results, such as speed, average speed, and distance performance.

## Custom Metrics:

1. Class Change Rate: Ratio of class changes (up or down) for a specific horse.
2. Success Rate by Class: Percentage of positive outcomes (e.g., win/place) for each class change type for a specific horse.
3. Trainer-Jockey Class Shift Success: Win/place rate when trainer-jockey combinations have adjusted a horse’s class.

**NOTE:** There are limited data for courses in the TPD dataset. Therefore I will create two sets of models:

> Tailor Models Based on Data Source

    •	For models relying on detailed timing, sectional, and GPS data (like pace or stride analysis), restrict training to the shared tracks.
	•	For models that don’t need TPD-specific attributes, you can use the full set of EQB tracks but may want to separate the model training datasets into “shared” and “EQB-only” groups to compare results.
                    

## Step 1: Define Your Prediction Goal Clearly

	•	Objective: Predict the probability of each horse finishing in the top 4 positions in upcoming races.
	•	Target Variable: A binary indicator for each horse (1 if the horse finishes in the top 4, 0 otherwise) for historical data.

## Step 2: Gather Pre-Race Data Only

Collect data that is available before the race starts. This includes:
	1.	Horse Performance Metrics:
	•	Historical Speed Ratings: Average speed ratings from past races.
	•	Gate Splits and Sectional Times: From previous races.
	•	Finish Positions: Placings in recent races.
	2.	Horse Profile Information:
	•	Age and Sex: Basic demographics.
	•	Weight: Racing weight from past performances.
	•	Breeding Information: Sire and dam performance statistics.
	3.	Jockey and Trainer Data:
	•	Win/Loss Records: Historical performance metrics.
	•	Jockey-Trainer Combinations: Success rates when they team up.
	4.	Race Details:
	•	Race Class: The level of competition.
	•	Distance: Length of the race.
	•	Surface Type: Turf, dirt, synthetic.
	•	Track Conditions: Weather forecasts, track ratings.
	5.	Workout History:
	•	Recent Workouts: Times and frequencies.
	•	Training Patterns: Indications of fitness and readiness.
	6.	Betting Market Data:
	•	Morning Line Odds: Early odds provided by bookmakers.
	•	Public Sentiment: Media reports, expert picks.

## Step 3: Prepare Your Dataset

	1.	Create a Feature Matrix:
	•	Each row represents a horse in a specific historical race.
	•	Columns are the features listed above.
	2.	Ensure Data Consistency:
	•	Data Types: Convert categorical variables to numerical using encoding techniques.
	•	Missing Values: Impute or remove missing data appropriately.
	3.	Feature Engineering:
	•	Form Indicators: Recent performance trends (e.g., improvement over last three races).
	•	Performance Ratios: Speed rating relative to the average for that class.
	•	Experience Metrics: Number of races run, experience at the distance or surface.

## Step 4: Avoid Data Leakage

	1.	Exclude Post-Race Data:
	•	Do not include any data generated during or after the race you’re predicting (e.g., in-race GPS data, finishing positions for the race in question).
	2.	Temporal Separation:
	•	Split your data chronologically to prevent future information from influencing past predictions.
	•	Training Set: Use data from races up to a certain cutoff date.
	•	Validation/Test Set: Use data from races after the cutoff date.
	3.	Simulate Real-World Prediction Conditions:
	•	When evaluating your model, ensure that you’re only using information that would have been available before the race.
	4.	Feature Selection Discipline:
	•	Be cautious with features that could inadvertently include future information (e.g., betting odds that might be influenced by insider knowledge). 

## Step 5: Split Your Data Appropriately

	1.	Training, Validation, and Test Sets:
	•	Training Set: Historical data up to a specific date.
	•	Validation Set: The next set of races to fine-tune your model.
	•	Test Set: The most recent races to evaluate final model performance.
	2.	Cross-Validation:
	•	Use Time-Series Cross-Validation methods like Rolling Window Validation to respect the temporal order of races.

## Step 6: Choose Essential Features

Focus on features that have strong predictive power and are available pre-race:
	1.	Performance Metrics:
	•	Average Speed Rating: From past races, adjusted for class and conditions.
	•	Consistency Indicators: Standard deviation of speed ratings.
	2.	Jockey/Trainer Statistics:
	•	Win Percentage: Overall and recent form.
	•	Track-Specific Performance: Success rates at the current track.
	3.	Horse Condition Indicators:
	•	Days Since Last Race: Indicators of rest or potential overtraining.
	•	Recent Workout Times: Fast workouts could indicate readiness.
	4.	Race Conditions:
	•	Class Changes: Moving up or down in class can impact performance.
	•	Distance Suitability: Historical performance at similar distances.
	5.	Breeding Information:
	•	Sire/Dam Performance: Success rates over certain distances or surfaces.

## Step 7: Feature Engineering and Transformation

	1.	Normalize Numerical Features:
	•	Scale features like speed ratings to have a mean of zero and a standard deviation of one.
	2.	Encode Categorical Variables:
	•	Use label encoding or one-hot encoding for variables like Track Surface or Race Class.
	3.	Interaction Terms:
	•	Create features that capture interactions (e.g., Jockey Win Rate × Horse Speed Rating).
	4.	Historical Trends:
	•	Include features that capture improvement or decline over time.

## Step 8: Model Selection

	1.	Choose Appropriate Algorithms:
	•	Gradient Boosting Machines (e.g., XGBoost, LightGBM): Good for handling tabular data with both numerical and categorical variables.
	•	Random Forests: Useful as a benchmark model.
	•	Logistic Regression: For baseline comparisons.
	2.	Set Up Evaluation Metrics:
	•	Use metrics appropriate for classification and ranking, such as Accuracy, Precision, Recall, and AUC-ROC.

## Step 9: Train and Validate Your Model

	1.	Training:
	•	Fit your model on the training set using the selected features.
	2.	Hyperparameter Tuning:
	•	Use the validation set to adjust model parameters for optimal performance.
	3.	Cross-Validation:
	•	Implement time-aware cross-validation to ensure robustness.

## Step 10: Test and Evaluate Your Model

	1.	Final Evaluation:
	•	Assess your model on the test set to estimate real-world performance.
	2.	Check for Overfitting:
	•	Ensure that your model generalizes well to unseen data.
	3.	Analyze Feature Importance:
	•	Identify which features contribute most to predictions.

## Step 11: Prepare for Deployment

	1.	Set Up Prediction Pipeline:
	•	Automate data collection and feature engineering steps for new races.
	2.	Monitoring:
	•	Continuously monitor model performance and update as necessary.
	3.	Ethical Considerations:
	•	Ensure responsible use of predictions, especially if sharing with others.

## Step 12: Continual Learning and Improvement

	1.	Feedback Loop:
	•	Incorporate results from recent races to retrain and improve your model.
	2.	Stay Updated with Domain Knowledge:
	•	Keep abreast of changes in horse racing (e.g., new jockeys, changes in track conditions).
	3.	Experimentation:
	•	Try additional features or different modeling techniques to enhance performance.

# Note (CRITICAL):    Avoid Cross-Contamination:

	•	No Future Data in Features: Ensure that features do not inadvertently include information from future races (e.g., cumulative statistics that include future performances).
	•	Isolation of Target Variable: The target variable (e.g., whether the horse finished in the top 4) should only correspond to the race being predicted and not be influenced by future results.
	5.	Implement Time-Aware Cross-Validation:
	•	Time-Series Split:
	•	Use scikit-learn’s TimeSeriesSplit or similar methods that respect temporal order.
	•	Avoid random shuffling of data, which can mix future information into the training set.

In [6]:
# Environment setup

import logging
import os
import pandas as pd
from sqlalchemy import create_engine
import geopandas as gpd
from datetime import datetime
import configparser
from src.data_ingestion.ingestion_utils import (
    get_db_connection, update_tracking, load_processed_files
)
from src.data_ingestion.eqb_ppData import process_pluspro_data
from src.data_ingestion.eqb_resultsCharts import process_resultscharts_data
from src.data_ingestion.tpd_datasets import (
    process_tpd_sectionals_data,
    process_tpd_gpsdata_data
)

# Load the configuration file
config = configparser.ConfigParser()
config.read('/home/exx/myCode/horse-racing/FoxRiverAIRacing/config.ini')

# Set up logging for consistent logging behavior in Notebook
logging.basicConfig(level=logging.INFO)

# Retrieve database credentials from config file
# Retrieve database credentials from config file
db_host = config['database']['host']
db_port = config['database']['port']
db_name = config['database']['dbname']  # Corrected from 'name' to 'dbname'
db_user = config['database']['user']

# Establish connection using get_db_connection
conn = get_db_connection(config)

# Create the SQLAlchemy engine
engine = create_engine(f'postgresql+psycopg2://{db_user}@{db_host}:{db_port}/{db_name}')

### Key Attributes:

1. class rating: Measure of the race’s competitive level.
2. todays_cls: Indicates the current class level.
3. historical class changes: For each horse, this is a record of up/down movements over time.
4. trainer and jockey win rates: Historical data on trainer and jockey success rates for class transitions.
5. horse performance metrics: Past results, such as speed, average speed, and distance performance.

Value of partim (par time)
Adjust Horse Times: Compare each horse’s actual time to the par time, possibly creating a new metric like “time relative to par” or “par-adjusted time.”
	•	Feature Engineering: Use par time to engineer features like “exceeds par by X seconds” or “performs within Y% of par,” which could contribute valuable insights for class change and win probability models.
	•	Combine with Class Code: Analyzing par time alongside stkorclm (Today’s Class Code) may also reveal more about class-specific performance trends, especially when predicting success rates for horses switching between classes.

In [None]:
query = """ 
WITH racedata_classes AS (
select rd.todays_cls, r.todays_cls, r.avgcls, r.avgspd, rs.ave_cl_sd, rs.avg_spd_sd, rs.hi_spd_sd
FROM racedata rd
JOIN runners r on rd.course_cd = r.course_cd
    AND rd.race_date = r.race_date
    AND rd.post_time = r.post_time
    AND rd.race_number = r.race_number
JOIN runners_stats rs on r.course_cd = rs.course_cd
    AND r.race_date = rs.race_date
    AND r.post_time = rs.post_time
    AND r.race_number = rs.race_number
    AND r.saddle_cloth_number = rs.saddle_cloth_number
)


trainer_win_rates AS (
    SELECT trainer_id, AVG(win) AS win_rate
    FROM races
    WHERE <your conditions here>
    GROUP BY trainer_id
),
horse_performance_metrics AS (
    SELECT horse_id, AVG(speed) AS avg_speed, AVG(class_rating) AS avg_class_rating
    FROM performance_data
    WHERE <your conditions here>
    GROUP BY horse_id
)
SELECT hpm.horse_id, t.trainer_id, hpm.avg_speed, hpm.avg_class_rating, t.win_rate
FROM trainer_win_rates t
JOIN horse_performance_metrics hpm ON t.trainer_id = hpm.horse_id
"""

# Execute the query and load it into a DataFrame
df = pd.read_sql_query(query, engine)

# Display the DataFrame
df.head()