# Performance Prediction Based on Class Changes

## Model Recommendations: 

1. Logistic Regression: Good for binary outcomes (win/place vs. loss).
2. Random Forest or Gradient Boosting: Capture non-linear relationships, which are useful for interactions between class changes and other variables.
3. XGBoost: Often performs well with tabular data, especially for binary or multi-class classification.

## Key Data Attributes:

1. class rating: Measure of the race’s competitive level.
2. todays_cls: Indicates the current class level.
3. historical class changes: For each horse, this is a record of up/down movements over time.
4. trainer and jockey win rates: Historical data on trainer and jockey success rates for class transitions.
5. horse performance metrics: Past results, such as speed, average speed, and distance performance.

## Custom Metrics:

1. Class Change Rate: Ratio of class changes (up or down) for a specific horse.
2. Success Rate by Class: Percentage of positive outcomes (e.g., win/place) for each class change type for a specific horse.
3. Trainer-Jockey Class Shift Success: Win/place rate when trainer-jockey combinations have adjusted a horse’s class.

# Notes -- delete later:

To test your system, especially with GPUs, I recommend starting with a model that balances computational demand and efficiency while taking advantage of parallel processing on GPUs. Here are three models to consider, listed in order of increasing complexity and computational intensity:

1. Random Forest with Scikit-Learn

	•	Why: Random Forest is easy to set up and provides a good performance baseline. It’s also highly parallelizable, which can utilize multiple CPU cores efficiently.
	•	GPU Compatibility: While Scikit-Learn itself doesn’t natively support GPU acceleration, RAPIDS (from NVIDIA) has a library called cuML that includes a GPU-accelerated version of Random Forest.
	•	Implementation:
	•	Start with Scikit-Learn’s RandomForestClassifier to verify data pipelines and performance on CPU.
	•	Optionally, try cuML’s RandomForestClassifier to utilize GPU(s) on your RTX A6000 cards.

2. XGBoost with GPU Support

	•	Why: XGBoost is widely used for structured/tabular data and has excellent GPU support built in. It handles complex, non-linear data well and is optimized for performance, especially with large datasets.
	•	GPU Compatibility: XGBoost can leverage your GPUs for faster training.
	•	Implementation:
	•	Set tree_method='gpu_hist' in XGBClassifier to enable GPU acceleration.
	•	This is a good choice for benchmarking GPU performance and testing your system’s stability under heavy loads.

3. LightGBM with GPU Support

	•	Why: LightGBM is similar to XGBoost but typically faster and more memory-efficient, especially on large datasets with high-dimensional data. It’s well-suited for imbalanced datasets and allows for efficient handling of categorical data.
	•	GPU Compatibility: LightGBM can use your GPUs for accelerated training.
	•	Implementation:
	•	Use device='gpu' in LGBMClassifier to leverage GPU acceleration.
	•	LightGBM’s memory efficiency can help gauge your system’s ability to handle large, complex datasets.

Suggested Approach to Testing

	1.	Data Preparation: Use a subset of your data initially to ensure models are set up correctly and refine your feature engineering process.
	2.	Hyperparameter Tuning: Keep parameters minimal for initial tests, then gradually increase complexity, exploring tuning options.
	3.	Benchmark: Start with CPU tests for Random Forest, then move to GPU-accelerated XGBoost or LightGBM.
	4.	System Monitoring: Track GPU and CPU usage, memory consumption, and training times.

Given your goal of performance testing, starting with XGBoost using GPU acceleration would likely provide the best balance of computational load and insight into your system’s capabilities.


In [2]:
# Environment setup
import logging
import os
import pandas as pd
from sqlalchemy import create_engine
import geopandas as gpd
from datetime import datetime
import configparser
from src.data_ingestion.ingestion_utils import (
    get_db_connection, update_tracking, load_processed_files
)
from src.data_ingestion.eqb_ppData import process_pluspro_data
from src.data_ingestion.eqb_resultsCharts import process_resultscharts_data
from src.data_ingestion.tpd_datasets import (
    process_tpd_sectionals_data,
    process_tpd_gpsdata_data
)

# Load the configuration file
config = configparser.ConfigParser()
config.read('/home/exx/myCode/horse-racing/FoxRiverAIRacing/config.ini')

# Set up logging for consistent logging behavior in Notebook
logging.basicConfig(level=logging.INFO)

# Retrieve database credentials from config file
# Retrieve database credentials from config file
db_host = config['database']['host']
db_port = config['database']['port']
db_name = config['database']['dbname']  # Corrected from 'name' to 'dbname'
db_user = config['database']['user']

# Establish connection using get_db_connection
conn = get_db_connection(config)

# Create the SQLAlchemy engine
engine = create_engine(f'postgresql+psycopg2://{db_user}@{db_host}:{db_port}/{db_name}')

ModuleNotFoundError: No module named 'ingestion_utils'