# Proposed Ensemble Models

Given the constraints and objectives, I recommend considering the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


### Validate GPU Setup

In [6]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/home/exx/anaconda3/envs/mamba_env/envs/tf_310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/exx/anaconda3/envs/mamba_env/envs/tf_310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/exx/anaconda3/envs/mamba_env/envs/tf_310/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/home/exx/anaconda3/envs/mamba_env/envs/tf_310/lib/python3.10/site-pa

AttributeError: _ARRAY_API not found

SystemError: initialization of _pywrap_checkpoint_reader raised unreported exception

In [7]:
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs Available:", len(physical_devices))

NameError: name 'tf' is not defined

In [8]:
for device in physical_devices:
    print(device)

NameError: name 'physical_devices' is not defined

#### Make sure JAVA HOME is set to version 11 -- source .zshrc in same folder as you start jupyter

In [9]:
!echo $JAVA_HOME
!java --version

IOStream.flush timed out
/usr/lib/jvm/java-11-openjdk
IOStream.flush timed out
openjdk 11.0.25 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.25.0.9-1) (build 11.0.25+9-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.25.0.9-1) (build 11.0.25+9-LTS, mixed mode, sharing)


# The LSTM Network on Raw GPS Data

Initially I desired to merge the GPS data with Sectionals, but the timestamp and gate_name intervals of each respectively made it difficult to align the data in sequences -- something that is needed for Long-Short Term Memory models. Therefore, it was decided to go with an ensemble approach. There will be additional models that incorporate Equibase data as well, but for the time being, the focus will be on Total Performance GPS data. 

In [2]:
# Environment Setup

import os
import logging
import configparser
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min as spark_min, sum as spark_sum, from_utc_timestamp, expr
from pyspark.sql.window import Window

# Load configuration file
config = configparser.ConfigParser()
config.read('/home/exx/myCode/horse-racing/FoxRiverAIRacing/config.ini')

# Database credentials from config
db_host = config['database']['host']
db_port = config['database']['port']
db_name = config['database']['dbname']
db_user = config['database']['user']
db_password = os.getenv("DB_PASSWORD", "SparkPy24!")  # Ensure DB_PASSWORD is set

# Validate database password
if not db_password:
    raise ValueError("Database password is missing. Set it in the DB_PASSWORD environment variable.")

# JDBC URL and properties
jdbc_url = f"jdbc:postgresql://{db_host}:{db_port}/{db_name}"
jdbc_properties = {
    "user": db_user,
    "password": db_password,
    "driver": "org.postgresql.Driver"
}

# Path to JDBC driver
jdbc_driver_path = "/home/exx/myCode/horse-racing/FoxRiverAIRacing/jdbc/postgresql-42.7.4.jar"

# Configure logging
log_file = "/home/exx/myCode/horse-racing/FoxRiverAIRacing/logs/SparkPy_load.log"
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler()
    ]
)
logging.info("Environment setup initialized.")

# Initialize Spark session
def initialize_spark():
    spark = SparkSession.builder \
        .appName("Horse Racing Data Processing") \
        .config("spark.driver.extraClassPath", jdbc_driver_path) \
        .config("spark.executor.extraClassPath", jdbc_driver_path) \
        .config("spark.driver.memory", "64g") \
        .config("spark.executor.memory", "32g") \
        .config("spark.executor.memoryOverhead", "8g") \
        .config("spark.sql.debug.maxToStringFields", "1000") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") \
        .config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY") \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    logging.info("Spark session created successfully.")
    return spark

# Initialize Spark
spark = initialize_spark()

2024-12-25 21:59:15,192 - INFO - Environment setup initialized.


AttributeError: module 'numpy' has no attribute 'ndarray'

In [8]:
# Data Loading and Transformation

from pyspark.sql.functions import col, min as spark_min, sum as spark_sum, unix_timestamp, to_timestamp, expr
from pyspark.sql.window import Window

# Define SQL queries without trailing semicolons
queries = {
    "results": """
        SELECT vre.course_cd, vre.race_date, vre.race_number, vre.program_num AS saddle_cloth_number, vre.post_pos,
               h.horse_id, vre.official_fin, vre.finish_time, vre.speed_rating, vr.todays_cls,
               vr.previous_surface, vr.previous_class, vr.net_sentiment
        FROM v_results_entries vre
        JOIN v_runners vr 
            ON vre.course_cd = vr.course_cd
            AND vre.race_date = vr.race_date
            AND vre.race_number = vr.race_number
            AND vre.program_num = vr.saddle_cloth_number
        JOIN horse h 
            ON vre.axciskey = h.axciskey
        WHERE vre.breed = 'TB'
        GROUP BY vre.course_cd, vre.race_date, vre.race_number, vre.program_num, vre.post_pos,
                 h.horse_id, vre.official_fin, vre.finish_time, vre.speed_rating, vr.todays_cls,
                 vr.previous_surface, vr.previous_class, vr.net_sentiment
    """,
    "sectionals": """
        SELECT course_cd, race_date, race_number, saddle_cloth_number, gate_name, 
               gate_numeric, length_to_finish, sectional_time, running_time, 
               distance_back, distance_ran, number_of_strides
        FROM v_sectionals
    """,
    "gpspoint": """
        SELECT course_cd, race_date, race_number, saddle_cloth_number, time_stamp, 
               longitude, latitude, speed, progress, stride_frequency, post_time, location
        FROM v_gpspoint
    """
}

# Path to save Parquet files
parquet_dir = "/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/"
os.makedirs(parquet_dir, exist_ok=True)

# Load data directly from PostgreSQL into Spark DataFrames and save as Parquet
dfs = {}
for name, query in queries.items():
    logging.info(f"Loading {name} data from PostgreSQL...")
    try:
        df = spark.read.jdbc(url=jdbc_url, table=f"({query}) AS subquery", properties=jdbc_properties)
        output_path = os.path.join(parquet_dir, f"{name}.parquet")
        logging.info(f"Saving {name} DataFrame to Parquet at {output_path}...")
        df.write.mode("overwrite").parquet(output_path)
        dfs[name] = df
        logging.info(f"{name} data loaded and saved successfully.")
    except Exception as e:
        logging.error(f"Error loading {name} data: {e}")
        raise

# Reload Parquet files into Spark DataFrames for processing
logging.info("Reloading Parquet files into Spark DataFrames for transformation...")
results_df = spark.read.parquet(os.path.join(parquet_dir, "results.parquet"))
sectionals_df = spark.read.parquet(os.path.join(parquet_dir, "sectionals.parquet"))
gps_df = spark.read.parquet(os.path.join(parquet_dir, "gpspoint.parquet"))
logging.info("Parquet files reloaded successfully.")



2024-12-06 20:45:54,482 - INFO - Loading results data from PostgreSQL...
2024-12-06 20:45:55,637 - INFO - Saving results DataFrame to Parquet at /home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/results.parquet...
2024-12-06 20:45:58,430 - INFO - results data loaded and saved successfully.    
2024-12-06 20:45:58,431 - INFO - Loading sectionals data from PostgreSQL...
2024-12-06 20:45:58,451 - INFO - Saving sectionals DataFrame to Parquet at /home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/sectionals.parquet...
2024-12-06 20:46:07,526 - INFO - sectionals data loaded and saved successfully. 
2024-12-06 20:46:07,527 - INFO - Loading gpspoint data from PostgreSQL...
2024-12-06 20:46:07,547 - INFO - Saving gpspoint DataFrame to Parquet at /home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/gpspoint.parquet...
2024-12-06 20:47:41,554 - INFO - gpspoint data loaded and saved successfully.   
2024-12-06 20:47:41,554 - INFO - Reloading Parquet files into Spark DataF

In [43]:
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
from datetime import timedelta

# Define the UDF to add seconds (including fractional seconds) to a timestamp
def add_seconds(ts, seconds):
    if ts is None or seconds is None:
        return None
    return ts + timedelta(seconds=seconds)

# Register the UDF
add_seconds_udf = udf(add_seconds, TimestampType())

In [97]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    unix_timestamp,
    expr,
    min as spark_min,
    sum as spark_sum,
    date_format
)
from pyspark.sql.window import Window
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf
from datetime import timedelta

# Clear all cached data
spark.catalog.clearCache()

# Reload the DataFrames from Parquet files
gps_df = spark.read.parquet("/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/gpspoint.parquet")
sectionals_df = spark.read.parquet("/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/sectionals.parquet")

# Convert time_stamp to timestamp type
gps_df = gps_df.withColumn("time_stamp", col("time_stamp").cast("timestamp"))

# Print a sample of the time_stamp column to check for millisecond precision
print("Sample of gps_df time_stamp column:")
gps_df.select("time_stamp").show(10, truncate=False)


Sample of gps_df time_stamp column:
+---------------------+
|time_stamp           |
+---------------------+
|2023-10-15 16:37:05.2|
|2023-12-10 17:23:03.2|
|2023-12-10 17:23:04.2|
|2023-12-10 17:23:05.2|
|2023-12-10 17:23:06.2|
|2023-12-10 17:23:08.2|
|2023-10-15 16:37:06.2|
|2023-12-10 17:23:10.2|
|2023-12-10 17:23:11.2|
|2023-12-10 17:23:12.2|
+---------------------+
only showing top 10 rows



In [98]:
# Step 1: Calculate the earliest 'time_stamp' for each race
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]

first_time_df = gps_df.groupBy(*race_id_cols).agg(
    spark_min("time_stamp").alias("earliest_time_stamp")
)

# Step 2: Join 'first_time_df' with 'sectionals_df' to associate each sectional with the race's start time
sectionals_df = sectionals_df.join(
    first_time_df,
    on=race_id_cols,
    how="left"
)

# Step 3: Sort 'sectionals_df' by 'gate_numeric' to ensure correct order of gates
sectionals_df = sectionals_df.orderBy(*race_id_cols, "gate_numeric")

# Step 4: Define the window specification for cumulative sum
window_spec = Window.partitionBy(*race_id_cols).orderBy("gate_numeric").rowsBetween(Window.unboundedPreceding, 0)

# Step 5: Compute cumulative sum of 'sectional_time' for each race
sectionals_df = sectionals_df.withColumn(
    "cumulative_sectional_time",
    spark_sum("sectional_time").over(window_spec)
)

In [99]:
# Step 6: Define the UDF to add seconds (including fractional seconds) to a timestamp
def add_seconds(ts, seconds):
    if ts is None or seconds is None:
        return None
    return ts + timedelta(seconds=seconds)

# Register the UDF
add_seconds_udf = udf(add_seconds, TimestampType())

# Step 7: Create 'sec_time_stamp' by adding 'cumulative_sectional_time' to 'earliest_time_stamp' using the UDF
sectionals_df = sectionals_df.withColumn(
    "sec_time_stamp",
    add_seconds_udf(col("earliest_time_stamp"), col("cumulative_sectional_time"))
)

# Step 8: Drop intermediate columns if no longer needed
sectionals_df = sectionals_df.drop("earliest_time_stamp", "cumulative_sectional_time")

# Show a sample of the results
print("Sample of sectionals_df with sec_time_stamp:")
sectionals_df.select(
    "course_cd",
    "race_date",
    "race_number",
    "saddle_cloth_number",
    "gate_numeric",
    "gate_name",
    "sectional_time",
    "sec_time_stamp"
).show(10, truncate=False)


Sample of sectionals_df with sec_time_stamp:


                                                                                

+---------+----------+-----------+-------------------+------------+---------+--------------+----------------------+
|course_cd|race_date |race_number|saddle_cloth_number|gate_numeric|gate_name|sectional_time|sec_time_stamp        |
+---------+----------+-----------+-------------------+------------+---------+--------------+----------------------+
|AQU      |2023-01-05|1          |6                  |0.5         |0.5f     |6.89          |2023-01-05 17:53:06.09|
|AQU      |2023-01-05|1          |6                  |1.0         |1f       |6.66          |2023-01-05 17:53:12.75|
|AQU      |2023-01-05|1          |6                  |1.5         |1.5f     |7.25          |2023-01-05 17:53:20   |
|AQU      |2023-01-05|1          |6                  |2.0         |2f       |6.15          |2023-01-05 17:53:26.15|
|AQU      |2023-01-05|1          |6                  |2.5         |2.5f     |7.07          |2023-01-05 17:53:33.22|
|AQU      |2023-01-05|1          |6                  |3.0         |3f   

In [100]:
# Now, proceed with the join to create matched_df

from pyspark.sql.functions import abs

# Step 9: Convert 'time_stamp' and 'sec_time_stamp' to milliseconds since epoch to preserve sub-second precision
gps_with_ms = gps_df.withColumn(
    "time_stamp_ms",
    (col("time_stamp").cast("double") * 1000).cast("long")
)

sectionals_with_ms = sectionals_df.withColumn(
    "sec_time_stamp_ms",
    (col("sec_time_stamp").cast("double") * 1000).cast("long")
)

# Step 10: Define the join condition with time window (±1000 milliseconds)
join_condition = (
    (gps_with_ms.course_cd == sectionals_with_ms.course_cd) &
    (gps_with_ms.race_date == sectionals_with_ms.race_date) &
    (gps_with_ms.race_number == sectionals_with_ms.race_number) &
    (gps_with_ms.saddle_cloth_number == sectionals_with_ms.saddle_cloth_number) &
    (abs(gps_with_ms.time_stamp_ms - sectionals_with_ms.sec_time_stamp_ms) <= 500)
)

# Step 11: Perform the left join based on the join condition
matched_df = gps_with_ms.join(
    sectionals_with_ms,
    on=join_condition,
    how="left"
).select(
    gps_with_ms["*"],
    sectionals_with_ms["sec_time_stamp"],
    sectionals_with_ms["gate_numeric"],
    sectionals_with_ms["gate_name"],
    sectionals_with_ms["sectional_time"]
)

In [101]:
# Step 12: Verify the matched records
print("Sample of matched_df (All gps_df records with sectional data where available):")
matched_df.select(
    *race_id_cols,
    "time_stamp",
    "sec_time_stamp",
    "gate_numeric",
    "gate_name",
    "sectional_time"
).show(10, truncate=False)


Sample of matched_df (All gps_df records with sectional data where available):




+---------+----------+-----------+-------------------+---------------------+----------------------+------------+---------+--------------+
|course_cd|race_date |race_number|saddle_cloth_number|time_stamp           |sec_time_stamp        |gate_numeric|gate_name|sectional_time|
+---------+----------+-----------+-------------------+---------------------+----------------------+------------+---------+--------------+
|ELP      |2023-07-30|1          |3                  |2023-07-30 16:48:47.5|null                  |null        |null     |null          |
|ELP      |2023-07-30|1          |3                  |2023-07-30 16:48:48.5|null                  |null        |null     |null          |
|LRL      |2023-11-24|9          |1                  |2023-11-24 21:25:51.2|null                  |null        |null     |null          |
|LRL      |2023-11-24|9          |1                  |2023-11-24 21:25:52.2|null                  |null        |null     |null          |
|TTP      |2022-12-09|3          |



In [212]:
# Step 13: Show the total number of rows in matched_df, gps_df, and sectionals_df
matched_count = matched_df.count()
print(f"Total number of rows in matched_df: {matched_count}")

matched_count_unique = matched_df_unique.count()
print(f"Total number of rows in matched_df_unique: {matched_count_unique}")

gps_count = gps_df.count()
print(f"Total number of rows in gps_df: {gps_count}")

sectionals_count = sectionals_df.count()
print(f"Total number of rows in sectionals_df: {sectionals_count}")


                                                                                

Total number of rows in matched_df: 36633397




Total number of rows in matched_df_unique: 5086658
Total number of rows in gps_df: 36633397
Total number of rows in sectionals_df: 4742485




In [206]:
matched_df.printSchema()

root
 |-- course_cd: string (nullable = true)
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- saddle_cloth_number: string (nullable = true)
 |-- time_stamp: timestamp (nullable = true)
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- speed: double (nullable = true)
 |-- progress: double (nullable = true)
 |-- stride_frequency: double (nullable = true)
 |-- post_time: timestamp (nullable = true)
 |-- location: string (nullable = true)
 |-- time_stamp_ms: long (nullable = true)
 |-- sec_time_stamp: timestamp (nullable = true)
 |-- sec_time_stamp_ms: long (nullable = true)
 |-- gate_numeric: double (nullable = true)
 |-- gate_name: string (nullable = true)
 |-- sectional_time: double (nullable = true)



In [205]:
sectionals_df.printSchema()

root
 |-- course_cd: string (nullable = true)
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- saddle_cloth_number: string (nullable = true)
 |-- gate_name: string (nullable = true)
 |-- gate_numeric: double (nullable = true)
 |-- length_to_finish: double (nullable = true)
 |-- sectional_time: double (nullable = true)
 |-- running_time: double (nullable = true)
 |-- distance_back: double (nullable = true)
 |-- distance_ran: double (nullable = true)
 |-- number_of_strides: double (nullable = true)
 |-- sec_time_stamp: timestamp (nullable = true)



In [200]:
# Step 14: Show the number of unmatched sectional records

# Extract matched sectional records
matched_sectionals = matched_df.select(
    *race_id_cols,
    "sec_time_stamp"
).distinct()

# Perform a left anti join to find sectionals not present in matched_sectionals
unmatched_sectionals_df = sectionals_df.join(
    matched_sectionals,
    on=race_id_cols + ["sec_time_stamp"],
    how="leftanti"
)

# Count the number of unmatched sectional records
unmatched_sectionals_count = unmatched_sectionals_df.count()
print(f"Number of unmatched sectional records: {unmatched_sectionals_count}")

# Stop the Spark session
# spark.stop()



Number of unmatched sectional records: 1045




### Deduplication Process

### Recheck for Duplicates

In [108]:
from pyspark.sql.functions import count, col, collect_list
from pyspark.sql import functions as F

# Define race identifier columns
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]


# Group by sectional identifiers and collect GPS time_stamps
duplicate_matches = matched_df_unique.groupBy(*race_id_cols, "sec_time_stamp") \
    .agg(
        collect_list("time_stamp").alias("gps_time_stamps"),
        count("*").alias("gps_match_count")
    ) \
    .filter(col("gps_match_count") > 1)

# Count the number of sectional records with duplicates
num_duplicate_sectionals = duplicate_matches.count()
print(f"Number of sectional records with duplicate GPS matches: {num_duplicate_sectionals}")




Number of sectional records with duplicate GPS matches: 0


                                                                                

# Merge Route and Results with GPS and Sectionals

## Step 1: Load and Prepare the Routes Data

The Routes table includes route layouts and the coordinates of the running and winning lines. 

We’ll need to:

1.	Load the Routes data into a Spark DataFrame.

2.	Join it with the GPS data using the race identifiers (course_cd, race_date, race_number).

3.	Calculate deviations from the “ideal” route.

In [150]:
# Load the Routes data
# PRIMARY KEY (course_cd, line_type, line_name)
routes_query = """
    SELECT course_cd, track_name, line_type, line_name, coordinates, 
           created_at
    FROM routes
""" 
    
routes_df = spark.read.jdbc(url=jdbc_url, table=f"({routes_query}) AS subquery", properties=jdbc_properties)

# Save to Parquet for reuse
routes_df.write.mode("overwrite").parquet(os.path.join(parquet_dir, "routes.parquet"))

# Reload if needed
routes_df = spark.read.parquet(os.path.join(parquet_dir, "routes.parquet"))

In [183]:
routes_df.select("course_cd", "track_name", "line_type", "line_name").show(40, truncate=False)

+---------+---------------------------+------------+----------------------+
|course_cd|track_name                 |line_type   |line_name             |
+---------+---------------------------+------------+----------------------+
|KEE      |KEENELAND                  |WINNING_LINE|WINNING_LINE          |
|KEE      |KEENELAND                  |RUNNING_LINE|RUNNING_LINE 5 30FT   |
|TGG      |GOLDEN GATE FIELDS         |WINNING_LINE|WINNING_LINE          |
|TGG      |GOLDEN GATE FIELDS         |RUNNING_LINE|RUNNING_LINE_3        |
|TOP      |OAKLAWN PARK               |WINNING_LINE|WINNING_LINE ALTERNATE|
|TAM      |TAMPA BAY DOWNS            |WINNING_LINE|WINNING_LINE          |
|TLS      |LONE STAR PARK             |WINNING_LINE|WINNING_LINE          |
|TCD      |CHURCHILL DOWNS            |WINNING_LINE|WINNING_LINE          |
|TCD      |CHURCHILL DOWNS            |RUNNING_LINE|RUNNING_LINE LANE 4   |
|MVR      |MAHONING VALLEY RACE COURSE|WINNING_LINE|WINNING_LINE          |
|MVR      |M

In [176]:
saved_df = spark.read.parquet("/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/parsed_routes_df.parquet")

# Route Metrics

## 1. Route Efficiency

Route Efficiency measures the degree to which a horse follows the optimal racing path defined by the track’s coordinates. It provides a ratio of the actual distance a horse runs versus the optimal distance prescribed by the track layout.

> •	Formula:

> $$\text{Route Efficiency} = \frac{\text{Distance Ran}}{\text{Optimal Route Length}}$$

### Interpretation:

> •	Route Efficiency = 1: Perfect adherence to the optimal path.

> •	Route Efficiency < 1: Runs shorter than the optimal path (possibly cutting corners).

> •	Route Efficiency > 1: Runs longer than the optimal path (possibly veering off track).


In [196]:
from pyspark.sql.functions import col, udf, explode, lit, sum as spark_sum
from pyspark.sql.types import DoubleType
import math

# Haversine function to calculate distances
def haversine(lat1, lon1, lat2, lon2):
    if None in (lat1, lon1, lat2, lon2):
        return 0.0
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2
    c = 2 * math.asin(math.sqrt(a))
    r = 6371000  # Radius of Earth in meters
    return c * r

haversine_udf = udf(haversine, DoubleType())

# Filter valid routes and explode coordinates
routes_valid = routes_df.filter(~col("coordinates").contains("010200"))  # Exclude invalid rows
routes_exploded = routes_valid.withColumn("coord", explode("parsed_coordinates"))

# Generate a sequence order for coordinates to preserve order
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("course_cd").orderBy("created_at")

routes_ordered = routes_exploded.withColumn("coord_order", row_number().over(window_spec)) \
    .select("course_cd", "coord_order", "coord.latitude", "coord.longitude")

# Create pairs of consecutive points
routes_shifted = routes_ordered.withColumnRenamed("latitude", "lat1") \
    .withColumnRenamed("longitude", "lon1")

routes_consecutive = routes_ordered.alias("current") \
    .join(routes_shifted.alias("previous"),
          (col("current.course_cd") == col("previous.course_cd")) &
          (col("current.coord_order") == col("previous.coord_order") + 1),
          "left") \
    .select(
        col("current.course_cd"),
        col("previous.lat1"),
        col("previous.lon1"),
        col("current.latitude").alias("lat2"),
        col("current.longitude").alias("lon2")
    )

# Calculate segment distances
routes_with_distance = routes_consecutive.withColumn(
    "segment_distance",
    haversine_udf(col("lat1"), col("lon1"), col("lat2"), col("lon2"))
)

# Aggregate the total optimal route length per course
optimal_route_length = routes_with_distance.groupBy("course_cd").agg(
    spark_sum("segment_distance").alias("optimal_route_length_meters")
)

optimal_route_length.show()

# Calculate Route Efficiency
# Join with matched_df_unique to calculate Route Efficiency for each horse
matched_with_routes = matched_df_unique.join(
    optimal_route_length,
    on="course_cd",
    how="inner"
)

# Add route efficiency column
matched_with_efficiency = matched_with_routes.withColumn(
    "route_efficiency",
    col("progress") / col("optimal_route_length_meters")
)

matched_with_efficiency.select(
    "course_cd", "race_date", "race_number", "saddle_cloth_number",
    "progress", "optimal_route_length_meters", "route_efficiency"
).show()

+---------+---------------------------+
|course_cd|optimal_route_length_meters|
+---------+---------------------------+
+---------+---------------------------+

+---------+---------+-----------+-------------------+--------+---------------------------+----------------+
|course_cd|race_date|race_number|saddle_cloth_number|progress|optimal_route_length_meters|route_efficiency|
+---------+---------+-----------+-------------------+--------+---------------------------+----------------+
+---------+---------+-----------+-------------------+--------+---------------------------+----------------+



In [188]:
# Count invalid or malformed entries
invalid_count = routes_df.filter(routes_df.coordinates.contains("010200")).count()
print(f"Number of invalid WKT records: {invalid_count}")


Number of invalid WKT records: 38


In [197]:
print(f"matched_df_unique record count: {matched_df_unique.count()}")
print(f"routes_df record count: {routes_df.count()}")

matched_df_unique.show(5, truncate=False)
routes_df.show(5, truncate=False)

matched_df_unique record count: 4741439
routes_df record count: 38
+---------+----------+-----------+-------------------+---------------------+-----------+----------+-----+--------+----------------+-------------------+--------------------------------------------------+-------------+----------------------+-----------------+------------+---------+--------------+
|course_cd|race_date |race_number|saddle_cloth_number|time_stamp           |longitude  |latitude  |speed|progress|stride_frequency|post_time          |location                                          |time_stamp_ms|sec_time_stamp        |sec_time_stamp_ms|gate_numeric|gate_name|sectional_time|
+---------+----------+-----------+-------------------+---------------------+-----------+----------+-----+--------+----------------+-------------------+--------------------------------------------------+-------------+----------------------+-----------------+------------+---------+--------------+
|AQU      |2023-01-07|6          |5          

Failed to parse WKT: 0102000020E61000000200000073A9728AF92655C027187B6C1C064340BE82D7F5F02655C0BA6BC4BF26064340, Error: ParseException: Unknown type: '0102000020E61000000200000073A9728AF92655C027187B6C1C064340BE82D7F5F02655C0BA6BC4BF26064340'
Failed to parse WKT: 0102000020E6100000800100002682823EFD2655C089E0EAB3CA0543406EAAD1E9FC2655C08A4EC1FFC905434093BE2A40FA2655C00E40D784C4054340408BC53AF82655C0B88A8043C00543406C91F636F62655C029A33813BC054340A0607046F42655C028DA950BB805434091D615F7F22655C097CA4250B5054340136069BFF22655C01409FFD0B405434082A64685F22655C0CA108454B405434058E7BB48F22655C0205FF0DAB30543405EBF9E52F02655C0D91935E5AF054340FEE1C1E7EE2655C03DB56815AD054340C7B08D33EE2655C0941ED5C2AB054340DC3D2685ED2655C0EFD7BD68AA054340FD76B5D1EC2655C0D2F37010A905434061864817EC2655C0E1CB8DC1A7054340EC2B7BC4EA2655C025220766A50543400C74942FE92655C0173A16BBA2054340E7E07226E92655C0C771B3ACA20543401A821D48E82655C01A844B63A1054340ABDF7764E72655C012D51823A0054340262D8F7CE62655C0ED62500D9F054340B62146

# Cached matchted_df_unique

In [116]:
matched_df_unique.cache()

DataFrame[course_cd: string, race_date: date, race_number: int, saddle_cloth_number: string, time_stamp: timestamp, longitude: double, latitude: double, speed: double, progress: double, stride_frequency: double, post_time: timestamp, location: string, time_stamp_ms: bigint, sec_time_stamp: timestamp, sec_time_stamp_ms: bigint, gate_numeric: double, gate_name: string, sectional_time: double]

In [168]:
matched_df_unique.write.mode("overwrite").parquet("/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/matched_df_unique.parquet")

                                                                                

In [189]:
from shapely.wkt import loads
from shapely.errors import WKTReadingError

def validate_wkt(wkt_str):
    try:
        return loads(wkt_str) is not None
    except WKTReadingError:
        return False

validate_wkt_udf = udf(validate_wkt, BooleanType())
valid_routes_df = routes_df.filter(validate_wkt_udf(routes_df.coordinates))

  from shapely.errors import WKTReadingError


In [195]:
routes_df.printSchema()

root
 |-- course_cd: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- line_type: string (nullable = true)
 |-- line_name: string (nullable = true)
 |-- coordinates: string (nullable = true)
 |-- created_at: timestamp (nullable = true)
 |-- parsed_coordinates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- longitude: double (nullable = true)
 |    |    |-- latitude: double (nullable = true)



# Data Preparation

## Sample data initially

Taking a random sample will not work for time series as was attempted, but taking a smaller sample by filtering on date should work fine.

In [None]:
# Filter data for a specific date range or course
#df_filtered = merged_df[merged_df['race_date'] >= '2024-01-01']

In [None]:
#df_filtered.shape

In [None]:
#print(df_filtered.isnull().sum())

## Check for missing data

In [111]:
# Check for missing values
from pyspark.sql.functions import col, when, count

# List of all columns in the DataFrame
columns = matched_df_unique.columns

# Create a list of expressions to count nulls in each column
null_counts = [count(when(col(c).isNull(), c)).alias(c) for c in columns]

# Apply the expressions to the DataFrame
matched_df_unique.select(null_counts).show(truncate=False)



+---------+---------+-----------+-------------------+----------+---------+--------+-----+--------+----------------+---------+--------+-------------+--------------+-----------------+------------+---------+--------------+
|course_cd|race_date|race_number|saddle_cloth_number|time_stamp|longitude|latitude|speed|progress|stride_frequency|post_time|location|time_stamp_ms|sec_time_stamp|sec_time_stamp_ms|gate_numeric|gate_name|sectional_time|
+---------+---------+-----------+-------------------+----------+---------+--------+-----+--------+----------------+---------+--------+-------------+--------------+-----------------+------------+---------+--------------+
|0        |0        |0          |0                  |0         |0        |0       |0    |0       |397773          |0        |0       |0            |0             |0                |205559      |0        |0             |
+---------+---------+-----------+-------------------+----------+---------+--------+-----+--------+----------------+-----



In [None]:
#import seaborn as sns
#import matplotlib.pyplot as plt

# Heatmap of missing values (on a small sample)
#sns.heatmap(df.isnull(), cbar=False)
#plt.show()

## Imputation for Stride Frequency and number_of_strides


#### Group-Based Imputation: Impute based on groups, such as per horse.

In [None]:
df['stride_frequency'] = df.groupby('saddle_cloth_number')['stride_frequency'].transform(lambda x: x.fillna(x.median()))

#### gate_numeric remains the same within a group until changed:

In [None]:
df.dropna(subset=['gate_numeric'], inplace=True)

#### Interprolation distance_back changes over time 

In [None]:
df.dropna(subset=['distance_back'], inplace=True)

#### Group Based Imputation for number of strides

In [None]:
df.dropna(subset=['number_of_strides'], inplace=True)

## Choose Features

In [None]:
df.shape

In [None]:
print(df.isnull().sum())

In [None]:
feature_columns = [
    'speed',
    'progress',
    'stride_frequency',
    'number_of_strides',
    'post_pos',
    'gate_numeric',
    'length_to_finish',
    'sectional_time',
    'running_time',
    'distance_back',
    'distance_ran'
]

## Feature Engineering -- calculate additional features

In [None]:
df.sort_values(by=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp'], inplace=True)
df['acceleration'] = df.groupby(
    ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
)['speed'].diff() / df.groupby(
    ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
)['time_stamp'].diff().dt.total_seconds()

In [None]:
import numpy as np
df['acceleration'] = df['acceleration'].replace([np.inf, -np.inf], np.nan)
df['acceleration'] = df['acceleration'].fillna(0)

In [None]:
feature_columns.append('acceleration')

## Scale Features

Scaling helps in training neural networks.

In [None]:
# Note: Scaling should be done after sequences are created to avoid data leakage.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Create Sequences for LSTM

a. Group Data

Group the data to create sequences for each horse in each race.

In [None]:
group_columns = ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
groups = df_sampled.groupby(group_columns)

##  Create Sequences and Labels

In [None]:
sequences = []
labels = []

for name, group in groups:
    # Ensure group is sorted by time
    group = group.sort_values('time_stamp')

    # Extract features
    features = group[feature_columns].values

    # Append the sequence
    sequences.append(features)

    # Extract label (official finishing position)
    label = group['official_fin'].iloc[0]  # Assuming it's the same for all entries in the group
    labels.append(label)

## Determine max_seq_length and num_features

In [None]:
# Note: Alternatively, set a fixed max_seq_length to limit memory usage.
max_seq_length = max(len(seq) for seq in sequences)
num_features = len(feature_columns)


In [None]:
print(max_seq_length)
print(num_features)

## Pad Sequences

In [None]:
import tensorflow as tf
# from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=max_seq_length, padding='post', dtype='float32'
)

## Convert Labels

Adjust labels to start from 0 if they start from 1.

In [None]:
labels = np.array(labels).astype(int) - 1
num_classes = labels.max() + 1

## Scale Features

Now, scale the features. Be cautious to fit the scaler only on the training data to prevent data leakage.

Flatten sequences for scaling:

In [None]:
num_samples = padded_sequences.shape[0]
X_flat = padded_sequences.reshape(-1, num_features)

## Fit scaler on the flattened data:

In [None]:
X_scaled_flat = scaler.fit_transform(X_flat)

## Reshape back to original shape:

In [None]:
X_scaled = X_scaled_flat.reshape(num_samples, max_seq_length, num_features)

# Split Data into Training, Validation, and Test Sets

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assume sequences and labels have been created and padded_sequences is available

# Convert labels
labels = np.array(labels).astype(int) - 1
num_classes = labels.max() + 1

# Split data
X_temp, X_test, y_temp, y_test = train_test_split(
    padded_sequences, labels, test_size=0.1, random_state=42
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1, random_state=42
)

# Check shapes
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)
print("y_test shape:", y_test.shape)

# Scale features
scaler = StandardScaler()

# Flatten training data and fit scaler
num_samples_train = X_train.shape[0]
X_train_flat = X_train.reshape(-1, num_features)
X_train_scaled_flat = scaler.fit_transform(X_train_flat)
X_train_scaled = X_train_scaled_flat.reshape(num_samples_train, max_seq_length, num_features)

# Scale validation data
num_samples_val = X_val.shape[0]
X_val_flat = X_val.reshape(-1, num_features)
X_val_scaled_flat = scaler.transform(X_val_flat)
X_val_scaled = X_val_scaled_flat.reshape(num_samples_val, max_seq_length, num_features)

# Scale test data
num_samples_test = X_test.shape[0]
X_test_flat = X_test.reshape(-1, num_features)
X_test_scaled_flat = scaler.transform(X_test_flat)
X_test_scaled = X_test_scaled_flat.reshape(num_samples_test, max_seq_length, num_features)

Ensure that X_train, X_val, X_test, y_train, y_val, and y_test are correctly shaped.

In [None]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

# Prepare Data for Model Training

## Training the LSTM Model

### Build the Model

This model combines dropout, regularization, and normalization for better results.

In [None]:
import tensorflow as tf

model_lstm = tf.keras.Sequential([
    tf.keras.Input(shape=(max_seq_length, num_features)),
    tf.keras.layers.Masking(mask_value=0.0),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        256, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),    
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        128, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        64, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])


#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.LSTM(128),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#988/988 ━━━━━━━━━━━━━━━━━━━━ 7s 7ms/step - accuracy: 0.3606 - loss: 1.6184
#Test Loss: 1.6182985305786133, Test Accuracy: 0.36063656210899353

### Compile the Model

RMSprop is often a good choice for RNNs.

>	•	The learning rate of 0.001 is a typical starting point.

>   •	Recommendation:

>   •	You can experiment with different learning rates (e.g., 0.0005, 0.0001) if needed.

>   •	Alternatively, you can also try the Adam optimizer and compare results.

In [None]:
# experimenting with different learning rates (e.g., 0.0005, 0.0001) to see if it affects convergence.

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)

model_lstm.compile(
    optimizer=optimizer,   # 'adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'] #,tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)



### Train the Model

## Hyperparameter Tuning

> Learning Rate Scheduler and Early Stopping

> * Learning Rate Scheduler

>  * Earlystopping



In [None]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=2, min_lr=1e-6
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [None]:
history = model_lstm.fit(
    X_train, y_train,
    epochs=50,  
    batch_size=128,  # 64,
    validation_data=(X_val, y_val),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ]
)

### Evaluate the Model

In [None]:
test_loss, test_accuracy = model_lstm.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

## Plot Training and Validation Loss and Accuracy:

In [None]:
import matplotlib.pyplot as plt

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.show()

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.show()

### Check for Imbalance

In [None]:
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))

In [None]:
plt.bar(unique, counts)
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
corr_matrix = df[['speed', 'progress', 'stride_frequency', 'longitude', 'latitude', 'post_pos', 'official_fin']].corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Define your variables
max_seq_length = 120  # Replace with your maximum sequence length
num_features = 5      # Replace with the actual number of features in your data
num_classes = 12      # Replace with the actual number of classes

# Build your model
model_lstm = tf.keras.Sequential()
model_lstm.add(tf.keras.layers.Masking(mask_value=0., input_shape=(max_seq_length, num_features)))
model_lstm.add(tf.keras.layers.LSTM(128))
model_lstm.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

model_lstm.summary()

In [None]:
import tensorflow as tf
print(tf.__version__)


In [None]:
# Load data into dataframe:

import pandas as pd

In [None]:
# Training

history_lstm = model_lstm.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping]
)

# Combining the Models

To create an ensemble, you can combine the predictions of these models in several ways:
	1.	Averaging Probabilities:
	•	Obtain probability distributions over finishing positions from each model.
	•	Average the probabilities across models to get the final prediction.
	2.	Weighted Averaging:
	•	Assign weights to each model based on validation performance.
	•	Compute a weighted average of the probabilities.
	3.	Stacking (Meta-Learner):
	•	Use the predictions from the individual models as input features to a meta-model (e.g., a logistic regression or another neural network).
	•	The meta-model learns how to best combine the individual predictions.
	4.	Voting (for Classification):
	•	If treating the problem as classification into discrete positions, use majority voting among the models.
	•	Not as suitable if you need probability distributions.

Implementation Steps

1. Data Preparation

	•	Sequences:
	•	Use the raw GPS data (gpspoint) to create sequences for each horse in each race.
	•	Ensure that sequences are properly sorted by time_stamp.
	•	Features:
	•	Include raw features such as speed, progress, stride_frequency.
	•	Avoid hand-engineering features like acceleration to adhere to your objective.
	•	Labels:
	•	Use official_fin from results_entries as the target variable.
	•	Since you want probabilities for each finishing position, consider encoding official_fin as categorical labels.
