# Proposed Ensemble Models

Given the constraints and objectives, I recommend considering the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


### Validate GPU Setup

In [3]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.17.0


In [4]:
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs Available:", len(physical_devices))

Num GPUs Available: 2


In [5]:
for device in physical_devices:
    print(device)

PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')


#### Make sure JAVA HOME is set to version 11 -- source .zshrc in same folder as you start jupyter

In [6]:
!echo $JAVA_HOME
!java --version

/usr/lib/jvm/java-11-openjdk
openjdk 11.0.25 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.25.0.9-1) (build 11.0.25+9-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.25.0.9-1) (build 11.0.25+9-LTS, mixed mode, sharing)


# The LSTM Network on Raw GPS Data

Initially I desired to merge the GPS data with Sectionals, but the timestamp and gate_name intervals of each respectively made it difficult to align the data in sequences -- something that is needed for Long-Short Term Memory models. Therefore, it was decided to go with an ensemble approach. There will be additional models that incorporate Equibase data as well, but for the time being, the focus will be on Total Performance GPS data. 

In [7]:
# Environment Setup

import os
import logging
import configparser
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min as spark_min, sum as spark_sum, from_utc_timestamp, expr
from pyspark.sql.window import Window

# Load configuration file
config = configparser.ConfigParser()
config.read('/home/exx/myCode/horse-racing/FoxRiverAIRacing/config.ini')

# Database credentials from config
db_host = config['database']['host']
db_port = config['database']['port']
db_name = config['database']['dbname']
db_user = config['database']['user']
db_password = os.getenv("DB_PASSWORD", "SparkPy24!")  # Ensure DB_PASSWORD is set

# Validate database password
if not db_password:
    raise ValueError("Database password is missing. Set it in the DB_PASSWORD environment variable.")

# JDBC URL and properties
jdbc_url = f"jdbc:postgresql://{db_host}:{db_port}/{db_name}"
jdbc_properties = {
    "user": db_user,
    "password": db_password,
    "driver": "org.postgresql.Driver"
}

# Path to JDBC driver
jdbc_driver_path = "/home/exx/myCode/horse-racing/FoxRiverAIRacing/jdbc/postgresql-42.7.4.jar"

# Configure logging
log_file = "/home/exx/myCode/horse-racing/FoxRiverAIRacing/logs/SparkPy_load.log"
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler()
    ]
)
logging.info("Environment setup initialized.")

# Initialize Spark session
def initialize_spark():
    spark = SparkSession.builder \
        .appName("Horse Racing Data Processing") \
        .config("spark.driver.extraClassPath", jdbc_driver_path) \
        .config("spark.executor.extraClassPath", jdbc_driver_path) \
        .config("spark.driver.memory", "64g") \
        .config("spark.executor.memory", "32g") \
        .config("spark.executor.memoryOverhead", "8g") \
        .config("spark.sql.debug.maxToStringFields", "1000") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") \
        .config("spark.sql.legacy.parquet.int96RebaseModeInWrite", "LEGACY") \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    logging.info("Spark session created successfully.")
    return spark

# Initialize Spark
spark = initialize_spark()

2024-12-06 20:45:51,531 - INFO - Environment setup initialized.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/06 20:45:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2024-12-06 20:45:53,440 - INFO - Spark session created successfully.


In [None]:
# Data Loading and Transformation

from pyspark.sql.functions import col, min as spark_min, sum as spark_sum, unix_timestamp, to_timestamp, expr
from pyspark.sql.window import Window

# Define SQL queries without trailing semicolons
queries = {
    "results": """
        SELECT vre.course_cd, vre.race_date, vre.race_number, vre.program_num AS saddle_cloth_number, vre.post_pos,
               h.horse_id, vre.official_fin, vre.finish_time, vre.speed_rating, vr.todays_cls,
               vr.previous_surface, vr.previous_class, vr.net_sentiment
        FROM v_results_entries vre
        JOIN v_runners vr 
            ON vre.course_cd = vr.course_cd
            AND vre.race_date = vr.race_date
            AND vre.race_number = vr.race_number
            AND vre.program_num = vr.saddle_cloth_number
        JOIN horse h 
            ON vre.axciskey = h.axciskey
        WHERE vre.breed = 'TB'
        GROUP BY vre.course_cd, vre.race_date, vre.race_number, vre.program_num, vre.post_pos,
                 h.horse_id, vre.official_fin, vre.finish_time, vre.speed_rating, vr.todays_cls,
                 vr.previous_surface, vr.previous_class, vr.net_sentiment
    """,
    "sectionals": """
        SELECT course_cd, race_date, race_number, saddle_cloth_number, gate_name, 
               gate_numeric, length_to_finish, sectional_time, running_time, 
               distance_back, distance_ran, number_of_strides
        FROM v_sectionals
    """,
    "gpspoint": """
        SELECT course_cd, race_date, race_number, saddle_cloth_number, time_stamp, 
               longitude, latitude, speed, progress, stride_frequency, post_time, location
        FROM v_gpspoint
    """
}

# Path to save Parquet files
parquet_dir = "/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/"
os.makedirs(parquet_dir, exist_ok=True)

# Load data directly from PostgreSQL into Spark DataFrames and save as Parquet
dfs = {}
for name, query in queries.items():
    logging.info(f"Loading {name} data from PostgreSQL...")
    try:
        df = spark.read.jdbc(url=jdbc_url, table=f"({query}) AS subquery", properties=jdbc_properties)
        output_path = os.path.join(parquet_dir, f"{name}.parquet")
        logging.info(f"Saving {name} DataFrame to Parquet at {output_path}...")
        df.write.mode("overwrite").parquet(output_path)
        dfs[name] = df
        logging.info(f"{name} data loaded and saved successfully.")
    except Exception as e:
        logging.error(f"Error loading {name} data: {e}")
        raise

# Reload Parquet files into Spark DataFrames for processing
logging.info("Reloading Parquet files into Spark DataFrames for transformation...")
results_df = spark.read.parquet(os.path.join(parquet_dir, "results.parquet"))
sectionals_df = spark.read.parquet(os.path.join(parquet_dir, "sectionals.parquet"))
gps_df = spark.read.parquet(os.path.join(parquet_dir, "gpspoint.parquet"))
logging.info("Parquet files reloaded successfully.")



2024-12-06 20:45:54,482 - INFO - Loading results data from PostgreSQL...
2024-12-06 20:45:55,637 - INFO - Saving results DataFrame to Parquet at /home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/results.parquet...
2024-12-06 20:45:58,430 - INFO - results data loaded and saved successfully.    
2024-12-06 20:45:58,431 - INFO - Loading sectionals data from PostgreSQL...
2024-12-06 20:45:58,451 - INFO - Saving sectionals DataFrame to Parquet at /home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/sectionals.parquet...
2024-12-06 20:46:07,526 - INFO - sectionals data loaded and saved successfully. 
2024-12-06 20:46:07,527 - INFO - Loading gpspoint data from PostgreSQL...
2024-12-06 20:46:07,547 - INFO - Saving gpspoint DataFrame to Parquet at /home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/gpspoint.parquet...
[Stage 2:>                                                          (0 + 1) / 1]

In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
from datetime import timedelta

# Define the UDF to add seconds (including fractional seconds) to a timestamp
def add_seconds(ts, seconds):
    if ts is None or seconds is None:
        return None
    return ts + timedelta(seconds=seconds)

# Register the UDF
add_seconds_udf = udf(add_seconds, TimestampType())

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    unix_timestamp,
    expr,
    min as spark_min,
    sum as spark_sum,
    date_format
)
from pyspark.sql.window import Window
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import udf
from datetime import timedelta

# Initialize Spark session
spark = SparkSession.builder \
    .appName("HorseRacingDataProcessing") \
    .getOrCreate()

# Clear all cached data
spark.catalog.clearCache()

# Reload the DataFrames from Parquet files
gps_df = spark.read.parquet("/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/gpspoint.parquet")
sectionals_df = spark.read.parquet("/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/sectionals.parquet")

# Convert time_stamp to timestamp type
gps_df = gps_df.withColumn("time_stamp", col("time_stamp").cast("timestamp"))

# Print a sample of the time_stamp column to check for millisecond precision
print("Sample of gps_df time_stamp column:")
gps_df.select("time_stamp").show(10, truncate=False)


In [None]:
# Step 1: Calculate the earliest 'time_stamp' for each race
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]

first_time_df = gps_df.groupBy(*race_id_cols).agg(
    spark_min("time_stamp").alias("earliest_time_stamp")
)

# Step 2: Join 'first_time_df' with 'sectionals_df' to associate each sectional with the race's start time
sectionals_df = sectionals_df.join(
    first_time_df,
    on=race_id_cols,
    how="left"
)

# Step 3: Sort 'sectionals_df' by 'gate_numeric' to ensure correct order of gates
sectionals_df = sectionals_df.orderBy(*race_id_cols, "gate_numeric")

# Step 4: Define the window specification for cumulative sum
window_spec = Window.partitionBy(*race_id_cols).orderBy("gate_numeric").rowsBetween(Window.unboundedPreceding, 0)

# Step 5: Compute cumulative sum of 'sectional_time' for each race
sectionals_df = sectionals_df.withColumn(
    "cumulative_sectional_time",
    spark_sum("sectional_time").over(window_spec)
)

In [None]:
# Step 6: Define the UDF to add seconds (including fractional seconds) to a timestamp
def add_seconds(ts, seconds):
    if ts is None or seconds is None:
        return None
    return ts + timedelta(seconds=seconds)

# Register the UDF
add_seconds_udf = udf(add_seconds, TimestampType())

# Step 7: Create 'sec_time_stamp' by adding 'cumulative_sectional_time' to 'earliest_time_stamp' using the UDF
sectionals_df = sectionals_df.withColumn(
    "sec_time_stamp",
    add_seconds_udf(col("earliest_time_stamp"), col("cumulative_sectional_time"))
)

# Step 8: Drop intermediate columns if no longer needed
sectionals_df = sectionals_df.drop("earliest_time_stamp", "cumulative_sectional_time")

# Show a sample of the results
print("Sample of sectionals_df with sec_time_stamp:")
sectionals_df.select(
    "course_cd",
    "race_date",
    "race_number",
    "saddle_cloth_number",
    "gate_numeric",
    "gate_name",
    "sectional_time",
    "sec_time_stamp"
).show(10, truncate=False)

# Now, proceed with the join to create matched_df

from pyspark.sql.functions import abs

In [None]:
# Step 9: Convert 'time_stamp' and 'sec_time_stamp' to milliseconds since epoch to preserve sub-second precision
gps_with_ms = gps_df.withColumn(
    "time_stamp_ms",
    (col("time_stamp").cast("double") * 1000).cast("long")
)

sectionals_with_ms = sectionals_df.withColumn(
    "sec_time_stamp_ms",
    (col("sec_time_stamp").cast("double") * 1000).cast("long")
)

# Step 10: Define the join condition with time window (±1000 milliseconds)
join_condition = (
    (gps_with_ms.course_cd == sectionals_with_ms.course_cd) &
    (gps_with_ms.race_date == sectionals_with_ms.race_date) &
    (gps_with_ms.race_number == sectionals_with_ms.race_number) &
    (gps_with_ms.saddle_cloth_number == sectionals_with_ms.saddle_cloth_number) &
    (abs(gps_with_ms.time_stamp_ms - sectionals_with_ms.sec_time_stamp_ms) <= 1000)
)

# Step 11: Perform the left join based on the join condition
matched_df = gps_with_ms.join(
    sectionals_with_ms,
    on=join_condition,
    how="left"
).select(
    gps_with_ms["*"],
    sectionals_with_ms["sec_time_stamp"],
    sectionals_with_ms["gate_numeric"],
    sectionals_with_ms["gate_name"],
    sectionals_with_ms["sectional_time"]
)

In [None]:
# Step 12: Verify the matched records
print("Sample of matched_df (All gps_df records with sectional data where available):")
matched_df.select(
    *race_id_cols,
    "time_stamp",
    "sec_time_stamp",
    "gate_numeric",
    "gate_name",
    "sectional_time"
).show(10, truncate=False)

# Step 13: Show the total number of rows in matched_df, gps_df, and sectionals_df
matched_count = matched_df.count()
print(f"Total number of rows in matched_df: {matched_count}")

gps_count = gps_df.count()
print(f"Total number of rows in gps_df: {gps_count}")

sectionals_count = sectionals_df.count()
print(f"Total number of rows in sectionals_df: {sectionals_count}")


In [None]:
# Step 14: Show the number of unmatched sectional records

# Extract matched sectional records
matched_sectionals = matched_df.select(
    *race_id_cols,
    "sec_time_stamp"
).distinct()

# Perform a left anti join to find sectionals not present in matched_sectionals
unmatched_sectionals_df = sectionals_df.join(
    matched_sectionals,
    on=race_id_cols + ["sec_time_stamp"],
    how="leftanti"
)

# Count the number of unmatched sectional records
unmatched_sectionals_count = unmatched_sectionals_df.count()
print(f"Number of unmatched sectional records: {unmatched_sectionals_count}")

# Stop the Spark session
spark.stop()

In [None]:
sectionals_df.columns

In [None]:

# Step 1c: Compute Time Zone Offset (delta_time_seconds) as the difference between 'first_time_stamp' and 'post_time'
# Assuming 'post_time' is in local time and 'first_time_stamp' is in UTC
# delta_time_seconds = first_time_stamp (UTC) - post_time (local) => time_zone_offset in seconds

logging.info("Computing time zone offset (delta_time_seconds) for each race...")
time_zone_offset_df = first_time_with_post_df.withColumn(
    "delta_time_seconds",
    (unix_timestamp("first_time_stamp") - unix_timestamp("post_time"))
)

# Step 1d: Join 'time_zone_offset_df' back to 'gps_df' and 'sectionals_df'
logging.info("Joining 'time_zone_offset_df' back to 'gps_df' and 'sectionals_df'...")
gps_df = gps_df.join(
    time_zone_offset_df.select(
        "course_cd", "race_date", "race_number", "saddle_cloth_number", "delta_time_seconds"
    ),
    on=["course_cd", "race_date", "race_number", "saddle_cloth_number"],
    how="left"
)

sectionals_df = sectionals_df.join(
    time_zone_offset_df.select(
        "course_cd", "race_date", "race_number", "saddle_cloth_number", "delta_time_seconds"
    ),
    on=["course_cd", "race_date", "race_number", "saddle_cloth_number"],
    how="left"
)

# Step 2: Compute 'time_stamp_local' by adjusting 'time_stamp' with 'delta_time_seconds'
# 'time_stamp_local' = 'time_stamp' - 'delta_time_seconds'

logging.info("Computing 'time_stamp_local' by adjusting 'time_stamp' with 'delta_time_seconds'...")
gps_df = gps_df.withColumn(
    "time_stamp_local",
    expr("cast((unix_timestamp(time_stamp) - delta_time_seconds) as timestamp)")
)

# Step 3: Calculate the minimum 'time_stamp_local' for each race
logging.info("Calculating minimum 'time_stamp_local' for each race...")
min_time_df = gps_df.groupBy("course_cd", "race_date", "race_number", "saddle_cloth_number") \
    .agg(spark_min("time_stamp_local").alias("min_time_stamp_local"))

# Step 4: Join 'min_time_df' with 'sectionals_df' to associate each sectional with the race's start time
logging.info("Joining 'min_time_df' with 'sectionals_df' to associate each sectional with the race's start time...")
sectionals_df = sectionals_df.join(
    min_time_df,
    on=["course_cd", "race_date", "race_number", "saddle_cloth_number"],
    how="left"
)

# Step 5: Sort 'sectionals_df' within each race by 'gate_numeric' to maintain sectional order
# (Assuming 'gate_numeric' determines the sequence of sectionals)

# Step 6: Compute cumulative 'sectional_time' within each race to determine when each sectional occurs
logging.info("Computing cumulative 'sectional_time' for each race...")
window_spec = Window.partitionBy(
    "course_cd", "race_date", "race_number", "saddle_cloth_number"
).orderBy("gate_numeric") \
 .rowsBetween(Window.unboundedPreceding, Window.currentRow)

sectionals_df = sectionals_df.withColumn(
    "sectional_time_cumulative",
    spark_sum("sectional_time").over(window_spec)
)

# Step 7: Create 'sec_time_stamp' by adding cumulative 'sectional_time' to 'min_time_stamp_local'
# Assuming 'sectional_time_cumulative' is in seconds

logging.info("Creating 'sec_time_stamp' by adding cumulative 'sectional_time' to 'min_time_stamp_local'...")
sectionals_df = sectionals_df.withColumn(
    "sec_time_stamp",
    expr("cast((unix_timestamp(min_time_stamp_local) + sectional_time_cumulative) as timestamp)")
)

# Step 8: Drop intermediate columns if not needed
sectionals_df = sectionals_df.drop("min_time_stamp_local", "sectional_time_cumulative", "delta_time_seconds")

logging.info("'sec_time_stamp' created successfully.")



In [None]:

# Optional: Display sample data to verify 'sec_time_stamp'
logging.info("Displaying sample 'sec_time_stamp' values:")
sectionals_df.select(
    "course_cd", "race_date", "race_number", "saddle_cloth_number",
    "gate_numeric", "sectional_time", "sec_time_stamp"
).show(30, truncate=False)

### Explanation of the Corrected Steps

1.	Calculate Time Zone Offset (delta_time_seconds):

>    •	Objective: Determine the difference between the earliest GPS time_stamp (in UTC) and the post_time (scheduled start time in local time) for each race.

>.   •	Implementation:

>.   •	Step 1a: Group gps_df by race identifiers and find the earliest time_stamp (first_time_stamp) for each race.

>    •	Step 1b: Join first_time_df with gps_df to retrieve the corresponding post_time for each race.

>.   •	Step 1c: Compute delta_time_seconds as the difference between first_time_stamp and post_time in seconds using unix_timestamp.

2.	Adjust time_stamp to time_stamp_local:

>.   •	Objective: Convert GPS time_stamp from UTC to local time using the computed delta_time_seconds.

>.   •	Implementation: Subtract delta_time_seconds from time_stamp to obtain time_stamp_local.
	
3.	Compute sec_time_stamp:

>    •	Objective: Assign a local timestamp to each sectional by adding cumulative sectional_time to the race’s start time (min_time_stamp_local).

>    •	Implementation:
	
>    •	Step 3a: Group gps_df by race identifiers and compute the minimum time_stamp_local (min_time_stamp_local) for each race.

>    •	Step 3b: Join min_time_df with sectionals_df to associate each sectional with the race’s start time.

>    •	Step 3c: Define a window specification to sort sectionals_df by gate_numeric within each race.

>    •	Step 3d: Compute the cumulative sum of sectional_time (sectional_time_cumulative) ordered by gate_numeric.

>    •	Step 3e: Create sec_time_stamp by adding sectional_time_cumulative to min_time_stamp_local.

4.	Cleanup:

>    •	Objective: Remove intermediate columns that are no longer needed to maintain a clean DataFrame.

>    •	Implementation: Drop columns such as min_time_stamp_local, sectional_time_cumulative, and delta_time_seconds.

5.	Validation:

>    •	Objective: Ensure that sec_time_stamp has been correctly computed.

>    •	Implementation: Display a sample of the sec_time_stamp values to verify correctness.

Recap of What Was Done

	1.	Grouped and Sorted DataFrames:
	•	Both gps_df and sectionals_df were grouped by race identifiers (course_cd, race_date, race_number, saddle_cloth_number).
	•	gps_df was sorted by time_stamp to identify the earliest timestamp per race.
	2.	Calculated Time Zone Offset (delta_time_seconds):
	•	Computed the difference between the earliest GPS time_stamp (in UTC) and post_time (scheduled race start time in local time) for each race.
	•	This difference represents the time zone offset in seconds.
	3.	Adjusted time_stamp to time_stamp_local:
	•	Subtracted delta_time_seconds from each time_stamp to obtain time_stamp_local, representing the local time.
	4.	Computed sec_time_stamp:
	•	Calculated the cumulative sum of sectional_time ordered by gate_numeric within each race.
	•	Added this cumulative sectional_time to the race’s start time (min_time_stamp_local) to assign a precise local timestamp (sec_time_stamp) to each sectional.
	5.	Cleaned Up DataFrame:
	•	Removed intermediate columns to maintain a focused and clean sectionals_df.
	6.	Validation:
	•	Displayed a sample of sec_time_stamp values to ensure correctness.



### Next Steps: Merging the Datasets

#### Implementation Steps

We’ll implement this strategy using Spark’s DataFrame operations and window functions.

1. Prepare Sectionals DataFrame

2. Ensure that each sectional has a unique identifier. This helps in managing joins and window functions.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

# Add a unique identifier to each sectional
sectionals_df = sectionals_df.withColumn("sectional_id", monotonically_increasing_id())

### Join Sectionals with GPS Points Within ±1 Second

We’ll perform a range join where GPS points are within ±1 second of the sectional’s sec_time_stamp.

In [None]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import TimestampType
from datetime import timedelta

# Define the UDF to subtract hours from a timestamp
def subtract_hours(ts, dh):
    if ts is None or dh is None:
        return None
    return ts - timedelta(hours=dh)

# Register the UDF
subtract_hours_udf = udf(subtract_hours, TimestampType())

In [None]:
# Step 3: Subtract 'delta_hours' hours from 'time_stamp' to get 'time_stamp_local' using the UDF
gps_df = gps_df.withColumn(
    "time_stamp_local",
    subtract_hours_udf(col("time_stamp"), col("delta_hours"))
).drop("delta_seconds", "delta_hours")

# Optional: Select relevant columns to verify the transformation
gps_df.select("time_stamp", "post_time", "time_stamp_local").show(10, truncate=False)

### Removed all time stamps other than time_stamp which is now converted to local time

In [None]:
from pyspark.sql.functions import col

# Step 5: Set 'time_stamp' to 'time_stamp_local' and remove 'post_time' and 'time_stamp_local'
gps_df = gps_df.withColumn("time_stamp", col("time_stamp_local")) \
               .drop("post_time", "time_stamp_local")

# Optional: Verify the transformation
gps_df.select("time_stamp").show(10, truncate=False)

### Group and Sort gps_df and sectionals_df

In [None]:
from pyspark.sql.functions import col

# Define race identifier columns
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]

# Step 1: Group and Sort `gps_df` by race and `time_stamp`
gps_sorted_df = gps_df.orderBy(
    *[col(col_name).asc() for col_name in race_id_cols],
    col("time_stamp").asc()
)

# Optional: Verify the sorting
gps_sorted_df.select(*race_id_cols, "time_stamp").show(10, truncate=False)

In [None]:
# Step 2: Group and Sort `sectionals_df` by race and `gate_numeric`
sectionals_sorted_df = sectionals_df.orderBy(
    *[col(col_name).asc() for col_name in race_id_cols],
    col("gate_numeric").asc()
)

# Optional: Verify the sorting
sectionals_sorted_df.select(*race_id_cols, "gate_numeric").show(10, truncate=False)

1.	Create the sec_time_stamp column in sectionals_df.

2.	For each race, take the first time_stamp from gps_df, add the first sectional_time to it, and populate the sec_time_stamp column.

3.	Continue adding sectional_time to the previous sec_time_stamp for each subsequent row within the race.

4.	Apply this process for every race in the dataset.


### Change precision of time stamps

In [None]:
# define a UDF that takes a timestamp and a number of seconds
from pyspark.sql.functions import col, udf
from pyspark.sql.types import TimestampType
from datetime import timedelta

# Define the UDF to add seconds (including fractional seconds) to a timestamp
def add_seconds(ts, seconds):
    if ts is None or seconds is None:
        return None
    return ts + timedelta(seconds=seconds)

# Register the UDF
add_seconds_udf = udf(add_seconds, TimestampType())

In [None]:
# Step 5 and 6: Compute 'sec_time_stamp' using the UDF and preserve sub-second precision
sectionals_final_df = sectionals_with_cum_sum_df.withColumn(
    "sec_time_stamp",
    add_seconds_udf(col("base_time"), col("cumulative_sectional_time"))
).drop("base_time", "cumulative_sectional_time", "sec_time_stamp_unix")

# Step 7: Verify the result by displaying sample data
sectionals_final_df.select(
    *race_id_cols, "gate_numeric", "sectional_time", "sec_time_stamp"
).show(10, truncate=False)

### Merge on time

In [None]:
from pyspark.sql.functions import col

# Define race identifier columns
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]

# Step 1.1: Sort `gps_df` by race identifiers and `time_stamp` in ascending order
gps_sorted_df = gps_df.orderBy(
    *[col(column).asc() for column in race_id_cols],
    col("time_stamp").asc()
)

# Step 1.2: Sort `sectionals_df` by race identifiers and `sec_time_stamp` in ascending order
sectionals_sorted_df = sectionals_df.orderBy(
    *[col(column).asc() for column in race_id_cols],
    col("sec_time_stamp").asc()
)

# Optional: Verify the sorting by displaying the first few rows of each DataFrame
print("Sorted GPS DataFrame:")
gps_sorted_df.select(*race_id_cols, "time_stamp").show(5, truncate=False)

print("Sorted Sectionals DataFrame:")
sectionals_sorted_df.select(*race_id_cols, "sec_time_stamp").show(5, truncate=False)

In [None]:
from pyspark.sql.functions import abs, col

# Define race identifier columns
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]

# Step 2.1: Convert 'time_stamp' and 'sec_time_stamp' to milliseconds since epoch to preserve sub-second precision
gps_with_ms = gps_sorted_df.withColumn(
    "time_stamp_ms",
    (col("time_stamp").cast("double") * 1000).cast("long")
)

sectionals_with_ms = sectionals_sorted_df.withColumn(
    "sec_time_stamp_ms",
    (col("sec_time_stamp").cast("double") * 1000).cast("long")
)

# Step 2.2: Define the join condition with time window (±1000 milliseconds)
join_condition = (
    (gps_with_ms.course_cd == sectionals_with_ms.course_cd) &
    (gps_with_ms.race_date == sectionals_with_ms.race_date) &
    (gps_with_ms.race_number == sectionals_with_ms.race_number) &
    (gps_with_ms.saddle_cloth_number == sectionals_with_ms.saddle_cloth_number) &
    (abs(gps_with_ms.time_stamp_ms - sectionals_with_ms.sec_time_stamp_ms) <= 1000)
)

# Step 2.3: Perform the left join based on the join condition
matched_df = gps_with_ms.join(
    sectionals_with_ms,
    on=join_condition,
    how="left"
).select(
    gps_with_ms["*"],
    sectionals_with_ms["sec_time_stamp"],
    sectionals_with_ms["gate_numeric"],
    sectionals_with_ms["sectional_time"]
)



In [None]:

# Optional: Verify the matched records
print("Sample of matched_df (All gps_df records with sectional data where available):")
matched_df.select(
    *race_id_cols,
    "time_stamp",
    "sec_time_stamp",
    "gate_numeric",
    "sectional_time"
).show(10, truncate=False)

In [None]:
# Query 1: Show the total number of rows in matched_df, gps_df, and sectionals_df

# Count of matched_df
matched_count = matched_df.count()
print(f"Total number of rows in matched_df: {matched_count}")

# Count of gps_df
gps_count = gps_df.count()
print(f"Total number of rows in gps_df: {gps_count}")

# Count of sectionals_df
sectionals_count = sectionals_df.count()
print(f"Total number of rows in sectionals_df: {sectionals_count}")

In [None]:
# Query 2: Show the number of unmatched sectional records

from pyspark.sql.functions import col

# Step 2.1: Extract distinct sectional records from matched_df
matched_sectionals = matched_df.select(
    *race_id_cols,
    "sec_time_stamp"
).distinct()

# Step 2.2: Perform a left anti join to find sectionals in sectionals_df not present in matched_sectionals
unmatched_sectionals_df = sectionals_sorted_df.join(
    matched_sectionals,
    on=race_id_cols + ["sec_time_stamp"],
    how="leftanti"
)

# Step 2.3: Count the number of unmatched sectional records
unmatched_sectionals_count = unmatched_sectionals_df.count()
print(f"Number of unmatched sectional records: {unmatched_sectionals_count}")

#### Explanation:

> •	Aliases: sec for sectionals_df and gps for gps_df to simplify column references.

> •	Join Conditions:

> •	Race Identifiers: Ensure that sectionals and GPS points belong to the same race.

> •	Time Window: GPS time_stamp_local must be within ±1 second of sec_time_stamp.

#### Calculate Absolute Time Difference

Compute the absolute difference between sec_time_stamp and time_stamp_local.

In [None]:
from pyspark.sql.functions import abs as spark_abs, unix_timestamp

# Calculate absolute time difference in seconds
joined_df = joined_df.withColumn(
    "time_diff",
    spark_abs(unix_timestamp(col("sec.sec_time_stamp")) - unix_timestamp(col("gps.time_stamp_local")))
)

#### Assign the Closest GPS Point to Each Sectional

Use window functions to rank GPS points based on time_diff and select the top-ranked GPS point for each sectional.

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

# Define window specification to rank GPS points within each sectional
window_spec = Window.partitionBy("sec.sectional_id").orderBy("time_diff")

# Assign row numbers based on time difference
joined_df = joined_df.withColumn(
    "rank",
    row_number().over(window_spec)
)

# Filter to keep only the closest GPS point (rank = 1)
matched_sectionals_df = joined_df.filter(col("rank") == 1).select(
    "sec.*",
    "gps.time_stamp_local",
    "gps.longitude",
    "gps.latitude",
    "gps.speed",
    "gps.progress",
    "gps.stride_frequency",
    "gps.post_time",
    "gps.location",
    "gps.time_stamp",
    "time_diff"
)

### Explanation:

> •	Window Specification: For each sectional_id, rank GPS points based on the smallest time_diff.

> •	Row Number: Assigns a unique row number starting from 1 within each window.
	
> •	Filtering: Retain only the GPS point with rank = 1, i.e., the closest GPS point.

#### Handle Sectionals Without Matching GPS Points

Identify sectionals that did not find any matching GPS point within the ±1-second window.

In [None]:
# Find all sectionals
all_sectionals_df = sectionals_df.select("sectional_id")

# Find matched sectionals
matched_sectionals_ids = matched_sectionals_df.select("sectional_id")

# Identify unmatched sectionals
unmatched_sectionals_df = all_sectionals_df.join(
    matched_sectionals_ids,
    on="sectional_id",
    how="left_anti"
)

# Optionally, log the number of unmatched sectionals
unmatched_count = unmatched_sectionals_df.count()
logging.warning(f"There are {unmatched_count} sectionals without a matching GPS point within ±1 second.")

### Explanation:

> •	Left Anti Join: Retrieves sectionals that do not have a corresponding entry in matched_sectionals_df.

> •	Logging: Alerts you to the number of sectionals without a match, enabling further investigation if necessary.

#### Finalize the Matched Sectionals DataFrame

Clean up the matched_sectionals_df to include only necessary columns and make it ready for further processing.

In [None]:
# Select and rename columns as needed
final_matched_sectionals_df = matched_sectionals_df.select(
    "sectional_id",
    "course_cd",
    "race_date",
    "race_number",
    "saddle_cloth_number",
    "gate_name",
    "gate_numeric",
    "length_to_finish",
    "sectional_time",
    "running_time",
    "distance_back",
    "distance_ran",
    "number_of_strides",
    "sec_time_stamp",
    "time_stamp_local",    # GPS local time
    "longitude",
    "latitude",
    "speed",
    "progress",
    "stride_frequency",
    "post_time",
    "location",
    "time_stamp",          # Original GPS timestamp (UTC)
    "time_diff"
)

# Display sample matched sectionals
logging.info("Displaying sample matched sectionals with GPS data:")
final_matched_sectionals_df.show(10, truncate=False)

Unmatched sectionals

In [None]:
# Display sample unmatched sectionals
if unmatched_count > 0:
    logging.warning("Displaying sample unmatched sectionals:")
    unmatched_sectionals_df.select("sectional_id", "course_cd", "race_date", "race_number", "saddle_cloth_number", "gate_name", "gate_numeric", "sectional_time", "sec_time_stamp").show(10, truncate=False)
else:
    logging.info("All sectionals have been successfully matched with GPS points.")

### 3. Merge with gps_df Based on Time Alignment

To integrate GPS data, you’ll need to align GPS points with the sectional times. This typically involves matching GPS timestamps with the corresponding sec_time_stamp to determine which sectional a GPS point belongs to.

Approach: Range Join Using Spark SQL

Spark doesn’t natively support range joins (joins based on a range of values), but you can achieve this using Spark SQL or broadcast joins with window functions. Here’s a method using window functions to assign each GPS point to the nearest sec_time_stamp.

In [None]:
from pyspark.sql.functions import abs as spark_abs

# Assuming gps_df has 'time_stamp_local' and sectionals_df has 'sec_time_stamp'

# First, join GPS points with sectional times based on race identifiers
logging.info("Joining 'merged_results_sectionals' with 'gps_df' based on race identifiers...")

joined_df = merged_results_sectionals.join(
    gps_df,
    on=["course_cd", "race_date", "race_number", "saddle_cloth_number"],
    how="inner"
)

logging.info("Joining complete. Now aligning GPS points with sectional times...")

# Define a window partitioned by race and ordered by time_stamp_local
window_spec = Window.partitionBy(
    "course_cd", "race_date", "race_number", "saddle_cloth_number"
).orderBy("sec_time_stamp")

# Add a column that holds the closest sec_time_stamp for each GPS point
from pyspark.sql.functions import lag, lead

# For each GPS point, find the previous and next sec_time_stamp
joined_df = joined_df.withColumn(
    "prev_sec_time_stamp",
    lag("sec_time_stamp").over(window_spec)
).withColumn(
    "next_sec_time_stamp",
    lead("sec_time_stamp").over(window_spec)
)

# Define logic to assign GPS points to the nearest sectional
from pyspark.sql.functions import when

# Assign GPS points to the nearest sectional based on time
joined_df = joined_df.withColumn(
    "assigned_sec_time_stamp",
    when(
        (col("time_stamp_local") >= col("prev_sec_time_stamp")) & (col("time_stamp_local") < col("sec_time_stamp")),
        col("sec_time_stamp")
    ).otherwise(col("next_sec_time_stamp"))
)

# Clean up intermediate columns
joined_df = joined_df.drop("prev_sec_time_stamp", "next_sec_time_stamp")

logging.info("GPS points aligned with sectional times successfully.")

### 4. Validate the Merged Data

It’s crucial to ensure that the merging process has correctly aligned the GPS points with the sectional times. Here are some steps to validate:

In [None]:
# Display sample data
logging.info("Displaying sample of the merged data...")
joined_df.select(
    "course_cd", "race_date", "race_number", "saddle_cloth_number",
    "official_fin", "finish_time", "speed_rating", "sec_time_stamp",
    "time_stamp_local", "longitude", "latitude", "speed",
    "assigned_sec_time_stamp", "gate_numeric", "sec_time_stamp"
).show(30, truncate=False)

# Check for any nulls in 'assigned_sec_time_stamp'
logging.info("Checking for nulls in 'assigned_sec_time_stamp'...")
null_count = joined_df.filter(col("assigned_sec_time_stamp").isNull()).count()
if null_count > 0:
    logging.warning(f"There are {null_count} GPS points without an assigned sectional time stamp.")
else:
    logging.info("All GPS points have been successfully assigned to sectional time stamps.")

In [None]:
# Function to display the shape of a Spark DataFrame
def display_shape(df, name):
    count = df.count()
    columns = len(df.columns)
    print(f"{name} DataFrame Shape: ({count}, {columns})")

# Display shapes of all DataFrames
display_shape(results_df, "results_df")
display_shape(sectionals_df, "sectionals_df")
display_shape(gps_df, "gps_df")

In [None]:
display_shape(joined_df, "joined_df")

### 5. Save the Final Merged DataFrame

Once validated, save the final merged DataFrame for downstream analysis or modeling.

In [None]:
# Save the merged DataFrame
final_output_path = os.path.join(parquet_dir, "final_merged_data.parquet")
logging.info(f"Saving final merged DataFrame to Parquet at {final_output_path}...")
joined_df.write.mode("overwrite").parquet(final_output_path)
logging.info("Final merged data saved successfully.")

### 6. Optional: Optimize for Performance

Depending on the size of your datasets, you might want to optimize the performance of your Spark jobs:
	•	Caching: If you need to reuse certain DataFrames multiple times, consider caching them.

In [None]:
merged_results_sectionals.cache()
gps_df.cache()

Partitioning: Repartition your DataFrames based on high-cardinality columns to optimize join operations.

In [None]:
merged_results_sectionals = merged_results_sectionals.repartition("race_number")
gps_df = gps_df.repartition("race_number")

	•	Broadcast Joins: For smaller DataFrames, you can use broadcast joins to speed up the join process.

In [None]:
from pyspark.sql.functions import broadcast

merged_df = merged_results_sectionals.join(
    broadcast(gps_df),
    on=["course_cd", "race_date", "race_number", "saddle_cloth_number"],
    how="inner"
)

### 7. Summary of the Merging Process

	1.	Initial Join: Merge results_df with sectionals_df based on race identifiers.
	2.	GPS Alignment: Assign each GPS point to the nearest sec_time_stamp using window functions.
	3.	Validation: Ensure that all GPS points are correctly aligned without missing assignments.
	4.	Final Save: Persist the merged and enriched DataFrame for further use.

### 8. Final Recommendations

	•	Monitor Performance: Use Spark’s UI (http://<driver-node>:4040) to monitor job execution and identify any performance bottlenecks.
	•	Handle Edge Cases: Ensure that GPS points falling outside the range of sectional times are handled appropriately, possibly by assigning them to the nearest sectional or flagging them for review.
	•	Documentation: Keep detailed logs and documentation of each step to facilitate debugging and future maintenance.

Congratulations!

You’ve successfully set up a scalable and efficient data pipeline using PySpark, PostgreSQL, SQL, and Parquet. This architecture is well-suited for handling large datasets and performing complex transformations necessary for your horse racing analytics and modeling tasks.

# Data Preparation

## Sample data initially

Taking a random sample will not work for time series as was attempted, but taking a smaller sample by filtering on date should work fine.

In [None]:
# Filter data for a specific date range or course
#df_filtered = merged_df[merged_df['race_date'] >= '2024-01-01']

In [None]:
#df_filtered.shape

In [None]:
#print(df_filtered.isnull().sum())

## Check for missing data

In [None]:
# Check for missing values
print(df.isnull().sum())

In [None]:
#import seaborn as sns
#import matplotlib.pyplot as plt

# Heatmap of missing values (on a small sample)
#sns.heatmap(df.isnull(), cbar=False)
#plt.show()

## Imputation for Stride Frequency and number_of_strides


#### Group-Based Imputation: Impute based on groups, such as per horse.

In [None]:
df['stride_frequency'] = df.groupby('saddle_cloth_number')['stride_frequency'].transform(lambda x: x.fillna(x.median()))

#### gate_numeric remains the same within a group until changed:

In [None]:
df.dropna(subset=['gate_numeric'], inplace=True)

#### Interprolation distance_back changes over time 

In [None]:
df.dropna(subset=['distance_back'], inplace=True)

#### Group Based Imputation for number of strides

In [None]:
df.dropna(subset=['number_of_strides'], inplace=True)

## Choose Features

In [None]:
df.shape

In [None]:
print(df.isnull().sum())

In [None]:
feature_columns = [
    'speed',
    'progress',
    'stride_frequency',
    'number_of_strides',
    'post_pos',
    'gate_numeric',
    'length_to_finish',
    'sectional_time',
    'running_time',
    'distance_back',
    'distance_ran'
]

## Feature Engineering -- calculate additional features

In [None]:
df.sort_values(by=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp'], inplace=True)
df['acceleration'] = df.groupby(
    ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
)['speed'].diff() / df.groupby(
    ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
)['time_stamp'].diff().dt.total_seconds()

In [None]:
import numpy as np
df['acceleration'] = df['acceleration'].replace([np.inf, -np.inf], np.nan)
df['acceleration'] = df['acceleration'].fillna(0)

In [None]:
feature_columns.append('acceleration')

## Scale Features

Scaling helps in training neural networks.

In [None]:
# Note: Scaling should be done after sequences are created to avoid data leakage.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Create Sequences for LSTM

a. Group Data

Group the data to create sequences for each horse in each race.

In [None]:
group_columns = ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
groups = df_sampled.groupby(group_columns)

##  Create Sequences and Labels

In [None]:
sequences = []
labels = []

for name, group in groups:
    # Ensure group is sorted by time
    group = group.sort_values('time_stamp')

    # Extract features
    features = group[feature_columns].values

    # Append the sequence
    sequences.append(features)

    # Extract label (official finishing position)
    label = group['official_fin'].iloc[0]  # Assuming it's the same for all entries in the group
    labels.append(label)

## Determine max_seq_length and num_features

In [None]:
# Note: Alternatively, set a fixed max_seq_length to limit memory usage.
max_seq_length = max(len(seq) for seq in sequences)
num_features = len(feature_columns)


In [None]:
print(max_seq_length)
print(num_features)

## Pad Sequences

In [None]:
import tensorflow as tf
# from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=max_seq_length, padding='post', dtype='float32'
)

## Convert Labels

Adjust labels to start from 0 if they start from 1.

In [None]:
labels = np.array(labels).astype(int) - 1
num_classes = labels.max() + 1

## Scale Features

Now, scale the features. Be cautious to fit the scaler only on the training data to prevent data leakage.

Flatten sequences for scaling:

In [None]:
num_samples = padded_sequences.shape[0]
X_flat = padded_sequences.reshape(-1, num_features)

## Fit scaler on the flattened data:

In [None]:
X_scaled_flat = scaler.fit_transform(X_flat)

## Reshape back to original shape:

In [None]:
X_scaled = X_scaled_flat.reshape(num_samples, max_seq_length, num_features)

# Split Data into Training, Validation, and Test Sets

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assume sequences and labels have been created and padded_sequences is available

# Convert labels
labels = np.array(labels).astype(int) - 1
num_classes = labels.max() + 1

# Split data
X_temp, X_test, y_temp, y_test = train_test_split(
    padded_sequences, labels, test_size=0.1, random_state=42
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1, random_state=42
)

# Check shapes
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)
print("y_test shape:", y_test.shape)

# Scale features
scaler = StandardScaler()

# Flatten training data and fit scaler
num_samples_train = X_train.shape[0]
X_train_flat = X_train.reshape(-1, num_features)
X_train_scaled_flat = scaler.fit_transform(X_train_flat)
X_train_scaled = X_train_scaled_flat.reshape(num_samples_train, max_seq_length, num_features)

# Scale validation data
num_samples_val = X_val.shape[0]
X_val_flat = X_val.reshape(-1, num_features)
X_val_scaled_flat = scaler.transform(X_val_flat)
X_val_scaled = X_val_scaled_flat.reshape(num_samples_val, max_seq_length, num_features)

# Scale test data
num_samples_test = X_test.shape[0]
X_test_flat = X_test.reshape(-1, num_features)
X_test_scaled_flat = scaler.transform(X_test_flat)
X_test_scaled = X_test_scaled_flat.reshape(num_samples_test, max_seq_length, num_features)

Ensure that X_train, X_val, X_test, y_train, y_val, and y_test are correctly shaped.

In [None]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

# Prepare Data for Model Training

## Training the LSTM Model

### Build the Model

This model combines dropout, regularization, and normalization for better results.

In [None]:
import tensorflow as tf

model_lstm = tf.keras.Sequential([
    tf.keras.Input(shape=(max_seq_length, num_features)),
    tf.keras.layers.Masking(mask_value=0.0),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        256, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),    
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        128, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        64, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])


#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.LSTM(128),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#988/988 ━━━━━━━━━━━━━━━━━━━━ 7s 7ms/step - accuracy: 0.3606 - loss: 1.6184
#Test Loss: 1.6182985305786133, Test Accuracy: 0.36063656210899353

### Compile the Model

RMSprop is often a good choice for RNNs.

>	•	The learning rate of 0.001 is a typical starting point.

>   •	Recommendation:

>   •	You can experiment with different learning rates (e.g., 0.0005, 0.0001) if needed.

>   •	Alternatively, you can also try the Adam optimizer and compare results.

In [None]:
# experimenting with different learning rates (e.g., 0.0005, 0.0001) to see if it affects convergence.

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)

model_lstm.compile(
    optimizer=optimizer,   # 'adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'] #,tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)



### Train the Model

## Hyperparameter Tuning

> Learning Rate Scheduler and Early Stopping

> * Learning Rate Scheduler

>  * Earlystopping



In [None]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=2, min_lr=1e-6
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [None]:
history = model_lstm.fit(
    X_train, y_train,
    epochs=50,  
    batch_size=128,  # 64,
    validation_data=(X_val, y_val),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ]
)

### Evaluate the Model

In [None]:
test_loss, test_accuracy = model_lstm.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

## Plot Training and Validation Loss and Accuracy:

In [None]:
import matplotlib.pyplot as plt

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.show()

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.show()

### Check for Imbalance

In [None]:
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))

In [None]:
plt.bar(unique, counts)
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
corr_matrix = df[['speed', 'progress', 'stride_frequency', 'longitude', 'latitude', 'post_pos', 'official_fin']].corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Define your variables
max_seq_length = 120  # Replace with your maximum sequence length
num_features = 5      # Replace with the actual number of features in your data
num_classes = 12      # Replace with the actual number of classes

# Build your model
model_lstm = tf.keras.Sequential()
model_lstm.add(tf.keras.layers.Masking(mask_value=0., input_shape=(max_seq_length, num_features)))
model_lstm.add(tf.keras.layers.LSTM(128))
model_lstm.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

model_lstm.summary()

In [None]:
import tensorflow as tf
print(tf.__version__)


In [None]:
# Load data into dataframe:

import pandas as pd

In [None]:
# Training

history_lstm = model_lstm.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping]
)

# Combining the Models

To create an ensemble, you can combine the predictions of these models in several ways:
	1.	Averaging Probabilities:
	•	Obtain probability distributions over finishing positions from each model.
	•	Average the probabilities across models to get the final prediction.
	2.	Weighted Averaging:
	•	Assign weights to each model based on validation performance.
	•	Compute a weighted average of the probabilities.
	3.	Stacking (Meta-Learner):
	•	Use the predictions from the individual models as input features to a meta-model (e.g., a logistic regression or another neural network).
	•	The meta-model learns how to best combine the individual predictions.
	4.	Voting (for Classification):
	•	If treating the problem as classification into discrete positions, use majority voting among the models.
	•	Not as suitable if you need probability distributions.

Implementation Steps

1. Data Preparation

	•	Sequences:
	•	Use the raw GPS data (gpspoint) to create sequences for each horse in each race.
	•	Ensure that sequences are properly sorted by time_stamp.
	•	Features:
	•	Include raw features such as speed, progress, stride_frequency.
	•	Avoid hand-engineering features like acceleration to adhere to your objective.
	•	Labels:
	•	Use official_fin from results_entries as the target variable.
	•	Since you want probabilities for each finishing position, consider encoding official_fin as categorical labels.
