# Proposed Ensemble Models

Given the constraints and objectives, I recommend considering the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


### Validate GPU Setup

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

In [None]:
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs Available:", len(physical_devices))

In [None]:
for device in physical_devices:
    print(device)

# The LSTM Network on Raw GPS Data

Initially I desired to merge the GPS data with Sectionals, but the timestamp and gate_name intervals of each respectively made it difficult to align the data in sequences -- something that is needed for Long-Short Term Memory models. Therefore, it was decided to go with an ensemble approach. There will be additional models that incorporate Equibase data as well, but for the time being, the focus will be on Total Performance GPS data. 

In [2]:
# Environment setup

import logging
import os
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text
import geopandas as gpd
from datetime import datetime
import configparser
from src.data_ingestion.ingestion_utils import (
    get_db_connection, update_tracking, load_processed_files
)
import traceback

# Load the configuration file
config = configparser.ConfigParser()
config.read('/home/exx/myCode/horse-racing/FoxRiverAIRacing/config.ini')

# Set up logging for consistent logging behavior in Notebook
logging.basicConfig(level=logging.INFO)

# Retrieve database credentials from config file
# Retrieve database credentials from config file
db_host = config['database']['host']
db_port = config['database']['port']
db_name = config['database']['dbname']  # Corrected from 'name' to 'dbname'
db_user = config['database']['user']

# Establish connection using get_db_connection
conn = get_db_connection(config)

# Create the SQLAlchemy engine
engine = create_engine(f'postgresql+psycopg2://{db_user}@{db_host}:{db_port}/{db_name}')

In [None]:
query_results = """
   SELECT course_cd, race_date, race_number, program_num, post_pos, axciskey, official_fin
    FROM v_results_entries
    WHERE breed = 'TB';
"""

query_sectionals = """    
    
SELECT course_cd, race_date, race_number, saddle_cloth_number, gate_name, gate_numeric, 
    length_to_finish,sectional_time, running_time, distance_back, distance_ran, 
    number_of_strides, post_time 
FROM v_sectionals;
"""

query_gpspoint = """
select course_cd, race_date, race_number, saddle_cloth_number, time_stamp, 
longitude, latitude, speed, progress, stride_frequency, post_time, location
from v_gpspoint;
"""


query_routes = """
select course_cd, track_name, line_type, line_name, coordinates
from routes;
"""



In [None]:
# Execute the query and load data into a DataFrame
gps_df = pd.read_sql_query(query_gpspoint, engine, parse_dates=['time_stamp'])


In [None]:
gps_df.dtypes
print(gps_df.shape)

In [None]:
sectionals_df = pd.read_sql_query(query_sectionals, engine)


In [None]:
sectionals_df.dtypes
print(sectionals_df.shape)

In [None]:
results_df = pd.read_sql_query(query_results, engine)


In [None]:
results_df.dtypes

In [None]:
print(results_df.shape)

In [None]:
routes_df = pd.read_sql_query(query_routes, engine)

In [None]:
routes_df.dtypes

In [None]:
routes_df.shape

In [None]:
gps_df['race_date'] = pd.to_datetime(gps_df['race_date'])
sectionals_df['race_date'] = pd.to_datetime(sectionals_df['race_date'])
results_df['race_date'] = pd.to_datetime(results_df['race_date'])

In [None]:
gps_df.to_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/gps.parquet', index=False)


In [None]:
sectionals_df.to_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/sectionals.parquet', index=False)

In [None]:
results_df.to_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/results.parquet', index=False)

In [None]:
routes_df.to_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/routes.parquet', index=False)

# Start here if data is current in parquet

In [83]:
# Load the Parquet file into a DataFrame
gps_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/gps.parquet')
print(gps_df.shape)

(36590003, 12)


In [84]:
# Filter for races on '2024-01-01'
gps_df = gps_df[gps_df['race_date'] >= '2024-01-01']
print(gps_df.shape)

(13079926, 12)


In [85]:
# Load the Parquet file into a DataFrame
sectionals_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/sectionals.parquet')
print(sectionals_df.shape)

(4737778, 13)


In [86]:
# Filter for races on '2024-01-01'
sectionals_df = sectionals_df[sectionals_df['race_date'] >= '2024-01-01']
print(sectionals_df.shape)

(1682918, 13)


In [87]:
# Load the Parquet file into a DataFrame
results_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/results.parquet')
print(results_df.shape)

(387680, 7)


In [88]:
# Filter for races on '2024-01-01'
results_df = results_df[results_df['race_date'] >= '2024-01-01']
print(results_df.shape)

(115181, 7)


In [89]:
# Load the Parquet file into a DataFrame
routes_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/training/routes.parquet')
print(routes_df.shape)

(125, 5)


## Mapping GPS to Sectionals

### Objective

>	•	Goal: Merge the gpspoint data with the sectionals data to incorporate gate-specific features into the sequences used for the LSTM model.

>    •	Challenge: Since the sectionals data does not have time_stamp, we need an alternative method to align the data.

### Proposed Solution: Use running_time to Estimate time_stamp in sectionals Data

Why Use post_time?

>	•	running_time in the sectionals data represents the total time from the official start of the race to when the horse passed each gate.

>    •	We can use running_time to estimate the time_stamp of gate events by adding it to the race start time.


### Implementation Steps

1. Ensure gpspoint Data is Properly Formatted

>	•	Convert time_stamp to datetime:

### Action: Convert Data Types and Standardize Column Names

In [90]:
import pandas as pd

# Convert 'race_date' to datetime in all DataFrames
gps_df['race_date'] = pd.to_datetime(gps_df['race_date'])
sectionals_df['race_date'] = pd.to_datetime(sectionals_df['race_date'])
results_df['race_date'] = pd.to_datetime(results_df['race_date'])

# Ensure 'race_number' is int in all DataFrames
gps_df['race_number'] = gps_df['race_number'].astype(int)
sectionals_df['race_number'] = sectionals_df['race_number'].astype(int)
results_df['race_number'] = results_df['race_number'].astype(int)

# Standardize 'saddle_cloth_number' and 'program_num' to strings and strip whitespace
gps_df['saddle_cloth_number'] = gps_df['saddle_cloth_number'].astype(str).str.strip()
sectionals_df['saddle_cloth_number'] = sectionals_df['saddle_cloth_number'].astype(str).str.strip()
results_df['program_num'] = results_df['program_num'].astype(str).str.strip()

# Rename sectionals_df post_time
sectionals_df = sectionals_df.rename(columns={'post_time': 'post_time_sectional'})

# Rename 'program_num' to 'saddle_cloth_number' in results_df for consistency
results_df.rename(columns={'program_num': 'saddle_cloth_number'}, inplace=True)

# Confirm data types
print("gps_df.dtypes:")
print(gps_df.dtypes)
print("\nsectionals_df.dtypes:")
print(sectionals_df.dtypes)
print("\nresults_df.dtypes:")
print(results_df.dtypes)
print("\nroutes_df.dtypes")
print(routes_df.dtypes)

gps_df.dtypes:
course_cd                      object
race_date              datetime64[ns]
race_number                     int64
saddle_cloth_number            object
time_stamp             datetime64[ns]
longitude                     float64
latitude                      float64
speed                         float64
progress                      float64
stride_frequency              float64
post_time              datetime64[ns]
location                       object
dtype: object

sectionals_df.dtypes:
course_cd                      object
race_date              datetime64[ns]
race_number                     int64
saddle_cloth_number            object
gate_name                      object
gate_numeric                  float64
length_to_finish              float64
sectional_time                float64
running_time                  float64
distance_back                 float64
distance_ran                  float64
number_of_strides             float64
post_time_sectional    datetime64[ns

### Overview of solution

#### The problem:

No, I do not believe it is causing a cartesian join.  I explained this earlier, the two do not have the same primary keys. The gps data has course_cd, race_date, race_number, saddle_cloth_number, and timestamp.  Each horse has at least 120 timestamps per race. Then the sectionals data has the course_cd, race_date, race_number, saddle_cloth_number, and gate_name (which we converted to gate_numeric so it would sort correctly). If you join on the partial key, you will get back more rows of data -- each unique horse will merge 120 times, to each unique horse in sectionals for at least 20 times. So the problem is that the time has to be taken into account -- use the timestamp, and compute the equivalent timestamp for each sectional gate_numeric. For example, when a horse takes off at post_time, the first gate (typically at 0.5f) will be reached in the sectional_time, and this time should match with one of the timestamps (or at least within a second or two). The next gate, (1f), would be the sectional_time added on, etc. This is why I thought that the gate_numeric was being ignored and it can't be for a time series analysis. There are several ways that I see it being possible -- like using distance (length_to_finish), etc. but I'm not smart enough to put that in code. Can you do it?  Do you agree with my logic, or am I missing something?

I think there was a misunderstanding. Post_time cannot be used to synch up with GPS time_stamp. The main reason is that most, if not all (and apparently it is all since there are no matches)of the post_time are significantly different to gps time_stamp. In examining the data, the fist gps time_stamp in a sorted race, begins far enough after post_time, that the race would probably be over -- 2-4 minutes. None of the times I saw were within 1 sec of the post_time. 
Therefore, it was decided that we would sort the two frames on time_stamp and gate_numeric, take the sectional_time and add it to the minimum gps time_stamp, and call that the first sectional_timestamp which would then be mapped to the nearest gps time_stamp -- hopefully within one second. Not every gps timestamp would have a match since there are approximately 100 for each race (at minimum) and there are 13 gate_name timestamps for a 6.5f race. So for each horse, there would be multiple gps timestamps, and 20 or less gate_name timestamps -- it would be a left join gps to sectional. 

### Overview of the Solution

The steps that need to be followed are:

> 1. The only good use case in my opinion for post_time is to use it to convert gps_df to local time -- assuming post_time is local, just take the hour differences, and subtract it from gps_df time_stamp (after making the datatypes correct for no tz.

> 2. Once the time_stamp is correct, group on course_cd, race_date, race_number, saddle_cloth_number (both gps_df, and sectionals_df) and then sort by time_stamp and gate_numeric respectively. 

> 3. Take the minimum gps_df time_stamp for each race and add the sectional_time from the first gate_name (0.5f). This will be the sectional_timestamp for the first sectionals_df row. Continue until the gate_numeric = 9999 which is  the finish line.

> 4. With the sectional_timestamp populated, then do a merge between gps_df and sectionals_df. 


### Step 1: Prepare the Dataframes and Standardize Times

In [95]:
gps_df['time_stamp'].head()

889   2024-10-19 16:12:24.200
890   2024-10-19 16:12:24.200
891   2024-10-19 16:12:25.200
892   2024-10-19 16:12:25.200
893   2024-10-19 16:12:25.200
Name: time_stamp, dtype: datetime64[ns]

In [92]:
import pandas as pd
import numpy as np

# Step 1: Adjust GPS timestamps to local time using post_time

# Ensure datetime columns are in datetime format
gps_df['time_stamp'] = pd.to_datetime(gps_df['time_stamp'])
sectionals_df['post_time_sectional'] = pd.to_datetime(sectionals_df['post_time_sectional'])


In [94]:
# Remove timezone information from GPS timestamps (make them timezone-naive)
gps_df['time_stamp'] = gps_df['time_stamp'].dt.tz_localize(None)

In [96]:
# Standardize data types for merge keys
gps_df['race_number'] = gps_df['race_number'].astype(int)
sectionals_df['race_number'] = sectionals_df['race_number'].astype(int)

gps_df['saddle_cloth_number'] = gps_df['saddle_cloth_number'].astype(str).str.strip()
sectionals_df['saddle_cloth_number'] = sectionals_df['saddle_cloth_number'].astype(str).str.strip()

gps_df['course_cd'] = gps_df['course_cd'].astype(str).str.upper().str.strip()
sectionals_df['course_cd'] = sectionals_df['course_cd'].astype(str).str.upper().str.strip()


In [99]:
# Merge gps_df and sectionals_df to get 'post_time_sectional' into gps_df
gps_df = pd.merge(
    gps_df,
    sectionals_df[['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'post_time_sectional']].drop_duplicates(),
    on=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'],
    how='left'
)


In [101]:
# Compute time difference between GPS time_stamp and post_time_sectional
gps_df['time_diff_hours'] = (gps_df['time_stamp'] - gps_df['post_time_sectional']).dt.total_seconds() / 3600.0


In [102]:
# Round time difference to nearest integer to get timezone offset
gps_df['timezone_offset_hours'] = gps_df['time_diff_hours'].round().astype(int)

unique_time_zone_deltas_from_ZULU = gps_df['timezone_offset_hours'].unique()

print(unique_time_zone_deltas_from_ZULU)

[4 5 6 7 8 3]


In [110]:
# Adjust GPS time_stamp to local time
# gps_df['time_stamp_local'] = gps_df['time_stamp'] - pd.to_timedelta(gps_df['timezone_offset_hours'], unit='h')
gps_df['time_stamp'] = gps_df['time_stamp'] - pd.to_timedelta(gps_df['timezone_offset_hours'], unit='h')

In [121]:
# Update gps_df with adjusted time_stamp_local and timezone_offset_hours
# gps_df[['time_stamp', 'post_time', 'timezone_offset_hours']]
sectionals_df.dtypes

course_cd                      object
race_date              datetime64[ns]
race_number                     int64
saddle_cloth_number            object
gate_name                      object
gate_numeric                  float64
length_to_finish              float64
sectional_time                float64
running_time                  float64
distance_back                 float64
distance_ran                  float64
number_of_strides             float64
post_time_sectional    datetime64[ns]
dtype: object

### Step 2: Sort Dataframes

Create a merged DataFrame that contains all combinations of GPS points and sectional times within ±1 second.


In [178]:
# Sort gps_df by time_stamp_local
gps_df = gps_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp'])

In [179]:
# Sort sectionals_df by gate_numeric
sectionals_df = sectionals_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'gate_numeric'])


### Step 3: Calculate sectional_timestamp based on minimum GPS time_stamp_local and sectional_time


In [124]:
# Get the minimum GPS time_stamp_local per horse (group)
min_gps_time = gps_df.groupby(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'])['time_stamp'].min().reset_index()


In [126]:
# Merge min_gps_time into sectionals_df
sectionals_df = pd.merge(
    sectionals_df,
    min_gps_time,
    on=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'],
    how='left'
)

In [131]:
sectionals_df.dtypes

course_cd                      object
race_date              datetime64[ns]
race_number                     int64
saddle_cloth_number            object
gate_name                      object
gate_numeric                  float64
length_to_finish              float64
sectional_time                float64
running_time                  float64
distance_back                 float64
distance_ran                  float64
number_of_strides             float64
post_time_sectional    datetime64[ns]
time_stamp             datetime64[ns]
dtype: object

In [134]:
import pandas as pd

# Ensure the sectional_time is in seconds (float)
sectionals_df['sectional_time'] = sectionals_df['sectional_time'].astype(float)

# Create the sec_time_stamp column by cumulatively adding sectional_time to the first time_stamp
sectionals_df = sectionals_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'gate_numeric'])

# Group by primary identifiers and compute cumulative time
sectionals_df['sec_time_stamp'] = sectionals_df.groupby(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'])['sectional_time'].cumsum()

# Add the cumulative seconds to the minimum time_stamp per group
min_timestamps = sectionals_df.groupby(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'])['time_stamp'].transform('min')
sectionals_df['sec_time_stamp'] = min_timestamps + pd.to_timedelta(sectionals_df['sec_time_stamp'], unit='s')


  course_cd race_date   race_number saddle_cloth_number gate_name        time_stamp            sec_time_stamp    
0     AQU   2024-01-01       1                1            0.5f   2024-01-01 12:52:06.200 2024-01-01 12:52:12.630
1     AQU   2024-01-01       1                1              1f   2024-01-01 12:52:06.200 2024-01-01 12:52:18.860
2     AQU   2024-01-01       1                1            1.5f   2024-01-01 12:52:06.200 2024-01-01 12:52:25.150
3     AQU   2024-01-01       1                1              2f   2024-01-01 12:52:06.200 2024-01-01 12:52:31.410
4     AQU   2024-01-01       1                1            2.5f   2024-01-01 12:52:06.200 2024-01-01 12:52:37.590


### Explanation:

>.  1.	Convert sectional_time to seconds: Ensure it’s in seconds for proper timedelta calculations.

>   2.	Sort by relevant columns: This ensures the data is ordered correctly for cumulative calculations.
	
>   3.	Compute cumulative time (cumsum): Adds sectional_time cumulatively for each group of race identifiers.

>   4.	Add cumulative time to the first time_stamp in each group: Uses the minimum time_stamp for each group as the starting point.
	
>.  5.	Create sec_time_stamp: Populates the new column with the computed timestamps.


In [154]:
# Set pandas display options for maximum space
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)     # Set max rows to display
pd.set_option('display.width', 1000)      # Increase display width
pd.set_option('display.colheader_justify', 'center')  # Center-align headers


In [165]:
sectionals_df.dtypes

course_cd                      object
race_date              datetime64[ns]
race_number                     int64
saddle_cloth_number            object
gate_name                      object
gate_numeric                  float64
length_to_finish              float64
sectional_time                float64
running_time                  float64
distance_back                 float64
distance_ran                  float64
number_of_strides             float64
sec_time_stamp         datetime64[ns]
dtype: object

In [168]:
gps_df.dtypes

course_cd                      object
race_date              datetime64[ns]
race_number                     int64
saddle_cloth_number            object
time_stamp             datetime64[ns]
longitude                     float64
latitude                      float64
speed                         float64
progress                      float64
stride_frequency              float64
location                       object
dtype: object

In [149]:
# Check the results
#sectionals_df[['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'gate_name', 
#               'sectional_time', 'time_stamp', 'sec_time_stamp',
#              'length_to_finish', 'running_time', 'distance_back']].head(30)

### # Step 4: Merge gps_df and sectionals_df on matching keys and time

In [169]:
gps_df.shape #[['course_cd', 'race_date', 'race_number', 'saddle_cloth_number' ]]

(13079926, 11)

In [170]:
sectionals_df.shape

(1682918, 13)

In [186]:
# Sort gps_df by the necessary keys, ensuring time_stamp is sorted
gps_df = gps_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp'])

In [191]:
# Sort sectionals_df by the necessary keys, ensuring sec_time_stamp is sorted
sectionals_df = sectionals_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'sec_time_stamp'])

In [196]:
# Check sorting within groups for gps_df
sorted_correctly = gps_df.groupby(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'])['time_stamp'].apply(
    lambda x: x.is_monotonic_increasing
)

# Display groups that are not sorted correctly
print("Groups with sorting issues in gps_df:")
print(sorted_correctly[~sorted_correctly])

# Check sorting within groups for sectionals_df
sorted_correctly_sectionals = sectionals_df.groupby(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'])['sec_time_stamp'].apply(
    lambda x: x.is_monotonic_increasing
)

print("Groups with sorting issues in sectionals_df:")
print(sorted_correctly_sectionals[~sorted_correctly_sectionals])

Groups with sorting issues in gps_df:
Series([], Name: time_stamp, dtype: bool)
Groups with sorting issues in sectionals_df:
Series([], Name: sec_time_stamp, dtype: bool)


In [198]:
gps_df_sorted = gps_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp']).reset_index(drop=True)
sectionals_df_sorted = sectionals_df.sort_values(['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'sec_time_stamp']).reset_index(drop=True)

In [227]:

# Check for missing values
print("Missing values in gps_analysis_df:")
print(gps_analysis_df.isnull().sum())

# Handle missing values (example: forward fill)
gps_analysis_df.fillna(method='ffill', inplace=True)

# Check for missing values
print("Missing values in gps_analysis_df:")
print(gps_analysis_df.isnull().sum())


Missing values in gps_analysis_df:
course_cd                    0
race_date                    0
race_number                  0
saddle_cloth_number          0
time_stamp                   0
longitude                    0
latitude                     0
speed                        0
progress                     0
stride_frequency       1609048
location                     0
gate_name              9621680
gate_numeric           9745648
length_to_finish       9621680
sectional_time         9621680
running_time           9621680
distance_back          9621821
distance_ran           9621680
number_of_strides      9646685
sec_time_stamp         9621680
dtype: int64


  gps_analysis_df.fillna(method='ffill', inplace=True)


Missing values in gps_analysis_df:
course_cd               0
race_date               0
race_number             0
saddle_cloth_number     0
time_stamp              0
longitude               0
latitude                0
speed                   0
progress                0
stride_frequency       65
location                0
gate_name              42
gate_numeric           42
length_to_finish       42
sectional_time         42
running_time           42
distance_back          42
distance_ran           42
number_of_strides      42
sec_time_stamp         42
dtype: int64


In [224]:
# Check for any remaining null values in time_stamp
print("Null values in gps_df['time_stamp']:", gps_df['time_stamp'].isnull().sum())

# Check for any remaining null values in sec_time_stamp
print("Null values in sectionals_df['sec_time_stamp']:", sectionals_df['sec_time_stamp'].isnull().sum())

# Check for any duplicate time_stamp values
print("Duplicate time_stamp values in gps_df:", gps_df.duplicated(subset=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp']).sum())

# Check for any duplicate sec_time_stamp values
print("Duplicate sec_time_stamp values in sectionals_df:", sectionals_df.duplicated(subset=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'sec_time_stamp']).sum())

Null values in gps_df['time_stamp']: 0
Null values in sectionals_df['sec_time_stamp']: 0
Duplicate time_stamp values in gps_df: 0
Duplicate sec_time_stamp values in sectionals_df: 0


In [217]:
# Define a tolerance for time difference (1 second)
tolerance = pd.Timedelta(seconds=1)

# Perform the merge
gps_analysis_df = pd.merge_asof(
    gps_df.sort_values('time_stamp'),
    sectionals_df.sort_values('sec_time_stamp'),
    left_on='time_stamp',
    right_on='sec_time_stamp',
    by=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'],
    tolerance=tolerance,
    direction='nearest'  # Nearest match within tolerance
)



### Explanation of the Code:

1. Sorting:

>    •	Both gps_df and sectionals_df need to be sorted by their respective timestamps (time_stamp for gps_df and sec_time_stamp for sectionals_df) for the merge_asof function to work properly.

2.	Keys for Matching:

>    •	Besides matching timestamps, we are merging on the primary keys: course_cd, race_date, race_number, and saddle_cloth_number.
	
3.	Tolerance:

>.   •	pd.Timedelta(seconds=1) ensures that the nearest sec_time_stamp within ±1 second of the time_stamp is included in the merge.
	
4.	Merge Direction:

>.   •	Using direction='nearest' ensures the closest match is chosen. You can use 'backward' or 'forward' instead if you want only earlier or later matches, respectively.

5.	Left Join:

>.   •	By default, merge_asof performs a left join, ensuring all rows from gps_df are preserved.

This will create gps_analysis_df, containing all rows from gps_df and the nearest matching rows from sectionals_df. Empty cells (NaN) will appear in the columns from sectionals_df where no match is found.

In [226]:
# Define a tolerance for time difference (1 second)
tolerance = pd.Timedelta(seconds=1)

# Perform the merge
gps_analysis_df = pd.merge_asof(
    gps_df.sort_values('time_stamp'),
    sectionals_df.sort_values('sec_time_stamp'),
    left_on='time_stamp',
    right_on='sec_time_stamp',
    by=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number'],
    tolerance=tolerance,
    direction='nearest'  # Nearest match within tolerance
)

In [159]:
gps_analysis_df.dtypes

course_cd                        object
race_date                datetime64[ns]
race_number                       int64
saddle_cloth_number              object
time_stamp_x             datetime64[ns]
longitude                       float64
latitude                        float64
speed                           float64
progress                        float64
stride_frequency                float64
post_time                datetime64[ns]
location                         object
timezone_offset_hours             int64
gate_name                        object
gate_numeric                    float64
length_to_finish                float64
sectional_time                  float64
running_time                    float64
distance_back                   float64
distance_ran                    float64
number_of_strides               float64
post_time_sectional      datetime64[ns]
time_stamp_y             datetime64[ns]
sec_time_stamp           datetime64[ns]
dtype: object

In [160]:
gps_analysis_df = gps_analysis_df.sort_values(
    by=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp_x', 'gate_numeric']
).reset_index(drop=True)

In [161]:
gps_analysis_df.shape

(13079926, 24)

In [228]:
# Inspect the merged dataframe
gps_analysis_df.head(100)

Unnamed: 0,course_cd,race_date,race_number,saddle_cloth_number,time_stamp,longitude,latitude,speed,progress,stride_frequency,location,gate_name,gate_numeric,length_to_finish,sectional_time,running_time,distance_back,distance_ran,number_of_strides,sec_time_stamp
0,TSA,2024-01-01,1,1,2024-01-01 12:01:16.200,-118.040042,34.140403,0.07,1207.0,,0101000020E61000007BFE0F0B90825DC0A49F15BDF811...,,,,,,,,,NaT
1,TSA,2024-01-01,1,3,2024-01-01 12:01:16.200,-118.040027,34.140429,0.24,1207.0,,0101000020E6100000D4BC3ECE8F825DC027367A90F911...,,,,,,,,,NaT
2,TSA,2024-01-01,1,7,2024-01-01 12:01:16.200,-118.040011,34.140451,0.12,1207.0,,0101000020E61000005DFF09898F825DC08C570F4EFA11...,,,,,,,,,NaT
3,TSA,2024-01-01,1,4,2024-01-01 12:01:17.200,-118.040019,34.140426,0.06,1207.0,,0101000020E6100000694E03AB8F825DC04B822678F911...,,,,,,,,,NaT
4,TSA,2024-01-01,1,8,2024-01-01 12:01:17.200,-118.04,34.140461,1.52,1207.0,,0101000020E610000099767B5A8F825DC032B0E99CFA11...,,,,,,,,,NaT
5,TSA,2024-01-01,1,6,2024-01-01 12:01:17.200,-118.040015,34.140444,0.04,1207.0,,0101000020E6100000BC35559C8F825DC086F6FB0FFA11...,,,,,,,,,NaT
6,TSA,2024-01-01,1,5,2024-01-01 12:01:17.200,-118.040016,34.140433,0.1,1207.0,,0101000020E61000006F53F2A08F825DC0682508B2F911...,,,,,,,,,NaT
7,TSA,2024-01-01,1,7,2024-01-01 12:01:17.200,-118.040011,34.140452,0.08,1207.0,,0101000020E6100000F2BEE0898F825DC074154152FA11...,,,,,,,,,NaT
8,TSA,2024-01-01,1,3,2024-01-01 12:01:17.200,-118.040027,34.140429,0.13,1207.0,,0101000020E6100000AB3D91CC8F825DC039735996F911...,,,,,,,,,NaT
9,TSA,2024-01-01,1,1,2024-01-01 12:01:17.200,-118.040042,34.140403,0.11,1207.0,,0101000020E6100000049DFF0D90825DC0E66091BAF811...,,,,,,,,,NaT


# Data Preparation

## Sample data initially

Taking a random sample will not work for time series as was attempted, but taking a smaller sample by filtering on date should work fine.

In [None]:
# Filter data for a specific date range or course
#df_filtered = merged_df[merged_df['race_date'] >= '2024-01-01']

In [None]:
#df_filtered.shape

In [None]:
#print(df_filtered.isnull().sum())

## Check for missing data

In [None]:
# Check for missing values
print(df.isnull().sum())

In [None]:
#import seaborn as sns
#import matplotlib.pyplot as plt

# Heatmap of missing values (on a small sample)
#sns.heatmap(df.isnull(), cbar=False)
#plt.show()

## Imputation for Stride Frequency and number_of_strides


#### Group-Based Imputation: Impute based on groups, such as per horse.

In [None]:
df['stride_frequency'] = df.groupby('saddle_cloth_number')['stride_frequency'].transform(lambda x: x.fillna(x.median()))

#### gate_numeric remains the same within a group until changed:

In [None]:
df.dropna(subset=['gate_numeric'], inplace=True)

#### Interprolation distance_back changes over time 

In [None]:
df.dropna(subset=['distance_back'], inplace=True)

#### Group Based Imputation for number of strides

In [None]:
df.dropna(subset=['number_of_strides'], inplace=True)

## Choose Features

In [None]:
df.shape

In [None]:
print(df.isnull().sum())

In [None]:
feature_columns = [
    'speed',
    'progress',
    'stride_frequency',
    'number_of_strides',
    'post_pos',
    'gate_numeric',
    'length_to_finish',
    'sectional_time',
    'running_time',
    'distance_back',
    'distance_ran'
]

## Feature Engineering -- calculate additional features

In [None]:
df.sort_values(by=['course_cd', 'race_date', 'race_number', 'saddle_cloth_number', 'time_stamp'], inplace=True)
df['acceleration'] = df.groupby(
    ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
)['speed'].diff() / df.groupby(
    ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
)['time_stamp'].diff().dt.total_seconds()

In [None]:
import numpy as np
df['acceleration'] = df['acceleration'].replace([np.inf, -np.inf], np.nan)
df['acceleration'] = df['acceleration'].fillna(0)

In [None]:
feature_columns.append('acceleration')

## Scale Features

Scaling helps in training neural networks.

In [None]:
# Note: Scaling should be done after sequences are created to avoid data leakage.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Create Sequences for LSTM

a. Group Data

Group the data to create sequences for each horse in each race.

In [None]:
group_columns = ['course_cd', 'race_date', 'race_number', 'saddle_cloth_number']
groups = df_sampled.groupby(group_columns)

##  Create Sequences and Labels

In [None]:
sequences = []
labels = []

for name, group in groups:
    # Ensure group is sorted by time
    group = group.sort_values('time_stamp')

    # Extract features
    features = group[feature_columns].values

    # Append the sequence
    sequences.append(features)

    # Extract label (official finishing position)
    label = group['official_fin'].iloc[0]  # Assuming it's the same for all entries in the group
    labels.append(label)

## Determine max_seq_length and num_features

In [None]:
# Note: Alternatively, set a fixed max_seq_length to limit memory usage.
max_seq_length = max(len(seq) for seq in sequences)
num_features = len(feature_columns)


In [None]:
print(max_seq_length)
print(num_features)

## Pad Sequences

In [None]:
import tensorflow as tf
# from tensorflow.keras.preprocessing.sequence import pad_sequences

padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=max_seq_length, padding='post', dtype='float32'
)

## Convert Labels

Adjust labels to start from 0 if they start from 1.

In [None]:
labels = np.array(labels).astype(int) - 1
num_classes = labels.max() + 1

## Scale Features

Now, scale the features. Be cautious to fit the scaler only on the training data to prevent data leakage.

Flatten sequences for scaling:

In [None]:
num_samples = padded_sequences.shape[0]
X_flat = padded_sequences.reshape(-1, num_features)

## Fit scaler on the flattened data:

In [None]:
X_scaled_flat = scaler.fit_transform(X_flat)

## Reshape back to original shape:

In [None]:
X_scaled = X_scaled_flat.reshape(num_samples, max_seq_length, num_features)

# Split Data into Training, Validation, and Test Sets

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assume sequences and labels have been created and padded_sequences is available

# Convert labels
labels = np.array(labels).astype(int) - 1
num_classes = labels.max() + 1

# Split data
X_temp, X_test, y_temp, y_test = train_test_split(
    padded_sequences, labels, test_size=0.1, random_state=42
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.1, random_state=42
)

# Check shapes
print("X_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)
print("y_test shape:", y_test.shape)

# Scale features
scaler = StandardScaler()

# Flatten training data and fit scaler
num_samples_train = X_train.shape[0]
X_train_flat = X_train.reshape(-1, num_features)
X_train_scaled_flat = scaler.fit_transform(X_train_flat)
X_train_scaled = X_train_scaled_flat.reshape(num_samples_train, max_seq_length, num_features)

# Scale validation data
num_samples_val = X_val.shape[0]
X_val_flat = X_val.reshape(-1, num_features)
X_val_scaled_flat = scaler.transform(X_val_flat)
X_val_scaled = X_val_scaled_flat.reshape(num_samples_val, max_seq_length, num_features)

# Scale test data
num_samples_test = X_test.shape[0]
X_test_flat = X_test.reshape(-1, num_features)
X_test_scaled_flat = scaler.transform(X_test_flat)
X_test_scaled = X_test_scaled_flat.reshape(num_samples_test, max_seq_length, num_features)

Ensure that X_train, X_val, X_test, y_train, y_val, and y_test are correctly shaped.

In [None]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

# Prepare Data for Model Training

## Training the LSTM Model

### Build the Model

This model combines dropout, regularization, and normalization for better results.

In [None]:
import tensorflow as tf

model_lstm = tf.keras.Sequential([
    tf.keras.Input(shape=(max_seq_length, num_features)),
    tf.keras.layers.Masking(mask_value=0.0),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        256, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),    
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        128, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        64, kernel_regularizer=tf.keras.regularizers.l2(1e-4))),
    tf.keras.layers.LayerNormalization(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])


#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.LSTM(128),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#model_lstm = tf.keras.Sequential([
#    tf.keras.Input(shape=(max_seq_length, num_features)),
#    tf.keras.layers.Masking(mask_value=0.0),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
#    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
#    tf.keras.layers.Dense(num_classes, activation='softmax')
#])

#988/988 ━━━━━━━━━━━━━━━━━━━━ 7s 7ms/step - accuracy: 0.3606 - loss: 1.6184
#Test Loss: 1.6182985305786133, Test Accuracy: 0.36063656210899353

### Compile the Model

RMSprop is often a good choice for RNNs.

>	•	The learning rate of 0.001 is a typical starting point.

>   •	Recommendation:

>   •	You can experiment with different learning rates (e.g., 0.0005, 0.0001) if needed.

>   •	Alternatively, you can also try the Adam optimizer and compare results.

In [None]:
# experimenting with different learning rates (e.g., 0.0005, 0.0001) to see if it affects convergence.

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001)

model_lstm.compile(
    optimizer=optimizer,   # 'adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy'] #,tf.keras.metrics.Precision(), tf.keras.metrics.Recall()]
)



### Train the Model

## Hyperparameter Tuning

> Learning Rate Scheduler and Early Stopping

> * Learning Rate Scheduler

>  * Earlystopping



In [None]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=2, min_lr=1e-6
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=5, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [None]:
history = model_lstm.fit(
    X_train, y_train,
    epochs=50,  
    batch_size=128,  # 64,
    validation_data=(X_val, y_val),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ]
)

### Evaluate the Model

In [None]:
test_loss, test_accuracy = model_lstm.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

## Plot Training and Validation Loss and Accuracy:

In [None]:
import matplotlib.pyplot as plt

# Plot loss
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.show()

# Plot accuracy
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.show()

### Check for Imbalance

In [None]:
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))

In [None]:
plt.bar(unique, counts)
plt.xlabel('Class')
plt.ylabel('Frequency')
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation matrix
corr_matrix = df[['speed', 'progress', 'stride_frequency', 'longitude', 'latitude', 'post_pos', 'official_fin']].corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Define your variables
max_seq_length = 120  # Replace with your maximum sequence length
num_features = 5      # Replace with the actual number of features in your data
num_classes = 12      # Replace with the actual number of classes

# Build your model
model_lstm = tf.keras.Sequential()
model_lstm.add(tf.keras.layers.Masking(mask_value=0., input_shape=(max_seq_length, num_features)))
model_lstm.add(tf.keras.layers.LSTM(128))
model_lstm.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

model_lstm.summary()

In [None]:
import tensorflow as tf
print(tf.__version__)


In [None]:
# Load data into dataframe:

import pandas as pd

In [None]:
# Training

history_lstm = model_lstm.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_data=(X_val, y_val),
    callbacks=[early_stopping]
)

# Combining the Models

To create an ensemble, you can combine the predictions of these models in several ways:
	1.	Averaging Probabilities:
	•	Obtain probability distributions over finishing positions from each model.
	•	Average the probabilities across models to get the final prediction.
	2.	Weighted Averaging:
	•	Assign weights to each model based on validation performance.
	•	Compute a weighted average of the probabilities.
	3.	Stacking (Meta-Learner):
	•	Use the predictions from the individual models as input features to a meta-model (e.g., a logistic regression or another neural network).
	•	The meta-model learns how to best combine the individual predictions.
	4.	Voting (for Classification):
	•	If treating the problem as classification into discrete positions, use majority voting among the models.
	•	Not as suitable if you need probability distributions.

Implementation Steps

1. Data Preparation

	•	Sequences:
	•	Use the raw GPS data (gpspoint) to create sequences for each horse in each race.
	•	Ensure that sequences are properly sorted by time_stamp.
	•	Features:
	•	Include raw features such as speed, progress, stride_frequency.
	•	Avoid hand-engineering features like acceleration to adhere to your objective.
	•	Labels:
	•	Use official_fin from results_entries as the target variable.
	•	Since you want probabilities for each finishing position, consider encoding official_fin as categorical labels.
