# Proposed Ensemble Models

Given the constraints and objectives, I will consider the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


## Load Parquet Train, Test, and Validaion (VAL) Data:

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet

In [2]:
#spark.stop()

In [11]:
# Set Environment
import os
import pyspark.sql.functions as F
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, size, when, count
from src.data_preprocessing.data_prep1.data_utils import (
    save_parquet, gather_statistics, initialize_environment,
    load_config, initialize_logging, initialize_spark, 
    identify_and_impute_outliers, identify_and_remove_outliers, process_merged_results_sectionals,
    identify_missing_and_outliers
)

try:
    spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()
    # input("Press Enter to continue...")
except Exception as e:
    print(f"An error occurred during initialization: {e}")
    logging.error(f"An error occurred during initialization: {e}")

2024-12-28 00:50:02,537 - INFO - Environment setup initialized.


Spark session created successfully.


In [12]:
train_sequences_path = os.path.join(parquet_dir, "train_sequences.parquet")
val_sequences_path = os.path.join(parquet_dir, "val_sequences.parquet")
test_sequences_path = os.path.join(parquet_dir, "test_sequences.parquet")
train_sequences = spark.read.parquet(train_sequences_path)
val_sequences = spark.read.parquet(val_sequences_path)
test_sequences = spark.read.parquet(test_sequences_path)

In [13]:
train_sequences.printSchema()

root
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- horse_id: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- gate_index: integer (nullable = true)
 |-- course_cd_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- equip_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- surface_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- trk_cond_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- weather_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- med_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- stk_clm_md_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- turf_mud_mark_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- race_type_ohe: array (nullable = true)
 |    |-- element: d

In [14]:
# Convert to Pandas DataFrame
train_sequences_pd = train_sequences.toPandas()
val_sequences_pd = val_sequences.toPandas()
test_sequences_pd = test_sequences.toPandas()

horse_ids_train = train_sequences_pd["horse_id"].values  # Extract horse_id for training
horse_ids_val = val_sequences_pd["horse_id"].values  # Extract horse_id for validation
horse_ids_test = test_sequences_pd["horse_id"].values  # Extract horse_id for testing

In [15]:
print(train_sequences.select(F.size("past_races_sequence")).distinct().show())

+-------------------------+
|size(past_races_sequence)|
+-------------------------+
|                        5|
+-------------------------+

None


In [16]:
label_distribution = train_sequences.groupBy("label").count().collect()
print(label_distribution)

[Row(label=1, count=580), Row(label=0, count=3708)]


In [17]:
train_sequences_pd["past_races_sequence"].head()

0    [([0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0...
1    [([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
2    [([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
3    [([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
4    [([1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
Name: past_races_sequence, dtype: object

In [18]:
import numpy as np

def flatten_sequence(sequence):
    """
    Flattens a sequence of race data into a single NumPy array.
    Ensures uniform array shapes for each time step in the sequence.
    """
    ohe_length = 43 + 26 + 10 + 2 + 8 + 7 + 5 + 4 + 4 + 14  # Total OHE length
    aggregator_length = len(aggregator_cols)  # Length of aggregator columns

    flattened_sequence = []
    for step in sequence:
        # Extract `ohe_flat` or default to zero array
        ohe_flat = step["ohe_flat"] if "ohe_flat" in step else [0.0] * ohe_length

        # Ensure `ohe_flat` has the correct length
        if len(ohe_flat) != ohe_length:
            ohe_flat = [0.0] * ohe_length

        # Extract aggregator values or default to -999.0
        aggregator_values = [step[agg] if agg in step else -999.0 for agg in aggregator_cols]

        # Concatenate `ohe_flat` and aggregator values
        flattened_step = np.array(ohe_flat + aggregator_values)

        # Verify the length of the flattened step
        if len(flattened_step) != (ohe_length + aggregator_length):
            raise ValueError(f"Flattened step has inconsistent length: {len(flattened_step)}")

        flattened_sequence.append(flattened_step)

    return np.array(flattened_sequence)

aggregator_cols = [
    "avg_speed_agg", "max_speed_agg", "final_speed_agg", "avg_accel_agg", 
    "fatigue_agg", "sectional_time_agg", "running_time_agg", "distance_back_agg", 
    "distance_ran_agg", "strides_agg", "max_speed_overall", "min_speed_overall"
]

X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

In [19]:
X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

X_val = np.array([flatten_sequence(seq) for seq in val_sequences_pd["past_races_sequence"]])
y_val = val_sequences_pd["label"].values

X_test = np.array([flatten_sequence(seq) for seq in test_sequences_pd["past_races_sequence"]])
y_test = test_sequences_pd["label"].values


In [40]:
y_val

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,

In [20]:
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

X_train shape: (4288, 5, 135)
y_train shape: (4288,)


In [21]:
for i, seq in enumerate(train_sequences_pd["past_races_sequence"][:5]):
    print(f"Sequence {i}:")
    for j, step in enumerate(seq):
        print(f"  Step {j}: {step}")

Sequence 0:
  Step 0: Row(ohe_flat=[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], avg_speed_agg=14.191215986394557, max_speed_agg=18.94333333333333, final_speed_agg=11.540000000000001, avg_accel_agg=0.11617676967429082, fatigue_agg=0.39081471054020767, sectional_time_agg=6.239285714285715, running_time_agg=52.2375, distance_back_agg=1.4000000000000001, distance_ran_agg=101.12142857142855, strides_agg=14.142857142857142, max_speed_overall=18.94333333333333, min_speed_overall=0.21250000000000002)
  Step 1: Row(ohe_flat=[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0

In [22]:
 # Label targets
print(np.unique(y_train))

[0 1]


In [23]:
label_dist = train_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 3708|
|    1|  580|
+-----+-----+



In [24]:
label_dist = val_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0|  749|
|    1|   98|
+-----+-----+



In [36]:
label_dist = test_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0|  699|
|    1|   89|
+-----+-----+



In [26]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}, horse_ids_train shape: {horse_ids_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}, horse_ids_val shape: {horse_ids_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}, horse_ids_test shape: {horse_ids_test.shape}")

X_train shape: (4288, 5, 135), y_train shape: (4288,), horse_ids_train shape: (4288,)
X_val shape: (847, 5, 135), y_val shape: (847,), horse_ids_val shape: (847,)
X_test shape: (788, 5, 135), y_test shape: (788,), horse_ids_test shape: (788,)


In [37]:
print(f"X_train: {X_train.shape}, horse_ids_train: {horse_ids_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, horse_ids_val: {horse_ids_val.shape}, y_val: {y_val.shape}")

X_train: (4288, 5, 135), horse_ids_train: (4288,), y_train: (4288,)
X_val: (847, 5, 135), horse_ids_val: (847,), y_val: (847,)


In [53]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Concatenate, Dropout, Flatten

# Example shapes (adjust these to match your own data)
time_steps = 1          # e.g. 10 time steps in the sequence
features   = 135         # e.g. 135 features per time step
num_horses = 4288       # example number of unique horse_ids

# 1. Define Input Layers
input_features = Input(shape=(time_steps, features), name='input_features')
input_horse_id = Input(shape=(1,), name='input_horse_id')

# 2. Embedding for horse_id
embedding_dim = 32
embedding = Embedding(input_dim=num_horses, output_dim=embedding_dim)(input_horse_id)
embedding = Flatten()(embedding)

# 3. Stacked LSTM layers with increased hidden sizes
#    - First LSTM: 256 hidden units, returns full sequence
#    - Second LSTM: 128 hidden units, returns full sequence
#    - Third LSTM: 64 hidden units, returns final state only
x = LSTM(128, return_sequences=True)(input_features)
x = Dropout(0.1)(x)

x = LSTM(64, return_sequences=True)(input_features)
x = Dropout(0.1)(x)

x = LSTM(32, return_sequences=False)(x)
x = Dropout(0.1)(x)

# 4. Concatenate the final LSTM output with the horse embedding
concat = Concatenate()([x, embedding])

# 5. Dense layers (you can also increase these if you like)
dense = Dense(128, activation='relu')(concat)
dense = Dropout(0.3)(dense)

# Output layer for 5-class classification
output = Dense(1, activation='sigmoid')(dense_out)

# 6. Build and compile the model
model_lstm = Model(inputs=[input_features, input_horse_id], outputs=output)
# Compile the model with binary_crossentropy
model_lstm.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=['accuracy']
)
model_lstm.summary()

ValueError: All `outputs` values must be KerasTensors. Received: outputs=<Dense name=dense_9, built=False> including invalid value <Dense name=dense_9, built=False> of type <class 'keras.src.layers.core.dense.Dense'>

In [55]:
y_train_binary = (y_train == 1).astype(int)  # Convert to binary labels
y_val_binary = (y_val == 1).astype(int)|

In [30]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [41]:
unique, counts = np.unique(y_train, return_counts=True)
train_counts = dict(zip(unique, counts))

print(train_counts)
print(unique)
print(counts)

{0: 3708, 1: 580}
[0 1]
[3708  580]


In [45]:
y_train_binary = (y_train == 1).astype(int)  # 1 for first place, 0 otherwise
y_val_binary = (y_val == 1).astype(int)

In [46]:
import numpy as np

import numpy as np

# Label distribution
unique, counts = np.unique(y_train_binary, return_counts=True)
train_counts = dict(zip(unique, counts))

print("Label Counts:", train_counts)

# Total samples and number of unique classes
total_samples = np.sum(counts)
num_classes = len(unique)

# Calculate class weights
class_weight = {label: (total_samples / (num_classes * count)) for label, count in train_counts.items()}

print("Class Weights:", class_weight)
# # Train label counts from your distribution
# train_counts = np.array([580, 621, 604, 594, 1889])
# total = train_counts.sum()  # 18164
# n_classes = len(train_counts)  # 5

# class_weight = {}
# for i, count_i in enumerate(train_counts):
#     class_weight[i] = float(total) / (n_classes * count_i)

print(class_weight)
# Example output:
# {0: 1.463870..., 1: 1.416..., 2: 1.426..., 3: 1.446..., 4: 0.45...}

Label Counts: {0: 3708, 1: 580}
Class Weights: {0: 0.5782092772384034, 1: 3.696551724137931}
{0: 0.5782092772384034, 1: 3.696551724137931}


In [47]:
print(X_train.shape)

(4288, 5, 135)


In [56]:
# Train the model

history = model_lstm.fit(
    [X_train, horse_ids_train],
    y_train_binary,
    epochs=50,  
    batch_size=8,  # 64,
    validation_data=([X_val, horse_ids_val], y_val_binary),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ],
    class_weight=class_weight,
    verbose=1
)


NameError: name 'dense_out' is not defined

In [None]:
val_loss, val_accuracy = model_lstm.evaluate([X_val, horse_ids_val], y_val)
print(f'Validation Loss: {val_loss}, Validation Accuracy: {val_accuracy}')

In [None]:
preds = model_lstm.predict([X_test, horse_ids_test])
print(np.argmax(preds, axis=-1)[:50])  # First 50 predictio

In [None]:
# Save the trained model
#model.save('/path/to/save/model.h5')

In [None]:
model_lstm.get_layer("embedding_6").get_weights()