# Proposed Ensemble Models

Given the constraints and objectives, I will consider the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


## Load Parquet Train, Test, and Validaion (VAL) Data:

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet

# Set Environment
import os
import pyspark.sql.functions as F
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, size, when, count
from src.data_preprocessing.data_prep1.data_utils import (
    save_parquet, gather_statistics, initialize_environment,
    load_config, initialize_logging, initialize_spark, 
    identify_and_impute_outliers, identify_and_remove_outliers, process_merged_results_sectionals,
    identify_missing_and_outliers
)

try:
    spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()
    # input("Press Enter to continue...")
except Exception as e:
    print(f"An error occurred during initialization: {e}")
    logging.error(f"An error occurred during initialization: {e}")

In [2]:
# Set Environment
import os
import pyspark.sql.functions as F
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, size, when, count
from src.data_preprocessing.data_prep1.data_utils import (
    save_parquet, gather_statistics, initialize_environment,
    load_config, initialize_logging, initialize_spark, 
    identify_and_impute_outliers, identify_and_remove_outliers, process_merged_results_sectionals,
    identify_missing_and_outliers
)

try:
    spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()
    # input("Press Enter to continue...")
except Exception as e:
    print(f"An error occurred during initialization: {e}")
    logging.error(f"An error occurred during initialization: {e}")

2024-12-27 23:32:13,845 - INFO - Environment setup initialized.


Spark session created successfully.


In [3]:
train_sequences_path = os.path.join(parquet_dir, "train_sequences.parquet")
val_sequences_path = os.path.join(parquet_dir, "val_sequences.parquet")
test_sequences_path = os.path.join(parquet_dir, "test_sequences.parquet")
train_sequences = spark.read.parquet(train_sequences_path)
val_sequences = spark.read.parquet(val_sequences_path)
test_sequences = spark.read.parquet(test_sequences_path)

In [4]:
train_sequences.printSchema()

root
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- horse_id: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- gate_index: integer (nullable = true)
 |-- course_cd_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- equip_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- surface_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- trk_cond_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- weather_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- med_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- stk_clm_md_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- turf_mud_mark_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- race_type_ohe: array (nullable = true)
 |    |-- element: d

In [5]:
# Convert to Pandas DataFrame
train_sequences_pd = train_sequences.toPandas()
val_sequences_pd = val_sequences.toPandas()
test_sequences_pd = test_sequences.toPandas()

horse_ids_train = train_sequences_pd["horse_id"].values  # Extract horse_id for training
horse_ids_val = val_sequences_pd["horse_id"].values  # Extract horse_id for validation
horse_ids_test = test_sequences_pd["horse_id"].values  # Extract horse_id for testing

                                                                                

In [6]:
print(train_sequences.select(F.size("past_races_sequence")).distinct().show())

+-------------------------+
|size(past_races_sequence)|
+-------------------------+
|                        5|
+-------------------------+

None


In [7]:
label_distribution = train_sequences.groupBy("label").count().collect()
print(label_distribution)

[Row(label=1, count=621), Row(label=3, count=594), Row(label=4, count=1889), Row(label=2, count=604), Row(label=0, count=580)]


In [8]:
train_sequences_pd["past_races_sequence"].head()

0    [([0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
1    [([0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0...
2    [([0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0...
3    [([0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0...
4    [([0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0...
Name: past_races_sequence, dtype: object

In [9]:
import numpy as np

def flatten_sequence(sequence):
    """
    Flattens a sequence of race data into a single NumPy array.
    Ensures uniform array shapes for each time step in the sequence.
    """
    ohe_length = 43 + 26 + 10 + 2 + 8 + 7 + 5 + 4 + 4 + 14  # Total OHE length
    aggregator_length = len(aggregator_cols)  # Length of aggregator columns

    flattened_sequence = []
    for step in sequence:
        # Extract `ohe_flat` or default to zero array
        ohe_flat = step["ohe_flat"] if "ohe_flat" in step else [0.0] * ohe_length

        # Ensure `ohe_flat` has the correct length
        if len(ohe_flat) != ohe_length:
            ohe_flat = [0.0] * ohe_length

        # Extract aggregator values or default to -999.0
        aggregator_values = [step[agg] if agg in step else -999.0 for agg in aggregator_cols]

        # Concatenate `ohe_flat` and aggregator values
        flattened_step = np.array(ohe_flat + aggregator_values)

        # Verify the length of the flattened step
        if len(flattened_step) != (ohe_length + aggregator_length):
            raise ValueError(f"Flattened step has inconsistent length: {len(flattened_step)}")

        flattened_sequence.append(flattened_step)

    return np.array(flattened_sequence)

aggregator_cols = [
    "avg_speed_agg", "max_speed_agg", "final_speed_agg", "avg_accel_agg", 
    "fatigue_agg", "sectional_time_agg", "running_time_agg", "distance_back_agg", 
    "distance_ran_agg", "strides_agg", "max_speed_overall", "min_speed_overall"
]

X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

In [10]:
X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

X_val = np.array([flatten_sequence(seq) for seq in val_sequences_pd["past_races_sequence"]])
y_val = val_sequences_pd["label"].values

X_test = np.array([flatten_sequence(seq) for seq in test_sequences_pd["past_races_sequence"]])
y_test = test_sequences_pd["label"].values


In [11]:
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

X_train shape: (4288, 5, 135)
y_train shape: (4288,)


In [12]:
for i, seq in enumerate(train_sequences_pd["past_races_sequence"][:5]):
    print(f"Sequence {i}:")
    for j, step in enumerate(seq):
        print(f"  Step {j}: {step}")

Sequence 0:
  Step 0: Row(ohe_flat=[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], avg_speed_agg=14.236616815476191, max_speed_agg=15.92266666666667, final_speed_agg=13.114374999999999, avg_accel_agg=-0.016854211262208464, fatigue_agg=0.17637068748953297, sectional_time_agg=14.278749999999999, running_time_agg=61.815, distance_back_agg=4.0, distance_ran_agg=201.27499999999998, strides_agg=21.025, max_speed_overall=15.92266666666667, min_speed_overall=13.114374999999999)
  Step 1: Row(ohe_flat=[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.

In [13]:
 # Label targets
print(np.unique(y_train))

[0 1 2 3 4]


In [14]:
label_dist = train_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0|  580|
|    1|  621|
|    2|  604|
|    3|  594|
|    4| 1889|
+-----+-----+



In [15]:
label_dist = val_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0|   98|
|    1|  105|
|    2|  131|
|    3|  128|
|    4|  385|
+-----+-----+



In [16]:
label_dist = test_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0|   89|
|    1|   97|
|    2|  107|
|    3|  113|
|    4|  382|
+-----+-----+



In [17]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}, horse_ids_train shape: {horse_ids_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}, horse_ids_val shape: {horse_ids_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}, horse_ids_test shape: {horse_ids_test.shape}")

X_train shape: (4288, 5, 135), y_train shape: (4288,), horse_ids_train shape: (4288,)
X_val shape: (847, 5, 135), y_val shape: (847,), horse_ids_val shape: (847,)
X_test shape: (788, 5, 135), y_test shape: (788,), horse_ids_test shape: (788,)


In [18]:
print(f"X_train: {X_train.shape}, horse_ids_train: {horse_ids_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, horse_ids_val: {horse_ids_val.shape}, y_val: {y_val.shape}")

X_train: (4288, 5, 135), horse_ids_train: (4288,), y_train: (4288,)
X_val: (847, 5, 135), horse_ids_val: (847,), y_val: (847,)


In [20]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, GRU, Dense, Embedding, Concatenate, Dropout, Flatten

# Define the input shapes
time_steps = X_train.shape[1]
features = X_train.shape[2]
num_horses = len(np.unique(horse_ids_train))

# 1) Input layers
input_features = Input(shape=(time_steps, features), name='input_features')
input_horse_id = Input(shape=(1,), name='input_horse_id')

# 2) Embedding for horse_id
num_horses = len(np.unique(horse_ids_train))
embedding = Embedding(input_dim=num_horses, output_dim=32)(input_horse_id)
embedding = Flatten()(embedding)

# 3) GRU layers (replaces LSTM layers)
gru_out = GRU(128, return_sequences=True)(input_features)
gru_out = Dropout(0.2)(gru_out)
gru_out = GRU(64, return_sequences=False)(gru_out)
gru_out = Dropout(0.2)(gru_out)

# 4) Concatenate GRU output with embedding
concat = Concatenate()([gru_out, embedding])

# 5) Dense layers + output
dense_out = Dense(64, activation='relu')(concat)
dense_out = Dropout(0.2)(dense_out)
output = Dense(5, activation='softmax')(dense_out)

# 6) Build and compile model
model_gru = Model(inputs=[input_features, input_horse_id], outputs=output)
model_gru.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model_gru.summary()

In [21]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=5, min_lr=1e-5
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [22]:
unique, counts = np.unique(y_train, return_counts=True)
train_counts = dict(zip(unique, counts))

print(train_counts)
print(unique)
print(counts)

{0: 580, 1: 621, 2: 604, 3: 594, 4: 1889}
[0 1 2 3 4]
[ 580  621  604  594 1889]


In [23]:
import numpy as np

# Train label counts from your distribution
train_counts = np.array([580, 621, 604, 594, 1889])
total = train_counts.sum()  # 18164
n_classes = len(train_counts)  # 5

class_weight = {}
for i, count_i in enumerate(train_counts):
    class_weight[i] = float(total) / (n_classes * count_i)

print(class_weight)
# Example output:
# {0: 1.463870..., 1: 1.416..., 2: 1.426..., 3: 1.446..., 4: 0.45...}

{0: 1.4786206896551723, 1: 1.3809983896940419, 2: 1.4198675496688742, 3: 1.4437710437710438, 4: 0.453996823716252}


In [29]:
# Train the model

history = model_gru.fit(
    [X_train, horse_ids_train],
    y_train,
    epochs=50,  
    batch_size=4,  # 64,
    validation_data=([X_val, horse_ids_val], y_val),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ],
    class_weight=class_weight,
    verbose=1
)


Epoch 1/50
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.2018 - loss: 1.6274 - val_accuracy: 0.1535 - val_loss: 1.6100 - learning_rate: 2.5000e-04
Epoch 2/50
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.1794 - loss: 1.6103 - val_accuracy: 0.1476 - val_loss: 1.6153 - learning_rate: 2.5000e-04
Epoch 3/50
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.2383 - loss: 1.5823 - val_accuracy: 0.1535 - val_loss: 1.6201 - learning_rate: 2.5000e-04
Epoch 4/50
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.1824 - loss: 1.5915 - val_accuracy: 0.1240 - val_loss: 1.6151 - learning_rate: 2.5000e-04
Epoch 5/50
[1m1072/1072[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.1807 - loss: 1.6291 - val_accuracy: 0.1240 - val_loss: 1.6217 - learning_rate: 2.5000e-04
Epoch 6/50
[1m1072/1072[0m [32m━━━━━━

In [30]:
val_loss, val_accuracy = model_gru.evaluate([X_val, horse_ids_val], y_val)
print(f'Validation Loss: {val_loss}, Validation Accuracy: {val_accuracy}')

[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.1144 - loss: 1.6077 
Validation Loss: 1.6077064275741577, Validation Accuracy: 0.12160566449165344


In [31]:
preds = model_gru.predict([X_test, horse_ids_test])
print(np.argmax(preds, axis=-1)[:50])  # First 50 predictio

[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [32]:
# Save the trained model
#model.save('/path/to/save/model.h5')

In [33]:
model_gru.get_layer("embedding_1").get_weights()

[array([[ 0.04286112, -0.01373911,  0.0260236 , ...,  0.0472402 ,
         -0.01445367,  0.03107635],
        [-0.01513082,  0.03691511,  0.00313271, ..., -0.0067361 ,
         -0.01546048, -0.0210182 ],
        [-0.01055573,  0.02396407,  0.02193829, ...,  0.04986105,
         -0.04454018,  0.016414  ],
        ...,
        [ 0.04376504,  0.01242807, -0.03852003, ...,  0.02041975,
          0.02567029, -0.03338864],
        [-0.03575704,  0.0302016 ,  0.01271212, ..., -0.04342083,
          0.04992411, -0.0016344 ],
        [ 0.04707389, -0.00486392, -0.01071117, ..., -0.02900936,
         -0.03296497, -0.04964702]], dtype=float32)]