# Proposed Ensemble Models

Given the constraints and objectives, I will consider the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


## Load Parquet Train, Test, and Validaion (VAL) Data:

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet

In [None]:
#spark.stop()

In [2]:
# Set Environment
import os
import pyspark.sql.functions as F
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, size, when, count
from src.data_preprocessing.data_prep1.data_utils import (
    save_parquet, gather_statistics, initialize_environment,
    load_config, initialize_logging, initialize_spark, 
    identify_and_impute_outliers, identify_and_remove_outliers, process_merged_results_sectionals,
    identify_missing_and_outliers
)

try:
    spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()
    # input("Press Enter to continue...")
except Exception as e:
    print(f"An error occurred during initialization: {e}")
    logging.error(f"An error occurred during initialization: {e}")

2024-12-30 15:06:03,543 - INFO - Environment setup initialized.


Spark session created successfully.


In [3]:
train_sequences = os.path.join(parquet_dir, "train_sequences.parquet")
train_sequences = spark.read.parquet(train_sequences)
test_sequences = os.path.join(parquet_dir, "test_sequences.parquet")
test_sequences = spark.read.parquet(test_sequences)
val_sequences = os.path.join(parquet_dir, "val_sequences.parquet")
val_sequences = spark.read.parquet(val_sequences)

In [4]:
train_sequences.printSchema()

root
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- horse_id: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- gate_index: integer (nullable = true)
 |-- course_cd_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- equip_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- surface_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- trk_cond_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- weather_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- med_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- stk_clm_md_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- turf_mud_mark_ohe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- race_type_ohe: array (nullable = true)
 |    |-- element: d

In [None]:
# test_sequences.printSchema()

In [5]:
# val_sequences.printSchema()

In [6]:
# If your label column is named "label":
train_sequences.groupBy("label").count().orderBy("count", ascending=False).show()

# This shows how many examples you have in each label category.
# If it's a multi-class problem (e.g., finishing position 0,1,2,...),
# you'll see how many rows for each class. 

+-----+-----+
|label|count|
+-----+-----+
|    2| 5984|
|    3| 5955|
|    1| 5840|
|    4| 5738|
|    0| 5662|
|    7| 5065|
|    5| 4928|
|    6| 3640|
+-----+-----+



In [7]:
import pyspark.sql.functions as F

label_counts = (
    train_sequences.groupBy("label")
      .agg(F.count("*").alias("count"))
)

total_count = train_sequences.count()

label_distribution = (
    label_counts
    .withColumn("percentage", (F.col("count") / total_count) * 100)
    .orderBy(F.desc("count"))
)

label_distribution.show()

+-----+-----+------------------+
|label|count|        percentage|
+-----+-----+------------------+
|    2| 5984| 13.97738951695786|
|    3| 5955|13.909651499579557|
|    1| 5840|13.641035223769038|
|    4| 5738|13.402784266093617|
|    0| 5662|13.225263944688406|
|    7| 5065|  11.8307951041764|
|    5| 4928|11.510791366906476|
|    6| 3640| 8.502289077828646|
+-----+-----+------------------+



In [None]:
# Distribution of course codes
# train_sequences.groupBy("course_cd_ohe").count().orderBy(F.desc("count")).show(50, truncate=False)

In [9]:
# train_sequences.groupBy("race_type_ohe").count().orderBy(F.desc("count")).show(50, truncate=False)

In [10]:
train_sequences.agg(
    F.countDistinct("course_cd_ohe").alias("distinct_tracks"),
    F.countDistinct("race_type_ohe").alias("distinct_race_types")
).show()

+---------------+-------------------+
|distinct_tracks|distinct_race_types|
+---------------+-------------------+
|             26|                 13|
+---------------+-------------------+



In [11]:
train_sequences.groupBy(F.year("race_date").alias("year"), F.month("race_date").alias("month"))\
  .count()\
  .orderBy("year", "month")\
  .show(48, truncate=False)

+----+-----+-----+
|year|month|count|
+----+-----+-----+
|2022|3    |80   |
|2022|4    |576  |
|2022|5    |1051 |
|2022|6    |1266 |
|2022|7    |1606 |
|2022|8    |1942 |
|2022|9    |2469 |
|2022|10   |2683 |
|2022|11   |3096 |
|2022|12   |3261 |
|2023|1    |2945 |
|2023|2    |3289 |
|2023|3    |3976 |
|2023|4    |4335 |
|2023|5    |5090 |
|2023|6    |5147 |
+----+-----+-----+



In [12]:
# Convert to Pandas DataFrame
train_sequences_pd = train_sequences.toPandas()
val_sequences_pd = val_sequences.toPandas()
test_sequences_pd = test_sequences.toPandas()

horse_ids_train = train_sequences_pd["horse_id"].values  # Extract horse_id for training
horse_ids_val = val_sequences_pd["horse_id"].values  # Extract horse_id for validation
horse_ids_test = test_sequences_pd["horse_id"].values  # Extract horse_id for testing

                                                                                

In [13]:
print(train_sequences.select(F.size("past_races_sequence")).distinct().show())

+-------------------------+
|size(past_races_sequence)|
+-------------------------+
|                       10|
+-------------------------+

None


In [14]:
label_distribution = train_sequences.groupBy("label").count().collect()
print(label_distribution)

[Row(label=1, count=5840), Row(label=6, count=3640), Row(label=3, count=5955), Row(label=5, count=4928), Row(label=4, count=5738), Row(label=7, count=5065), Row(label=2, count=5984), Row(label=0, count=5662)]


In [15]:
train_sequences_pd["past_races_sequence"].head()

0    [([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
1    [([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
2    [([0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
3    [([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0...
4    [([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0...
Name: past_races_sequence, dtype: object

In [16]:
import numpy as np

def flatten_sequence(sequence):
    """
    Flattens a sequence of race data into a single NumPy array.
    Ensures uniform array shapes for each time step in the sequence.
    """
    ohe_length = 43 + 26 + 10 + 2 + 8 + 7 + 5 + 4 + 4 + 14  # Total OHE length
    aggregator_length = len(aggregator_cols)  # Length of aggregator columns

    flattened_sequence = []
    for step in sequence:
        # Extract `ohe_flat` or default to zero array
        ohe_flat = step["ohe_flat"] if "ohe_flat" in step else [0.0] * ohe_length

        # Ensure `ohe_flat` has the correct length
        if len(ohe_flat) != ohe_length:
            ohe_flat = [0.0] * ohe_length

        # Extract aggregator values or default to -999.0
        aggregator_values = [step[agg] if agg in step else -999.0 for agg in aggregator_cols]

        # Concatenate `ohe_flat` and aggregator values
        flattened_step = np.array(ohe_flat + aggregator_values)

        # Verify the length of the flattened step
        if len(flattened_step) != (ohe_length + aggregator_length):
            raise ValueError(f"Flattened step has inconsistent length: {len(flattened_step)}")

        flattened_sequence.append(flattened_step)

    return np.array(flattened_sequence)

aggregator_cols = [
    "avg_speed_agg", "max_speed_agg", "final_speed_agg", "avg_accel_agg", 
    "fatigue_agg", "sectional_time_agg", "running_time_agg", "distance_back_agg", 
    "distance_ran_agg", "strides_agg", "max_speed_overall", "min_speed_overall"
]

X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

In [17]:
X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

X_val = np.array([flatten_sequence(seq) for seq in val_sequences_pd["past_races_sequence"]])
y_val = val_sequences_pd["label"].values

X_test = np.array([flatten_sequence(seq) for seq in test_sequences_pd["past_races_sequence"]])
y_test = test_sequences_pd["label"].values


In [18]:
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

X_train shape: (42812, 10, 135)
y_train shape: (42812,)


In [19]:
 # Label targets
print(np.unique(y_train))

[0 1 2 3 4 5 6 7]


In [20]:
label_dist = train_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 5662|
|    1| 5840|
|    2| 5984|
|    3| 5955|
|    4| 5738|
|    5| 4928|
|    6| 3640|
|    7| 5065|
+-----+-----+



In [21]:
label_dist = val_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 1473|
|    1| 1591|
|    2| 1528|
|    3| 1536|
|    4| 1437|
|    5| 1261|
|    6|  940|
|    7| 1487|
+-----+-----+



In [22]:
label_dist = test_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 1055|
|    1| 1115|
|    2| 1130|
|    3| 1107|
|    4| 1073|
|    5|  943|
|    6|  821|
|    7| 1224|
+-----+-----+



In [23]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}, horse_ids_train shape: {horse_ids_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}, horse_ids_val shape: {horse_ids_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}, horse_ids_test shape: {horse_ids_test.shape}")

X_train shape: (42812, 10, 135), y_train shape: (42812,), horse_ids_train shape: (42812,)
X_val shape: (11253, 10, 135), y_val shape: (11253,), horse_ids_val shape: (11253,)
X_test shape: (8468, 10, 135), y_test shape: (8468,), horse_ids_test shape: (8468,)


In [24]:
print(f"X_train: {X_train.shape}, horse_ids_train: {horse_ids_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, horse_ids_val: {horse_ids_val.shape}, y_val: {y_val.shape}")

X_train: (42812, 10, 135), horse_ids_train: (42812,), y_train: (42812,)
X_val: (11253, 10, 135), horse_ids_val: (11253,), y_val: (11253,)


In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Concatenate, Dropout, Flatten

# Define the input shapes
time_steps = X_train.shape[1]  # Number of time steps in sequences
features = X_train.shape[2]    # Number of features per time step
num_horses = len(np.unique(horse_ids_train))  # Number of unique horse IDs

# 1. Input layers
input_features = Input(shape=(time_steps, features), name='input_features')
input_horse_id = Input(shape=(1,), name='input_horse_id')

# 2. Embedding layer for horse_id
embedding = Embedding(input_dim=num_horses, output_dim=32)(input_horse_id)
embedding = Flatten()(embedding)  # Flatten embedding

# 3. LSTM layers for sequential data
x = LSTM(256, return_sequences=True)(input_features)
x = Dropout(0.2)(x)
x = LSTM(128, return_sequences=True)(x)
x = Dropout(0.2)(x)
x = LSTM(64, return_sequences=True)(x)
x = Dropout(0.2)(x)
x = LSTM(32, return_sequences=False)(x)
x = Dropout(0.2)(x)

# 4. Concatenate the final LSTM output with the horse embedding
concat = Concatenate()([x, embedding])

# 5. Dense layers
dense_out = Dense(64, activation='relu')(concat)
dense_out = Dropout(0.2)(dense_out)

# 6. Final output layer for 8-class classification
num_classes = 8
output = Dense(num_classes, activation='softmax')(dense_out)

# Compile with sparse_categorical_crossentropy for integer labels
# Define the model
model_lstm = Model(inputs=[input_features, input_horse_id], outputs=output)

# Compile the model with binary_crossentropy
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

# Compile with sparse_categorical_crossentropy for integer labels
model_lstm = Model(inputs=[input_features, input_horse_id], outputs=output)
model_lstm.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',  # works with integer labels
    metrics=['accuracy']
)
model_lstm.summary()

In [None]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [None]:
unique, counts = np.unique(y_train, return_counts=True)
train_counts = dict(zip(unique, counts))

print(train_counts)
print(unique)
print(counts)

In [None]:
import numpy as np

import numpy as np

# Label distribution
unique, counts = np.unique(y_train, return_counts=True)
train_counts = dict(zip(unique, counts))

print("Label Counts:", train_counts)

# Total samples and number of unique classes
total_samples = np.sum(counts)
num_classes = len(unique)

# Calculate class weights
class_weight = {label: (total_samples / (num_classes * count)) for label, count in train_counts.items()}

print("Class Weights:", class_weight)
# # Train label counts from your distribution
# train_counts = np.array([580, 621, 604, 594, 1889])
# total = train_counts.sum()  # 18164
# n_classes = len(train_counts)  # 5

# class_weight = {}
# for i, count_i in enumerate(train_counts):
#     class_weight[i] = float(total) / (n_classes * count_i)

print(class_weight)
# Example output:
# {0: 1.463870..., 1: 1.416..., 2: 1.426..., 3: 1.446..., 4: 0.45...}

In [None]:
print(X_train.shape)

In [None]:
# Train the model

history = model_lstm.fit(
    [X_train, horse_ids_train],
    y_train,
    epochs=50,  
    batch_size=64,  # 64,
    validation_data=([X_val, horse_ids_val], y_val),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ],
    #class_weight=class_weight,
    verbose=1
)


In [None]:
val_loss, val_accuracy = model_lstm.evaluate([X_val, horse_ids_val], y_val)
print(f'Validation Loss: {val_loss}, Validation Accuracy: {val_accuracy}')

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Predict on validation data
val_preds = (model_lstm.predict([X_val, horse_ids_val]) > 0.5).astype(int)

# Generate a classification report
print(classification_report(y_val, val_preds, target_names=["Not 1st", "1st"]))

# Display a confusion matrix
print(confusion_matrix(y_val, val_preds))

In [None]:
# Save the trained model
#model.save('/path/to/save/model.h5')

In [None]:
model_lstm.get_layer("embedding_6").get_weights()