# Proposed Ensemble Models

Given the constraints and objectives, I will consider the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


## Load Parquet Train, Test, and Validaion (VAL) Data:

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet

In [None]:
#spark.stop()

In [2]:
# Set Environment
import os
import pyspark.sql.functions as F
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, size, when, count
from src.data_preprocessing.data_prep1.data_utils import (
    save_parquet, gather_statistics, initialize_environment,
    load_config, initialize_logging, initialize_spark, 
    identify_and_impute_outliers, identify_and_remove_outliers, process_merged_results_sectionals,
    identify_missing_and_outliers
)

try:
    spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()
    # input("Press Enter to continue...")
except Exception as e:
    print(f"An error occurred during initialization: {e}")
    logging.error(f"An error occurred during initialization: {e}")

2024-12-28 22:17:21,367 - INFO - Environment setup initialized.


Spark session created successfully.


In [3]:
gpspoint = os.path.join(parquet_dir, "gpspoint.parquet")
gpspoint = spark.read.parquet(gpspoint)
sectionals = os.path.join(parquet_dir, "sectionals.parquet")
sectionals = spark.read.parquet(sectionals)
results = os.path.join(parquet_dir, "results.parquet")
results = spark.read.parquet(results)

In [4]:
gpspoint.printSchema()

root
 |-- course_cd: string (nullable = true)
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- saddle_cloth_number: string (nullable = true)
 |-- time_stamp: timestamp (nullable = true)
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- speed: double (nullable = true)
 |-- progress: double (nullable = true)
 |-- stride_frequency: double (nullable = true)
 |-- post_time: timestamp (nullable = true)
 |-- location: string (nullable = true)



In [5]:
sectionals.printSchema()

root
 |-- course_cd: string (nullable = true)
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- saddle_cloth_number: string (nullable = true)
 |-- gate_name: string (nullable = true)
 |-- gate_numeric: double (nullable = true)
 |-- length_to_finish: double (nullable = true)
 |-- sectional_time: double (nullable = true)
 |-- running_time: double (nullable = true)
 |-- distance_back: double (nullable = true)
 |-- distance_ran: double (nullable = true)
 |-- number_of_strides: double (nullable = true)
 |-- post_time: timestamp (nullable = true)



In [6]:
results.printSchema()

root
 |-- course_cd: string (nullable = true)
 |-- race_date: date (nullable = true)
 |-- race_number: integer (nullable = true)
 |-- saddle_cloth_number: string (nullable = true)
 |-- horse_id: integer (nullable = true)
 |-- horse_name: string (nullable = true)
 |-- official_fin: integer (nullable = true)
 |-- purse: integer (nullable = true)
 |-- wps_pool: decimal(10,2) (nullable = true)
 |-- weight: decimal(10,2) (nullable = true)
 |-- date_of_birth: date (nullable = true)
 |-- sex: string (nullable = true)
 |-- start_position: long (nullable = true)
 |-- equip: string (nullable = true)
 |-- claimprice: double (nullable = true)
 |-- surface: string (nullable = true)
 |-- surface_type_description: string (nullable = true)
 |-- trk_cond: string (nullable = true)
 |-- trk_cond_desc: string (nullable = true)
 |-- weather: string (nullable = true)
 |-- distance: decimal(10,2) (nullable = true)
 |-- dist_unit: string (nullable = true)
 |-- power: decimal(10,2) (nullable = true)
 |-- med: 

In [7]:
results.count()

391094

In [8]:
sectionals.count()

4473707

In [9]:
matched = results.join(sectionals, ["course_cd","race_date","race_number","saddle_cloth_number"], "inner").count()
total_results = results.count()
print("Matched:", matched, "out of total results:", total_results)

Matched: 1120045 out of total results: 391094


In [10]:
# Which (course_cd, race_date, race_number, SC) are in results but not in sectionals?
unmatched = results.join(
    sectionals, 
    ["course_cd","race_date","race_number","saddle_cloth_number"], 
    "left_anti"  # rows from results NOT in sectionals
)
print(unmatched.count())



310997




In [11]:
matched_results_only = results.join(
    sectionals, 
    ["course_cd","race_date","race_number","saddle_cloth_number"], 
    "left_semi"
)
matched_results_only_count = matched_results_only.count()
print (matched_results_only_count)



80097


                                                                                

In [12]:
391094 - 310997

80097

In [20]:
# Select the course_cd column and get distinct rows
distinct_tracks_df = matched_results_only.select("course_cd").distinct()

# Display them on the console
distinct_tracks_df.show(50)

# If you want them as a Python list, you can do:
unique_tracks = [row["course_cd"] for row in distinct_tracks_df.collect()]
print(unique_tracks)

                                                                                

+---------+
|course_cd|
+---------+
|      TWO|
|      MTH|
|      MVR|
|      TSA|
|      CBY|
|      SAR|
|      TTP|
|      TLS|
|      PEN|
|      TGP|
|      AQU|
|      CNL|
|      LAD|
|      KEE|
|      LRL|
|      DMR|
|      CTD|
|      TAM|
|      CLS|
|      ELP|
|      PIM|
|      IND|
|      TCD|
|      HOU|
|      BEL|
|      TOP|
|      ASD|
|      TGG|
+---------+

['TWO', 'MTH', 'MVR', 'TSA', 'CBY', 'SAR', 'TTP', 'TLS', 'PEN', 'TGP', 'AQU', 'CNL', 'LAD', 'KEE', 'LRL', 'DMR', 'CTD', 'TAM', 'CLS', 'ELP', 'PIM', 'IND', 'TCD', 'HOU', 'BEL', 'TOP', 'ASD', 'TGG']


In [None]:
unmatched

In [21]:
# Select the course_cd column and get distinct rows
distinct_tracks_df = unmatched.select("course_cd").distinct()

# Display them on the console
distinct_tracks_df.show(50)

# If you want them as a Python list, you can do:
unique_tracks = [row["course_cd"] for row in distinct_tracks_df.collect()]
print(unique_tracks)

                                                                                

+---------+
|course_cd|
+---------+
|      TWO|
|      MTH|
|      MVR|
|      TSA|
|      CBY|
|      SAR|
|      TTP|
|      TLS|
|      PEN|
|      TGP|
|      AQU|
|      CNL|
|      LAD|
|      KEE|
|      LRL|
|      TKD|
|      DMR|
|      CTD|
|      TAM|
|      CLS|
|      MED|
|      ELP|
|      PIM|
|      IND|
|      TCD|
|      HOU|
|      BEL|
|      TOP|
|      ASD|
|      TGG|
+---------+

['TWO', 'MTH', 'MVR', 'TSA', 'CBY', 'SAR', 'TTP', 'TLS', 'PEN', 'TGP', 'AQU', 'CNL', 'LAD', 'KEE', 'LRL', 'TKD', 'DMR', 'CTD', 'TAM', 'CLS', 'MED', 'ELP', 'PIM', 'IND', 'TCD', 'HOU', 'BEL', 'TOP', 'ASD', 'TGG']


In [None]:
train_sequences_path = os.path.join(parquet_dir, "train_sequences.parquet")
val_sequences_path = os.path.join(parquet_dir, "val_sequences.parquet")
test_sequences_path = os.path.join(parquet_dir, "test_sequences.parquet")
train_sequences = spark.read.parquet(train_sequences_path)
val_sequences = spark.read.parquet(val_sequences_path)
test_sequences = spark.read.parquet(test_sequences_path)

In [None]:
train_sequences.printSchema()

In [None]:
# Convert to Pandas DataFrame
train_sequences_pd = train_sequences.toPandas()
val_sequences_pd = val_sequences.toPandas()
test_sequences_pd = test_sequences.toPandas()

horse_ids_train = train_sequences_pd["horse_id"].values  # Extract horse_id for training
horse_ids_val = val_sequences_pd["horse_id"].values  # Extract horse_id for validation
horse_ids_test = test_sequences_pd["horse_id"].values  # Extract horse_id for testing

In [None]:
print(train_sequences.select(F.size("past_races_sequence")).distinct().show())

In [None]:
label_distribution = train_sequences.groupBy("label").count().collect()
print(label_distribution)

In [None]:
train_sequences_pd["past_races_sequence"].head()

In [None]:
import numpy as np

def flatten_sequence(sequence):
    """
    Flattens a sequence of race data into a single NumPy array.
    Ensures uniform array shapes for each time step in the sequence.
    """
    ohe_length = 43 + 26 + 10 + 2 + 8 + 7 + 5 + 4 + 4 + 14  # Total OHE length
    aggregator_length = len(aggregator_cols)  # Length of aggregator columns

    flattened_sequence = []
    for step in sequence:
        # Extract `ohe_flat` or default to zero array
        ohe_flat = step["ohe_flat"] if "ohe_flat" in step else [0.0] * ohe_length

        # Ensure `ohe_flat` has the correct length
        if len(ohe_flat) != ohe_length:
            ohe_flat = [0.0] * ohe_length

        # Extract aggregator values or default to -999.0
        aggregator_values = [step[agg] if agg in step else -999.0 for agg in aggregator_cols]

        # Concatenate `ohe_flat` and aggregator values
        flattened_step = np.array(ohe_flat + aggregator_values)

        # Verify the length of the flattened step
        if len(flattened_step) != (ohe_length + aggregator_length):
            raise ValueError(f"Flattened step has inconsistent length: {len(flattened_step)}")

        flattened_sequence.append(flattened_step)

    return np.array(flattened_sequence)

aggregator_cols = [
    "avg_speed_agg", "max_speed_agg", "final_speed_agg", "avg_accel_agg", 
    "fatigue_agg", "sectional_time_agg", "running_time_agg", "distance_back_agg", 
    "distance_ran_agg", "strides_agg", "max_speed_overall", "min_speed_overall"
]

X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

In [None]:
X_train = np.array([flatten_sequence(seq) for seq in train_sequences_pd["past_races_sequence"]])
y_train = train_sequences_pd["label"].values

X_val = np.array([flatten_sequence(seq) for seq in val_sequences_pd["past_races_sequence"]])
y_val = val_sequences_pd["label"].values

X_test = np.array([flatten_sequence(seq) for seq in test_sequences_pd["past_races_sequence"]])
y_test = test_sequences_pd["label"].values


In [None]:
y_val

In [None]:
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

In [None]:
for i, seq in enumerate(train_sequences_pd["past_races_sequence"][:5]):
    print(f"Sequence {i}:")
    for j, step in enumerate(seq):
        print(f"  Step {j}: {step}")

In [None]:
 # Label targets
print(np.unique(y_train))

In [None]:
label_dist = train_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

In [None]:
label_dist = val_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

In [None]:
label_dist = test_sequences.groupBy("label").count().orderBy("label")
label_dist.show()

In [None]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}, horse_ids_train shape: {horse_ids_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}, horse_ids_val shape: {horse_ids_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}, horse_ids_test shape: {horse_ids_test.shape}")

In [None]:
print(f"X_train: {X_train.shape}, horse_ids_train: {horse_ids_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, horse_ids_val: {horse_ids_val.shape}, y_val: {y_val.shape}")

In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Concatenate, Dropout, Flatten

# Define the input shapes
time_steps = X_train.shape[1]  # Number of time steps in sequences
features = X_train.shape[2]    # Number of features per time step
num_horses = len(np.unique(horse_ids_train))  # Number of unique horse IDs

# 1. Input layers
input_features = Input(shape=(time_steps, features), name='input_features')
input_horse_id = Input(shape=(1,), name='input_horse_id')

# 2. Embedding layer for horse_id
embedding = Embedding(input_dim=num_horses, output_dim=32)(input_horse_id)
embedding = Flatten()(embedding)  # Flatten embedding

# 3. LSTM layers for sequential data
x = LSTM(128, return_sequences=True)(input_features)
x = Dropout(0.2)(x)
x = LSTM(64, return_sequences=False)(x)
x = Dropout(0.2)(x)

# 4. Concatenate the final LSTM output with the horse embedding
concat = Concatenate()([x, embedding])

# 5. Dense layers
dense_out = Dense(64, activation='relu')(concat)
dense_out = Dropout(0.2)(dense_out)

# 6. Final output layer for binary classification
output = Dense(1, activation='sigmoid')(dense_out)  # Binary classification

# Define the model
model_lstm = Model(inputs=[input_features, input_horse_id], outputs=output)

# Compile the model with binary_crossentropy
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model_lstm.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Display the model summary
model_lstm.summary()

In [None]:
y_train_binary = (y_train == 1).astype(int)  # Convert to binary labels
y_val_binary = (y_val == 1).astype(int)

In [None]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6
)

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', patience=10, restore_best_weights=True
)

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath='best_model.keras',
    monitor='val_loss',
    save_best_only=True
)

In [None]:
unique, counts = np.unique(y_train, return_counts=True)
train_counts = dict(zip(unique, counts))

print(train_counts)
print(unique)
print(counts)

In [None]:
y_train_binary = (y_train == 1).astype(int)  # 1 for first place, 0 otherwise
y_val_binary = (y_val == 1).astype(int)

In [None]:
import numpy as np

import numpy as np

# Label distribution
unique, counts = np.unique(y_train_binary, return_counts=True)
train_counts = dict(zip(unique, counts))

print("Label Counts:", train_counts)

# Total samples and number of unique classes
total_samples = np.sum(counts)
num_classes = len(unique)

# Calculate class weights
class_weight = {label: (total_samples / (num_classes * count)) for label, count in train_counts.items()}

print("Class Weights:", class_weight)
# # Train label counts from your distribution
# train_counts = np.array([580, 621, 604, 594, 1889])
# total = train_counts.sum()  # 18164
# n_classes = len(train_counts)  # 5

# class_weight = {}
# for i, count_i in enumerate(train_counts):
#     class_weight[i] = float(total) / (n_classes * count_i)

print(class_weight)
# Example output:
# {0: 1.463870..., 1: 1.416..., 2: 1.426..., 3: 1.446..., 4: 0.45...}

In [None]:
print(X_train.shape)

In [None]:
# Train the model

history = model_lstm.fit(
    [X_train, horse_ids_train],
    y_train_binary,
    epochs=50,  
    batch_size=8,  # 64,
    validation_data=([X_val, horse_ids_val], y_val_binary),
    callbacks=[
        lr_scheduler, 
        early_stopping,
        model_checkpoint
    ],
    class_weight=class_weight,
    verbose=1
)


In [None]:
val_loss, val_accuracy = model_lstm.evaluate([X_val, horse_ids_val], y_val_binary)
print(f'Validation Loss: {val_loss}, Validation Accuracy: {val_accuracy}')

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Predict on validation data
val_preds = (model_lstm.predict([X_val, horse_ids_val]) > 0.5).astype(int)

# Generate a classification report
print(classification_report(y_val_binary, val_preds, target_names=["Not 1st", "1st"]))

# Display a confusion matrix
print(confusion_matrix(y_val_binary, val_preds))

In [None]:
# Save the trained model
#model.save('/path/to/save/model.h5')

In [None]:
model_lstm.get_layer("embedding_6").get_weights()