# Proposed Ensemble Models

Given the constraints and objectives, I will consider the following models for the ensemble:
	
    1.	Model 1: LSTM Network on Raw GPS Data
    
>•	Input Data: Sequences of raw GPS data (speed, progress, stride_frequency, etc.).

>•	Architecture: An LSTM network designed to capture temporal dependencies and patterns in the sequential data.

>•	Advantage: LSTMs are well-suited for time-series data and can learn complex temporal dynamics without the need for hand-engineered features like acceleration.

    2.	Model 2: 1D Convolutional Neural Network (1D-CNN)
	
>•	Input Data: The same raw GPS sequences as in Model 1.

>•	Architecture: A 1D-CNN that applies convolutional filters across the time dimension to detect local patterns.

>•	Advantage: CNNs can capture spatial hierarchies and are effective in recognizing patterns in sequences, potentially identifying features like sudden changes in speed or stride frequency.

    3.	Model 3: Transformer-based Model
	
>•	Input Data: Raw GPS sequences and possibly sectionals data.

>•	Architecture: A Transformer model that uses self-attention mechanisms to weigh the importance of different parts of the sequence.

>•	Advantage: Transformers can model long-range dependencies and focus on the most relevant parts of the sequence for prediction.

## Additional Models (Optional):

    4.	Model 4: Gated Recurrent Unit (GRU) Network

>•	Similar to LSTMs but with a simpler architecture, GRUs can be more efficient and may perform better on certain datasets.

>•	Model 5: Temporal Convolutional Network (TCN)

>•	TCNs are designed for sequential data and can capture long-term dependencies using causal convolutions and residual connections.


In [19]:
!conda list

# packages in environment at /home/exx/anaconda3/envs/mamba_env/envs/tf_310:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
anyio                     4.6.2.post1        pyhd8ed1ab_0    conda-forge
argon2-cffi               23.1.0             pyhd8ed1ab_0    conda-forge
argon2-cffi-bindings      21.2.0          py310ha75aee5_5    conda-forge
arrow                     1.3.0              pyhd8ed1ab_0    conda-forge
asttokens                 2.4.1              pyhd8ed1ab_0    conda-forge
async-lru                 2.0.4              pyhd8ed1ab_0    conda-forge
attrs                     24.2.0             pyh71513ae_0    conda-forge
aws-c-auth                0.8.0                h56a2c13_4    conda-forge
aws-c-cal                 0.8.0                hd3f4568_0    conda-forge
aws-c-common              0.9.31     

In [9]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
import tensorflow as tf

In [10]:
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]


In [11]:
physical_devices = tf.config.list_physical_devices('GPU')
print("Num GPUs Available:", len(physical_devices))

Num GPUs Available: 2


In [12]:
for device in physical_devices:
    print(device)

PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')


In [13]:
!echo $JAVA_HOME
!java --version

/usr/lib/jvm/java-11-openjdk
openjdk 11.0.25 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.25.0.9-1) (build 11.0.25+9-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.25.0.9-1) (build 11.0.25+9-LTS, mixed mode, sharing)


## Load Parquet Train, Test, and Validaion (VAL) Data:

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet

/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet

In [25]:
import pyarrow.parquet as pq
import pandas as pd

In [26]:
import pandas as pd
import tensorflow as tf
#from tensorflow.keras.models import Sequential
#from tensorflow.layers import LSTM, Dense
#from sklearn.model_selection import train_test_split

In [34]:
import pyarrow.parquet as pq
import pandas as pd

# Load the data from Parquet files using PyArrow
train_table = pq.read_table('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet')
test_table = pq.read_table('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet')
validation_table = pq.read_table('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet')

# Inspect the schema of the PyArrow table
print(train_table.schema)

# Convert each column individually and handle complex data types
def convert_table_to_dataframe(table):
    columns = {}
    for column_name in table.column_names:
        column = table[column_name]
        try:
            columns[column_name] = column.to_pandas()
        except TypeError as e:
            print(f"Error converting column {column_name}: {e}")
            # Handle complex data types here if needed
            if column.type == 'struct' or column.type == 'list':
                columns[column_name] = column.to_pylist()
            else:
                raise e
    return pd.DataFrame(columns)

train_df = convert_table_to_dataframe(train_table)
test_df = convert_table_to_dataframe(test_table)
validation_df = convert_table_to_dataframe(validation_table)

# Check the schema of the DataFrames
print(train_df.head())
print(test_df.head())
print(validation_df.head())

race_date: date32[day]
race_number: int32
gate_index: int32
horse_id: int32
label: int32 not null
course_cd_ohe: struct<type: int8 not null, size: int32, indices: list<element: int32 not null>, values: list<element: double not null>>
  child 0, type: int8 not null
  child 1, size: int32
  child 2, indices: list<element: int32 not null>
      child 0, element: int32 not null
  child 3, values: list<element: double not null>
      child 0, element: double not null
equip_ohe: struct<type: int8 not null, size: int32, indices: list<element: int32 not null>, values: list<element: double not null>>
  child 0, type: int8 not null
  child 1, size: int32
  child 2, indices: list<element: int32 not null>
      child 0, element: int32 not null
  child 3, values: list<element: double not null>
      child 0, element: double not null
surface_ohe: struct<type: int8 not null, size: int32, indices: list<element: int32 not null>, values: list<element: double not null>>
  child 0, type: int8 not null
  c

TypeError: Cannot convert numpy.ndarray to numpy.ndarray

In [27]:
# Load the data from Parquet files
train_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/train_sequences.parquet')
test_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/test_sequences.parquet')
validation_df = pd.read_parquet('/home/exx/myCode/horse-racing/FoxRiverAIRacing/data/parquet/val_sequences.parquet')

# Preprocess the data
# Assuming 'features' is a column with the input features and 'label' is the target column
X_train = train_df['features'].tolist()
y_train = train_df['label'].tolist()
X_test = test_df['features'].tolist()
y_test = test_df['label'].tolist()
X_val = validation_df['features'].tolist()
y_val = validation_df['label'].tolist()

# Convert the data into TensorFlow datasets
def create_dataset(X, y):
    X = tf.ragged.constant(X).to_tensor()
    y = tf.constant(y)
    return tf.data.Dataset.from_tensor_slices((X, y))

train_dataset = create_dataset(X_train, y_train).batch(32)
test_dataset = create_dataset(X_test, y_test).batch(32)
val_dataset = create_dataset(X_val, y_val).batch(32)

# Define the LSTM model
model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(64, input_shape=(None, X_train[0].shape[1]), return_sequences=True),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')  # Adjust the output layer based on your problem (e.g., regression or classification)
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_dataset, validation_data=val_dataset, epochs=10)

# Evaluate the model
loss, accuracy = model.evaluate(test_dataset)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

TypeError: Cannot convert numpy.ndarray to numpy.ndarray