# Ensemble Model Preparation

A lot of work has gone into compiling the current dataset. I have merged the gps_df, sectionals_df and results_df. I have limited the amount of Equibase data I am using just to keep the focus on the TPD GPS data, and to do some feature engineering.  However, there are some good metrics from the Equibase data that are just basic measures that could be obtained from any racebook sheet. 

## Get Started

1. Going to load the parquet DataFrame from disk and do some imputation, one-hot encoding, string indexing, and scaling. The run it through XBBoost to see how it's looking. At this point I will do the integration of route data, and add the GPS aggregations. I just want to see what I can minimally do and how its working before I go down the wrong path. If the XGBoost doesn't do any better than the LSTM, at least I won't have wasted any more time on it. 

### Model Additional Requirements

#### Logistic Regression:
> Ensure features are scaled (e.g., StandardScaler) and that categorical variables are one-hot encoded.

#### Random Forest	
> Scaling is unnecessary, and categorical variables should be one-hot encoded.

#### XGBoost/LightGBM	
> **Scaling is unnecessary, and categorical variables should be one-hot encoded.**

#### Support Vector Machines (SVM)	
>Requires scaling and one-hot encoding.

#### k-Nearest Neighbors	
>Requires scaling and one-hot encoding.

#### Multi-Layer Perceptron (MLP)
> Requires scaling and one-hot encoding.

#### CatBoost	
> No need for one-hot encoding; you can specify categorical columns directly using CatBoost’s cat_features parameter.


### Load master_results_df.parquet file

In [17]:
# Setup Environment

import os
import logging
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn import set_config

import pyspark.sql.functions as F
import xgboost as xgb
from sklearn import set_config
from pyspark.sql.functions import (col, count, row_number, abs, unix_timestamp, mean, 
                                   when, lit, min as F_min, max as F_max , 
                                   row_number, mean as F_mean, countDistinct, last, first, when)
import configparser
from pyspark.sql import SparkSession
from src.data_preprocessing.data_prep1.sql_queries import sql_queries
from pyspark.sql.window import Window
from pyspark.sql import DataFrame, Window
from src.data_preprocessing.data_prep1.data_utils import (save_parquet, gather_statistics, 
                initialize_environment, load_config, initialize_logging, initialize_spark, 
                identify_and_impute_outliers, 
                identify_and_remove_outliers, identify_missing_and_outliers)
from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col

# Set global references to None
spark = None
master_results_df = None

In [3]:

spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()


2024-12-31 14:19:04,392 - INFO - Environment setup initialized.


Spark session created successfully.


In [4]:
df = spark.read.parquet(os.path.join(parquet_dir, "results_only_clean.parquet"))


In [5]:
df.count()

777100

# Switching to Pandas

In [6]:
# Convert Spark DataFrame -> Pandas DataFrame

rf_df = df.toPandas()
df = None
# Quick info about the DataFrame
#print(df.info())
#print(df.head(5))

                                                                                

In [7]:
# Suppose the finish position is in a column named 'official_fin'
# Create a binary label: 1 = first place, 0 = others
#df["label"] = (df["official_fin"] == 1).astype(int)

# Check distribution of the label
#print(df["label"].value_counts())

# Define a function to map official_fin to label
def map_official_fin_to_label(official_fin):
    if official_fin == 1:
        return 0  # Win
    elif official_fin == 2:
        return 1  # Place
    elif official_fin == 3:
        return 2  # Show
    elif official_fin == 4:
        return 3  # Fourth
    elif official_fin == 5:
        return 4  # Fifth
    elif official_fin == 6:
        return 5  # Sixth
    elif official_fin == 7:
        return 6  # Seventh
    else:
        return 7  # Outside top-7

# Apply the function to create the label column
rf_df['label'] = rf_df['official_fin'].apply(map_official_fin_to_label)

# Check the DataFrame
#print(df)

In [8]:
# 4a) Identify columns with high missingness
missing_summary = rf_df.isna().sum().sort_values(ascending=False)
#print("Missing Value Summary:\n", missing_summary)

In [9]:
# print("Descriptive Stats:\n", df.describe())


In [10]:
# # Quick correlation matrix
# corr_matrix = df.corr(numeric_only=True)
# print("Correlation Matrix:\n", corr_matrix["label"].sort_values(ascending=False))

# # Possibly visualize
# import seaborn as sns
# import matplotlib.pyplot as plt

# plt.figure(figsize=(8, 6))
# sns.heatmap(corr_matrix, cmap="coolwarm")
# plt.title("Correlation Heatmap")
# plt.show()

In [11]:
# # Histograms for numeric columns
# df.hist(bins=30, figsize=(15,10))
# plt.tight_layout()
# plt.show()

# OHE

> Required for Random Forest

In [12]:
# 6b) Convert categorical columns to numeric dummies
categorical_cols = ["course_cd", 
                    "equip", 
                    "surface", 
                    "trk_cond", 
                    "weather", 
                    "med", 
                    "stk_clm_md", 
                    "turf_mud_mark", 
                    "race_type"]

# Perform one-hot encoding on the categorical columns
df_encoded = pd.get_dummies(rf_df, columns=categorical_cols, drop_first=True)

In [13]:
metadata = rf_df[["race_date", "race_number", "horse_id"]]  # Replace with your metadata columns

feature_cols = ["morn_odds", "net_sentiment", "power", "avg_spd_sd", "hi_spd_sd",
                "avgspd", "ave_cl_sd", "cond_win", "cond_place", "all_win", "all_place",
                "cond_earnings", "all_earnings", "weight", "cond_show", "all_show", 
                "cond_starts", "all_starts", "class_rating", "age_at_race_day", "distance",
                "horse_id", "claimprice", "wps_pool", "cond_fourth", "all_fourth", 
                "purse", "pstyerl", "race_number", "start_position"]

# Separate the label from the features
y = df_encoded["label"].values

X = rf_df[feature_cols].copy()

# Combine the encoded features with the numeric features
X = df_encoded[feature_cols].copy()

# Separate the date columns
# date_cols = ['race_date']
# X = X.drop(columns=date_cols)

# Train-test split
X_train, X_test, y_train, y_test, metadata_train, metadata_test = train_test_split(
    X, y, metadata, test_size=0.20, random_state=42, stratify=y
)


# Scaling

>Unnecessary for Random Forest

In [14]:
# Optional: Scale numeric features
# # Not necessary for Random Forest
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# Check the scaled features
# print(X_train_scaled)
# print(X_test_scaled)

# XGBoost

In [19]:
set_config(display="text")  # Switch to text-based display

xgb_model = xgb.XGBClassifier(
    objective="multi:softmax",  # Multi-class classification
    num_class=8,               # Number of classes (0 to 7)
    max_depth=6,               # Tree depth
    learning_rate=0.1,         # Learning rate
    n_estimators=100,          # Number of trees
    eval_metric="mlogloss",    # Log loss for multi-class
    early_stopping_rounds=10   # Specify early stopping rounds here
)

In [21]:
# Train the model
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=True  # Use verbose for training progress
)


[0]	validation_0-mlogloss:2.06199	validation_1-mlogloss:2.06233
[1]	validation_0-mlogloss:2.04695	validation_1-mlogloss:2.04759
[2]	validation_0-mlogloss:2.03382	validation_1-mlogloss:2.03479
[3]	validation_0-mlogloss:2.02233	validation_1-mlogloss:2.02360
[4]	validation_0-mlogloss:2.01206	validation_1-mlogloss:2.01363
[5]	validation_0-mlogloss:2.00295	validation_1-mlogloss:2.00482
[6]	validation_0-mlogloss:1.99479	validation_1-mlogloss:1.99699
[7]	validation_0-mlogloss:1.98747	validation_1-mlogloss:1.98996
[8]	validation_0-mlogloss:1.98084	validation_1-mlogloss:1.98360
[9]	validation_0-mlogloss:1.97486	validation_1-mlogloss:1.97792
[10]	validation_0-mlogloss:1.96942	validation_1-mlogloss:1.97277
[11]	validation_0-mlogloss:1.96444	validation_1-mlogloss:1.96807
[12]	validation_0-mlogloss:1.95992	validation_1-mlogloss:1.96384
[13]	validation_0-mlogloss:1.95579	validation_1-mlogloss:1.95998
[14]	validation_0-mlogloss:1.95200	validation_1-mlogloss:1.95647
[15]	validation_0-mlogloss:1.94855	

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=10,
              enable_categorical=False, eval_metric='mlogloss',
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_class=8, num_parallel_tree=None, objective='multi:softmax', ...)

In [None]:
# from sklearn.utils.class_weight import compute_class_weight

# # Assuming y_train contains the labels
# class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
# class_weights_dict = {i: class_weights[i] for i in range(len(class_weights))}

# Use 'balanced' which automatically computes the class weights based on training data.

In [22]:
# Predict on test set
y_pred = xgb_model.predict(X_test)

# Evaluate
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[10995  3068  2111  1205  1207   547     8  2014]
 [ 8152  3238  2674  1656  1836   836    16  2720]
 [ 6184  2977  2877  2054  2455  1184    21  3353]
 [ 4604  2544  2670  2280  3121  1688    28  4086]
 [ 3433  2061  2374  2265  3347  1966    44  4933]
 [ 2353  1499  1779  1730  2906  2162    45  5525]
 [ 1422   897  1034   975  1745  1621    50  5669]
 [ 1345   831   972   780  1409  1474    29 12336]]

Classification Report:
              precision    recall  f1-score   support

           0       0.29      0.52      0.37     21155
           1       0.19      0.15      0.17     21128
           2       0.17      0.14      0.15     21105
           3       0.18      0.11      0.13     21021
           4       0.19      0.16      0.17     20423
           5       0.19      0.12      0.15     17999
           6       0.21      0.00      0.01     13413
           7       0.30      0.64      0.41     19176

    accuracy                           0.24    155420
   macr

In [23]:
importances = xgb_model.feature_importances_
for col, imp in zip(feature_cols, importances):
    print(f"{col}: {imp:.4f}")

morn_odds: 0.4517
net_sentiment: 0.0453
power: 0.0157
avg_spd_sd: 0.0172
hi_spd_sd: 0.0184
avgspd: 0.0175
ave_cl_sd: 0.0118
cond_win: 0.0154
cond_place: 0.0095
all_win: 0.0197
all_place: 0.0129
cond_earnings: 0.0124
all_earnings: 0.0170
weight: 0.0101
cond_show: 0.0100
all_show: 0.0153
cond_starts: 0.0121
all_starts: 0.0161
class_rating: 0.0171
age_at_race_day: 0.0117
distance: 0.0117
horse_id: 0.0104
claimprice: 0.0133
wps_pool: 0.0274
cond_fourth: 0.0115
all_fourth: 0.0141
purse: 0.0180
pstyerl: 0.0195
race_number: 0.0447
start_position: 0.0726


# Predicting Probabilities for Ranking

In [21]:
# Step 1: Predict Probabilities
predicted_probs = rf_model.predict_proba(X_test_scaled)[:, 1]  # Probability for winning



In [22]:
# Step 2: Combine Metadata with Predictions
ranked_df = metadata_test.copy()
ranked_df["predicted_probability"] = predicted_probs

# Step 3: Rank Horses
ranked_df["rank"] = (
    ranked_df.groupby(["race_date", "race_number"])["predicted_probability"]
    .rank(method="first", ascending=False)
)

# Step 4: Sort Ranked DataFrame
ranked_df = ranked_df.sort_values(by=["race_date", "race_number", "rank"])

# Step 5: Save or Display Results
print(ranked_df.head(20))  # Display top 20 ranked horses
ranked_df.to_csv("ranked_horses.csv", index=False)  # Save to CSV if needed

         race_date  race_number  horse_id  predicted_probability  rank
443968  2022-01-01            1    257862                  0.190   1.0
403657  2022-01-01            1     96356                  0.189   2.0
557148  2022-01-01            1     14941                  0.188   3.0
143143  2022-01-01            1    294126                  0.187   4.0
732523  2022-01-01            1    113458                  0.148   5.0
241696  2022-01-01            1      9903                  0.143   6.0
732524  2022-01-01            1     35560                  0.085   7.0
732528  2022-01-01            1    181319                  0.084   8.0
403659  2022-01-01            1     96358                  0.084   9.0
200286  2022-01-01            1    269764                  0.084  10.0
443966  2022-01-01            1     10298                  0.078  11.0
200285  2022-01-01            1     62596                  0.078  12.0
424227  2022-01-01            1    163465                  0.077  13.0
443969

In [23]:
import pandas as pd
import numpy as np

# Assuming X_test corresponds to the feature matrix for your test set
# And you have additional columns `race_date`, `race_number`, `horse_id` in the original data.

# Step 1: Predict Probabilities
proba = rf_model.predict_proba(X_test_scaled)  # Shape: (n_samples, 2)
predicted_probs = proba[:, 1]  # Probability for class 1 (winning)

# Step 2: Create a DataFrame for `X_test` with metadata
# Replace with actual metadata (e.g., from original test data before splitting)
metadata = pd.DataFrame({
    "race_date": race_date,  # Replace with actual race dates
    "race_number": race_number,  # Replace with actual race numbers
    "horse_id": horse_id,  # Replace with actual horse IDs
})

# Combine predicted probabilities with metadata
ranked_df = metadata.copy()
ranked_df["predicted_probability"] = predicted_probs

# Step 3: Rank Horses
# Group by race_date and race_number, then rank by predicted_probability
ranked_df["rank"] = (
    ranked_df.groupby(["race_date", "race_number"])["predicted_probability"]
    .rank(method="first", ascending=False)
)

# Step 4: Sort Ranked DataFrame
ranked_df = ranked_df.sort_values(by=["race_date", "race_number", "rank"])

# Step 5: Save or Display Results
print(ranked_df.head(20))  # Display top 20 ranked horses
ranked_df.to_csv("ranked_horses.csv", index=False)  # Save to CSV if needed



NameError: name 'race_date' is not defined

In [None]:
ranked_df = metadata_test.copy()  # Metadata includes race_date, race_number, horse_id
ranked_df["predicted_probability"] = predicted_probs
ranked_df["actual_label"] = y_test  # Optional: for evaluation purposes

## Predict via Ranking using Test Race from Dataset

You can test the model on a race in your dataset by excluding the target variable (official_fin) and making predictions as if it were a new race. Here’s how you can proceed:

Step 1: Select a Race for Testing

Extract a specific race from your dataset based on race_date and race_number.

race_to_test = df[
    (df["race_date"] == "2023-05-15") & (df["race_number"] == 5)
].drop(columns=["official_fin"])  # Drop the target variable

Step 2: Prepare the Features

Ensure that the extracted race has the same feature processing (scaling, encoding) as was done during training.

# Extract features
X_race = race_to_test[feature_cols]  # Ensure `feature_cols` matches the training feature set

# Scale features
X_race_scaled = scaler.transform(X_race)  # Use the scaler fitted during training

Step 3: Predict Probabilities

Use the model to predict probabilities for this specific race.

# Predict probabilities
race_probs = rf_model.predict_proba(X_race_scaled)

# Attach probabilities back to the metadata
race_to_test["predicted_probability"] = race_probs[:, 1]  # Assuming class 1 is 'winning'

# Rank horses by predicted probability
race_to_test["rank"] = race_to_test["predicted_probability"].rank(ascending=False)
race_to_test = race_to_test.sort_values(by="rank")

Step 4: Compare to Actual Results

If you still have the actual official_fin values in a backup, compare the model’s ranking against the real results.

# Add actual finish positions for comparison (if available)
actual_results = df[
    (df["race_date"] == "2023-05-15") & (df["race_number"] == 5)
][["horse_id", "official_fin"]]

race_to_test = race_to_test.merge(actual_results, on="horse_id", how="left")

# Display results
print(race_to_test[["horse_id", "predicted_probability", "rank", "official_fin"]])

Testing Without horse_id, race_date, or race_number

If you remove horse_id, race_date, or race_number, the model should still be able to make predictions if those columns are not part of the feature set. However, you won’t be able to group or rank the predictions by race because race_date and race_number are critical for distinguishing horses in the same race.

To simulate a future race:
	1.	Select a race that the model hasn’t seen during training (e.g., from a holdout set).
	2.	Remove any identifying metadata (e.g., horse_id, race_date, race_number).
	3.	Process the features the same way as during training.
	4.	Use the model to predict probabilities for the horses in that race.
	5.	Rank the predictions by probability.

Testing with a Future Race

For a future race:
	1.	Collect horse features for that race (e.g., from Equibase).
	2.	Process the data the same way as the training data.
	3.	Use the model to make predictions.
	4.	Rank horses by predicted probabilities.
	5.	Wait for the race results and compare the model’s rankings to the actual outcomes.

This approach ensures the model is evaluated on unseen data, simulating a real-world scenario.

## Hyperparameter

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20],
    "max_features": ["auto", "sqrt", 0.5],
    "class_weight": [None, "balanced"]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    scoring="accuracy",    # or "f1", "balanced_accuracy", etc.
    cv=3,                  # 3-fold cross-validation
    n_jobs=-1             # use all CPU cores
)

grid_search.fit(X_train, y_train)
print("Best Params:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Then evaluate on the test set:
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
print("Final Test Accuracy:", (y_pred_best == y_test).mean())