# Logistic Regression Data Preparation

A lot of work has gone into compiling the current dataset. I have merged the gps_df, sectionals_df and results_df. I have limited the amount of Equibase data I am using just to keep the focus on the TPD GPS data, and to do some feature engineering.  However, there are some good metrics from the Equibase data that are just basic measures that could be obtained from any racebook sheet. 

## Get Started

1. Going to load the parquet DataFrame from disk and do some imputation, one-hot encoding, string indexing, and scaling. The run it through XBBoost to see how it's looking. At this point I will do the integration of route data, and add the GPS aggregations. I just want to see what I can minimally do and how its working before I go down the wrong path. If the XGBoost doesn't do any better than the LSTM, at least I won't have wasted any more time on it. 

### Load master_results_df.parquet file

In [1]:
# Setup Environment

import os
import logging
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pyspark.sql.functions as F
import xgboost as xgb
from sklearn import set_config
from pyspark.sql.functions import (col, count, row_number, abs, unix_timestamp, mean, 
                                   when, lit, min as F_min, max as F_max , 
                                   row_number, mean as F_mean, countDistinct, last, first, when)
import configparser
from pyspark.sql import SparkSession
from src.data_preprocessing.data_prep1.sql_queries import sql_queries
from pyspark.sql.window import Window
from pyspark.sql import DataFrame, Window
from src.data_preprocessing.data_prep1.data_utils import (save_parquet, gather_statistics, 
                initialize_environment, load_config, initialize_logging, initialize_spark, 
                identify_and_impute_outliers, 
                identify_and_remove_outliers, identify_missing_and_outliers)
from pyspark.ml.functions import vector_to_array
from pyspark.sql.functions import col

# Set global references to None
spark = None
master_results_df = None

In [3]:

spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()


2024-12-31 11:41:38,296 - INFO - Environment setup initialized.


Spark session created successfully.


In [4]:
df = spark.read.parquet(os.path.join(parquet_dir, "results_only_clean.parquet"))


In [5]:
df.count()

777100

# Switching to Pandas

In [6]:
# Convert Spark DataFrame -> Pandas DataFrame

rf_df = df.toPandas()
df = None
# Quick info about the DataFrame
#print(df.info())
#print(df.head(5))

                                                                                

In [7]:
# Suppose the finish position is in a column named 'official_fin'
# Create a binary label: 1 = first place, 0 = others
#df["label"] = (df["official_fin"] == 1).astype(int)

# Check distribution of the label
#print(df["label"].value_counts())

# Define a function to map official_fin to label
def map_official_fin_to_label(official_fin):
    if official_fin == 1:
        return 0  # Win
    elif official_fin == 2:
        return 1  # Place
    elif official_fin == 3:
        return 2  # Show
    elif official_fin == 4:
        return 3  # Fourth
    elif official_fin == 5:
        return 4  # Fifth
    elif official_fin == 6:
        return 5  # Sixth
    elif official_fin == 7:
        return 6  # Seventh
    else:
        return 7  # Outside top-7

# Apply the function to create the label column
rf_df['label'] = rf_df['official_fin'].apply(map_official_fin_to_label)

# Check the DataFrame
#print(df)

In [8]:
# 4a) Identify columns with high missingness
missing_summary = rf_df.isna().sum().sort_values(ascending=False)
#print("Missing Value Summary:\n", missing_summary)

In [9]:
# print("Descriptive Stats:\n", df.describe())


In [10]:
# # Quick correlation matrix
# corr_matrix = df.corr(numeric_only=True)
# print("Correlation Matrix:\n", corr_matrix["label"].sort_values(ascending=False))

# # Possibly visualize
# import seaborn as sns
# import matplotlib.pyplot as plt

# plt.figure(figsize=(8, 6))
# sns.heatmap(corr_matrix, cmap="coolwarm")
# plt.title("Correlation Heatmap")
# plt.show()

In [11]:
# # Histograms for numeric columns
# df.hist(bins=30, figsize=(15,10))
# plt.tight_layout()
# plt.show()

In [14]:
metadata = rf_df[["race_date", "race_number", "horse_id"]]  # Replace with your metadata columns

feature_cols = ["morn_odds", "net_sentiment", "power", "avg_spd_sd", "hi_spd_sd",
                "avgspd", "ave_cl_sd", "cond_win", "cond_place", "all_win", "all_place",
                "cond_earnings", "all_earnings", "weight", "cond_show", "all_show", 
                "cond_starts", "all_starts", "class_rating", "age_at_race_day", "distance",
                "horse_id", "claimprice", "wps_pool", "cond_fourth", "all_fourth", 
                "purse", "pstyerl", "race_number", "start_position"]

# 6b) Convert categorical columns to numeric dummies
categorical_cols = ["course_cd", 
                    "equip", 
                    "surface", 
                    "trk_cond", 
                    "weather", 
                    "med", 
                    "stk_clm_md", 
                    "turf_mud_mark", 
                    "race_type"]

# Perform one-hot encoding on the categorical columns
df_encoded = pd.get_dummies(rf_df, columns=categorical_cols, drop_first=True)

# Separate the label from the features
y = df_encoded["label"].values

X = rf_df[feature_cols].copy()

# Combine the encoded features with the numeric features
X = df_encoded[feature_cols].copy()

# Separate the date columns
# date_cols = ['race_date']
# X = X.drop(columns=date_cols)

# Train-test split
X_train, X_test, y_train, y_test, metadata_train, metadata_test = train_test_split(
    X, y, metadata, test_size=0.20, random_state=42, stratify=y
)
# Optional: Scale numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Check the scaled features
print(X_train_scaled)
print(X_test_scaled)

[[ 2.2897043   0.53291582  0.02192181 ... -0.55929365 -0.81725626
  -1.38242958]
 [ 0.93588108  0.67407432 -1.47097668 ... -1.08219468 -0.44065956
  -1.01882963]
 [-0.73036289 -1.16098625 -0.0087242  ...  0.56695473  0.68913051
   0.79917007]
 ...
 [ 0.62346033  0.39175731 -0.3195623  ... -0.51907049  1.0657272
  -0.29162975]
 [-1.14692388 -0.17287671 -0.07877222 ... -1.68554203  1.44232389
   0.07197019]
 [ 1.76900306  0.67407432  0.25395589 ...  1.04963261 -1.57044964
  -1.01882963]]
[[-0.52208239 -1.58446177  1.2214942  ...  0.28539264  1.0657272
  -0.65522969]
 [-1.14692388  0.53291582 -0.09628423 ... -1.68554203 -0.06406287
  -0.65522969]
 [-0.20966165  0.53291582  0.24519988 ...  0.28539264 -0.06406287
   0.79917007]
 ...
 [-1.04278363 -0.45519372  0.84498608 ...  0.48650842  0.68913051
   0.79917007]
 [ 0.41517984 -1.86677878  0.66548802 ...  0.80829367  1.0657272
   1.88996989]
 [-1.04278363 -0.17287671 -1.52351269 ...  1.49208733  0.68913051
   0.79917007]]


# Random Forest

In [15]:
# Get the distribution of the labels
label_distribution = rf_df['label'].value_counts()
print(label_distribution)

label
0    105776
1    105638
2    105525
3    105107
4    102116
7     95879
5     89994
6     67065
Name: count, dtype: int64


In [None]:
# from sklearn.utils.class_weight import compute_class_weight

# # Assuming y_train contains the labels
# class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(y_train), y=y_train)
# class_weights_dict = {i: class_weights[i] for i in range(len(class_weights))}

# Use 'balanced' which automatically computes the class weights based on training data.

In [None]:
# Initialize the RF classifier
rf_model = RandomForestClassifier(
    n_estimators=1000,       # number of trees
    max_depth=None,         # can tune if you want
    random_state=42,        # for reproducibility
    class_weight="balanced"  # class_weights_dict       # or "balanced" if you have class imbalance
)

# Fit the model
rf_model.fit(X_train, y_train)

In [17]:
# Predict on test set
y_pred = rf_model.predict(X_test)

# Evaluate
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[ 9924  3457  2273  1768  1313   591    77  1752]
 [ 7427  3426  2669  2351  1906   883   141  2325]
 [ 5670  3180  2522  2873  2499  1229   196  2936]
 [ 4269  2661  2646  2880  3139  1665   250  3511]
 [ 3209  2236  2243  2888  3164  2045   320  4318]
 [ 2205  1621  1690  2307  3040  1865   395  4876]
 [ 1311   957   963  1329  1946  1497   265  5145]
 [ 1258   888   941  1204  1619  1390   330 11546]]

Classification Report:
              precision    recall  f1-score   support

           0       0.28      0.47      0.35     21155
           1       0.19      0.16      0.17     21128
           2       0.16      0.12      0.14     21105
           3       0.16      0.14      0.15     21021
           4       0.17      0.15      0.16     20423
           5       0.17      0.10      0.13     17999
           6       0.13      0.02      0.03     13413
           7       0.32      0.60      0.42     19176

    accuracy                           0.23    155420
   macr

In [None]:
importances = rf_model.feature_importances_
for col, imp in zip(feature_cols, importances):
    print(f"{col}: {imp:.4f}")

# Predicting Probabilities for Ranking

In [None]:
# Step 1: Predict Probabilities
predicted_probs = rf_model.predict_proba(X_test_scaled)[:, 1]  # Probability for winning

In [None]:
# Step 2: Combine Metadata with Predictions
ranked_df = metadata_test.copy()
ranked_df["predicted_probability"] = predicted_probs

# Step 3: Rank Horses
ranked_df["rank"] = (
    ranked_df.groupby(["race_date", "race_number"])["predicted_probability"]
    .rank(method="first", ascending=False)
)

# Step 4: Sort Ranked DataFrame
ranked_df = ranked_df.sort_values(by=["race_date", "race_number", "rank"])

# Step 5: Save or Display Results
print(ranked_df.head(20))  # Display top 20 ranked horses
ranked_df.to_csv("ranked_horses.csv", index=False)  # Save to CSV if needed

In [None]:
import pandas as pd
import numpy as np

# Assuming X_test corresponds to the feature matrix for your test set
# And you have additional columns `race_date`, `race_number`, `horse_id` in the original data.

# Step 1: Predict Probabilities
proba = rf_model.predict_proba(X_test_scaled)  # Shape: (n_samples, 2)
predicted_probs = proba[:, 1]  # Probability for class 1 (winning)

# Step 2: Create a DataFrame for `X_test` with metadata
# Replace with actual metadata (e.g., from original test data before splitting)
metadata = pd.DataFrame({
    "race_date": race_date,  # Replace with actual race dates
    "race_number": race_number,  # Replace with actual race numbers
    "horse_id": horse_id,  # Replace with actual horse IDs
})

# Combine predicted probabilities with metadata
ranked_df = metadata.copy()
ranked_df["predicted_probability"] = predicted_probs

# Step 3: Rank Horses
# Group by race_date and race_number, then rank by predicted_probability
ranked_df["rank"] = (
    ranked_df.groupby(["race_date", "race_number"])["predicted_probability"]
    .rank(method="first", ascending=False)
)

# Step 4: Sort Ranked DataFrame
ranked_df = ranked_df.sort_values(by=["race_date", "race_number", "rank"])

# Step 5: Save or Display Results
print(ranked_df.head(20))  # Display top 20 ranked horses
ranked_df.to_csv("ranked_horses.csv", index=False)  # Save to CSV if needed

In [None]:
ranked_df = metadata_test.copy()  # Metadata includes race_date, race_number, horse_id
ranked_df["predicted_probability"] = predicted_probs
ranked_df["actual_label"] = y_test  # Optional: for evaluation purposes

## Predict via Ranking using Test Race from Dataset

You can test the model on a race in your dataset by excluding the target variable (official_fin) and making predictions as if it were a new race. Here’s how you can proceed:

Step 1: Select a Race for Testing

Extract a specific race from your dataset based on race_date and race_number.

race_to_test = df[
    (df["race_date"] == "2023-05-15") & (df["race_number"] == 5)
].drop(columns=["official_fin"])  # Drop the target variable

Step 2: Prepare the Features

Ensure that the extracted race has the same feature processing (scaling, encoding) as was done during training.

# Extract features
X_race = race_to_test[feature_cols]  # Ensure `feature_cols` matches the training feature set

# Scale features
X_race_scaled = scaler.transform(X_race)  # Use the scaler fitted during training

Step 3: Predict Probabilities

Use the model to predict probabilities for this specific race.

# Predict probabilities
race_probs = rf_model.predict_proba(X_race_scaled)

# Attach probabilities back to the metadata
race_to_test["predicted_probability"] = race_probs[:, 1]  # Assuming class 1 is 'winning'

# Rank horses by predicted probability
race_to_test["rank"] = race_to_test["predicted_probability"].rank(ascending=False)
race_to_test = race_to_test.sort_values(by="rank")

Step 4: Compare to Actual Results

If you still have the actual official_fin values in a backup, compare the model’s ranking against the real results.

# Add actual finish positions for comparison (if available)
actual_results = df[
    (df["race_date"] == "2023-05-15") & (df["race_number"] == 5)
][["horse_id", "official_fin"]]

race_to_test = race_to_test.merge(actual_results, on="horse_id", how="left")

# Display results
print(race_to_test[["horse_id", "predicted_probability", "rank", "official_fin"]])

Testing Without horse_id, race_date, or race_number

If you remove horse_id, race_date, or race_number, the model should still be able to make predictions if those columns are not part of the feature set. However, you won’t be able to group or rank the predictions by race because race_date and race_number are critical for distinguishing horses in the same race.

To simulate a future race:
	1.	Select a race that the model hasn’t seen during training (e.g., from a holdout set).
	2.	Remove any identifying metadata (e.g., horse_id, race_date, race_number).
	3.	Process the features the same way as during training.
	4.	Use the model to predict probabilities for the horses in that race.
	5.	Rank the predictions by probability.

Testing with a Future Race

For a future race:
	1.	Collect horse features for that race (e.g., from Equibase).
	2.	Process the data the same way as the training data.
	3.	Use the model to make predictions.
	4.	Rank horses by predicted probabilities.
	5.	Wait for the race results and compare the model’s rankings to the actual outcomes.

This approach ensures the model is evaluated on unseen data, simulating a real-world scenario.

## Hyperparameter

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 10, 20],
    "max_features": ["auto", "sqrt", 0.5],
    "class_weight": [None, "balanced"]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    scoring="accuracy",    # or "f1", "balanced_accuracy", etc.
    cv=3,                  # 3-fold cross-validation
    n_jobs=-1             # use all CPU cores
)

grid_search.fit(X_train, y_train)
print("Best Params:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Then evaluate on the test set:
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
print("Final Test Accuracy:", (y_pred_best == y_test).mean())