# XGBoost Data Preparation

A lot of work has gone into compiling the current dataset. I have merged the gps_df, sectionals_df and results_df. I have limited the amount of Equibase data I am using just to keep the focus on the TPD GPS data, and to do some feature engineering.  However, there are some good metrics from the Equibase data that are just basic measures that could be obtained from any racebook sheet. 

## Get Started

1. Going to load the parquet DataFrame from disk and do some imputation, one-hot encoding, string indexing, and scaling. The run it through XBBoost to see how it's looking. At this point I will do the integration of route data, and add the GPS aggregations. I just want to see what I can minimally do and how its working before I go down the wrong path. If the XGBoost doesn't do any better than the LSTM, at least I won't have wasted any more time on it. 

### Load master_results_df.parquet file

In [17]:
# Setup Environment

import os
import logging
from pyspark.sql.functions import (col, count, row_number, abs, unix_timestamp, mean, 
                                   when, lit, min as spark_min, max as spark_max , 
                                   row_number, mean, countDistinct, last, first, when)
import configparser
from pyspark.sql import SparkSession
from src.data_preprocessing.data_prep1.sql_queries import sql_queries
from pyspark.sql.window import Window
from pyspark.sql import DataFrame, Window
from src.data_preprocessing.data_prep1.data_utils import (save_parquet, gather_statistics, 
                initialize_environment, load_config, initialize_logging, initialize_spark, 
                drop_duplicates_with_tolerance, identify_and_impute_outliers, 
                identify_and_remove_outliers, identify_missing_and_outliers)
# Set global references to None
spark = None
master_results_df = None

In [18]:

spark, jdbc_url, jdbc_properties, queries, parquet_dir, log_file = initialize_environment()
master_results_df = spark.read.parquet(os.path.join(parquet_dir, "master_results_df.parquet"))

2024-12-10 22:44:04,072 - INFO - Environment setup initialized.
2024-12-10 22:44:04,075 - INFO - Spark session created successfully.


### Imputation for Time-Series like Data

 In time-series-like data (which GPS and sectionals data resemble), more sophisticated imputation methods are often desirable. While Spark doesn’t provide a built-in linear interpolation or regression-based imputation function out-of-the-box, you can approximate these methods using a combination of window functions, logical steps, or even Pandas UDFs if you need more complex logic.

#### Approaches

	Below is a more complete and refined version of the code tries earlier. It implements the forward/backward fill logic entirely in Spark using window functions, without having to resort to Pandas. The approach is:
    
	1.	Sort by time for each race/horse partition.

    2.	Compute a forward fill by looking up the last non-null value encountered so far.
	
    3.	Compute a backward fill by ordering the DataFrame in reverse order and again using last(...) with ignorenulls=True.
	
    4.	Join the forward and backward fills together or handle them in one go if you prefer to cache and re-order.
	
    5.	Finally, impute the missing stride_frequency values by taking the average of forward and backward fills.

Note: In the example below, we use a temporary DataFrame for the backward fill results and then join them back to avoid complexity. Another approach is to re-apply the window with reverse ordering and store the result, but you’d need to ensure that the ordering and partitioning keys are identical.


In [None]:
race_id_cols = ["course_cd", "race_date", "race_number", "saddle_cloth_number"]

In [None]:


def forward_backward_fill_impute(df: DataFrame, group_cols, time_col: str, value_col: str) -> DataFrame:
    """
    Perform forward and backward fill imputation on `value_col` within each group defined by `group_cols`,
    then impute missing values by taking the average of forward and backward fill values.

    Parameters:
    df (DataFrame): Input DataFrame
    group_cols (list of str): Columns that define the partition (e.g. ["course_cd", "race_date", "race_number", "saddle_cloth_number"])
    time_col (str): The timestamp or time ordering column
    value_col (str): The column to impute missing values for

    Returns:
    DataFrame: DataFrame with an imputed column named `value_col+"_imputed"` 
    """
    # Create a window for forward fill
    forward_window = Window.partitionBy(*group_cols).orderBy(time_col)

    # Forward fill
    df_fwd = df.withColumn("forward_fill_value", last(value_col, ignorenulls=True).over(forward_window))
    
    # Create a window for backward fill (reverse order)
    backward_window = Window.partitionBy(*group_cols).orderBy(col(time_col).desc())

    # Backward fill
    # We'll do this in a separate DataFrame and then join, to avoid complexity
    df_bwd = df.withColumn("backward_fill_value", last(value_col, ignorenulls=True).over(backward_window)) \
               .select(*group_cols, time_col, "backward_fill_value")

    # Join forward and backward fills together
    join_cond = [df_fwd[c] == df_bwd[c] for c in group_cols] + [df_fwd[time_col] == df_bwd[time_col]]
    df_joined = df_fwd.join(df_bwd, join_cond, how="inner")

    # Impute missing stride_frequency by averaging forward and backward fills if both exist
    # If one side is null and the other is not, we can just use the available one.
    # The following logic: if original is null, use avg of forward/backward. If one is null, the avg will still work as intended.
    df_imputed = df_joined.withColumn(
        value_col + "_imputed",
        when(col(value_col).isNull(),
             (when(col("forward_fill_value").isNull(), col("backward_fill_value"))
              .otherwise(when(col("backward_fill_value").isNull(), col("forward_fill_value"))
                         .otherwise((col("forward_fill_value") + col("backward_fill_value")) / 2.0))))
        ).otherwise(col(value_col))
    )

    # Drop intermediate columns
    df_final = df_imputed.drop("forward_fill_value", "backward_fill_value")

    return df_final
