# Horse Race Data Pipeline Notebook

This notebook documents our Horse Race Data Pipeline. It explains what the pipeline does, how it works, and suggests future improvements. The notebook also demonstrates key steps such as data ingestion, merging, feature engineering, and model preparation.

Below is an overview of the pipeline architecture.

## Architecture & Workflow Diagram

```plaintext
    +------------------------+
    |   Configuration &      |
    |   Setup (config.ini)   |
    +-----------+------------+
                |
                v
    +-----------+------------+
    | Data Ingestion         |
    |  - PP Files (scan_cards, load_all_pp_cards)
    |  - Result Files (load_results_and_merge)
    +-----------+------------+
                |
                v
    +-----------+------------+
    | Data Merging           |
    |  - Build merge key     |
    |  - Merge PP & Results  |
    +-----------+------------+
                |
                v
    +-----------+------------+
    | Feature Engineering    |
    |  - Select predictive   |
    |    features            |
    |  - Impute missing vals |
    |  - Save engineered CSV |
    +-----------+------------+
                |
                v
    +-----------+------------+
    | Model Preparation &    |
    | Training (Optional)    |
    +------------------------+
```

This diagram shows the sequential flow from configuration, through data ingestion, merging, feature engineering, and finally to model training.

## Step 1: Configuration and Setup

The pipeline reads configuration settings (such as the track code and file paths) from a configuration file (`config.ini`) and sets up logging. This allows the pipeline to know which directories to look at for the Past Performance (PP) and result files and how to log its progress.

In [2]:
# Example: Load configuration and setup logging
from data_prep.utilities import load_config, setup_logging
import logging

setup_logging(level=logging.INFO)
config = load_config()
track = config['DEFAULT'].get('track', 'SA')
print(f"Track from config.ini: {track}")

ModuleNotFoundError: No module named 'data_prep'

## Step 2: Data Ingestion and Merging

### Past Performance (PP) Data

- **scan_cards()** scans the given PP directory (non-recursively) for files matching a given pattern (e.g., `SA*.DRF`).
- **load_all_pp_cards()** loads the PP data from the scanned files, using a PP field mapping CSV.

### Race Results Data

- The result field mapping CSV (e.g., `result_field_mapping.csv`) defines the structure of the result files.
- **load_results_and_merge()** parses the result files, constructs a result DataFrame, and merges it with the PP DataFrame using a common merge key (`race_key`).

The merged data is saved to `merged_data.csv`.

In [None]:
# Example: Ingest and merge data
from data_prep.ingestion import scan_cards, load_all_pp_cards, load_results_and_merge

# Get directories and mapping paths from config
pp_location = config['DEFAULT'].get('pp_location')
result_location = config['DEFAULT'].get('result_location')
pp_field_mapping_csv = config['DEFAULT'].get('pp_fields_mapping_location')
result_field_mapping_csv = config['DEFAULT'].get('result_fields_mapping_location')

# Build file patterns based on track code
pp_track_pattern = f"{track}*.DRF"
res_track_pattern = f"{track}*.*"

# Load PP data
pp_cards = scan_cards(pp_location, pp_track_pattern)
pp_all_df, pp_field_map = load_all_pp_cards(pp_cards, pp_field_mapping_csv)
print("Loaded PP data shape:", pp_all_df.shape)

# Load result data and merge with PP data
merged_df, result_field_map = load_results_and_merge(pp_all_df, result_location, result_field_mapping_csv, res_track_pattern)
print("Merged data shape:", merged_df.shape)
print("Merged columns:", merged_df.columns.tolist())

## Step 3: Feature Engineering

The feature engineering step extracts predictive features from both the result and PP field mappings. It then selects only those columns from the merged DataFrame. Instead of dropping rows with missing values, the pipeline imputes them (e.g., with the median for numeric columns and a placeholder for categorical columns). Finally, the engineered DataFrame is saved to `engineered_features.csv`.

In [None]:
# Example: Feature Engineering
from feature_engineering import engineer_features

# Call the feature engineering function
engineered_df = engineer_features(merged_df, result_field_map, pp_field_map)
print("Engineered features shape:", engineered_df.shape)
engineered_df.head()

## Step 4: Model Preparation and Basic ML Example

Once we have the engineered features, we prepare the data for modeling. This step typically involves splitting the data into training, validation, and test sets and then training a model. In the example below, we demonstrate this with a simple linear regression model to predict the race finish position (assumed to be in the column `res_finish_position`).

Adjust the target column and model as necessary.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Set the target column. Adjust if needed.
target_column = "res_finish_position"

# Ensure the target column exists in engineered_df
if target_column not in engineered_df.columns:
    raise ValueError(f"Target column '{target_column}' not found. Available columns: {engineered_df.columns.tolist()}")

# Separate features and target
X = engineered_df.drop(columns=[target_column])
y = engineered_df[target_column]

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set and compute MSE
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

## Step 5: Future Improvements and Roadmap

- **Data Ingestion:**
  - Support for additional data formats (JSON, XML).
  - Enhanced error handling and logging.

- **Data Cleaning & Feature Engineering:**
  - Advanced imputation strategies (e.g., kNN imputation).
  - Outlier detection and removal.
  - Feature scaling and normalization.

- **Modeling Enhancements:**
  - Experiment with ensemble methods (e.g., Random Forest, Gradient Boosting).
  - Hyperparameter tuning and cross-validation.
  - Explore deep learning if data volume permits.

- **Pipeline Automation & Deployment:**
  - Containerization for reproducibility.
  - Workflow orchestration with tools like Airflow or Prefect.
  - Real-time prediction dashboards.

- **Collaboration & Documentation:**
  - Detailed documentation and code comments.
  - Interactive notebooks for exploratory data analysis.

## Summary

- We built a comprehensive pipeline that ingests, merges, and processes horse race data.
- Feature engineering is performed by selecting only predictive features (as defined by the field mappings) and imputing missing values.
- A basic ML example demonstrates model preparation and training.
- Future improvements include better data cleaning, advanced modeling, and deployment strategies.