## 🐎 Horse-Level Attributes and Their Impact on Win Rate

In this notebook, we explore **attributes of the individual horse** — characteristics known before the race that may influence performance.

Unlike contextual factors like field size or going, these features are about the **horse itself**.

We focus on:
- Physical characteristics (e.g. **age**, **sex**, **weight**)
- Basic identifiers (e.g. **saddlecloth number**, **horse name** — for reference only)
- Whether a horse **finished or failed to finish**

This helps us understand whether certain types of horses — by age, sex, or weight — tend to perform better, and whether these factors should be included in later models.


## 📦 Setup: Libraries, Data Access, and Target Variable

This section loads the necessary Python libraries, connects to the database, and sets up the target variable (`won`) for use throughout the notebook.


In [None]:
# Libraries
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

# Connect to SQLite database
conn = sqlite3.connect("../db/raceform.db")

# Load the main cleaned table
df = pd.read_sql_query("SELECT * FROM data_clean", conn)

# Define binary target: 1 if horse finished 1st, else 0
df['won'] = (df['pos'] == 1).astype(int)

# Copy only flat races for consistency
flat_df = df[df['type'].str.lower() == 'flat'].copy()


### 🎂 Win Rate by Horse Age

Horse age is a fundamental variable in racing. Age influences:
- Physical maturity and development
- Experience in racing conditions
- Eligibility for certain race types (e.g. 2yo maidens, veteran handicaps)

In general:
- **2-year-olds (2yo)** are inexperienced and often race in separate divisions
- **3–5yo** tend to be physically mature and are often at peak performance
- **Older horses (6yo+)** may be more experienced but may also decline in ability or be used in lower-grade races

In this section, we group runners by age and calculate win rates to see whether age has a clear relationship with performance — and to flag any age bands that stand out as over- or underperforming.


In [None]:
# Filter valid ages
flat_age_df = flat_df[flat_df['age'].notnull()].copy()
flat_age_df = flat_age_df[flat_age_df['age'].between(2, 12)]  # Typical racing ages

# Group and calculate win rate
age_win_rate = (
    flat_age_df.groupby('age', observed=True)['won']
    .mean()
    .reset_index()
    .rename(columns={'won': 'win_rate'})
)

# Plot
plt.figure(figsize=(10, 5))
sns.barplot(x='age', y='win_rate', data=age_win_rate)
plt.title('Win Rate by Horse Age')
plt.xlabel('Age')
plt.ylabel('Win Rate')
plt.grid(axis='y')
plt.show()
