# 🧹 Data Integrity Checkpoint (2019 Horse Racing Data)

This is the first notebook in our horse racing modelling project.

We’re using historical UK race data to learn how to build our own predictive models — the same way bookmakers, syndicates, and serious bettors do.

But before we do any analysis or modelling, we need to ask:

> 🧠 Can we trust this data?

---

## 📦 What’s This Dataset?

This dataset was published on Kaggle and contains UK and Irish horse racing data from **1990 to 2020**, across two files per year:
- `races_YEAR.csv` — one row per race
- `horses_YEAR.csv` — one row per horse in a race

Each horse entry includes odds, trainer, age, and finishing position.  
Each race entry includes track, distance, conditions, and prize money.

---

## 📍 Why Are We Using It?

- 🆓 It’s free and publicly available  
- 🧠 It’s ideal for learning and prototyping  
- 📊 It includes real-world race outcomes and betting prices  
- 🏇 It’s detailed enough to build serious models — but simple enough for newcomers

We’ve chosen to start with **just 2019** because:
- It’s the last full "normal" season before COVID-19 disrupted the sport
- It’s large enough to explore a wide range of race types
- We can expand later if needed

---

## 🧭 Why This Still Matters (Even If It’s Old)

Although this dataset stops in 2020, it’s perfect for learning:

- We can explore patterns in elite races (like the Epsom Derby)
- We can practice building models using real bookmaker odds and outcomes
- We can learn which features matter — and which don’t

Once we understand how to clean, analyse, and model this historical data, we’ll be better equipped to:
- Scrape or buy live racecards for upcoming races
- Apply our models to races happening today or in the future
- Extend this project into other sports or betting formats

This is our **training ground** — not the final product.

---

## 🔍 What This Notebook Will Do

This notebook is not about modelling. It’s about **understanding**:
- What the raw data really contains
- What might be broken, missing, or misleading
- What needs to be fixed or excluded

This is about **building trust** in our data before using it.

We’ll also begin answering:
- What other data would we need to model specific races like the Epsom Derby?
- What’s missing from this dataset (e.g. sectional times, pace, Betfair odds)?
- When will scraping or paid data be required?

This sets the foundation for every notebook that comes next.


## 📥 Step 1: Load the Raw CSV Files

We’re starting with two files:

- `races_2019.csv` — one row per race (course, date, distance, going, prize)
- `horses_2019.csv` — one row per runner in a race (horse name, odds, position)

We’ll begin by loading both and checking their shapes.

This gives us a basic sense of scale — how many races, how many horses — and acts as a first **checkpoint** before we merge or explore anything further.


In [2]:
import pandas as pd

# Load the raw CSVs
races = pd.read_csv("data/races_2019.csv")
horses = pd.read_csv("data/horses_2019.csv")

# Show shapes of both datasets
print("📁 Races dataset shape:", races.shape)
print("🐎 Horses dataset shape:", horses.shape)


📁 Races dataset shape: (17307, 18)
🐎 Horses dataset shape: (171849, 27)


### 📊 First Sanity Check: Do These Numbers Make Sense?

- The `races_2019.csv` file contains **17,307 races**
- The `horses_2019.csv` file contains **171,849 horses**

That means we have, on average:

$$
\frac{171,\!849\ \text{horses}}{17,\!307\ \text{races}} \approx 9.9\ \text{horses per race}
$$

This is exactly what we expect:
- Most UK races have **8–14 runners**, depending on race type, distance, and class
- Some elite races (like the Derby) have large fields (16–20), but many everyday races have smaller ones (5–10)

So these totals are consistent and suggest the dataset is **plausible and complete** at the top level.


## 🔍 Step 2: Check for Missing Values

Before we merge or model, we need to understand how clean the raw data is.

We’ll look at each column in both datasets and count how many values are missing (`NaN` or empty cells). This helps us decide:

- Which columns are safe to use as-is
- Which ones might need cleaning, fixing, or removing
- Whether any essential information (like track, result, or odds) is incomplete


In [5]:
# Show count of missing values in each column
print("📁 Races: Missing values")
print(races.isnull().sum())

print("\n🐎 Horses: Missing values")
print(horses.isnull().sum())


📁 Races: Missing values
rid                0
course             0
time               0
date               0
title              0
rclass          7225
band           10095
ages               0
distance           0
condition          1
hurdles        12644
prizes             0
winningTime        0
prize              2
metric             0
countryCode        0
ncond              0
class              0
dtype: int64

🐎 Horses: Missing values
rid                  0
horseName            0
age                  0
saddle             110
decimalPrice         0
isFav                0
trainerName          0
jockeyName           0
position             0
positionL        25219
dist             42476
weightSt             0
weightLb             0
overWeight      168665
outHandicap     168628
headGear        108472
RPR              18651
TR               63567
OR               69860
father               0
mother               0
gfather            170
runners              0
margin               0
weight 

### 🧼 What’s Missing — and What It Means

Let’s break down what the missing values tell us:

---

#### 📁 `races_2019.csv` (Race-level data)

| Column         | Missing | Notes |
|----------------|---------|-------|
| `rclass`       | 7,225   | Race class text (e.g. “Group 1”) — not critical since we have numeric `class` |
| `band`         | 10,095  | Often empty in higher-level races — not essential |
| `hurdles`      | 12,644  | Most races don’t have jumps — this is fine and expected |
| `condition`    | 1       | Only one missing going/ground value — we can handle it easily |
| `prize`        | 2       | Minor — we’re not modelling on prize money directly |
| ✅ **No missing values** in: `rid`, `course`, `date`, `distance`, `class`, `metric` — **excellent**

---

#### 🐎 `horses_2019.csv` (Horse-level data)

| Column         | Missing | Notes |
|----------------|---------|-------|
| `saddle`       | 110     | Minor — not useful for us right now |
| `positionL`    | 25,219  | Labelled version of `position` (e.g. “PU”, “F”) — might help with DNFs |
| `dist`         | 42,476  | Distance behind winner — many missing, might not be reliable |
| `overWeight` & `outHandicap` | 160K+ | Nearly all missing — likely not useful in this dataset |
| `headGear`     | 108,472 | Not used in early models — fine to ignore for now |
| `RPR`, `TR`, `OR` | 18–70K | Optional performance ratings — advanced features we may revisit |
| `gfather`      | 170     | Missing only occasionally — not a priority

✅ **No missing values** in `rid`, `horseName`, `age`, `position`, `odds`, `trainerName`, `jockeyName`, etc. — which are our core features.

---

### ✅ Summary

- **Most essential columns are complete**
- **Missing values are expected and manageable**
- We can proceed with merging and structure checks

Later notebooks can selectively ignore or impute these fields based on modelling needs.
