# 🧹 Data Integrity Check – UK Flat Racing (2019)

_This notebook establishes trust in the raw data before any modelling is attempted._



We’re using real-world historical data to estimate **pre-race win probabilities**, not just to pick winners — but to understand the market, detect value, and build transparent predictive tools.

---

## 🎯 Project Purpose

This is not a tipping sheet. It’s a **data science training ground**.

We’re using racing as a way to learn:

- End-to-end sports modelling
- Real-world feature engineering
- How to reason about uncertainty, risk, and odds
- How to spot market inefficiencies

We aim to estimate fair odds using only **pre-race data**, and then compare those to bookmaker prices to identify potential value bets.

---

## 📦 About the Dataset

We are working with two CSV files from the **2019 UK flat racing season**:

- `races_2019.csv` – One row per race (race-level data)
- `horses_2019.csv` – One row per runner per race (horse-level data)

The full dataset spans **multiple years**. We're starting with 2019 because:

- 🧪 It’s a clean, complete, pre-COVID season
- 🧱 It offers enough variety to build serious models
- 📉 It helps us focus on modelling structure, not data noise

We’ll incorporate other seasons later **only when necessary** — and with consistent processing.

---

## 🧭 Why This Still Matters (Even If It’s Old)

Although the dataset stops at 2020, it’s perfect for learning:

- We can study elite races like the Epsom Derby
- We can train and evaluate real models with real market prices
- We can learn what matters in racing data — and what doesn’t

Once we’ve mastered this historical data, we’ll be ready to:

- Scrape or buy current racecards
- Apply our models to today’s runners
- Extend this workflow to other sports or betting formats

This is our **training ground**, not our final product.

---

## 🛠️ Reusability and Scale

This notebook is designed to work with **any season** of the dataset.

By changing the `YEAR` parameter at the top, we can reload and reprocess data for 2018, 2020, etc. automatically. All renaming, cleaning, and integrity checks will be applied consistently.

This ensures:

- Reusable code
- Consistent data prep
- Minimal manual work when scaling up

🧩 The full pipeline is **modular and scalable**, even though we begin with one season.

---

## 📊 Data Integrity Notebook – Purpose and Scope

This notebook is not about modelling.

It’s about **building trust in the data**, by understanding:

✅ What the dataset actually contains (and what it doesn’t)  
⚠️ What might be broken, missing, or misleading  
🛠️ What needs to be cleaned, fixed, or excluded before modelling

This is a **prerequisite** for every notebook that follows.

---

## 📌 Why This Matters

We don’t want to feed misunderstood or biased data into a model — especially when we aim to price horses as confidently as a bookmaker would.

Before we train or test anything, we want to know:

- Column meanings are correct
- Errors and placeholders are handled
- We’re focusing on relevant fields
- The data makes sense in the context of UK flat racing

---

## ❓ Questions This Notebook Begins to Answer

- What columns are usable right now?
- What fields require further documentation or cleanup?
- What’s missing from this dataset (e.g. sectional times, Betfair prices)?
- What would we need to scrape or buy to model real betting markets?
- How can we structure this project to grow over time?

---

## 🔦 Scope of Checks

We focus only on the columns relevant to our **initial modelling goal**:

- 🎯 Target variable: `position`
- 🎲 Key inputs: `implied_prob`, `age`, `ratings` (`RPR`, `TR`, `OR`)
- 🧩 Join keys and identifiers: `rid`, `horseName`
- 🏇 Categorical factors we may later encode: trainer, jockey

Other fields will be reviewed **only if and when they become useful** in later analysis.

---

## 📘 On Understanding Column Definitions

Where possible, we’ve reviewed column meanings using:

- Dataset documentation (e.g. Kaggle source or data dictionary)
- Industry knowledge (e.g. Racing Post format, BHA rules)
- Data exploration and pattern validation

We avoid guessing or interpreting field names without checking.

If a field is unclear or undocumented, we will:
- Investigate it through examples and context
- Flag it for deferred analysis
- Avoid using it in modelling until verified

This is essential to maintaining **integrity, reproducibility, and trust** in our modelling pipeline.


## 📥 Step 1 – Load the Raw Data

Before we inspect, clean, or rename anything, we begin by loading the **raw 2019 flat racing data** directly from CSV.

This will give us a true first look at the structure, contents, and any surprises in the dataset — without applying assumptions from earlier work.

We'll set a `YEAR` variable to keep the notebook flexible in case we want to reuse it for other seasons later.


In [5]:
# 📅 Choose the year to load
YEAR = 2019  # You can change this later to another year

# 🗂️ File paths
horses_path = f"data/raw/horses_{YEAR}.csv"
races_path = f"data/raw/races_{YEAR}.csv"

# 📥 Load data
import pandas as pd

horses_raw = pd.read_csv(horses_path)
races_raw = pd.read_csv(races_path)

print("✅ Data loaded")
print("🐎 Horses:", horses_raw.shape)
print("🏁 Races:", races_raw.shape)

# Preview horse-level data
print("🐎 Horse data preview:")
horses_raw.head(3)

# Preview race-level data
print("🏁 Race data preview:")
races_raw.head(3)


✅ Data loaded
🐎 Horses: (171849, 27)
🏁 Races: (17307, 18)
🐎 Horse data preview:
🏁 Race data preview:


Unnamed: 0,rid,course,time,date,title,rclass,band,ages,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
0,11499,Sha Tin (HK),06:30,19/01/01,Wong Leng Handicap (Class 4) (3yo+) (Course C)...,,,3yo+,7f,Good,,"[52492.49, 20260.26, 10590.59, 5525.53, 3223.22]",82.46,92092.0,1407.0,HK,1,0
1,26954,Sha Tin (HK),05:00,19/01/01,Kowloon Peak Handicap (Class 5) (3yo+) (Course...,,,3yo+,1m,Good,,"[39369.37, 15195.2, 7942.94, 4144.14, 2417.42]",95.75,69068.0,1609.0,HK,1,0
2,35478,Fairyhouse (IRE),02:40,19/01/01,Follow Fairyhouse On Social Media Beginners Chase,,,5yo+,2m5f,Good,13 fences,"[8624.0, 2674.0, 1274.0, 574.0]",341.3,13146.0,4223.0,IE,1,0


---

### ✅ Initial Observations

- The structure matches expectations: **one row per horse per race**
- Key columns are present in `horses_raw`, including:
  - `position` (target), `decimalPrice` (odds proxy), `trainerName`, `jockeyName`, `age`, `rid` (race ID)
- Sample values look clean and correctly formatted:
  - ✅ No broken delimiters
  - ✅ No encoding issues
  - ✅ No immediately suspicious values

We now move on to a **schema-level overview** to confirm data types, null values, and column-level structure.




## 📘 Step 2 – Schema Overview

Before we begin checking individual columns, we want to understand the **overall structure** of the dataset.

In this step, we’ll:
- View the data types for each column
- Check for missing values
- Generate summary statistics for numeric fields

This gives us a high-level map of what we're working with and helps us prioritise which columns may need deeper inspection.


In [7]:
print("🐎 Horse-level data:")
horses_raw.info()
horses_raw.describe(include='all').T

print("🏁 Race-level data:")
races_raw.info()
races_raw.describe(include='all').T

🐎 Horse-level data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171849 entries, 0 to 171848
Data columns (total 27 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   rid           171849 non-null  int64  
 1   horseName     171849 non-null  object 
 2   age           171849 non-null  float64
 3   saddle        171739 non-null  float64
 4   decimalPrice  171849 non-null  float64
 5   isFav         171849 non-null  int64  
 6   trainerName   171849 non-null  object 
 7   jockeyName    171849 non-null  object 
 8   position      171849 non-null  int64  
 9   positionL     146630 non-null  object 
 10  dist          129373 non-null  object 
 11  weightSt      171849 non-null  int64  
 12  weightLb      171849 non-null  int64  
 13  overWeight    3184 non-null    float64
 14  outHandicap   3221 non-null    float64
 15  headGear      63377 non-null   object 
 16  RPR           153198 non-null  float64
 17  TR            108282 non-nul

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
rid,17307.0,,,,117783.64159,44869.585137,3.0,100007.5,129795.0,151547.0,172742.0
course,17307.0,276.0,Wolverhampton (AW),593.0,,,,,,,
time,17307.0,452.0,02:00,310.0,,,,,,,
date,17307.0,365.0,19/05/11,99.0,,,,,,,
title,17307.0,14363.0,Betway Handicap,96.0,,,,,,,
rclass,10082.0,7.0,Class 4,3133.0,,,,,,,
band,7212.0,138.0,0-75,607.0,,,,,,,
ages,17307.0,24.0,4yo+,5445.0,,,,,,,
distance,17307.0,60.0,1m,2008.0,,,,,,,
condition,17306.0,19.0,Good,4862.0,,,,,,,


---

## ✅ Step 2 – Schema Overview: Summary

📊 **Horse-level data (`horses_raw`)**
- Structure is consistent: 171,849 rows × 27 columns
- No major encoding or formatting issues
- Key columns like `position`, `decimalPrice`, `age`, `trainerName`, and `jockeyName` are present
- Some fields have significant missing values:
  - `positionL`, `dist`, `RPR`, `TR`, `OR`, and `headGear` have partial coverage
  - `overWeight` and `outHandicap` are sparsely populated and may need justification to include
- Data types are sensible (mostly floats, ints, strings)

📊 **Race-level data (`races_raw`)**
- 17,307 races with 18 fields
- Some columns (like `rclass`, `band`, `hurdles`) are missing values — likely due to international entries
- All races include basic metadata: `course`, `date`, `time`, `distance`, `condition`, `prizes`, `countryCode`
- Not all races are from the UK — e.g., some rows list `HK`, `IE`, etc.  
  🔍 We may need to filter to `countryCode == 'GB'` later depending on our modelling scope

This gives us a clear picture of the dataset's shape and reliability. In the next steps, we’ll validate key columns individually.


## 🔍 Step 3 – Data Integrity Check: `rid` (Race ID)

The `rid` column (race ID) is the key link between the `horses_raw` and `races_raw` tables.

We need to confirm that:

- Every `rid` in `horses_raw` exists in `races_raw` (`foreign key check`)
- Each `rid` in `races_raw` is unique (one row per race)
- There are no missing, null, or malformed values
- The data type is suitable for joining (typically `int64`)

This step ensures our joins will be reliable when we merge race-level info into horse-level data.


In [9]:
# Step 3 – Data Integrity Check: `rid`

# Check data types and nulls
print("🐎 horses_raw['rid'] type:", horses_raw['rid'].dtype)
print("🏁 races_raw['rid'] type:", races_raw['rid'].dtype)
print("🐎 Nulls in horses_raw['rid']:", horses_raw['rid'].isnull().sum())
print("🏁 Nulls in races_raw['rid']:", races_raw['rid'].isnull().sum())

# Check uniqueness
print("🏁 Unique race IDs in races_raw:", races_raw['rid'].nunique())
print("🏁 Total rows in races_raw:", len(races_raw))

# Foreign key check: Are all horse race IDs present in races?
missing_race_ids = set(horses_raw['rid']) - set(races_raw['rid'])
print("❓ Race IDs in horses_raw not found in races_raw:", len(missing_race_ids))


🐎 horses_raw['rid'] type: int64
🏁 races_raw['rid'] type: int64
🐎 Nulls in horses_raw['rid']: 0
🏁 Nulls in races_raw['rid']: 0
🏁 Unique race IDs in races_raw: 17307
🏁 Total rows in races_raw: 17307
❓ Race IDs in horses_raw not found in races_raw: 0


✅ `rid` integrity checks passed:

- Both `horses_raw['rid']` and `races_raw['rid']` are of type `int64`
- No missing values in either table
- All race IDs in `horses_raw` are found in `races_raw` (0 missing)
- All rows in `races_raw` have unique race IDs (`nunique` = total rows)

This confirms that `rid` is a valid primary key in `races_raw` and a valid foreign key in `horses_raw`.  
We can safely use this column for joining additional race-level features into the horse-level dataset.


## 🎯 Step 4 – Data Integrity Check: `position` (Target)

The `position` column represents each horse's finishing place in a race. This is the **target variable** for most modelling approaches.

We need to confirm:

- There are no missing values (or clearly explain any that exist)
- Values are numeric and represent actual finishing positions
- There are no placeholders (e.g. "PU", "UR", "WD", "40") mixed into the column
- The data type is suitable for ranking, comparison, or conversion to classification targets

This step won’t apply transformations yet — we’re simply checking what we have, and whether it’s usable.


In [10]:
# Step 4 – Data Integrity Check: `position`

# Basic data type and null check
print("Data type:", horses_raw['position'].dtype)
print("Null values:", horses_raw['position'].isnull().sum())

# Unique values (sorted for readability)
print("\nUnique values in 'position' column:")
print(sorted(horses_raw['position'].dropna().unique()))

# Frequency of each value (to check for odd entries)
print("\nValue counts:")
print(horses_raw['position'].value_counts().sort_index())


Data type: int64
Null values: 0

Unique values in 'position' column:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 40]

Value counts:
position
1     17260
2     17233
3     17148
4     16845
5     16109
6     14809
7     13194
8     11426
9      9639
10     7979
11     6473
12     5108
13     3607
14     2617
15     1547
16     1085
17      590
18      389
19      224
20      154
21      113
22       90
23       62
24       44
25       25
26       15
27       13
28        9
29        5
30        4
40     8033
Name: count, dtype: int64


✅ The `position` column is present and contains no missing values.

However, we’ve identified a notable quirk:

- All values are integers (✅ good for modelling)
- The range is mostly what we expect: **1 to 30**
- But there's a spike at `40` — which is not a real finishing position

🔎 `40` appears **8,033 times**, suggesting it’s being used as a **placeholder** (likely for runners who did not finish).

We’ll need to confirm this using documentation or contextual clues (e.g. number of runners, external race results).

📌 For now, we will **not transform this column**, but we flag `40` as a likely non-finish marker to be handled in a future step.


---

✅ **Decision: Handle `position == 40` as a placeholder**

To preserve the integrity of the `position` column as a numeric field — while correctly marking non-finishers or invalid results:

- We will replace all instances of `position == 40` with `NaN`
- This avoids falsely implying a horse finished in 40th place
- The column will become `float64` (as pandas promotes numeric columns with missing values)

This approach ensures future modelling pipelines stay clean, numerically safe, and semantically accurate.

🛠️ This transformation will be applied in a later cleaning step — for now, we’ve just identified the issue.


## 💸 Step 5 – Data Integrity Check: `decimalPrice` (Actually Implied Probability)

At first glance, the `decimalPrice` column appears to contain decimal bookmaker odds.  
However, according to the dataset’s documentation:

> **decimalPrice = 1 / Decimal Odds**  
> (i.e. this field is already the **implied probability**, not the raw odds)

This means values close to 1 represent strong favourites, while very small values (near 0) represent longshots.

In this step, we’ll:

- Confirm the values are within the valid range [0, 1]
- Check for zeros, nulls, or outliers
- Review basic distribution to confirm it matches expectations for implied probability

If confirmed, we’ll **rename the column to `implied_prob`** to improve clarity for all downstream steps.


In [12]:
# Step 5 – Confirm `decimalPrice` represents implied probability

print("Data type:", horses_raw['decimalPrice'].dtype)
print("Null values:", horses_raw['decimalPrice'].isnull().sum())

# Check value range
print("Values <= 0:", (horses_raw['decimalPrice'] <= 0).sum())
print("Values > 1:", (horses_raw['decimalPrice'] > 1).sum())

# Summary stats
print("\nSummary statistics:")
print(horses_raw['decimalPrice'].describe())

# Top and bottom 5
print("\nTop 5 implied probabilities:")
print(horses_raw['decimalPrice'].sort_values(ascending=False).head())

print("\nBottom 5 implied probabilities:")
print(horses_raw['decimalPrice'].sort_values().head())


Data type: float64
Null values: 0
Values <= 0: 0
Values > 1: 0

Summary statistics:
count    171849.000000
mean          0.120026
std           0.118543
min           0.001767
25%           0.038462
50%           0.083333
75%           0.163934
max           0.961538
Name: decimalPrice, dtype: float64

Top 5 implied probabilities:
156790    0.961538
44209     0.961538
154008    0.961538
147691    0.961538
117495    0.952381
Name: decimalPrice, dtype: float64

Bottom 5 implied probabilities:
40793     0.001767
59681     0.001767
76062     0.001996
114180    0.001996
168344    0.001996
Name: decimalPrice, dtype: float64


---

## ✅ Step 5 – Summary: `decimalPrice` is Implied Probability

Our results confirm the dataset documentation:

- All values are between 0 and 1
- There are no missing or zero entries
- Distribution is consistent with market-implied win probabilities:
  - Mean ~0.12 (avg win chance per horse in typical field size)
  - Max ~0.96 (strong favourite)
  - Min ~0.0018 (extreme outsider)

🧠 Since this column already contains **implied probability**, not raw decimal odds, we’ll rename it for clarity when we do our transforms.

---

✅ **Decision**: Rename `decimalPrice` → `implied_prob`

This avoids confusion in future steps and makes the feature’s role explicit when modelling.


## 🐴 Step 6 – Data Integrity Check: `age`

The `age` column gives the age of each horse in years.

This is a fundamental modelling feature — older or younger horses perform differently across race types, distances, and classes.

In this step, we’ll:

- Confirm the column is numeric
- Check for missing values
- Review the distribution and range (e.g. are there outliers?)
- Validate that values fall within expected bounds for flat racing (typically 2–10 years old)

Any unexpected ages (e.g. 15+) may need investigation or exclusion.


In [14]:
# Step 6 – Data Integrity Check: `age`

# Basic checks
print("Data type:", horses_raw['age'].dtype)
print("Null values:", horses_raw['age'].isnull().sum())

# Unique values
print("Unique age values:", sorted(horses_raw['age'].unique()))

# Value counts
print("\nAge distribution:")
print(horses_raw['age'].value_counts().sort_index())



Data type: float64
Null values: 0
Unique age values: [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0]

Age distribution:
age
2.0     17960
3.0     38253
4.0     32792
5.0     27323
6.0     21572
7.0     13224
8.0      8696
9.0      5614
10.0     3436
11.0     1806
12.0      784
13.0      270
14.0      100
15.0        7
16.0       12
Name: count, dtype: int64


#### ✅ Summary – Step 6: `age` Column

- All entries in the `age` column are present and numeric (`float64`).
- Age values range from 2.0 to 16.0.
- The distribution is realistic, with the majority of horses aged 2–8.
- A small number of entries exist for horses aged 15 (7 entries) and 16 (12 entries).

While this initially appeared unusual, we verified real-world examples with a quick google serach — including *Megalala*, a 16-year-old flat racer — confirming that such cases, though rare, are valid.

✅ The `age` column is clean and trustworthy, and all values will be retained.


In [None]:
## 🏇 Step 7 – Data Integrity Check: `trainerName` and `jockeyName`

These two columns identify the trainer and jockey responsible for each runner.

They’re likely to be used as **categorical modelling features**, so we need to:

- Confirm there are no missing values
- Check for inconsistencies like:
  - Leading/trailing whitespace
  - Duplicate entries due to inconsistent capitalisation or spacing
  - Rare or misspelled names

We won’t encode anything yet — we’re simply checking whether the fields are clean and reliable.


In [16]:
# Step 7 – Check trainerName and jockeyName

# Null checks
print("Null values – trainerName:", horses_raw['trainerName'].isnull().sum())
print("Null values – jockeyName:", horses_raw['jockeyName'].isnull().sum())

# Uniqueness and value counts
print("\nUnique trainers:", horses_raw['trainerName'].nunique())
print("Unique jockeys:", horses_raw['jockeyName'].nunique())

print("\nTop 10 most frequent trainers:")
print(horses_raw['trainerName'].value_counts().head(10))

print("\nTop 10 most frequent jockeys:")
print(horses_raw['jockeyName'].value_counts().head(10))

# Whitespace check
has_trainer_whitespace = horses_raw['trainerName'].str.strip().ne(horses_raw['trainerName'])
has_jockey_whitespace = horses_raw['jockeyName'].str.strip().ne(horses_raw['jockeyName'])

print("\nTrainer names with leading/trailing whitespace:", has_trainer_whitespace.sum())
print("Jockey names with leading/trailing whitespace:", has_jockey_whitespace.sum())


Null values – trainerName: 0
Null values – jockeyName: 0

Unique trainers: 3991
Unique jockeys: 2929

Top 10 most frequent trainers:
trainerName
Gordon Elliott            1584
Richard Fahey             1535
Mark Johnston             1495
Joseph Patrick O'Brien    1482
Tim Easterby              1307
Richard Hannon            1297
David O'Meara              977
W P Mullins                967
Michael Appleby            936
A P O'Brien                931
Name: count, dtype: int64

Top 10 most frequent jockeys:
jockeyName
Luke Morris           1267
David Probert         1211
Oisin Murphy          1190
Tom Marquand          1061
Ben Curtis             978
Maxime Guyon           910
P J McDonald           907
Cristian Demuro        888
Silvestre De Sousa     875
Richard Johnson        842
Name: count, dtype: int64

Trainer names with leading/trailing whitespace: 0
Jockey names with leading/trailing whitespace: 0


---

## ✅ Step 7 – Summary: `trainerName` and `jockeyName` Integrity

- ✅ No missing values in either column
- ✅ All names appear well-formatted with no leading/trailing whitespace
- ✅ The number of unique trainers (3,991) and jockeys (2,929) aligns with expectations for a full season
- ✅ Most frequent names are legitimate high-volume professionals (e.g. Gordon Elliott, Richard Fahey, Luke Morris, Oisin Murphy)

No cleaning is required at this stage.  
These columns are ready for future encoding when we begin feature engineering.

📌 Note: We may choose to limit to the top N trainers/jockeys or encode based on win rates later — but for now, the raw values are clean and trustworthy.


## 📈 Step 8 – Data Integrity Check: `RPR`, `TR`, `OR` (Horse Ratings)

These columns represent industry-standard performance ratings:

- `RPR` – Racing Post Rating (performance-based)
- `TR`  – Timeform Rating (independent performance-based)
- `OR`  – Official Rating (assigned by the BHA)

These features may be powerful predictors — but they can also be incomplete.

In this step, we’ll:

- Check data types and missing value counts
- Review ranges and distribution
- Identify whether values are plausible (e.g. not negative or extreme)

We won’t clean or impute anything yet — the goal is to assess coverage and quality.


In [17]:
# Step 8 – Data Integrity Check: RPR, TR, OR

ratings = ['RPR', 'TR', 'OR']

# Check null counts and types
for col in ratings:
    print(f"{col} – Nulls:", horses_raw[col].isnull().sum())
    print(f"{col} – Type:", horses_raw[col].dtype)

# Describe the distributions
print("\nSummary statistics:")
print(horses_raw[ratings].describe().T)

RPR – Nulls: 18651
RPR – Type: float64
TR – Nulls: 63567
TR – Type: float64
OR – Nulls: 69860
OR – Type: float64

Summary statistics:
        count       mean        std  min   25%   50%   75%    max
RPR  153198.0  74.054746  28.783986  1.0  54.0  73.0  94.0  181.0
TR   108282.0  48.638204  27.272410  1.0  28.0  46.0  65.0  177.0
OR   101989.0  81.516311  26.270792  1.0  61.0  77.0  99.0  177.0


---

## ✅ Step 8 – Summary: Ratings Columns (`RPR`, `TR`, `OR`)

These numeric columns provide important performance indicators — but they are only **partially complete**:

### 📊 Missing Values:
- `RPR` missing in ~11% of rows (18,651 / 171,849)
- `TR` missing in ~37%
- `OR` missing in ~41%

This level of missingness is **expected**:
- Not all horses receive every rating, especially in lower-tier races or with newcomers
- `OR` is often absent for unrated or debut runners
- `TR` and `RPR` are not always assigned consistently in foreign races or minor events

### ✅ Value Ranges Look Plausible:
- All ratings are positive, with no corrupt or extreme outliers
- Mean/median values align with published norms (e.g. `OR` around


## 🏁 Step 10 – Quick Audit of Race-Level Context (`races_raw`)

Although this notebook focuses primarily on horse-level data, we briefly audit select race-level columns to understand the structure of our dataset and support future filtering — especially for high-profile races like the Epsom Derby.

We’ll check:

- `distance`: To identify races of Derby length (about 1m4f or 2400m)
- `class`: To help filter top-tier races
- `course`: So we can isolate races run at Epsom
- `condition`: (going) Optional, but potentially useful for feature analysis

This helps ensure we have the metadata we’ll need to subset and model specific race types.


In [19]:
# Step 10 – Audit selected race-level fields

fields_to_check = ['distance', 'class', 'course', 'condition']

for col in fields_to_check:
    print(f"\n🔍 {col} – Nulls:", races_raw[col].isnull().sum())
    print(f"🔍 {col} – Unique values:")
    print(races_raw[col].value_counts().head(10))



🔍 distance – Nulls: 0
🔍 distance – Unique values:
distance
1m      2008
6f      1919
7f      1805
2m      1187
5f      1111
1m2f    1080
1m4f     847
2m4f     630
2m½f     462
3m       441
Name: count, dtype: int64

🔍 class – Nulls: 0
🔍 class – Unique values:
class
0    7225
4    3133
5    2872
6    1597
3    1247
2     694
1     502
7      37
Name: count, dtype: int64

🔍 course – Nulls: 0
🔍 course – Unique values:
course
Wolverhampton (AW)    593
Sha Tin (HK)          513
Kempton (AW)          482
Chelmsford (AW)       475
Lingfield (AW)        371
Newcastle (AW)        371
Deauville (FR)        365
Chantilly (FR)        359
Auteuil (FR)          307
Happy Valley (HK)     292
Name: count, dtype: int64

🔍 condition – Nulls: 1
🔍 condition – Unique values:
condition
Good                4862
Standard            2818
Good To Firm        1959
Soft                1901
Good To Soft        1714
Heavy               1042
Standard To Slow     678
Fast                 501
Very Soft            405

---

## ✅ Step 10 – Summary: Race-Level Fields for Filtering

We've confirmed that key race-level columns are well-formed and available for future filtering:

### 🏇 `distance`
- No missing values
- Most common distances include 5f–2m, as expected
- **1m4f** (Epsom Derby distance) occurs in **847 races**, giving us a good sample for analysis

### 🏆 `class`
- No missing values
- Class 0 = unclassified races (often international or non-handicaps)
- **Classes 1–3** represent higher-tier races suitable for Derby-like modelling

### 📍 `course`
- No missing values
- Frequent venues include Wolverhampton, Sha Tin, and other UK, FR, and HK tracks
- We can filter by `'Epsom'` in this column to locate Derby races directly

### 🌱 `condition` (Going)
- Only 1 missing value — likely a minor omission in source data (We'll retain this race and treat the condition as missing if needed)
- Values include `"Good"`, `"Soft"`, `"Standard"`, etc.
- Clean and usable if we want to explore how ground affects outcomes

---

✅ These race-level fields are ready to support downstream filtering for:
- Derby-specific modelling (class, distance, Epsom course)
- Broader race-type segmentation (e.g. sprints vs stayers)


---

## 🧼 Data Integrity Check – Complete

This notebook has completed a practical audit of the raw data used in our horse racing prediction project.

We’ve:

- 🧾 Loaded and reviewed the structure of two key 2019 datasets (`races_raw`, `horses_raw`)
- 🔍 Investigated and documented key columns relevant to our modelling goal
- 🧠 Cross-referenced ambiguous fields with external documentation (e.g. `decimalPrice`)
- 🛠️ Identified placeholders (like `position == 40`) and confirmed safe handling strategies
- 📊 Audited value ranges, formats, and missingness for numeric and categorical fields
- 🧩 Ensured join keys (like `rid`) are valid and consistent
- 🏇 Briefly assessed race-level context for filtering future subsets (e.g. Epsom Derby races)

We avoided over-processing at this stage. Our goal was **understanding, not transforming**.

---

## ✅ Outcome

We now trust the dataset enough to:

- Begin filtering and transforming fields with purpose
- Build a clean, processed dataset in the next notebook
- Start feature engineering and modelling with confidence

This notebook sets the foundation for everything that follows.

