# 🧹 Data Integrity Checkpoint (2019 Horse Racing Data)

This is the first notebook in our horse racing modelling project.

We’re using historical UK race data to learn how to build our own predictive models — the same way bookmakers, syndicates, and serious bettors do.

But before we do any analysis or modelling, we need to ask:

> 🧠 Can we trust this data?

---

## 📦 What’s This Dataset?

This dataset was published on Kaggle and contains UK and Irish horse racing data from **1990 to 2020**, across two files per year:
- `races_YEAR.csv` — one row per race
- `horses_YEAR.csv` — one row per horse in a race

Each horse entry includes odds, trainer, age, and finishing position.  
Each race entry includes track, distance, conditions, and prize money.

---

## 📍 Why Are We Using It?

- 🆓 It’s free and publicly available  
- 🧠 It’s ideal for learning and prototyping  
- 📊 It includes real-world race outcomes and betting prices  
- 🏇 It’s detailed enough to build serious models — but simple enough for newcomers

We’ve chosen to start with **just 2019** because:
- It’s the last full "normal" season before COVID-19 disrupted the sport
- It’s large enough to explore a wide range of race types
- We can expand later if needed

---

## 🧭 Why This Still Matters (Even If It’s Old)

Although this dataset stops in 2020, it’s perfect for learning:

- We can explore patterns in elite races (like the Epsom Derby)
- We can practice building models using real bookmaker odds and outcomes
- We can learn which features matter — and which don’t

Once we understand how to clean, analyse, and model this historical data, we’ll be better equipped to:
- Scrape or buy live racecards for upcoming races
- Apply our models to races happening today or in the future
- Extend this project into other sports or betting formats

This is our **training ground** — not the final product.

---

## 🔍 What This Notebook Will Do

This notebook is not about modelling. It’s about **understanding**:
- What the raw data really contains
- What might be broken, missing, or misleading
- What needs to be fixed or excluded

This is about **building trust** in our data before using it.

We’ll also begin answering:
- What other data would we need to model specific races like the Epsom Derby?
- What’s missing from this dataset (e.g. sectional times, pace, Betfair odds)?
- When will scraping or paid data be required?

This sets the foundation for every notebook that comes next.


## 📊 Data Integrity Notebook – Project Scope & Purpose

### 🔍 What This Notebook Will Do

This notebook is **not about modelling**.

It’s about building **trust** in the raw data — understanding:
- ✅ What the dataset actually contains (and what it doesn’t)
- ⚠️ What might be broken, missing, or misleading
- 🛠️ What needs to be cleaned, fixed, or excluded before modelling

It’s a **prerequisite** for every notebook that follows.

---

### 📌 Why This Matters

We don’t want to feed noisy, misunderstood, or biased data into a model — especially when working on a high-stakes goal like predicting the **2025 Epsom Derby**.

Before we train or test anything, we want confidence that:
- Column meanings are understood
- Errors and placeholders are flagged or handled
- We’ve focused on the fields relevant to our problem
- The data makes sense in the context of UK flat racing

---

### ❓ Questions This Notebook Begins to Answer

- What columns are usable right now, and which need more work?
- What other data would we need to model specific races like the Derby?
- What’s missing from this dataset (e.g. sectional times, pace, Betfair exchange data)?
- When might scraping or paid APIs be required?

---

### 🔦 Scope of Checks

We focus on columns that are **directly relevant** to our modelling goal:
- Target variable: `position`
- Inputs like `implied_prob`, `age`, ratings (`RPR`, `TR`, `OR`)
- Identifiers used in joins or aggregation (`rid`, `horseName`)
- Key categorical factors we may later encode (e.g. trainer, jockey)

Other fields will be reviewed only **if and when** they become useful in later notebooks.


### 📘 On Understanding Column Definitions

Where possible, we have **cross-referenced column meanings** with the dataset’s documentation (or source notes) to ensure accurate interpretation.

We avoid guessing the meaning of ambiguous fields — especially those that could be mislabelled, translated, or context-specific to racing terminology.

If documentation is missing or unclear, we either:
- Investigate further using data patterns and external examples
- Flag the field for deferred use until its meaning is verified

This is essential for preventing bad assumptions and maintaining the integrity of our modelling pipeline.

## 📥 Step 1: Load the Raw CSV Files

We’re starting with two files:

- `races_2019.csv` — one row per race (course, date, distance, going, prize)
- `horses_2019.csv` — one row per runner in a race (horse name, odds, position)

We’ll begin by loading both and checking their shapes.

This gives us a basic sense of scale — how many races, how many horses — and acts as a first **checkpoint** before we merge or explore anything further.


In [2]:
import pandas as pd

# Load the raw CSVs
races = pd.read_csv("data/races_2019.csv")
horses = pd.read_csv("data/horses_2019.csv")

# Show shapes of both datasets
print("📁 Races dataset shape:", races.shape)
print("🐎 Horses dataset shape:", horses.shape)


📁 Races dataset shape: (17307, 18)
🐎 Horses dataset shape: (171849, 27)


### 📊 First Sanity Check: Do These Numbers Make Sense?

- The `races_2019.csv` file contains **17,307 races**
- The `horses_2019.csv` file contains **171,849 horses**

That means we have, on average:

$$
\frac{171,\!849\ \text{horses}}{17,\!307\ \text{races}} \approx 9.9\ \text{horses per race}
$$

This is exactly what we expect:
- Most UK races have **8–14 runners**, depending on race type, distance, and class
- Some elite races (like the Derby) have large fields (16–20), but many everyday races have smaller ones (5–10)

So these totals are consistent and suggest the dataset is **plausible and complete** at the top level.


## 🔍 Step 2: Check for Missing Values

Before we merge or model, we need to understand how clean the raw data is.

We’ll look at each column in both datasets and count how many values are missing (`NaN` or empty cells). This helps us decide:

- Which columns are safe to use as-is
- Which ones might need cleaning, fixing, or removing
- Whether any essential information (like track, result, or odds) is incomplete


In [5]:
# Show count of missing values in each column
print("📁 Races: Missing values")
print(races.isnull().sum())

print("\n🐎 Horses: Missing values")
print(horses.isnull().sum())


📁 Races: Missing values
rid                0
course             0
time               0
date               0
title              0
rclass          7225
band           10095
ages               0
distance           0
condition          1
hurdles        12644
prizes             0
winningTime        0
prize              2
metric             0
countryCode        0
ncond              0
class              0
dtype: int64

🐎 Horses: Missing values
rid                  0
horseName            0
age                  0
saddle             110
decimalPrice         0
isFav                0
trainerName          0
jockeyName           0
position             0
positionL        25219
dist             42476
weightSt             0
weightLb             0
overWeight      168665
outHandicap     168628
headGear        108472
RPR              18651
TR               63567
OR               69860
father               0
mother               0
gfather            170
runners              0
margin               0
weight 

### 🧼 What’s Missing — and What It Means

Let’s break down what the missing values tell us:

---

#### 📁 `races_2019.csv` (Race-level data)

| Column         | Missing | Notes |
|----------------|---------|-------|
| `rclass`       | 7,225   | Race class text (e.g. “Group 1”) — not critical since we have numeric `class` |
| `band`         | 10,095  | Often empty in higher-level races — not essential |
| `hurdles`      | 12,644  | Most races don’t have jumps — this is fine and expected |
| `condition`    | 1       | Only one missing going/ground value — we can handle it easily |
| `prize`        | 2       | Minor — we’re not modelling on prize money directly |
| ✅ **No missing values** in: `rid`, `course`, `date`, `distance`, `class`, `metric` — **excellent**

---

#### 🐎 `horses_2019.csv` (Horse-level data)

| Column         | Missing | Notes |
|----------------|---------|-------|
| `saddle`       | 110     | Minor — not useful for us right now |
| `positionL`    | 25,219  | Labelled version of `position` (e.g. “PU”, “F”) — might help with DNFs |
| `dist`         | 42,476  | Distance behind winner — many missing, might not be reliable |
| `overWeight` & `outHandicap` | 160K+ | Nearly all missing — likely not useful in this dataset |
| `headGear`     | 108,472 | Not used in early models — fine to ignore for now |
| `RPR`, `TR`, `OR` | 18–70K | Optional performance ratings — advanced features we may revisit |
| `gfather`      | 170     | Missing only occasionally — not a priority

✅ **No missing values** in `rid`, `horseName`, `age`, `position`, `odds`, `trainerName`, `jockeyName`, etc. — which are our core features.

---

### ✅ Summary

- **Most essential columns are complete**
- **Missing values are expected and manageable**
- We can proceed with merging and structure checks

Later notebooks can selectively ignore or impute these fields based on modelling needs.


## 🔍 Step 3: Data Integrity Checks

Before we can trust our analysis or build any models, we need to verify that the data:

- ✅ Contains what it claims to  
- ✅ Matches expectations for a real racing dataset  
- ✅ Hasn't silently corrupted during import or processing  

These checks prevent **subtle bugs** from derailing our project months down the line.

We’ll start by verifying the **date format**, since we’ve already spotted something odd with 2001 and 2031 appearing in a dataset that’s supposed to be from 2019.

---

### 📅 Check Date Range (Broken Parse?)

Let’s first load the file *with* automatic date parsing




In [20]:
races = pd.read_csv("data/races_2019.csv", parse_dates=["date"], dayfirst=True)
print("📅 Min date:", races["date"].min())
print("📅 Max date:", races["date"].max())
print("❓ Null date values:", races["date"].isna().sum())

📅 Min date: 2001-01-19 00:00:00
📅 Max date: 2031-12-19 00:00:00
❓ Null date values: 0


  races = pd.read_csv("data/races_2019.csv", parse_dates=["date"], dayfirst=True)


That’s clearly wrong — the file is supposed to be from 2019. So we need to inspect the raw format.


### 🧪 Step 1: Reload Without Parsing

To diagnose the date issue properly, we’ll go back and **reload the raw CSV without automatic date parsing**.

This lets us inspect the original format directly, without Pandas making assumptions.

Why this matters:

- If the format is ambiguous (like `19/01/01`), Pandas might guess incorrectly
- We want to verify whether it’s actually **year/month/day**, **day/month/year**, or something else
- That way, we can manually parse it *correctly and consistently* ourselves

Let’s look at the first few rows to see what we’re working with.

In [22]:
raw_races = pd.read_csv("data/races_2019.csv")
raw_races["date"].head(10)

0    19/01/01
1    19/01/01
2    19/01/01
3    19/01/01
4    19/01/01
5    19/01/01
6    19/01/01
7    19/01/01
8    19/01/01
9    19/01/01
Name: date, dtype: object

That’s YY/MM/DD, but Pandas misinterpreted it as DD/MM/YY or YY/DD/MM earlier.

### 🧪 Step 2: Confirm Format (Month Range)

We can’t just assume the format is `YY/MM/DD` — we need to test it.

🔍 Why this matters:
- If the **middle part** is >12, then it can’t be a valid month
- That would suggest the parser guessed wrong (e.g. `YY/DD/MM` instead)

🧪 Strategy:
1. Convert the `date` column to strings (if not already)
2. Slice out the **middle two characters** (position 3–5)
3. Try to convert those to integers
4. Flag any rows where that value is greater than 12

If none exist, we can safely confirm:  
✅ Format is `YY/MM/DD` (e.g. `19/01/01` = 1st Jan 2019)



In [24]:
# Convert date column to string (in case it's datetime already)
raw_dates = raw_races["date"].astype(str)

# Extract the middle two characters (presumed month)
middle_values = raw_dates.str[3:5]

# Check for any invalid month values (greater than 12)
invalid_months = raw_dates[middle_values.astype(int) > 12]
invalid_months


Series([], Name: date, dtype: object)

### 🧪 Step 3: Parse Dates Explicitly (Now That We’re Confident)

Now that we’ve confirmed the format is `YY/MM/DD` — and there are no invalid month values — we can **safely re-parse the column** using the correct format string.

Why not rely on automatic parsing?

- Pandas made a mistake earlier by guessing
- When parsing is ambiguous, it's always better to be explicit

We’ll now use:

```python
format="%y/%m/%d"
```

That ensures dates like 19/01/01 are interpreted as 2019-01-01 — not 2001 or 2031.

In [28]:
# Parse the date column correctly using explicit format
races["date"] = pd.to_datetime(raw_races["date"], format="%y/%m/%d", errors="raise")

# Check min/max again to confirm it's fixed
print("📅 Min date:", races["date"].min())
print("📅 Max date:", races["date"].max())

📅 Min date: 2019-01-01 00:00:00
📅 Max date: 2019-12-31 00:00:00


---

### 🧪 Step 4: Validate Decimal Odds (`decimalPrice`)

The `decimalPrice` column represents each horse’s **starting price (SP)** — i.e. the odds at which they started the race.

📌 Why this matters:
- These prices reflect the **market’s belief** about a horse’s chance of winning
- We’ll convert them into **implied probabilities** later for modelling and analysis
- If they’re missing, zero, or negative, it breaks everything

🎯 What we’re checking:
1. Are any odds **missing or null**?
2. Are there any **zero or negative values** (which would be invalid)?
3. Are any values **suspiciously low or high**?

Let’s inspect the data now.


In [31]:
# Load horses data if not already
horses = pd.read_csv("data/raw/horses_2019.csv")

# Check for missing or invalid decimal odds
print("❓ Missing odds:", horses["decimalPrice"].isna().sum())
print("❗ Zero or negative odds:", (horses["decimalPrice"] <= 0).sum())

# Check overall range of values
print("📊 Odds range:")
print("Min:", horses["decimalPrice"].min())
print("Max:", horses["decimalPrice"].max())

# Optional: View very high odds (e.g. 100+)
horses[horses["decimalPrice"] > 100].sort_values("decimalPrice", ascending=False).head()

❓ Missing odds: 0
❗ Zero or negative odds: 0
📊 Odds range:
Min: 0.0017667844522968
Max: 0.9615384615384616


Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,TR,OR,father,mother,gfather,runners,margin,weight,res_win,res_place


---

### ⚠️ Unexpected Finding: `decimalPrice` May Not Be Odds at All

We assumed the `decimalPrice` column contained **decimal starting prices (SPs)** — i.e. odds like 5.0, 10.0, etc.

But our checks showed:

- ✅ No missing or negative values
- ❌ All values are **less than 1**
- 📉 Minimum: ~0.0017, Maximum: ~0.96

This is **impossible** if they’re odds, since odds can’t be below 1.0 — not even for the favourite in a 1-horse race.

🧠 **Interpretation:**
These values are likely **already implied probabilities** (i.e. `1 / odds`), but the column was mislabelled as `decimalPrice`.

We’ll test this by calculating the **inverse** of the values and inspecting the results.

If we find values in the expected range (e.g. odds of 3.0 to 100+), that will confirm the mistake.

---


In [32]:
# Quick check: what's the average?
print("📈 Mean decimalPrice:", horses["decimalPrice"].mean())

# What if we convert back to decimal odds?
horses["decimal_odds_estimate"] = 1 / horses["decimalPrice"]
horses["decimal_odds_estimate"].describe()


📈 Mean decimalPrice: 0.12002575950098471


count    171849.000000
mean         22.857245
std          31.782141
min           1.040000
25%           6.100000
50%          12.000000
75%          26.000000
max         566.000000
Name: decimal_odds_estimate, dtype: float64

#### ✅ Confirmed: `decimalPrice` = Implied Probability (Mislabelled)

Upon inspection, we realised the `decimalPrice` column does not contain decimal odds, but rather **implied probabilities** (i.e. `1 / odds` already applied).

To confirm this, we computed inverse values and found:

- Mean odds: ~22.9  
- Median odds: ~12.0  
- Max odds: 566.0  

These are consistent with typical horse racing prices and confirm the column was **mislabelled**.

📌 To avoid confusion, we renamed the column in memory to `implied_prob`.

We also performed a validation check to ensure it behaves like a proper probability:

- ✅ No missing values  
- ✅ All values fall in the valid range `0 < p ≤ 1`  
- ✅ Distribution and extremes are realistic (e.g. strong favourites have high values)



In [50]:
# Rename and Validate decimalPrice → implied_prob

# Rename column in memory
horses.rename(columns={'decimalPrice': 'implied_prob'}, inplace=True)

# Data type and null check
print("Data type:", horses['implied_prob'].dtype)
print("Null values:", horses['implied_prob'].isnull().sum())

# Range checks
print("\nValues <= 0:", (horses['implied_prob'] <= 0).sum())
print("Values > 1:", (horses['implied_prob'] > 1).sum())

# Summary stats
print("\nSummary statistics:")
print(horses['implied_prob'].describe())

# Optional: look at extremes
print("\nTop 5 implied probabilities:")
print(horses['implied_prob'].sort_values(ascending=False).head())

print("\nBottom 5 implied probabilities:")
print(horses['implied_prob'].sort_values().head())

Data type: float64
Null values: 0

Values <= 0: 0
Values > 1: 0

Summary statistics:
count    171849.000000
mean          0.120026
std           0.118543
min           0.001767
25%           0.038462
50%           0.083333
75%           0.163934
max           0.961538
Name: implied_prob, dtype: float64

Top 5 implied probabilities:
156790    0.961538
44209     0.961538
154008    0.961538
147691    0.961538
117495    0.952381
Name: implied_prob, dtype: float64

Bottom 5 implied probabilities:
40793     0.001767
59681     0.001767
76062     0.001996
114180    0.001996
168344    0.001996
Name: implied_prob, dtype: float64


#### ✅ Confirmed: `decimalPrice` = Implied Probability (Mislabelled)

Upon inspection, we realised the `decimalPrice` column does not contain decimal odds, but rather **implied probabilities** (i.e. `1 / odds` already applied).

To confirm this, we computed the inverse and found:

- Mean odds: ~22.9  
- Median odds: ~12.0  
- Max odds: 566.0  

These figures match typical horse racing markets and confirm the column was mislabelled.

📌 We renamed this column in memory to `implied_prob` to reflect its actual contents.

We then validated its integrity:

- ✅ No missing values (`nulls = 0`)
- ✅ All values fall in the valid range `0 < p ≤ 1`
- ✅ Summary:
  - Mean: 0.120
  - Median: 0.083
  - Max: 0.962 (strong favourite)
  - Min: 0.0018 (longshot)

✅ This column is clean and ready to use as a key input feature for modelling.


### 🔎 Step 5 – Data Integrity Check: `position` Column

#### 📘 Description:
The `position` column is expected to contain the finishing place of each horse in a given race. For modelling purposes, it is crucial to verify that this column is well-structured and clean.

We aim to check:

- Whether all entries are present (i.e. no missing values).
- Whether the data is in a consistent and expected format.
- Whether there are any non-numeric values or placeholders (e.g. `"PU"`, `"UR"`, `"WD"`, `"DNF"`).
- Whether the data type is appropriate for numerical comparisons.




In [37]:
# Step 5 – Data Integrity Check: position Column

# Basic data type and null check
print("Data type and null count:")
print(horses['position'].dtype)
print("Null values:", horses['position'].isnull().sum())

# Display unique values
print("\nUnique values in 'position' column:")
print(horses['position'].unique())

# Frequency counts of each unique value
print("\nValue counts (including possible edge cases):")
print(horses['position'].value_counts(dropna=False))



Data type and null count:
int64
Null values: 0

Unique values in 'position' column:
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 40 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30]

Value counts (including possible edge cases):
position
1     17260
2     17233
3     17148
4     16845
5     16109
6     14809
7     13194
8     11426
9      9639
40     8033
10     7979
11     6473
12     5108
13     3607
14     2617
15     1547
16     1085
17      590
18      389
19      224
20      154
21      113
22       90
23       62
24       44
25       25
26       15
27       13
28        9
29        5
30        4
Name: count, dtype: int64


#### 🔍 Findings:

- The `position` column contains **no missing values** and is of type `int64`, which is appropriate for ranking or sorting.
- All entries are numeric and positive integers, as expected.
- The vast majority of values fall within the expected range for typical flat races (1st–20th), though we observe a **notable spike at position `40`**, which appears **8,033 times** — significantly more frequent than positions 10–30.

This strongly suggests that `40` is a **placeholder code** rather than a true finishing position — likely representing horses that did not finish, were withdrawn, or otherwise excluded from the final rankings.

We will investigate the meaning of `40` and decide how to handle it in the next sub-step of the pipeline.


#### ✅ Decision:

To preserve the integrity of the `position` column as a numeric type, while marking invalid results like non-finishers:

- We will replace all instances of `position == 40` with `NaN`.
- This avoids falsely implying a horse finished in 40th place.
- The column will become `float64` (due to how pandas handles missing numeric data).
- This approach keeps modelling pipelines clean and compatible with numeric operations.


In [39]:
import numpy as np

# Replace 40s with NaN
horses['position'] = horses['position'].replace(40, np.nan)


# Confirm changes
print(horses['position'].value_counts(dropna=False))
print(horses.dtypes['position'])  # Should now be float64


position
1.0     17260
2.0     17233
3.0     17148
4.0     16845
5.0     16109
6.0     14809
7.0     13194
8.0     11426
9.0      9639
NaN      8033
10.0     7979
11.0     6473
12.0     5108
13.0     3607
14.0     2617
15.0     1547
16.0     1085
17.0      590
18.0      389
19.0      224
20.0      154
21.0      113
22.0       90
23.0       62
24.0       44
25.0       25
26.0       15
27.0       13
28.0        9
29.0        5
30.0        4
Name: count, dtype: int64
float64


#### ✅ Summary – Step 5

- We confirmed that `position == 40` was a placeholder for non-finishers.
- All such values have been replaced with `NaN` to make their meaning explicit.
- This preserves the column as a numeric type (`float64`), ensuring compatibility with ranking, sorting, and modelling.
- No valid finisher data was removed — all rows are retained.
- We will treat `NaN` values in `position` as clear indicators of non-finishers in all downstream steps.


### 🔎 Step 6 – Data Integrity Check: `rid` Column

#### 📘 Description:
The `rid` column links each horse entry to a specific race. It is a **crucial foreign key** for grouping horses into races and merging with race-level data.

We want to ensure:

- Each `rid` exists and is not null.
- There are no obviously invalid values (e.g. negative IDs, duplicates with conflicting context).
- It is consistent across datasets (i.e. `rid` in `horses` must match with those in the `races` table).

This check will focus on uniqueness, completeness, and basic structure.


In [42]:
# Step 6 – Data Integrity Check: rid Column

# Basic info
print("Data type:", horses['rid'].dtype)
print("Null values:", horses['rid'].isnull().sum())

# Unique value count
print("Unique race IDs in horses:", horses['rid'].nunique())

# Check for negative or zero race IDs
print("Race IDs <= 0:", (horses['rid'] <= 0).sum())

# Are all horse race IDs also in races dataset?
missing_in_races = ~horses['rid'].isin(races['rid'])
print("Horse race IDs not found in races table:", missing_in_races.sum())



Data type: int64
Null values: 0
Unique race IDs in horses: 17233
Race IDs <= 0: 0
Horse race IDs not found in races table: 0


#### ✅ Summary – Step 6: `rid` (Race ID) Column

- `rid` is present in all rows (no missing values).
- All values are valid integers (`int64`) and strictly positive.
- There are 17,233 unique race IDs in the `horses` dataset.
- Every `rid` in `horses` is also present in the `races` dataset — confirming referential integrity between the two tables.

✅ This column is clean and ready for use in grouping, joining, or aggregating horses by race.

### 🔎 Step 7 – Data Integrity Check: `horseName` Column

#### 📘 Description:
The `horseName` column provides the unique name of each horse entry. It is critical for:

- Identifying individual horses within and across races.
- Preventing duplicates or confusion when aggregating results or building horse-specific features.

In this step, we will check:

- That all entries are non-null.
- That names appear in a consistent format (e.g. no trailing whitespace, case issues, or strange symbols).
- Whether any unexpected duplicates or anomalies exist.


In [45]:
# Step 7 – Data Integrity Check: horseName Column

# Basic info
print("Data type:", horses['horseName'].dtype)
print("Null values:", horses['horseName'].isnull().sum())

# How many unique horse names?
print("Unique horse names:", horses['horseName'].nunique())

# Most common names (possible duplication or missing disambiguation?)
print("\nMost common horse names:")
print(horses['horseName'].value_counts().head(10))

# Check for leading/trailing whitespace
has_whitespace = horses['horseName'].str.contains(r'^\s+|\s+$')
print("\nHorse names with leading/trailing whitespace:", has_whitespace.sum())


Data type: object
Null values: 0
Unique horse names: 41241

Most common horse names:
horseName
Red Stripes        33
Zapper Cass        28
Celerity           27
Pearl Spectre      27
Catapult           27
Caledonian Gold    27
Contingency Fee    26
Alicia Darcy       26
B Fifty Two        26
Tavener            25
Name: count, dtype: int64

Horse names with leading/trailing whitespace: 0


#### ✅ Summary – Step 7: `horseName` Column

- All rows have a non-null `horseName` value.
- The column is stored as `object` (string), as expected.
- There are 41,241 unique horse names in the dataset.
- A small number of names appear 25+ times — the most frequent being *Red Stripes* (33 entries).
- We manually verified that *Red Stripes* ran 33 times in 2019, confirming the data is accurate and does not reflect duplication or name reuse.
- No names contain leading or trailing whitespace.

✅ The `horseName` column is clean and valid. No further cleaning is required.


### 🔎 Step 8 – Data Integrity Check: `age` Column

#### 📘 Description:
The `age` column represents the age of each horse on race day. This is a critical feature for:

- Filtering races (e.g. age-restricted events like the Derby).
- Analysing horse development and performance over time.
- Engineering meaningful features for modelling (e.g. age-relative performance).

In this step, we will check:

- That all entries are non-null and numeric.
- That all values fall within an expected range (e.g. 2 to 15 for UK flat racing).
- Whether any outliers, inconsistencies, or formatting issues are present.


In [46]:
# Step 8 – Data Integrity Check: age Column

# Basic info
print("Data type:", horses['age'].dtype)
print("Null values:", horses['age'].isnull().sum())

# Unique age values
print("Unique age values:", sorted(horses['age'].unique()))

# Value counts
print("\nAge distribution:")
print(horses['age'].value_counts().sort_index())


Data type: float64
Null values: 0
Unique age values: [2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0]

Age distribution:
age
2.0     17960
3.0     38253
4.0     32792
5.0     27323
6.0     21572
7.0     13224
8.0      8696
9.0      5614
10.0     3436
11.0     1806
12.0      784
13.0      270
14.0      100
15.0        7
16.0       12
Name: count, dtype: int64


#### ✅ Summary – Step 8: `age` Column

- All entries in the `age` column are present and numeric (`float64`).
- Age values range from 2.0 to 16.0.
- The distribution is realistic, with the majority of horses aged 2–8.
- A small number of entries exist for horses aged 15 (7 entries) and 16 (12 entries).

While this initially appeared unusual, we verified real-world examples with a quick google serach — including *Megalala*, a 16-year-old flat racer — confirming that such cases, though rare, are valid.

✅ The `age` column is clean and trustworthy, and all values will be retained.
