# 🧠 Lesson 2: Do Favourites Really Win?

This notebook explores one of the most common beliefs in horse racing:

> 📣 **What people say about favourites in racing:**

> “Favourites win more often, but the odds are too short to be profitable.”  
> — [FlatStats.co.uk](https://www.flatstats.co.uk/horseracing/index.php?cmd=article&id=102)

> “While the favourite has a higher chance of winning, it doesn't mean you will make a profit betting on them.”  
> — [GrandNational.Fans](https://www.grandnational.fans/news/how-often-do-favourites-win/)

> “The favourite-longshot bias... is the empirical finding that bettors tend to overvalue longshots and undervalue favourites.”  
> — [Wikipedia: Favourite–Longshot Bias](https://en.wikipedia.org/wiki/Favourite-longshot_bias)

These are commonly accepted ideas in both betting communities and academic literature. In this notebook, we'll test whether they hold true in **elite UK flat races** — like the Epsom Derby — using real race data.


We'll test this idea using real data from **Derby-type races** in the UK (Class 1, 3yo, ~2400m, flat, 2019).

---

## 🎯 Goal

Quantify how often favourites (and other top-ranked horses by odds) actually win in these races.

---

## 🧪 We'll Explore

- ✅ **Favourite Win Rate** – how often does the shortest-priced horse win?
- 🥈 **2nd & 3rd Favourites** – how do they compare?
- 💰 **Odds vs Outcome** – does a lower price actually mean higher win probability?
- 🔍 **Upsets** – how often does a horse with long odds win?
- 📉 **Any patterns?** – e.g. field size, going, class, number of runners?

---

## 💡 Why It Matters

Understanding how favourites perform:
- Sets a **baseline** for our future betting models
- Helps identify **value opportunities**
- Tests whether the market (odds) reflects reality

---

👉 Let’s load the filtered Derby-like dataset and begin by identifying the favourite in each race.


## 📦 Step 1: Load and Filter the Derby-Type Dataset

We’re recreating the filtered dataset from Lesson 1 using the full 2019 horse and race files.

This gives us a fresh copy of **Derby-type races**, defined as:

- 🇬🇧 UK (GB) flat turf races  
- 🎂 3-year-old runners only  
- 🥇 Class 1 (elite) races  
- 📏 Distance between 2200m and 2600m

This keeps our analysis focused on races **similar in structure to the Epsom Derby**, so we can explore how favourites perform in these specific conditions.


In [3]:
import pandas as pd

# Load the full 2019 data
races = pd.read_csv("data/races_2019.csv")
horses = pd.read_csv("data/horses_2019.csv")

# Merge on Race ID
data = horses.merge(races, on="rid", suffixes=("", "_race"))

# Check merged shape
print("🔗 Merged shape:", data.shape)
data.head(2)


🔗 Merged shape: (171849, 44)


Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,distance,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class
0,11499,Picken,4.0,6.0,0.3125,0,J Size,Joao Moreira,1,,...,7f,Good,,"[52492.49, 20260.26, 10590.59, 5525.53, 3223.22]",82.46,92092.0,1407.0,HK,1,0
1,11499,Noble De Love,6.0,7.0,0.333333,1,F C Lor,Zac Purton,2,nk,...,7f,Good,,"[52492.49, 20260.26, 10590.59, 5525.53, 3223.22]",82.46,92092.0,1407.0,HK,1,0


## 🔢 Step 2: Understand Starting Prices and the Bookmaker’s Margin

In this dataset, `decimalPrice` represents the **SP (Starting Price)** — the odds available just before the race begins.

These odds reflect the **market’s belief** about each horse’s chance of winning, and are typically what a bettor would receive unless they took an early price.

---

#### 💡 What Are Decimal Odds?

Decimal odds show your **total return** per £1 bet — including your stake.

| Decimal Odds | Implied Win Probability | Example |
|--------------|--------------------------|---------|
| 2.00         | 50%                      | Even money — returns £2 for every £1 bet |
| 3.00         | 33%                      | Returns £3 for every £1 bet |
| 10.00        | 10%                      | A longshot |

To convert decimal odds to implied probability:
implied_prob = 1 / decimalPrice

We’ll use this to analyse whether the market prices favourites **accurately, too conservatively, or too optimistically**.

#### 🧮 Bookmakers, Overround & the Real World

Unlike a fair market where probabilities would total 100%, bookmakers build in a margin known as the overround.

This means:

The sum of all implied probabilities in a race will usually exceed 100%.

This excess is the bookmaker’s profit margin — often between 110% and 130% in UK horse racing.

In racing lingo, this is referred to as “the book” (e.g. “the book was 118%”).

So when we analyse favourite odds, we're seeing prices with the overround already built in.

We'll keep this in mind when comparing market odds to actual win rates — especially if we want to identify value.


## 🔢 Step 3: Identify the Favourite in Each Race
We want to identify the favourite horse in every race — the one with the **lowest decimal odds** (i.e., the shortest price). Here's how we do that in code:

```python
data["is_favourite"] = data.groupby("rid")["decimalPrice"].transform(lambda x: x == x.min())
```

#### 🔍 What’s Going On Here?

| Part | Explanation |
|------|-------------|
| `groupby("rid")` | Group the dataset by `rid` (race ID), so we analyse horses **within the same race**. |
| `["decimalPrice"]` | We focus only on the column that contains the odds. |
| `.transform(...)` | Applies a function to **each group**, returning a result of the same size as the original — one per horse. |
| `lambda x: x == x.min()` | For each group (i.e. race), this checks if the horse's odds are the **lowest in that race**. If true, it's the favourite. |
| `data["is_favourite"] = ...` | We save the result as a new Boolean column — `True` if the horse is the favourite, `False` otherwise. |

🧠 This handles **ties automatically** — if more than one horse has the same shortest odds, they’ll all be marked as favourites.

In [5]:
# Step 3 — Mark the favourite in each race
data["is_favourite"] = data.groupby("rid")["decimalPrice"].transform(lambda x: x == x.min())

# Check how many favourites we've found
fave_counts = data["is_favourite"].sum()
total_races = data["rid"].nunique()

print(f"✅ Favourites marked: {int(fave_counts)}")
print(f"🧾 Total races in dataset: {total_races}")

# Preview favourites only
data[data["is_favourite"]].head()


✅ Favourites marked: 21337
🧾 Total races in dataset: 17233


Unnamed: 0,rid,horseName,age,saddle,decimalPrice,isFav,trainerName,jockeyName,position,positionL,...,condition,hurdles,prizes,winningTime,prize,metric,countryCode,ncond,class,is_favourite
12,11499,Guerdon Helmet,4.0,13.0,0.003509,0,D E Ferraris,H N Jack Wong,13,5.5,...,Good,,"[52492.49, 20260.26, 10590.59, 5525.53, 3223.22]",82.46,92092.0,1407.0,HK,1,0,True
21,26954,The Invincible,5.0,7.0,0.034483,0,W Y So,Chad Schofield,9,1.5,...,Good,,"[39369.37, 15195.2, 7942.94, 4144.14, 2417.42]",95.75,69068.0,1609.0,HK,1,0,True
26,35478,Home Place,6.0,3.0,0.038462,0,Karl Thornton,Robbie Colgan,3,2.25,...,Good,13 fences,"[8624.0, 2674.0, 1274.0, 574.0]",341.3,13146.0,4223.0,IE,1,0,True
44,3840,Corsecombe,7.0,6.0,0.004975,0,Mark Gillard,Theo Gillard,40,,...,Good To Soft,10 hurdles,"[4548.6, 1335.6, 667.8, 333.9]",278.2,6887.0,3720.5,GB,10,4,True
50,110475,Keep Standing,5.0,10.0,0.009901,0,Philip Fenton,Frank Hayes,4,2.5,...,Soft,10 hurdles,"[7392.0, 2292.0, 1092.0, 492.0, 192.0]",249.9,11460.0,3218.0,IE,5,0,True


## 🔢 Step 4: Calculate Favourite Win Rate

Now that we’ve identified the favourite in each race, we check how often they **actually win**.

We do this by:
- Filtering the dataset to include only favourites (`is_favourite == True`)
- Counting how many had a `position == 1` (i.e. finished first)

This gives us the **baseline win rate for favourites** — a key benchmark to compare with market expectations.


In [7]:
# Only look at rows where horse is favourite
favourites = data[data["is_favourite"] == True]

# Count how many favourites won (position == 1)
wins = (favourites["position"] == 1).sum()
total = len(favourites)

win_rate = wins / total

print(f"🏁 Favourite wins: {wins} / {total} ({win_rate:.2%})")


🏁 Favourite wins: 421 / 21337 (1.97%)


## 🧠 Sanity Check: Why Is the Favourite Win Rate So Low?

We calculated a favourite win rate of just **1.97%** — far lower than expected.

---

### 🎯 What Should We Expect?

In UK horse racing, favourites typically win **30–35%** of the time.  
This is supported by decades of industry data and academic research:

- 🏇 **FlatStats.co.uk**: “Favourites win about 35% of UK flat races.”
- 📊 **GrandNational.fans**: “Across all races, favourites win 32–35% of the time.”
- 📚 Academic studies into the favourite–longshot bias find similar ranges across major racing markets.

So when our result shows **just 1.97%**, something likely isn't right.

---

### 🛠️ Time to Investigate

This is a perfect example of why **data science isn’t just about writing code — it’s about thinking critically**.

Here are some things to check:
- ❓ Are favourites being marked correctly?
- 🔢 Are race results (`position`) stored in an unexpected format (e.g. strings like "1st" or tied values)?
- 🧍 Are there multiple horses marked as favourite in one race?
- 🏇 Are these races structured differently than typical UK flat races?

This is a **learning checkpoint** — so let’s dig in and figure out what’s going on.



## 🔍 Step 5: Investigate Why the Win Rate Is So Low

Our favourite win rate was just **1.97%**, which is far below the expected **30–35%**.

To begin debugging, we’ll investigate whether the `position` column is behaving as expected.

---

### ❓ What Could Be Wrong?

The logic we used was:

```python
favourites["position"] == 1
```



But this only works if the position column is stored as numeric values (e.g. 1, 2, 3).
If the positions are strings (like "1st", "PU", "F", or "WD"), that condition will silently fail.

We’ll inspect the column’s data type and preview a few values to check.

In [11]:
# Check if position is numeric or string
print("Data type of position column:", data["position"].dtype)

# Show the first 20 unique values (to see if there are surprises)
print("Sample unique values in 'position':")
print(data["position"].unique()[:20])


Data type of position column: int64
Sample unique values in 'position':
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 40 14 15 16 17 18 19]


### ⚠️ Anomaly Detected: `position = 40` in a 3YO Flat Race?

One of the unique `position` values is **40** — which immediately stands out.

That's suspicious, because:
- Earlier in our EDA, we found that **no Class 1 flat races** had anything close to 40 runners
- Even the Epsom Derby, a large-field race, rarely has more than 18 horses

This suggests that **some part of the data might be misaligned or inconsistent**.

Rather than blindly trust the `runners` column (which comes from a different file), we’ll investigate further by **recalculating the actual number of horses in each race** using the data we have.

This anomaly shows why **we can't assume data is perfect just because it loads without error**.



## 🧩 Recalculating Actual Runners and Investigating Finish Positions

Earlier, we discovered that some horses had `position` values higher than the number of runners listed in their race — like finishing 40th in a race that (according to our earlier EDA) had far fewer runners.

This raised an important question:
> Can we really trust the `runners` column from the race dataset?

To be safe, we’ll **recalculate the actual number of runners per race** using the horse-level data. Then, we’ll **look for any remaining impossible finish positions** — and decide whether they should be removed or fixed.

This lets us clean the data *transparently and responsibly*.




In [12]:
# ✅ Recalculate the actual number of runners per race
actual_runners = data.groupby("rid").size()
data["actual_runners"] = data["rid"].map(actual_runners)

# 🕵️ Investigate horses with impossible finish positions
weird_positions = data[data["position"] > data["actual_runners"]]

# Preview results
print(f"❗ Horses with position > actual number of runners: {len(weird_positions)}")
weird_positions[["rid", "horseName", "position", "actual_runners", "decimalPrice"]].head(10)


❗ Horses with position > actual number of runners: 8014


Unnamed: 0,rid,horseName,position,actual_runners,decimalPrice
28,35478,Maeve's Choice,40,5,0.153846
43,3840,Call Me Westie,40,18,0.009901
44,3840,Corsecombe,40,18,0.004975
45,3840,Fortescue,40,18,0.014925
46,3840,Little Robin,40,18,0.006623
59,110475,Claregate Street,40,15,0.014925
60,110475,Be My Dream,40,15,0.014925
61,110475,Kellyiscool,40,15,0.142857
70,110667,The Druids Nephew,40,9,0.076923
97,114194,Blameitalonmyroots,40,11,0.058824


### 🔍 Step 1: What Position Values Actually Exist?

Before we assume that `position = 40` means "did not finish", let’s explore all the values in the column.

This helps us confirm:
- What the **valid range** of finishing positions is
- Whether there are **other special cases** (like 0, 99, or missing data)
- How frequent `position = 40` actually is

This is a key principle in responsible data science: **Don’t guess — verify.**


In [13]:
# View the full distribution of position values (sorted)
position_counts = data["position"].value_counts().sort_index()
position_counts

position
1     17260
2     17233
3     17148
4     16845
5     16109
6     14809
7     13194
8     11426
9      9639
10     7979
11     6473
12     5108
13     3607
14     2617
15     1547
16     1085
17      590
18      389
19      224
20      154
21      113
22       90
23       62
24       44
25       25
26       15
27       13
28        9
29        5
30        4
40     8033
Name: count, dtype: int64

### 📋 Investigating Multiple Races Manually

To confirm that `position = 40` really means the horse didn’t finish, we’re inspecting several races where this value occurs.

We selected races with varying field sizes and different courses:
- Check if horses with `position = 40` were pulled up, fell, or did not finish
- Cross-check with external sources (e.g. [Racing Post](https://www.racingpost.com))

If this pattern holds across multiple races, we can safely treat `position = 40` as a consistent placeholder for non-finishers.

In [14]:
# Pick a few known race IDs with position = 40
sample_rids = [110475, 3840, 110667]  # You can add more

# Show all horses in those races with full context
data[data["rid"].isin(sample_rids)][[
    "rid", "horseName", "position", "finished", "decimalPrice", "is_favourite",
    "jockeyName", "trainerName",
    "course", "date", "condition", "class", "distance", "actual_runners"
]].sort_values(["rid", "position"], na_position="last")

KeyError: "['finished'] not in index"