---
# Data Collection:

**Author:** Xinyan Liao

**Date last edited:** 4/2/2025

**Objective:** Engaging in exploratory data analysis (EDA) and backtesting with regards to simple, unidirectional betting strategies.

---

# 1. Introduction
The main purpose of this notebook is to dive deeper into simpler, directional betting strategies like betting on the favourite or the home team. Our approach would involve both exploratory data analysis (EDA) to inform us of any potential betting edges before backtesting these respective strategies over time.

The research questions are kept intentionally vague to give us more wiggle room for exploration but here are some key questions we want to investigate in this notebook:

1. What are the bookmakers that provide the highest odds?
2. Are there potential mispricing in odds that are exploitable?
3. When are betting on favourites, underdogs or draws more advantageous?

For our purposes, we will merge the two tables from our SQLite database based off each fixture's unique `fixture_id` to form a table with the odds and match outcomes. We will then pick the best possible odds for our respective strategies for analysis.

---

# 2. Import Libraries
We import the necessary libraries for:
- Data manipulation using `pandas`.
- Interacting with the SQL database using `SQLAlchemy`.
- Generating and saving our visualisations with `lets_plot`.

In [1]:
import pandas as pd
from sqlalchemy import create_engine
from lets_plot import * 
from lets_plot import ggsave
from IPython.display import SVG

---

# 3. Merging Odds Data and Match Outcomes 

In [2]:
engine = create_engine('sqlite:///../data/sports_odds.db')
odds_df = pd.read_sql('SELECT * FROM historical_odds', con=engine)
results_df = pd.read_sql('SELECT * FROM match_results', con=engine)

After inspection, we realise that there is a different number of rows for `odds_df` and `results_df`. These discrepancies in the data size are normal and will not affect our analysis. Hence, we used `pandas` merge on 'inner' to ensure only matches with `fixture_id` present in both dataframes will be selected for analysis.

In [3]:
merged_df = odds_df.merge(results_df, on='fixture_id', how='inner')

Unnamed: 0,match_id,commence_time,home_team,away_team,Unibet_home_odds,Unibet_away_odds,Unibet_draw_odds,Sky Bet_home_odds,Sky Bet_away_odds,Sky Bet_draw_odds,...,Grosvenor_draw_odds,Smarkets_home_odds,Smarkets_away_odds,Smarkets_draw_odds,fixture_id,Date,Time,HomeTeam,AwayTeam,FTR
0,2dd4a4f8663e6f835226a5209c614a60,2020-06-17T17:00:00Z,Aston Villa,Sheffield United,3.35,2.32,3.25,3.1,2.25,3.3,...,,,,,AVLSHU170620,17/06/2020,18:00,Aston Villa,Sheffield United,D
1,b1e029a0d989b4c11e843204003044f9,2020-06-17T19:15:00Z,Manchester City,Arsenal,1.35,8.5,5.6,1.36,7.5,5.25,...,,,,,MCIARS170620,17/06/2020,20:15,Man City,Arsenal,H
2,59d68295dc2213634772cd941c91fa11,2020-06-19T19:15:00Z,Tottenham Hotspur,Manchester United,2.75,2.6,3.3,2.7,2.5,3.4,...,,,,,TOTMUN190620,19/06/2020,20:15,Tottenham,Man United,D
3,88352746f45f6beb4e2cb662d9414d0f,2020-06-20T11:30:00Z,Watford,Leicester City,3.4,2.15,3.45,3.25,2.2,3.4,...,,,,,WATLEI200620,20/06/2020,12:30,Watford,Leicester,D
4,065ae59da20562892de52b7f5598ecbf,2020-06-20T16:30:00Z,West Ham United,Wolverhampton Wanderers,3.5,2.15,3.35,3.3,2.2,3.3,...,,,,,WHUWOL200620,20/06/2020,17:30,West Ham,Wolves,A


---

# 4. Extracting the Highest Odds for Each Fixture and Outcome

We understand that getting the best odds for each fixture/outcome may be theoretically difficult as a retail better. However, this layer of analysis gives us further information about which bookmakers could potentially offer better odds for certain teams/outcomes and gives us more accuracy in predicting the profitability of our strategy.

In [4]:
# Import the function to process each row and obtain highest odds
from functions import process_row

# Apply the function to each row
new_columns = merged_df.apply(process_row, axis=1)
merged_df = pd.concat([merged_df, new_columns], axis=1)

Unnamed: 0,match_id,commence_time,home_team,away_team,Unibet_home_odds,Unibet_away_odds,Unibet_draw_odds,Sky Bet_home_odds,Sky Bet_away_odds,Sky Bet_draw_odds,...,Time,HomeTeam,AwayTeam,FTR,max_home,home_bookmaker,max_away,away_bookmaker,max_draw,draw_bookmaker
0,2dd4a4f8663e6f835226a5209c614a60,2020-06-17T17:00:00Z,Aston Villa,Sheffield United,3.35,2.32,3.25,3.10,2.25,3.30,...,18:00,Aston Villa,Sheffield United,D,3.35,Unibet,2.41,Marathon Bet,3.52,Marathon Bet
1,b1e029a0d989b4c11e843204003044f9,2020-06-17T19:15:00Z,Manchester City,Arsenal,1.35,8.50,5.60,1.36,7.50,5.25,...,20:15,Man City,Arsenal,H,1.39,Marathon Bet,8.70,Marathon Bet,5.95,Marathon Bet
2,59d68295dc2213634772cd941c91fa11,2020-06-19T19:15:00Z,Tottenham Hotspur,Manchester United,2.75,2.60,3.30,2.70,2.50,3.40,...,20:15,Tottenham,Man United,D,2.88,Betfair,2.64,Marathon Bet,3.70,Marathon Bet
3,88352746f45f6beb4e2cb662d9414d0f,2020-06-20T11:30:00Z,Watford,Leicester City,3.40,2.15,3.45,3.25,2.20,3.40,...,12:30,Watford,Leicester,D,3.52,Marathon Bet,2.22,Marathon Bet,3.75,Marathon Bet
4,065ae59da20562892de52b7f5598ecbf,2020-06-20T16:30:00Z,West Ham United,Wolverhampton Wanderers,3.50,2.15,3.35,3.30,2.20,3.30,...,17:30,West Ham,Wolves,A,3.70,Betfair,2.23,Marathon Bet,3.60,Marathon Bet
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1777,ac9d59dd7555c122948f98b8b2a19c8e,2024-12-08T14:00:00Z,Fulham,Arsenal,,,,6.25,1.44,4.33,...,14:00,Fulham,Arsenal,D,6.40,Smarkets,1.62,BoyleSports,4.33,Sky Bet
1778,8899ce68b0c9e451b17edc3f2e076a6b,2024-12-08T14:00:00Z,Ipswich Town,Bournemouth,,,,3.10,2.15,3.50,...,14:00,Ipswich,Bournemouth,A,3.20,Paddy Power,2.20,Smarkets,3.90,Betfair
1779,d9163188e3e3d9a8bf2525e2f9e3a553,2024-12-08T14:00:00Z,Leicester City,Brighton and Hove Albion,,,,4.75,1.62,4.20,...,14:00,Leicester,Brighton,D,5.00,Paddy Power,1.73,BoyleSports,4.40,Coral
1780,c87f3d3551a57bd560cb056ade831890,2024-12-08T16:30:00Z,Tottenham Hotspur,Chelsea,,,,2.20,2.80,3.75,...,16:30,Tottenham,Chelsea,A,2.30,Paddy Power,3.10,Smarkets,3.75,Sky Bet


In [5]:
merged_df = merged_df[['fixture_id', 'Date', 'home_team', 'away_team', 'FTR', 'max_home', 'home_bookmaker', 'max_away', 'away_bookmaker', 'max_draw', 'draw_bookmaker']]
merged_df

Unnamed: 0,fixture_id,Date,home_team,away_team,FTR,max_home,home_bookmaker,max_away,away_bookmaker,max_draw,draw_bookmaker
0,AVLSHU170620,17/06/2020,Aston Villa,Sheffield United,D,3.35,Unibet,2.41,Marathon Bet,3.52,Marathon Bet
1,MCIARS170620,17/06/2020,Manchester City,Arsenal,H,1.39,Marathon Bet,8.70,Marathon Bet,5.95,Marathon Bet
2,TOTMUN190620,19/06/2020,Tottenham Hotspur,Manchester United,D,2.88,Betfair,2.64,Marathon Bet,3.70,Marathon Bet
3,WATLEI200620,20/06/2020,Watford,Leicester City,D,3.52,Marathon Bet,2.22,Marathon Bet,3.75,Marathon Bet
4,WHUWOL200620,20/06/2020,West Ham United,Wolverhampton Wanderers,A,3.70,Betfair,2.23,Marathon Bet,3.60,Marathon Bet
...,...,...,...,...,...,...,...,...,...,...,...
1777,FULARS081224,08/12/2024,Fulham,Arsenal,D,6.40,Smarkets,1.62,BoyleSports,4.33,Sky Bet
1778,IPSBOU081224,08/12/2024,Ipswich Town,Bournemouth,A,3.20,Paddy Power,2.20,Smarkets,3.90,Betfair
1779,LEIBHA081224,08/12/2024,Leicester City,Brighton and Hove Albion,D,5.00,Paddy Power,1.73,BoyleSports,4.40,Coral
1780,TOTCHE081224,08/12/2024,Tottenham Hotspur,Chelsea,A,2.30,Paddy Power,3.10,Smarkets,3.75,Sky Bet


---

# 5. Which Bookmakers Offer the Best Odds?
First, let us look at the prevalence of bookmakers in providing the best odds. This gives us some insight into which bookmakers we can focus on when deciding to place bets, streamlining the process.

In [6]:
LetsPlot.setup_html()

In [7]:
# Function to get top bookmakers
def get_top_bookmakers(dataframe, column):
    
    total_matches = len(dataframe)
    counts = dataframe[column].value_counts(normalize=True) * 100

    # Get the top bookmakers
    top_counts = counts.nlargest(5).reset_index()
    top_counts.columns = ['Bookmaker', 'Percentage']
    
    return top_counts

top_home = get_top_bookmakers(merged_df, 'home_bookmaker')
top_away = get_top_bookmakers(merged_df, 'away_bookmaker')
top_draw = get_top_bookmakers(merged_df, 'draw_bookmaker')

In [8]:
top_home_bookmakers = (
    ggplot(top_home, aes(x='Bookmaker', y='Percentage'))
    + geom_bar(stat='identity', fill='#024B04')
    + ggsize(800, 400)
    + labs(
        title='Marathon Bet gives the highest home odds almost 20% of the time!',
        subtitle='The top five bookmakers provide the best odds almost 60% of the time!'
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue')
    )
)
top_home_bookmakers


In [9]:
top_away_bookmakers = (
    ggplot(top_away, aes(x='Bookmaker', y='Percentage'))
    + geom_bar(stat='identity', fill='#8B2E01')
    + ggsize(800, 400)
    + labs(
        title='Marathon Bet gives the highest away odds again!',
        subtitle='Similarly, the top five bookmakers provide the best away odds almost 60% of the time'
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue')
    )
)
top_away_bookmakers

In [10]:
top_draw_bookmakers = (
    ggplot(top_draw, aes(x='Bookmaker', y='Percentage'))
    + geom_bar(stat='identity', fill='grey')
    + ggsize(800, 400)
    + labs(
        title='Marathon Bet gives the highest odds almost 30% of the time, our clear winner!',
        subtitle='Sky Bet and Virgin Bet are additional contenders for good draw odds'
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=20),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue'),
    )
)
top_draw_bookmakers

**Conclusion**: Marathon Bet has the highest probability of providing the best odds for any particular outcome (Home, Away, Draw). This is unsurprising given its reputation as a low-margin bookmaker in the industry. Other strong contenders to consider are Paddy Power, Betclic, William Hill and Unibet.

---

# 6. How Do the Win Rates of Different Strategies Vary Over Time?
Let's look at the win rates of the respective strategies over time to see if there are any significant changes to the dynamics of EPL games (e.g. underdogs taking over)

First, we need to convert the full-time result to the outcome of each strategy -- e.g. if the team with higher odds wins, we lable it favourite (or F).

In [11]:
def get_result_type(row):
    
    home_odds = row['max_home']
    away_odds = row['max_away']
    result = row['FTR']
        
    if home_odds > away_odds:
        home = 0 # 0 denotes the underdog team
        away = 1
    else:
        home = 1
        away = 0

    if result == 'D':
        result_type = 'D'
    elif result == 'H':
        if home == 0:
            result_type = 'U'
        else:
            result_type = 'F'
    else:
        if away == 0:
            result_type = 'U'
        else:
            result_type = 'F'
    
    return result_type

In [12]:
strategy_df = merged_df.copy()
strategy_df['result_type'] = strategy_df.apply(get_result_type, axis=1)
strategy_df = strategy_df[['fixture_id', 'Date', 'home_team', 'away_team', 'result_type']]
strategy_df

Unnamed: 0,fixture_id,Date,home_team,away_team,result_type
0,AVLSHU170620,17/06/2020,Aston Villa,Sheffield United,D
1,MCIARS170620,17/06/2020,Manchester City,Arsenal,F
2,TOTMUN190620,19/06/2020,Tottenham Hotspur,Manchester United,D
3,WATLEI200620,20/06/2020,Watford,Leicester City,D
4,WHUWOL200620,20/06/2020,West Ham United,Wolverhampton Wanderers,F
...,...,...,...,...,...
1777,FULARS081224,08/12/2024,Fulham,Arsenal,D
1778,IPSBOU081224,08/12/2024,Ipswich Town,Bournemouth,F
1779,LEIBHA081224,08/12/2024,Leicester City,Brighton and Hove Albion,D
1780,TOTCHE081224,08/12/2024,Tottenham Hotspur,Chelsea,U


Now, let's further decompose it down to each year and month so we can visualise a time-series of how win rates vary across time.

In [13]:
strategy_df['Date'] = pd.to_datetime(strategy_df['Date'], dayfirst=True)
strategy_df['year'] = strategy_df['Date'].dt.year

In [14]:
# Group fixtures by month and year
result_counts = strategy_df.groupby(['year', 'result_type']).size().reset_index(name='count')
total_matches = strategy_df.groupby('year').size().reset_index(name='total_matches')
result_counts = result_counts.merge(total_matches, on='year')

# Calculate win rates for each strategy
result_counts['win_rate'] = result_counts['count'] / result_counts['total_matches']

In [15]:
# Reshape the data for plotting
win_rate_df = result_counts.pivot(index='year', columns='result_type', values='win_rate').reset_index()
win_rate_df = win_rate_df.rename(columns={'F': 'Bet on Favorite', 'U': 'Bet on Underdog', 'D': 'Bet on Draw'})

# Melt for easier plotting
win_rate_melted = win_rate_df.melt(id_vars='year', var_name='Strategy', value_name='Win Rate')
win_rate_melted['year'] = pd.to_datetime(win_rate_melted['year'], format='%Y')

# Reordering so legends appear in proper order in plot
desired_order = ['Bet on Favorite', 'Bet on Underdog', 'Bet on Draw']
win_rate_melted['Strategy'] = pd.Categorical(win_rate_melted['Strategy'], categories=desired_order, ordered=True)

In [16]:
win_rate_across_time = (
    ggplot(win_rate_melted, aes(x='year', y='Win Rate', color='Strategy'))
    + ggsize(800,400)
    + geom_line(size=1)
    + geom_point(size=2)
    + labs(
        x = 'Time',
        y = 'Win Rate',
        title = 'Expectedly, betting on the favourite team has the highest win rate',
        subtitle = 'Surprisingly, the prevalence of draws have seen a noticeable increase from 2023 to 2024.'
        )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue'),
        legend_position=(0.5, 0.5)
        )
    )

win_rate_across_time

**Conclusion**: Expectedly, betting on the favourite team has the greatest chance of being successful at over 50% on average. However, ths is priced into the odds (favourites have lower odds) and is not necessarily indicative of profitability. 

It is however, interesting, to observe that the prevalence of draws has increased these two seasons. We speculate that it might be due to the overall abilities of teams displaying mean reversion where 'traditional underdogs' like Aston Villa or Nottingham Forest have enjoyed great form and 'traditional powerhouses' like Manchester United expereienced the opposite.  

---

# 7. Are There Potential Mispricings in Odds Between Home and Away Teams?
Here we want to investigate if home teams are priced to be systematically advantaged compared to away teams and if there are potential deviations between actual and implied probabilities of winning.

First, we will visualise the odds distribution for both home and away teams to get a sensing of the implied probabilities of winning for both sides.

Next, we will use the distribution to decide on the categories to bin our odds in and then investigate the actual vs implied probabilities of winning across various odds categories.

### Implied Probability of Winning = 1 / Odds 
This is the formula we're using to calculate implied probability.

In [17]:
stadium_df = merged_df.copy()
stadium_df = stadium_df[['fixture_id','FTR', 'max_home','max_away']]

In [18]:
# Melt the DataFrame
odds_melted = stadium_df.melt(value_vars=['max_home', 'max_away'], 
                      var_name='team_type', value_name='odds')

# Rename categories for better readability
odds_melted['team_type'] = odds_melted['team_type'].replace({
    'max_home': 'Home Odds',
    'max_away': 'Away Odds'
})

In [19]:
odds_density = (
        ggplot(odds_melted, aes(x='odds', fill='team_type', color='team_type'))
        + ggsize(800,400)
        + geom_density(alpha=0.4)  # Add transparency for overlapping densities
        + labs(x='Odds', y='Density', color='Team Type', fill='Team Type',
               title = 'Density Plot of Home vs Away Odds',
               subtitle = 'Home odds peak at 1.6 and have much greater density from the 0 to 4 region')
        + scale_x_continuous(limits=(0, 20)) # 
        + theme(axis_text_x=element_text(size=12, angle=0, hjust=0.5),
             axis_title_x=element_text(size=14),
             axis_title_y=element_text(size=14),
             plot_title=element_text(size=22, face='bold', hjust = 0.5),
             plot_subtitle=element_text(size=16, hjust=0.5, color='blue')
             )
     )
odds_density

**Conclusion:** Expectedly, we observe that home teams are statistically favourited with a peak of 1.6 compared to away odds at 2.4 . There is also a noticeable leftward skew in the odds distrbibution for home teams.

---

Seeing that most of the odds are clustered from 1 to 10, let's set that as the range of our analysis. The extremes won't serve as good purposes of analysis given the small sample sizes present. 

For each bin, the implied probability of winning is calculated from the average odds of that bin. The actual probability of winning is determined by deriving the percentage of home/away wins of the total matches.

In [20]:
from functions import get_implied_vs_actual

bins = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 8.0, 10.0]

home_df = get_implied_vs_actual(
    df = stadium_df,
    odds_column = 'max_home',
    bins = bins,
    outcome_label = 'H'
)

away_df = get_implied_vs_actual(
    df = stadium_df,
    odds_column = 'max_away',
    bins = bins,
    outcome_label = 'A'
)

  df.groupby('odds_bin')['FTR']
  df.groupby('odds_bin')['implied_prob']
  df.groupby('odds_bin')['FTR']
  df.groupby('odds_bin')['implied_prob']


In [21]:
home_win_probabilities = (
    ggplot(home_df, aes(x='odds_bin', y='Probability', color='Type'))
    + geom_line(size=1)
    + ggsize(800, 400)
    + ggtitle('Home Teams: Implied vs Actual Win Probability')
    + labs(x='Home Odds Bin', y='Probability', color='Legend')
    + scale_color_manual(values={'Actual Probability': 'black', 'Implied Probability': '#C74506'})
    + theme(
        axis_text_x=element_text(size=12, angle=45, hjust=1),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        legend_position='bottom'
    )
)
home_win_probabilities

**Conclusion:** Home odds are rather efficiently priced with some home ground advantage observed in teams that are slight favourites and significant underdogs. 

Slight Favourites: Home teams which are priced with a 57% probability of winning have a slightly higher actual win probability at 61%.

Significant Underdogs: Home teams which are priced around 20% win probability have an actual win probability that is 3-4% points higher.

In [22]:
away_win_probabilities = (
    ggplot(away_df, aes(x='odds_bin', y='Probability', color='Type'))
    + geom_line(size=1)
    + ggsize(800, 400)
    + ggtitle('Away Teams: Implied vs Actual Win Probability')
    + labs(x='Away Odds Bin', y='Probability', color='Legend')
    + scale_color_manual(values={'Actual Probability': 'black', 'Implied Probability': '#7408B9'})
    + theme(
        axis_text_x=element_text(size=12, angle=45, hjust=1),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        legend_position='bottom'
    )
)
away_win_probabilities

**Conclusion:** Away odds are priced almost perfectly for teams which are favourites and fair matches. However, significant deviations occur when the away teams are underdogs.  

Slight Underdogs: Away teams with a 30-40% probability of winning have a significantly higher actual win probability at almost 10% higher.

Significant Underdogs: Away teams with less than 20% probability of winning have a significantly lower actual win probability at almost 10% lower.

---

# 8. Are There Potential Mispricings in Draw Odds?

Now that we've identified some exploitable mispricings in Home and Away odds, let's investigate if draws present the same opportunities for us.

Similarly, let's first visualise the distribution of draw odds before deciding the categories to bin them into for our subsequent analysis.

In [23]:
draw_odds_df = merged_df.copy()
draw_odds_df = draw_odds_df[['fixture_id', 'FTR', 'max_draw']]

In [24]:
draw_odds_density = (
    ggplot(draw_odds_df, aes(x='max_draw'))
    + geom_density(fill='blue', alpha=0.5)  # Add transparency for better visuals
    + ggsize(800,400)
    + scale_x_continuous(limits=(0, 10)) 
    + ggtitle('Density Plot of Draw Odds')
    + labs(x='Draw Odds', y='Density',
           title = 'Density Plot of Draw Odds',
           subtitle = 'Draw odds cluster at the range of 3 to 5 with a peak at 3.6')
    + theme(
        axis_text_x=element_text(size=12),
        axis_text_y=element_text(size=12),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        plot_subtitle=element_text(size=16,color = 'blue', hjust = 0.5)
    )
)
draw_odds_density

Seeing that most odds are clustered from 3 to 5, let's opt for greater granularity of bins in that region to have more precise analysis.

In [25]:
draw_bins = [3.0, 3.5, 4.0, 4.5, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

draw_df = get_implied_vs_actual(
    df = draw_odds_df,
    odds_column = 'max_draw',
    bins = draw_bins,
    outcome_label = 'D'
)

  df.groupby('odds_bin')['FTR']
  df.groupby('odds_bin')['implied_prob']


In [26]:
draw_probabilities = (
    ggplot(draw_df, aes(x='odds_bin', y='Probability', color='Type'))
    + geom_line(size=1)
    + ggsize(800, 400)
    + ggtitle('Implied vs Actual Draw Probability')
    + labs(x='Draw Odds Bin', y='Probability', color='Legend')
    + scale_color_manual(values={'Actual Probability': 'black', 'Implied Probability': '#C74506'})
    + theme(
        axis_text_x=element_text(size=12, angle=45, hjust=1),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        legend_position='bottom'
    )
)
draw_probabilities

**Conclusion:** Draw odds are almost always overpriced and at rather significant levels, offering no opportunity to exploit these mispricings

---

# 9. Are There Any Situations Where Betting on Draws Make Sense?

In [27]:
draw_strategy_df = merged_df.copy()
draw_strategy_df['odds_difference'] = abs(draw_strategy_df['max_home'] - draw_strategy_df['max_away'])
draw_strategy_df['result_type'] = draw_strategy_df.apply(get_result_type, axis=1)

In [28]:
odds_difference_density = (
    ggplot(draw_strategy_df, aes(x='odds_difference'))
    + geom_density(fill='blue', alpha=0.5)  # Add transparency for better visuals
    + ggsize(800,400)
    + scale_x_continuous(limits=(0, 10)) 
    + labs(x='Odds Difference', y='Density',
           title = 'Density Plot of Draw Odds',
           subtitle = 'Draw odds cluster at the range of 3 to 5 with a peak at 3.6')
    + theme(
        axis_text_x=element_text(size=12),
        axis_text_y=element_text(size=12),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        plot_subtitle=element_text(size=16,color = 'blue', hjust = 0.5)
    )
)
odds_difference_density

In [29]:
# Bin the strategies based upon the difference in odds
bins = [0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 7.0, 8.0, 9.0, 10.0]
draw_strategy_df['odds_diff_bin'] = pd.cut(draw_strategy_df['odds_difference'], bins=bins, include_lowest=True)

In [30]:
# Calculate win rates for each strategy
win_rates = (
    draw_strategy_df.groupby('odds_diff_bin')['result_type']
    .value_counts(normalize=True)  # Calculate proportions (win rates)
    .unstack(fill_value=0)         # Pivot to create columns for each strategy
    .reset_index()
)

# Rename columns for clarity
win_rates = win_rates.rename(columns={
    'F': 'Favorite Win Rate',
    'U': 'Underdog Win Rate',
    'D': 'Draw Win Rate'
})

# Convert bins to strings for plotting compatibility
win_rates['odds_diff_bin'] = win_rates['odds_diff_bin'].astype(str)

#Reshape the data for plotting
win_rates_melted = win_rates.melt(
    id_vars='odds_diff_bin',
    value_vars=['Favorite Win Rate', 'Underdog Win Rate', 'Draw Win Rate'],
    var_name='Strategy',
    value_name='Win Rate'
)

  draw_strategy_df.groupby('odds_diff_bin')['result_type']


In [31]:
win_rate_odds_differences = (
    ggplot(win_rates_melted, aes(x='odds_diff_bin', y='Win Rate', color='Strategy'))
    + geom_line(size=1)
    + geom_point(size=2)
    + ggsize(800,400)
    + ggtitle('Betting on draws has the lowest win rate irrespective of matchup differences!')
    + labs(x='Odds Difference (Home - Away)', y='Win Rate', color='Strategy')
    + scale_color_manual(values={
        'Bet on Favourites': 'black',
        'Bet on Underdogs': 'grey',
        'Bet on Draw': '#d85b06'
    })
    + theme(
        panel_grid = None,
        axis_text_x=element_text(size=12, angle=45, hjust=1),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=20, face='bold', hjust=0.5),
        legend_position = (0.5,0.5)
    )
)
win_rate_odds_differences

**Conclusion:** Unfortunately, betting on draws is not the statistically favourable strategy at any juncture. From this visualisation however, we can see that the probabilities of each outcome are equally likely in fairer matchups (which make sense intuitively). In that case, it might not be wise to take unidirectional bets in games with equal matchups because the risk is rather high.

---

# 10. Backtesting Various Unidirectional Strategies

Given the data we've collected, we would like to experiment with leveraging mispriced odds in our betting strategies. Hence, we would compare the profitabilities of:
| Strategy                           | Odds Range     |
|-------------------------------------|---------------|
| Betting on Home Favourites         | 1.25 to 1.75  |
| Betting on Home Underdogs          | 4.5 to 5.5    |
| Betting on Away Underdogs          | 2.5 to 4.0    |
| Betting on Favourites  | NIL |
| Betting on Underdogs | NIL |
| Betting on Draws | NIL |

In [32]:
backtest_df = merged_df.copy()

# Organising Dataframe
backtest_df = backtest_df[['fixture_id', 'Date', 'FTR', 'max_home', 'max_away', 'max_draw']]
backtest_df['FTR_simple'] = backtest_df.apply(get_result_type,axis=1)
backtest_df['FTR_discretionary'] = backtest_df['FTR']
backtest_df.drop(columns=['FTR'], inplace=True)

# Obtaining the odds for favourites and underdogs
backtest_df["favourite_odds"] = backtest_df[["max_home", "max_away"]].min(axis=1)
backtest_df["underdog_odds"] = backtest_df[["max_home", "max_away"]].max(axis=1)

# Dropping any games with NaN values for odds
backtest_df = backtest_df.dropna(subset=["max_home", "max_away", "max_draw"]).reset_index(drop=True)

#Sorting by date to ensure validity of backtesting
backtest_df["Date"] = pd.to_datetime(backtest_df["Date"], dayfirst=True, errors='coerce')
backtest_df = backtest_df.sort_values(by="Date").reset_index(drop=True)

backtest_df

Unnamed: 0,fixture_id,Date,max_home,max_away,max_draw,FTR_simple,FTR_discretionary,favourite_odds,underdog_odds
0,AVLSHU170620,2020-06-17,3.35,2.41,3.52,D,D,2.41,3.35
1,MCIARS170620,2020-06-17,1.39,8.70,5.95,F,H,1.39,8.70
2,TOTMUN190620,2020-06-19,2.88,2.64,3.70,D,D,2.64,2.88
3,NORSOU190620,2020-06-19,3.36,2.32,3.90,F,A,2.32,3.36
4,WATLEI200620,2020-06-20,3.52,2.22,3.75,D,D,2.22,3.52
...,...,...,...,...,...,...,...,...,...
1774,LEIBHA081224,2024-12-08,5.00,1.73,4.40,D,D,1.73,5.00
1775,TOTCHE081224,2024-12-08,2.30,3.10,3.75,U,A,2.30,3.10
1776,FULARS081224,2024-12-08,6.40,1.62,4.33,D,D,1.62,6.40
1777,IPSBOU081224,2024-12-08,3.20,2.20,3.90,F,A,2.20,3.20


Here let's define whether or not the discretionary strategy can be implemented based upon a binary '0' or '1' variable. This way, bet simulations can be more streamlined as we can just take this binary variable multiplied by the odds - so if no bet is made, no change to profit levels.

In [33]:
backtest_df["Bet_Home_Favourites"] = ((backtest_df["max_home"] >= 1.25) & (backtest_df["max_home"] <= 1.75)).astype(int)
backtest_df["Bet_Home_Underdogs"] = ((backtest_df["max_home"] >= 4.5) & (backtest_df["max_home"] <= 5.5)).astype(int)
backtest_df["Bet_Away_Underdogs"] = ((backtest_df["max_away"] >= 2.5) & (backtest_df["max_away"] <= 4.0)).astype(int)
backtest_df

Unnamed: 0,fixture_id,Date,max_home,max_away,max_draw,FTR_simple,FTR_discretionary,favourite_odds,underdog_odds,Bet_Home_Favourites,Bet_Home_Underdogs,Bet_Away_Underdogs
0,AVLSHU170620,2020-06-17,3.35,2.41,3.52,D,D,2.41,3.35,0,0,0
1,MCIARS170620,2020-06-17,1.39,8.70,5.95,F,H,1.39,8.70,1,0,0
2,TOTMUN190620,2020-06-19,2.88,2.64,3.70,D,D,2.64,2.88,0,0,1
3,NORSOU190620,2020-06-19,3.36,2.32,3.90,F,A,2.32,3.36,0,0,0
4,WATLEI200620,2020-06-20,3.52,2.22,3.75,D,D,2.22,3.52,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1774,LEIBHA081224,2024-12-08,5.00,1.73,4.40,D,D,1.73,5.00,0,1,0
1775,TOTCHE081224,2024-12-08,2.30,3.10,3.75,U,A,2.30,3.10,0,0,1
1776,FULARS081224,2024-12-08,6.40,1.62,4.33,D,D,1.62,6.40,0,0,0
1777,IPSBOU081224,2024-12-08,3.20,2.20,3.90,F,A,2.20,3.20,0,0,0


Now that we've decided on the dataframe structure let's set up the system for backtesting this strategy. The first thing we need to do is decide the stake limits for each category of strategies:

1. Simple Unidirectional (Betting on Favourites, Underdogs and Draws)
2. Discretionary Unidirectional (Betting on Home Favourites, Home Underdogs and Away Underdogs) 

There are 380 EPL matches over about 9 months, which comes round to about 10 matches per week. Hence, for Simple Unidirectional we would set a stake limit of 10%, meaning we bet 10% of our budget each time on every game.

A simple count below shows that in our dataset of 1779 matches only a total of 1084 matches qualify for discretionary betting opportunities that capitalise on potentially mispriced odds. In that case, let's adjust our stake limit to 25% such that we can concentrate our bets more and the returns aren't affected by the lower bet counts.

In [34]:
# Count the number of matches where at least one strategy is active (i.e., at least one '1')
bettable_matches = (backtest_df[['Bet_Home_Favourites', 'Bet_Home_Underdogs', 'Bet_Away_Underdogs']].sum(axis=1) > 0).sum()
bettable_matches

np.int64(1084)

Let's set the initial bankroll value at 1000. We would also like to set 1000 as the hard cap on the stake value - regardless of the budget, a stake wouldn't be more than 1000. This is just to mimic better behaviour better and can be potentially further analysed.

In [35]:
initial_bankroll = 1000  

This is the function to simulate the bankroll of discretionary strategies.

In [36]:
def calculate_discretionary_bankroll(df, strategy_col, odds_col, result_condition, stake_pct=0.25):
    bankroll = initial_bankroll
    bankroll_progress = [bankroll]  # Start bankroll at 1000

    for index, row in df.iterrows():
        if row[strategy_col] == 1:  
            stake = min(stake_pct * bankroll, 500)

            if row["FTR_discretionary"] == result_condition:  # If the bet wins
                bankroll += (row[odds_col] - 1) * stake  # Profit calculation
            else:  # If the bet loses
                bankroll -= stake  # Deduct stake

        bankroll_progress.append(bankroll)

    return bankroll_progress[: len(df)]  # Ensure list length matches DataFrame

backtest_df["Bankroll_Home_Favourites"] = calculate_discretionary_bankroll(
    backtest_df, strategy_col="Bet_Home_Favourites", odds_col="max_home", result_condition="H", stake_pct=0.25)

backtest_df["Bankroll_Home_Underdogs"] = calculate_discretionary_bankroll(
    backtest_df, strategy_col="Bet_Home_Underdogs", odds_col="max_home", result_condition="H", stake_pct=0.25)

backtest_df["Bankroll_Away_Underdogs"] = calculate_discretionary_bankroll(
    backtest_df, strategy_col="Bet_Away_Underdogs", odds_col="max_away", result_condition="A", stake_pct=0.25)

This is the function to simulate bankroll of simple betting strategies.

In [37]:
def calculate_simple_bankroll(df, strategy_condition, odds_col, stake_pct=0.10):
    bankroll = initial_bankroll
    bankroll_progress = [bankroll]  # Start bankroll at 1000

    for index, row in df.iterrows():
        stake = min(stake_pct * bankroll, 500)

        if row["FTR_simple"] == strategy_condition:  # If the bet wins
            bankroll += (row[odds_col] - 1) * stake  
        else:  # If the bet loses
            bankroll -= stake  # Deduct stake

        bankroll_progress.append(bankroll)

    return bankroll_progress[: len(df)]  # Ensure length matches DataFrame

backtest_df["Bankroll_Favourite"] = calculate_simple_bankroll(
    backtest_df, strategy_condition="F", odds_col="favourite_odds", stake_pct=0.1)

backtest_df["Bankroll_Underdog"] = calculate_simple_bankroll(
    backtest_df, strategy_condition="U", odds_col="underdog_odds", stake_pct=0.1)

backtest_df["Bankroll_Draw"] = calculate_simple_bankroll(
    backtest_df, strategy_condition="D", odds_col="max_draw", stake_pct=0.1)

In [38]:
backtest_df

Unnamed: 0,fixture_id,Date,max_home,max_away,max_draw,FTR_simple,FTR_discretionary,favourite_odds,underdog_odds,Bet_Home_Favourites,Bet_Home_Underdogs,Bet_Away_Underdogs,Bankroll_Home_Favourites,Bankroll_Home_Underdogs,Bankroll_Away_Underdogs,Bankroll_Favourite,Bankroll_Underdog,Bankroll_Draw
0,AVLSHU170620,2020-06-17,3.35,2.41,3.52,D,D,2.41,3.35,0,0,0,1000.000000,1000.000000,1000.000000,1000.000000,1.000000e+03,1.000000e+03
1,MCIARS170620,2020-06-17,1.39,8.70,5.95,F,H,1.39,8.70,1,0,0,1000.000000,1000.000000,1000.000000,900.000000,9.000000e+02,1.252000e+03
2,TOTMUN190620,2020-06-19,2.88,2.64,3.70,D,D,2.64,2.88,0,0,1,1097.500000,1000.000000,1000.000000,935.100000,8.100000e+02,1.126800e+03
3,NORSOU190620,2020-06-19,3.36,2.32,3.90,F,A,2.32,3.36,0,0,0,1097.500000,1000.000000,750.000000,841.590000,7.290000e+02,1.431036e+03
4,WATLEI200620,2020-06-20,3.52,2.22,3.75,D,D,2.22,3.52,0,0,0,1097.500000,1000.000000,750.000000,952.679880,6.561000e+02,1.287932e+03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1774,LEIBHA081224,2024-12-08,5.00,1.73,4.40,D,D,1.73,5.00,0,1,0,0.194394,0.384349,11171.615624,0.033001,2.877125e-11,8.878490e-15
1775,TOTCHE081224,2024-12-08,2.30,3.10,3.75,U,A,2.30,3.10,0,0,1,0.194394,0.288262,11171.615624,0.029701,2.589412e-11,1.189718e-14
1776,FULARS081224,2024-12-08,6.40,1.62,4.33,D,D,1.62,6.40,0,0,0,0.194394,0.288262,12221.615624,0.026731,3.133189e-11,1.070746e-14
1777,IPSBOU081224,2024-12-08,3.20,2.20,3.90,F,A,2.20,3.20,0,0,0,0.194394,0.288262,12221.615624,0.024058,2.819870e-11,1.427304e-14


In [51]:
backtest_df.iloc[170:180]

Unnamed: 0,fixture_id,Date,max_home,max_away,max_draw,FTR_simple,FTR_discretionary,favourite_odds,underdog_odds,Bet_Home_Favourites,Bet_Home_Underdogs,Bet_Away_Underdogs,Bankroll_Home_Favourites,Bankroll_Home_Underdogs,Bankroll_Away_Underdogs,Bankroll_Favourite,Bankroll_Underdog,Bankroll_Draw
170,FULWBA021120,2020-11-02,2.5,2.92,3.5,F,H,2.5,2.92,0,0,1,560.318134,311.695304,8431.615624,390.103014,1027.852267,7.293929
171,LEELEI021120,2020-11-02,2.8,2.52,3.53,F,A,2.52,2.8,0,0,1,560.318134,311.695304,7931.615624,448.618467,925.067041,6.564536
172,BHABUR061120,2020-11-06,1.94,4.33,3.78,D,D,1.94,4.33,0,0,0,560.318134,311.695304,8691.615624,516.808474,832.560336,5.908082
173,SOUNEW061120,2020-11-06,1.73,4.8,4.0,F,H,1.73,4.8,1,0,0,560.318134,311.695304,8691.615624,465.127626,749.304303,7.550529
174,WHUFUL071120,2020-11-07,1.91,4.1,3.75,F,H,1.91,4.1,0,0,0,662.576194,311.695304,8691.615624,499.081943,674.373873,6.795476
175,CHESHU071120,2020-11-07,1.37,10.0,5.5,F,H,1.37,10.0,1,0,0,662.576194,311.695304,8691.615624,544.4984,606.936485,6.115929
176,CRYLEE071120,2020-11-07,3.0,2.45,3.35,U,H,2.45,3.0,0,0,0,723.864491,311.695304,8691.615624,564.644841,546.242837,5.504336
177,EVEMUN071120,2020-11-07,2.65,2.62,3.63,F,A,2.62,2.65,0,0,1,723.864491,311.695304,8691.615624,508.180356,655.491404,4.953902
178,LEIWOL081120,2020-11-08,2.3,3.4,3.38,F,H,2.3,3.4,0,0,1,723.864491,311.695304,9501.615624,590.505574,589.942264,4.458512
179,MCILIV081120,2020-11-08,2.05,3.4,3.92,D,D,2.05,3.4,0,0,1,723.864491,311.695304,9001.615624,667.271299,530.948037,4.012661


In [40]:
# Dynamically generate valid year tick marks based on dataset range
year_ticks = pd.date_range(start=backtest_df["Date"].min(), end=backtest_df["Date"].max(), freq="YS")

# Prepare data for Lets-Plot
plot_data = backtest_df.melt(
    id_vars=["Date"], 
    value_vars=["Bankroll_Favourite", "Bankroll_Underdog", "Bankroll_Draw", 
                "Bankroll_Home_Favourites", "Bankroll_Home_Underdogs", "Bankroll_Away_Underdogs"],
    var_name="Strategy", 
    value_name="Bankroll"
)

Custom legend names and colours for better visuals

In [41]:
custom_legend = {
    "Bankroll_Away_Underdogs": "Bet on Away Underdogs",
    "Bankroll_Favourite": "Bet on Favourites",
    "Bankroll_Underdog": "Bet on Underdogs",
    "Bankroll_Draw": "Bet on Draws",
    "Bankroll_Home_Favourites": "Bet on Home Favourites",
    "Bankroll_Home_Underdogs": "Bet on Home Underdogs",
}

custom_colors = {
    "Bankroll_Away_Underdogs": "#e41a1c",  
    "Bankroll_Favourite": "black",  
    "Bankroll_Underdog": "#AA5F10",  
    "Bankroll_Draw": "#078D71",  
    "Bankroll_Home_Favourites": "#0B64A4",  
    "Bankroll_Home_Underdogs": "#4E0991",  
}

In [45]:
bet_simulation = (
    ggplot(plot_data, aes(x="Date", y="Bankroll", color="Strategy"))
    + ggsize(800, 600)
    + geom_line(size=0.5)
    + scale_color_manual(name="Betting Strategies", values=custom_colors, labels=custom_legend)
    + scale_y_continuous(limits=(0, None), expand=[0, 0]) 
    + scale_x_datetime(breaks=year_ticks.tolist(), labels=year_ticks.year.astype(int).astype(str).tolist())  # ✅ Fixed x-axis formatting
    + labs(
        x="Date",
        y="Bankroll (£)",
        title="Only Betting on Away Underdogs is Profitable Long Run!",
        subtitle="All other strategies eventually run your budget to zero"
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face="bold", hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color="blue"),
        legend_text=element_text(size=10),
        legend_title=element_text(size=12),
        axis_line_x=element_line(color="black", size=1), 
        axis_line_y=element_line(color="black", size=1),
        legend_position=(0.75, 0.2)  
    )
)

# Show the plot
bet_simulation

# 10. Overall Conclusions

**Best Bookmaker for High Odds**: Marathon Bet

**Beneficial Mispricings in Odds**:
1. Home Favourites (Odds from 1.5 to 2): Underpriced Odds
2. Home Underdogs (Odds from 4.5 to 5.5): Underpriced Odds
3. Away Underdogs (Odds from 3 to 4): - Significantly Underpriced Odds 

**Potential Strategy to Experiment With:**
Given the data we've collected, we would like to experiment with leveraging mispriced odds in our betting strategies. Hence, in NB05, we would compare the profitabilities of:
1. Betting on Home Favourites
2. Betting on Home Underdogs
3. Betting on Away Underdogs

and potentially explore any possible ways to optimise this betting process.

In [32]:
ggsave(top_home_bookmakers,"top_home_bookmakers.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(top_away_bookmakers,"top_away_bookmakers.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(top_draw_bookmakers,"top_draw_bookmakers.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(win_rate_across_time,"win_rate_across_time.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(odds_density,"odds_density.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(home_win_probabilities,"home_win_probabilities.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(away_win_probabilities,"away_win_probabilities.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(draw_odds_density,"draw_odds_density.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(draw_probabilities,"draw_probabilities.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(odds_difference_density,"odds_difference_density.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(win_rate_odds_differences,"win_rate_odds_differences.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(bet_simulation,"bet_simulation.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)

'c:\\Users\\Xinyan\\OneDrive\\Desktop\\DS105A\\ds105a-2024-project-good_gamblers\\data\\visualisations\\simple_betting_strategies\\win_rate_odds_differences.svg'