---

# 1. Introduction
In this notebook, we will analyse the following betting strategies:
1. **Betting on the Favourite**(Team With Lower Odds)
2. **Betting on the Underdog**(Team with Higher Odds)
3. **Betting on a Draw**

For our purposes, we will merge the two tables from our SQLite database based off each fixture's unique `fixture_id` to form a table with the odds and match outcomes. We will then pick the best possible odds for our respective strategies for analysis.

---

# 2. Import Libraries
We import the necessary libraries for:
- Data manipulation using `pandas`.
- Interacting with the SQL database using `SQLAlchemy`.
- Generating and saving our visualisations with `lets_plot`.

In [1]:
import pandas as pd
from sqlalchemy import create_engine
from lets_plot import * 
from lets_plot import ggsave
from IPython.display import SVG

---

# 3. Merging Odds Data and Match Outcomes 

In [2]:
engine = create_engine('sqlite:///../data/sports_odds.db')
odds_df = pd.read_sql('SELECT * FROM historical_odds', con=engine)
results_df = pd.read_sql('SELECT * FROM match_results', con=engine)

After inspection, we realise that there is a different number of rows for `odds_df` and `results_df`. These discrepancies in the data size are normal and will not affect our analysis. Hence, we used `pandas` merge on 'inner' to ensure only matches with `fixture_id` present in both dataframes will be selected for analysis.

In [3]:
merged_df = odds_df.merge(results_df, on='fixture_id', how='inner')
merged_df.head()

Unnamed: 0,match_id,commence_time,home_team,away_team,Unibet_home_odds,Unibet_away_odds,Unibet_draw_odds,Sky Bet_home_odds,Sky Bet_away_odds,Sky Bet_draw_odds,...,Grosvenor_draw_odds,Smarkets_home_odds,Smarkets_away_odds,Smarkets_draw_odds,fixture_id,Date,Time,HomeTeam,AwayTeam,FTR
0,2dd4a4f8663e6f835226a5209c614a60,2020-06-17T17:00:00Z,Aston Villa,Sheffield United,3.35,2.32,3.25,3.1,2.25,3.3,...,,,,,AVLSHU170620,17/06/2020,18:00,Aston Villa,Sheffield United,D
1,b1e029a0d989b4c11e843204003044f9,2020-06-17T19:15:00Z,Manchester City,Arsenal,1.35,8.5,5.6,1.36,7.5,5.25,...,,,,,MCIARS170620,17/06/2020,20:15,Man City,Arsenal,H
2,59d68295dc2213634772cd941c91fa11,2020-06-19T19:15:00Z,Tottenham Hotspur,Manchester United,2.75,2.6,3.3,2.7,2.5,3.4,...,,,,,TOTMUN190620,19/06/2020,20:15,Tottenham,Man United,D
3,88352746f45f6beb4e2cb662d9414d0f,2020-06-20T11:30:00Z,Watford,Leicester City,3.4,2.15,3.45,3.25,2.2,3.4,...,,,,,WATLEI200620,20/06/2020,12:30,Watford,Leicester,D
4,065ae59da20562892de52b7f5598ecbf,2020-06-20T16:30:00Z,West Ham United,Wolverhampton Wanderers,3.5,2.15,3.35,3.3,2.2,3.3,...,,,,,WHUWOL200620,20/06/2020,17:30,West Ham,Wolves,A


---

# 4. Extracting the Highest Odds for Each Fixture and Outcome

We understand that getting the best odds for each fixture/outcome may be theoretically difficult as a retail better. However, this layer of analysis gives us further information about which bookmakers could potentially offer better odds for certain teams/outcomes and gives us more accuracy in predicting the profitability of our strategy.

In [4]:
# Import the function to process each row and obtain highest odds
from functions import process_row

# Apply the function to each row
new_columns = merged_df.apply(process_row, axis=1)
merged_df = pd.concat([merged_df, new_columns], axis=1)
merged_df

Unnamed: 0,match_id,commence_time,home_team,away_team,Unibet_home_odds,Unibet_away_odds,Unibet_draw_odds,Sky Bet_home_odds,Sky Bet_away_odds,Sky Bet_draw_odds,...,Time,HomeTeam,AwayTeam,FTR,max_home,home_bookmaker,max_away,away_bookmaker,max_draw,draw_bookmaker
0,2dd4a4f8663e6f835226a5209c614a60,2020-06-17T17:00:00Z,Aston Villa,Sheffield United,3.35,2.32,3.25,3.10,2.25,3.30,...,18:00,Aston Villa,Sheffield United,D,3.35,Unibet,2.41,Marathon Bet,3.52,Marathon Bet
1,b1e029a0d989b4c11e843204003044f9,2020-06-17T19:15:00Z,Manchester City,Arsenal,1.35,8.50,5.60,1.36,7.50,5.25,...,20:15,Man City,Arsenal,H,1.39,Marathon Bet,8.70,Marathon Bet,5.95,Marathon Bet
2,59d68295dc2213634772cd941c91fa11,2020-06-19T19:15:00Z,Tottenham Hotspur,Manchester United,2.75,2.60,3.30,2.70,2.50,3.40,...,20:15,Tottenham,Man United,D,2.88,Betfair,2.64,Marathon Bet,3.70,Marathon Bet
3,88352746f45f6beb4e2cb662d9414d0f,2020-06-20T11:30:00Z,Watford,Leicester City,3.40,2.15,3.45,3.25,2.20,3.40,...,12:30,Watford,Leicester,D,3.52,Marathon Bet,2.22,Marathon Bet,3.75,Marathon Bet
4,065ae59da20562892de52b7f5598ecbf,2020-06-20T16:30:00Z,West Ham United,Wolverhampton Wanderers,3.50,2.15,3.35,3.30,2.20,3.30,...,17:30,West Ham,Wolves,A,3.70,Betfair,2.23,Marathon Bet,3.60,Marathon Bet
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1777,ac9d59dd7555c122948f98b8b2a19c8e,2024-12-08T14:00:00Z,Fulham,Arsenal,,,,6.25,1.44,4.33,...,14:00,Fulham,Arsenal,D,6.40,Smarkets,1.62,BoyleSports,4.33,Sky Bet
1778,8899ce68b0c9e451b17edc3f2e076a6b,2024-12-08T14:00:00Z,Ipswich Town,Bournemouth,,,,3.10,2.15,3.50,...,14:00,Ipswich,Bournemouth,A,3.20,Paddy Power,2.20,Smarkets,3.90,Betfair
1779,d9163188e3e3d9a8bf2525e2f9e3a553,2024-12-08T14:00:00Z,Leicester City,Brighton and Hove Albion,,,,4.75,1.62,4.20,...,14:00,Leicester,Brighton,D,5.00,Paddy Power,1.73,BoyleSports,4.40,Coral
1780,c87f3d3551a57bd560cb056ade831890,2024-12-08T16:30:00Z,Tottenham Hotspur,Chelsea,,,,2.20,2.80,3.75,...,16:30,Tottenham,Chelsea,A,2.30,Paddy Power,3.10,Smarkets,3.75,Sky Bet


In [5]:
merged_df = merged_df[['fixture_id', 'Date', 'home_team', 'away_team', 'FTR', 'max_home', 'home_bookmaker', 'max_away', 'away_bookmaker', 'max_draw', 'draw_bookmaker']]
merged_df

Unnamed: 0,fixture_id,Date,home_team,away_team,FTR,max_home,home_bookmaker,max_away,away_bookmaker,max_draw,draw_bookmaker
0,AVLSHU170620,17/06/2020,Aston Villa,Sheffield United,D,3.35,Unibet,2.41,Marathon Bet,3.52,Marathon Bet
1,MCIARS170620,17/06/2020,Manchester City,Arsenal,H,1.39,Marathon Bet,8.70,Marathon Bet,5.95,Marathon Bet
2,TOTMUN190620,19/06/2020,Tottenham Hotspur,Manchester United,D,2.88,Betfair,2.64,Marathon Bet,3.70,Marathon Bet
3,WATLEI200620,20/06/2020,Watford,Leicester City,D,3.52,Marathon Bet,2.22,Marathon Bet,3.75,Marathon Bet
4,WHUWOL200620,20/06/2020,West Ham United,Wolverhampton Wanderers,A,3.70,Betfair,2.23,Marathon Bet,3.60,Marathon Bet
...,...,...,...,...,...,...,...,...,...,...,...
1777,FULARS081224,08/12/2024,Fulham,Arsenal,D,6.40,Smarkets,1.62,BoyleSports,4.33,Sky Bet
1778,IPSBOU081224,08/12/2024,Ipswich Town,Bournemouth,A,3.20,Paddy Power,2.20,Smarkets,3.90,Betfair
1779,LEIBHA081224,08/12/2024,Leicester City,Brighton and Hove Albion,D,5.00,Paddy Power,1.73,BoyleSports,4.40,Coral
1780,TOTCHE081224,08/12/2024,Tottenham Hotspur,Chelsea,A,2.30,Paddy Power,3.10,Smarkets,3.75,Sky Bet


---

# 5. EDA: Which Bookmakers Offer the Best Odds?
First, let us look at the prevalence of bookmakers in providing the best odds. This gives us some insight into which bookmakers we can focus on when deciding to place bets, streamlining the process.

In [6]:
LetsPlot.setup_html()

In [7]:
# Function to get top bookmakers
def get_top_bookmakers(dataframe, column):
    
    total_matches = len(dataframe)
    counts = dataframe[column].value_counts(normalize=True) * 100

    # Get the top bookmakers
    top_counts = counts.nlargest(5).reset_index()
    top_counts.columns = ['Bookmaker', 'Percentage']
    
    return top_counts

top_home = get_top_bookmakers(merged_df, 'home_bookmaker')
top_away = get_top_bookmakers(merged_df, 'away_bookmaker')
top_draw = get_top_bookmakers(merged_df, 'draw_bookmaker')

In [8]:
top_home_bookmakers = (
    ggplot(top_home, aes(x='Bookmaker', y='Percentage'))
    + geom_bar(stat='identity', fill='#024B04')
    + ggsize(800, 400)
    + labs(
        title='Marathon Bet gives the highest home odds almost 20% of the time!',
        subtitle='The top five bookmakers provide the best odds almost 60% of the time!'
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue')
    )
)
top_home_bookmakers


In [9]:
top_away_bookmakers = (
    ggplot(top_away, aes(x='Bookmaker', y='Percentage'))
    + geom_bar(stat='identity', fill='#8B2E01')
    + ggsize(800, 400)
    + labs(
        title='Marathon Bet gives the highest away odds again!',
        subtitle='Similarly, the top five bookmakers provide the best away odds almost 60% of the time'
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue')
    )
)
top_away_bookmakers

In [10]:
top_draw_bookmakers = (
    ggplot(top_draw, aes(x='Bookmaker', y='Percentage'))
    + geom_bar(stat='identity', fill='grey')
    + ggsize(800, 400)
    + labs(
        title='Marathon Bet gives the highest odds almost 30% of the time, our clear winner!',
        subtitle='Sky Bet and Virgin Bet are additional contenders for good draw odds'
    )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=20),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue'),
    )
)
top_draw_bookmakers

In [11]:
ggsave(top_home_bookmakers,"top_home_bookmakers.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(top_away_bookmakers,"top_away_bookmakers.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)
ggsave(top_draw_bookmakers,"top_draw_bookmakers.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)

'c:\\Users\\Xinyan\\Desktop\\DS105A\\ds105a-2024-project-good_gamblers\\data\\visualisations\\simple_betting_strategies\\top_draw_bookmakers.svg'

**Conclusion**: Marathon Bet has the highest probability of providing the best odds for any particular outcome (Home, Away, Draw). This is unsurprising given its reputation as a low-margin bookmaker in the industry. Other strong contenders to consider are Paddy Power, Betclic, William Hill and Unibet.

---

# 6. EDA: Which Strategy Has the Highest Win Rate?
Before diving into actual profitabilities of each strategy, let's look at the win rates of the respective strategies over time to see if there are any significant changes to the dynamics of EPL games (e.g. underdogs taking over)

First, we need to convert the full-time result to the outcome of each strategy -- e.g. if the team with higher odds wins, we lable it favourite (or F).

In [12]:
def get_result_type(row):
    
    home_odds = row['max_home']
    away_odds = row['max_away']
    result = row['FTR']
        
    if home_odds > away_odds:
        home = 0 # 0 denotes the underdog team
        away = 1
    else:
        home = 1
        away = 0

    if result == 'D':
        result_type = 'D'
    elif result == 'H':
        if home == 0:
            result_type = 'U'
        else:
            result_type = 'F'
    else:
        if away == 0:
            result_type = 'U'
        else:
            result_type = 'F'
    
    return result_type

In [13]:
strategy_df = merged_df.copy()
strategy_df['result_type'] = strategy_df.apply(get_result_type, axis=1)
strategy_df = strategy_df[['fixture_id', 'Date', 'home_team', 'away_team', 'result_type']]
strategy_df

Unnamed: 0,fixture_id,Date,home_team,away_team,result_type
0,AVLSHU170620,17/06/2020,Aston Villa,Sheffield United,D
1,MCIARS170620,17/06/2020,Manchester City,Arsenal,F
2,TOTMUN190620,19/06/2020,Tottenham Hotspur,Manchester United,D
3,WATLEI200620,20/06/2020,Watford,Leicester City,D
4,WHUWOL200620,20/06/2020,West Ham United,Wolverhampton Wanderers,F
...,...,...,...,...,...
1777,FULARS081224,08/12/2024,Fulham,Arsenal,D
1778,IPSBOU081224,08/12/2024,Ipswich Town,Bournemouth,F
1779,LEIBHA081224,08/12/2024,Leicester City,Brighton and Hove Albion,D
1780,TOTCHE081224,08/12/2024,Tottenham Hotspur,Chelsea,U


Now, let's further decompose it down to each year and month so we can visualise a time-series of how win rates vary across time.

In [14]:
strategy_df['Date'] = pd.to_datetime(strategy_df['Date'], dayfirst=True)
strategy_df['year'] = strategy_df['Date'].dt.year
strategy_df

Unnamed: 0,fixture_id,Date,home_team,away_team,result_type,year
0,AVLSHU170620,2020-06-17,Aston Villa,Sheffield United,D,2020
1,MCIARS170620,2020-06-17,Manchester City,Arsenal,F,2020
2,TOTMUN190620,2020-06-19,Tottenham Hotspur,Manchester United,D,2020
3,WATLEI200620,2020-06-20,Watford,Leicester City,D,2020
4,WHUWOL200620,2020-06-20,West Ham United,Wolverhampton Wanderers,F,2020
...,...,...,...,...,...,...
1777,FULARS081224,2024-12-08,Fulham,Arsenal,D,2024
1778,IPSBOU081224,2024-12-08,Ipswich Town,Bournemouth,F,2024
1779,LEIBHA081224,2024-12-08,Leicester City,Brighton and Hove Albion,D,2024
1780,TOTCHE081224,2024-12-08,Tottenham Hotspur,Chelsea,U,2024


In [15]:
# Group fixtures by month and year
result_counts = strategy_df.groupby(['year', 'result_type']).size().reset_index(name='count')
total_matches = strategy_df.groupby('year').size().reset_index(name='total_matches')
result_counts = result_counts.merge(total_matches, on='year')

# Calculate win rates for each strategy
result_counts['win_rate'] = result_counts['count'] / result_counts['total_matches']

result_counts

Unnamed: 0,year,result_type,count,total_matches,win_rate
0,2020,D,60,264,0.227273
1,2020,F,137,264,0.518939
2,2020,U,67,264,0.253788
3,2021,D,93,411,0.226277
4,2021,F,219,411,0.532847
5,2021,U,99,411,0.240876
6,2022,D,76,362,0.209945
7,2022,F,204,362,0.563536
8,2022,U,82,362,0.226519
9,2023,D,86,412,0.208738


In [16]:
# Reshape the data for plotting
win_rate_df = result_counts.pivot(index='year', columns='result_type', values='win_rate').reset_index()
win_rate_df = win_rate_df.rename(columns={'F': 'Bet on Favorite', 'U': 'Bet on Underdog', 'D': 'Bet on Draw'})

# Melt for easier plotting
win_rate_melted = win_rate_df.melt(id_vars='year', var_name='Strategy', value_name='Win Rate')
win_rate_melted['year'] = pd.to_datetime(win_rate_melted['year'], format='%Y')

# Reordering so legends appear in proper order in plot
desired_order = ['Bet on Favorite', 'Bet on Underdog', 'Bet on Draw']
win_rate_melted['Strategy'] = pd.Categorical(win_rate_melted['Strategy'], categories=desired_order, ordered=True)
win_rate_melted

Unnamed: 0,year,Strategy,Win Rate
0,2020-01-01,Bet on Draw,0.227273
1,2021-01-01,Bet on Draw,0.226277
2,2022-01-01,Bet on Draw,0.209945
3,2023-01-01,Bet on Draw,0.208738
4,2024-01-01,Bet on Draw,0.258258
5,2020-01-01,Bet on Favorite,0.518939
6,2021-01-01,Bet on Favorite,0.532847
7,2022-01-01,Bet on Favorite,0.563536
8,2023-01-01,Bet on Favorite,0.558252
9,2024-01-01,Bet on Favorite,0.561562


In [17]:
win_rate = (
    ggplot(win_rate_melted, aes(x='year', y='Win Rate', color='Strategy'))
    + ggsize(800,400)
    + geom_line(size=1.5)
    + geom_point(size=3)
    + labs(
        x = 'Time',
        y = 'Win Rate',
        title = 'Expectedly, betting on the favourite team has the highest win rate',
        subtitle = 'Surprisingly, the prevalence of draws have seen a noticeable increase from 2023 to 2024.'
        )
    + theme(
        axis_text_x=element_text(size=12, angle=0, hjust=1),
        plot_title=element_text(face='bold', hjust=0.5, size=22),
        plot_subtitle=element_text(size=16, hjust=0.5, color='blue'),
        legend_position=(0.5, 0.5)
        )
    )

win_rate

In [18]:
ggsave(win_rate,"strategy_win_rates.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)

'c:\\Users\\Xinyan\\Desktop\\DS105A\\ds105a-2024-project-good_gamblers\\data\\visualisations\\simple_betting_strategies\\strategy_win_rates.svg'

**Conclusion**: Expectedly, betting on the favourite team has the greatest chance of being successful at over 50% on average. However, ths is priced into the odds (favourites have lower odds) and is not necessarily indicative of profitability. 

It is however, interesting, to observe that the prevalence of draws has increased these two seasons. We speculate that it might be due to the overall abilities of teams displaying mean reversion where 'traditional underdogs' like Aston Villa or Nottingham Forest have enjoyed great form and 'traditional powerhouses' like Manchester United expereienced the opposite.  

---

# 7. Are Odds Biased Against Home or Away Teams?


In [22]:
stadium_df = merged_df.copy()
stadium_df = stadium_df[['fixture_id','FTR', 'max_home','max_away']]
stadium_df

Unnamed: 0,fixture_id,FTR,max_home,max_away
0,AVLSHU170620,D,3.35,2.41
1,MCIARS170620,H,1.39,8.70
2,TOTMUN190620,D,2.88,2.64
3,WATLEI200620,D,3.52,2.22
4,WHUWOL200620,A,3.70,2.23
...,...,...,...,...
1777,FULARS081224,D,6.40,1.62
1778,IPSBOU081224,A,3.20,2.20
1779,LEIBHA081224,D,5.00,1.73
1780,TOTCHE081224,A,2.30,3.10


In [23]:
# Melt the DataFrame
odds_melted = stadium_df.melt(value_vars=['max_home', 'max_away'], 
                      var_name='team_type', value_name='odds')

# Rename categories for better readability
odds_melted['team_type'] = odds_melted['team_type'].replace({
    'max_home': 'Home Odds',
    'max_away': 'Away Odds'
})

Unnamed: 0,team_type,odds
0,Home Odds,3.35
1,Home Odds,1.39
2,Home Odds,2.88
3,Home Odds,3.52
4,Home Odds,3.70
...,...,...
3559,Away Odds,1.62
3560,Away Odds,2.20
3561,Away Odds,1.73
3562,Away Odds,3.10


In [36]:
odds_density_plot = (
        ggplot(odds_melted, aes(x='odds', fill='team_type', color='team_type'))
        + ggsize(800,400)
        + geom_density(alpha=0.4)  # Add transparency for overlapping densities
        + labs(x='Odds', y='Density', color='Team Type', fill='Team Type',
               title = 'Density Plot of Home vs Away Odds',
               subtitle = 'Home odds peak at 1.6 and have much greater density from the 0 to 4 region')
        + scale_x_continuous(limits=(0, 20)) # 
        + theme(axis_text_x=element_text(size=12, angle=0, hjust=0.5),
             axis_title_x=element_text(size=14),
             axis_title_y=element_text(size=14),
             plot_title=element_text(size=22, face='bold', hjust = 0.5),
             plot_subtitle=element_text(size=16, hjust=0.5, color='blue')
             )
     )
odds_density_plot

In [None]:
ggsave(odds_density_plot,"odds_density_plot.svg", path = "../data/visualisations/simple_betting_strategies", dpi = 300)

**Conclusion:** Expectedly, we observe that home teams are statistically favourited with a peak of 1.6 compared to away odds at 2.4 . There is also a noticeable leftward skew in the odds distrbibution for home teams.

In [51]:
# Define bins for home and away odds
bins = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 8.0, 10.0]

# Create bins for home and away odds
stadium_df['home_odds_bin'] = pd.cut(stadium_df['max_home'], bins=bins, include_lowest=True)
stadium_df['away_odds_bin'] = pd.cut(stadium_df['max_away'], bins=bins, include_lowest=True)

# Preview the binned data
print(stadium_df[['max_home', 'home_odds_bin', 'max_away', 'away_odds_bin']])

      max_home home_odds_bin  max_away away_odds_bin
0         3.35    (3.0, 3.5]      2.41    (2.0, 2.5]
1         1.39  (0.999, 1.5]      8.70   (8.0, 10.0]
2         2.88    (2.5, 3.0]      2.64    (2.5, 3.0]
3         3.52    (3.5, 4.0]      2.22    (2.0, 2.5]
4         3.70    (3.5, 4.0]      2.23    (2.0, 2.5]
...        ...           ...       ...           ...
1777      6.40    (6.0, 8.0]      1.62    (1.5, 2.0]
1778      3.20    (3.0, 3.5]      2.20    (2.0, 2.5]
1779      5.00    (4.5, 5.0]      1.73    (1.5, 2.0]
1780      2.30    (2.0, 2.5]      3.10    (3.0, 3.5]
1781      1.96    (1.5, 2.0]      3.60    (3.5, 4.0]

[1782 rows x 4 columns]


In [52]:
# Calculate implied probabilities for each bin (using average odds)
home_implied = stadium_df.groupby('home_odds_bin').agg(
    avg_home_odds=('max_home', 'mean') 
).reset_index()
home_implied['implied_home_prob'] = 1 / home_implied['avg_home_odds']

away_implied = stadium_df.groupby('away_odds_bin').agg(
    avg_away_odds=('max_away', 'mean') 
).reset_index()
away_implied['implied_away_prob'] = 1 / away_implied['avg_away_odds'] 

  home_implied = stadium_df.groupby('home_odds_bin').agg(
  away_implied = stadium_df.groupby('away_odds_bin').agg(


In [53]:
# Calculate actual probabilities (win rates) for each bin
home_actual = stadium_df.groupby('home_odds_bin')['FTR'].apply(lambda x: (x == 'H').mean()).reset_index(name='actual_home_win_rate')
away_actual = stadium_df.groupby('away_odds_bin')['FTR'].apply(lambda x: (x == 'A').mean()).reset_index(name='actual_away_win_rate')

  home_actual = stadium_df.groupby('home_odds_bin')['FTR'].apply(lambda x: (x == 'H').mean()).reset_index(name='actual_home_win_rate')
  away_actual = stadium_df.groupby('away_odds_bin')['FTR'].apply(lambda x: (x == 'A').mean()).reset_index(name='actual_away_win_rate')


In [54]:
# Merge implied and actual probabilities then melt dataframes for plotting
home_analysis = home_actual.merge(home_implied, on='home_odds_bin')
away_analysis = away_actual.merge(away_implied, on='away_odds_bin')

home_analysis_melted = home_analysis.melt(
    id_vars='home_odds_bin',
    value_vars=['actual_home_win_rate', 'implied_home_prob'],
    var_name='Type',
    value_name='Probability'
)

away_analysis_melted = away_analysis.melt(
    id_vars='away_odds_bin',
    value_vars=['actual_away_win_rate', 'implied_away_prob'],
    var_name='Type',
    value_name='Probability'
)

In [60]:
# Assign custom labels for the legend
home_analysis_melted['Type'] = home_analysis_melted['Type'].replace({
    'actual_home_win_rate': 'Actual Win Rate',
    'implied_home_prob': 'Implied Probability'
})

# Create the plot
p_home = (
    ggplot(home_analysis_melted, aes(x='home_odds_bin', y='Probability', color='Type'))
    + ggsize(800, 400)
    + geom_line(size=1)
    + ggtitle('Home Teams: Implied vs Actual Winning Probability')
    + labs(x='Home Odds Bin', y='Probability', color='Legend')  # Set legend title
    + scale_color_manual(
        values={'Actual Win Rate': 'Black', 'Implied Probability': '#C74506'}  
    )
    + theme(
        axis_text_x=element_text(size=12, angle=45, hjust=1),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        legend_position='bottom'  # Position the legend at the top
    )
)
p_home

In [63]:
# Melt the DataFrame to reshape it for plotting
away_analysis_melted = away_analysis.melt(
    id_vars='away_odds_bin',
    value_vars=['actual_away_win_rate', 'implied_away_prob'],
    var_name='Type',
    value_name='Probability'
)

# Assign custom labels for the legend
away_analysis_melted['Type'] = away_analysis_melted['Type'].replace({
    'actual_away_win_rate': 'Actual Win Rate',
    'implied_away_prob': 'Implied Probability'
})

# Create the plot
p_away = (
    ggplot(away_analysis_melted, aes(x='away_odds_bin', y='Probability', color='Type'))
    + ggsize(800, 400)
    + geom_line(size=1)
    + ggtitle('Away Teams: Implied vs Actual Winning Probability')
    + labs(x='Away Odds Bin', y='Probability', color='Legend')  # Set legend title
    + scale_color_manual(
        values={'Actual Win Rate': 'Black', 'Implied Probability': '#7D11B7'}  
    )
    + theme(
        axis_text_x=element_text(size=12, angle=45, hjust=1),
        axis_title_x=element_text(size=14),
        axis_title_y=element_text(size=14),
        plot_title=element_text(size=22, face='bold', hjust=0.5),
        legend_position='bottom'  # Position the legend at the bottom
    )
)
p_away