# Sports Betting EV Prediction Project

In this project, we analyze historical English Premier League data, build predictive models for match outcomes, calculate expected value (EV) for betting opportunities, and simulate potential profits using backtesting.


### Data Loading and Cleaning

We start by loading English Premier League (EPL) historical match data from Football-Data.org. We select key columns like home/away teams, match results, and bookmaker odds, and compute normalized implied probabilities to adjust for bookmaker overround.


In [None]:
import pandas as pd

# Load the CSV you provided
df = pd.read_csv('E0.csv')

# Preview the data
print(df.head())

# Select key columns
df = df[['Date', 'HomeTeam', 'AwayTeam', 'FTR', 'B365H', 'B365D', 'B365A']]

# Rename columns for easier handling
df = df.rename(columns={
    'B365H': 'Home_Odds',
    'B365D': 'Draw_Odds',
    'B365A': 'Away_Odds',
    'FTR': 'Result'
})

# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')

# Drop rows with missing odds
df = df.dropna(subset=['Home_Odds', 'Draw_Odds', 'Away_Odds'])

# Compute implied probabilities (1/odds)
df['Imp_Prob_Home'] = 1 / df['Home_Odds']
df['Imp_Prob_Draw'] = 1 / df['Draw_Odds']
df['Imp_Prob_Away'] = 1 / df['Away_Odds']

# Normalize probabilities (bookmaker overround adjustment)
df['Total_Imp_Prob'] = df['Imp_Prob_Home'] + df['Imp_Prob_Draw'] + df['Imp_Prob_Away']
df['Norm_Prob_Home'] = df['Imp_Prob_Home'] / df['Total_Imp_Prob']
df['Norm_Prob_Draw'] = df['Imp_Prob_Draw'] / df['Total_Imp_Prob']
df['Norm_Prob_Away'] = df['Imp_Prob_Away'] / df['Total_Imp_Prob']

# Encode result: H → 0, D → 1, A → 2
result_map = {'H': 0, 'D': 1, 'A': 2}
df['Result_Code'] = df['Result'].map(result_map)

# Final preview
print(df.head())

# Save cleaned data if needed
df.to_csv('cleaned_epl_data.csv', index=False)


### Exploratory Data Analysis (EDA)

Before modeling, we explore the data to understand its basic characteristics. We review the distribution of match outcomes (home win, draw, away win) and summarize the odds data to identify any class imbalances or unexpected patterns.


In [None]:
print("Match outcome counts:")
print(df['Result'].value_counts())

print("\nOdds summary stats:")
print(df[['Home_Odds', 'Draw_Odds', 'Away_Odds']].describe())


### Model Training and Evaluation

We primarily train a balanced logistic regression model (with class weighting) and a random forest classifier to predict match outcomes. To provide a meaningful comparison, we also include an unbalanced logistic regression model as a retrospective baseline. We evaluate all models using accuracy, precision, recall, and f1-scores to determine which performs best and informs our downstream expected value (EV) calculations.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Features: normalized bookmaker probabilities
features = df[['Norm_Prob_Home', 'Norm_Prob_Draw', 'Norm_Prob_Away']]
target = df['Result_Code']

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Logistic Regression with class balancing
logreg_balanced = LogisticRegression(solver='lbfgs', max_iter=1000, class_weight='balanced')
logreg_balanced.fit(X_train, y_train)
logreg_balanced_preds = logreg_balanced.predict(X_test)
logreg_balanced_probs = logreg_balanced.predict_proba(X_test)

print("Balanced Logistic Regression Performance:")
print(classification_report(y_test, logreg_balanced_preds, zero_division=0))
print("Accuracy:", accuracy_score(y_test, logreg_balanced_preds))


# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)
rf_probs = rf.predict_proba(X_test)

print("\nRandom Forest Performance:")
print(classification_report(y_test, rf_preds))
print("Accuracy:", accuracy_score(y_test, rf_preds))


### Unbalanced Logistic Regression (Retrospective)

While our main model flow uses the balanced logistic regression setup, we also trained an unbalanced logistic regression as a conceptual baseline. This helped us understand how class imbalance affected model predictions, particularly for underrepresented outcomes like draws.


In [None]:
# Unbalanced Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Train logistic regression WITHOUT class balancing
logreg = LogisticRegression(solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)
logreg_preds = logreg.predict(X_test)
logreg_probs = logreg.predict_proba(X_test)

print("Unbalanced Logistic Regression Performance:")
print(classification_report(y_test, logreg_preds, zero_division=0))
print("Accuracy:", accuracy_score(y_test, logreg_preds))

### Expected Value (EV) Calculation and Backtest Simulation

We calculate the expected value (EV) for each outcome using the model’s predicted probabilities and bookmaker odds. We then simulate placing a $1 bet on each positive EV opportunity to calculate total profit and average ROI. This combined analysis helps assess whether following the model’s recommendations would have been profitable historically.


In [None]:
# Use balanced model probabilities for backtest
probs_balanced = pd.DataFrame(logreg_balanced_probs, columns=['Prob_Home', 'Prob_Draw', 'Prob_Away'], index=X_test.index)
X_test_copy = X_test.copy()
X_test_copy['True_Result'] = y_test.values
X_test_copy['Home_Odds'] = df.loc[X_test_copy.index, 'Home_Odds'].values
X_test_copy['Draw_Odds'] = df.loc[X_test_copy.index, 'Draw_Odds'].values
X_test_copy['Away_Odds'] = df.loc[X_test_copy.index, 'Away_Odds'].values
X_test_copy = pd.concat([X_test_copy, probs_balanced], axis=1)

# Calculate EV
X_test_copy['EV_Home'] = (X_test_copy['Prob_Home'] * X_test_copy['Home_Odds']) - (1 - X_test_copy['Prob_Home'])
X_test_copy['EV_Draw'] = (X_test_copy['Prob_Draw'] * X_test_copy['Draw_Odds']) - (1 - X_test_copy['Prob_Draw'])
X_test_copy['EV_Away'] = (X_test_copy['Prob_Away'] * X_test_copy['Away_Odds']) - (1 - X_test_copy['Prob_Away'])
X_test_copy['Best_Bet'] = X_test_copy[['EV_Home', 'EV_Draw', 'EV_Away']].idxmax(axis=1)
X_test_copy['Best_Bet_EV'] = X_test_copy[['EV_Home', 'EV_Draw', 'EV_Away']].max(axis=1)

# Backtest: assume $1 per +EV bet
positive_ev_bets = X_test_copy[X_test_copy['Best_Bet_EV'] > 0]
def check_win(row):
    if row['Best_Bet'] == 'EV_Home' and row['True_Result'] == 0:
        return row['Home_Odds'] - 1  # Profit = odds - stake
    elif row['Best_Bet'] == 'EV_Draw' and row['True_Result'] == 1:
        return row['Draw_Odds'] - 1
    elif row['Best_Bet'] == 'EV_Away' and row['True_Result'] == 2:
        return row['Away_Odds'] - 1
    else:
        return -1  # Loss = -1 stake

positive_ev_bets['Profit'] = positive_ev_bets.apply(check_win, axis=1)
total_profit = positive_ev_bets['Profit'].sum()
num_bets = len(positive_ev_bets)
roi = total_profit / num_bets if num_bets > 0 else 0

print(f"Backtest Results on {num_bets} +EV Bets:")
print(f"Total Profit: ${total_profit:.2f}")
print(f"ROI per bet: {roi:.2f}")


### Results and Discussion

We visualize the distribution of positive EV bets, highlight the top five bets, and summarize the average, maximum, and minimum EV across all bets. These insights help us interpret how promising the model's recommendations are and which betting opportunities stand out.


In [None]:
import matplotlib.pyplot as plt

# Plot distribution of best EV
plt.figure(figsize=(8, 5))
plt.hist(positive_ev_bets['Best_Bet_EV'], bins=20, edgecolor='k')
plt.title('Distribution of Positive Expected Value (EV) Bets')
plt.xlabel('Expected Value')
plt.ylabel('Number of Bets')
plt.show()

# Show top 5 bets by EV
top_ev_bets = positive_ev_bets.sort_values('Best_Bet_EV', ascending=False).head()
print("Top 5 +EV Bets:")
print(top_ev_bets[['Best_Bet', 'Best_Bet_EV', 'Prob_Home', 'Prob_Draw', 'Prob_Away', 'Home_Odds', 'Draw_Odds', 'Away_Odds']])

# Summary stats
mean_ev = positive_ev_bets['Best_Bet_EV'].mean()
max_ev = positive_ev_bets['Best_Bet_EV'].max()
min_ev = positive_ev_bets['Best_Bet_EV'].min()

print(f"\nSummary of Positive EV Bets:")
print(f"Average EV: {mean_ev:.3f}")
print(f"Max EV: {max_ev:.3f}")
print(f"Min EV: {min_ev:.3f}")


### Limitations and Future Work

While our project successfully identifies positive expected value (EV) betting opportunities using historical English Premier League data, there are several limitations. 

First, we rely solely on bookmaker odds and historical match outcomes, without incorporating richer features like recent team form, player injuries, or lineup changes. Our models also assume that past betting market inefficiencies will continue into the future, which may not hold true.

Additionally, our sample size is limited to one season, and we only use simple models (logistic regression, random forest). More advanced methods like gradient boosting, ensemble models, or deep learning could potentially improve predictions.

In the future, we aim to incorporate live odds data using APIs, expand the feature set with advanced team and player statistics, and perform longer-term backtests across multiple seasons to better assess strategy robustness.
