# **Predicting football game outcome through Bayesian reasoning**


**Saleh Mir Mohammad Rezaei** \
Project for Fundamentals of AI course - module 3


---

It aims to predict football game outcomes using Bayesian reasoning techniques. The match data and relevant features are preprocessed, and probabilistic models will be applied to classify match results. The goal is to leverage statistical insights for accurate and interpretable predictions.





# Step 1. Data Preprocessing
The Football Match Statistics Dataset from Kaggle is used for this project. This dataset contains detailed match statistics for various football games. It consists of approximately 100,000 rows and 91 columns, offering rich data to analyze team and player performance across multiple dimensions. This dataset includes matches from 18 leagues, consisting of 3 leagues from each of the 6 countries.



In [34]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

> The dataset has been loaded, and an additional column has been introduced to represent the outcome of the match for the host team. Out of over 90 columns in the dataset, 14 features have been selected for the classification of match results.



> Because null values can disrupt splitting criteria, reduce model performance, and affect interpretability, it is essential to address them in the dataset. For numeric features, we replace null values with the mean, and for categorical features, we use the mode. This ensures the decision tree can effectively learn patterns without being impacted by missing data.











In [None]:
# Load the data
data = pd.read_csv("Football.csv", sep=',', header=0)

# Create match_outcome column and set value
data['match_outcome'] = data.apply(
          lambda row: 'win'
                      if row['home_score'] > row['away_score']
                      else ('lose' if row['home_score'] < row['away_score']
                      else 'draw'), axis=1)

In [36]:
# Select relevant features
selected_columns = [
    'home_team', 'away_team', 'Ball_Possession_Home','Shots_on_Goal_Home',
    'Shots_on_Goal_Away', 'Yellow_Cards_Home', 'Yellow_Cards_Away',
    'first_half', 'Red_Cards_Home', 'Red_Cards_Away', 'Goal_Attempts_Home',
    'Goal_Attempts_Away', 'match_outcome']

data = data[selected_columns]

In [None]:
# Show number of null values for each feature
print(data.isnull().sum())

In [None]:
# Replace null values with mean of the column for numeric features
columns_to_fill = [
    'Shots_on_Goal_Home', 'Shots_on_Goal_Away', 'Yellow_Cards_Home',
    'Yellow_Cards_Away', 'Red_Cards_Home', 'Red_Cards_Away',
    'Goal_Attempts_Home', 'Goal_Attempts_Away'
]

for col in columns_to_fill:
    data[col].fillna(data[col].mean(), inplace=True)

In [39]:
# Replace null values with mode of the column for categorical features
def valid_score_format(score):
    try:
        x, y = map(int, score.split(' - '))
        return score  # If valid, keep the original score
    except:
        return '0 - 0'  # If invalid, replace with '0 - 0'

data['first_half'] = data['first_half'].apply(valid_score_format)

# Determine the result directly by splitting and comparing the values
def determine_result(score):
    x, y = map(int, score.split(' - '))
    if x > y:
        return 'win'
    elif x == y:
        return 'draw'
    else:
        return 'lose'

data['first_half'] = data['first_half'].apply(determine_result)

In [None]:
# Remove '%' and convert to float and for null values use mean
data['Ball_Possession_Home'] = data['Ball_Possession_Home'].str.rstrip('%').astype(float)
data['Ball_Possession_Home'].fillna(data['Ball_Possession_Home'].mean(), inplace=True)

In [None]:
# Show number of null values
print(data.isnull().sum())



> Several data preprocessing steps were undertaken to enhance the dataset's suitability for effective analysis and subsequent modeling. These steps include **scaling of numerical features** and **encoding of categorical variables**, and the establishment of consistent feature representations. MinMaxScaler and LabelEncoder from the scikit-learn library are used for this purpose.

In [42]:
# encoding of categorical variables

# Strip extra whitespace and special characters
data['home_team'] = data['home_team'].str.strip()
data['away_team'] = data['away_team'].str.strip()

# Initialize LabelEncoder
team_encoder = LabelEncoder()

# Fit on all unique team names (combining both columns)
all_teams = pd.concat([data['home_team'], data['away_team']]).unique()
team_encoder.fit(all_teams)

# Transform both columns
data['home_team'] = team_encoder.transform(data['home_team'])
data['away_team'] = team_encoder.transform(data['away_team'])

# Encode match outcome and first half
outcome_encoder = LabelEncoder()
data['match_outcome'] = outcome_encoder.fit_transform(data['match_outcome'])
data['first_half'] = outcome_encoder.fit_transform(data['first_half'])

In [43]:
# scaling of numerical features
numerical_columns = [
    'Goal_Attempts_Home', 'Goal_Attempts_Away', 'Red_Cards_Home',
    'Shots_on_Goal_Home', 'Shots_on_Goal_Away', 'Red_Cards_Away',
    'Ball_Possession_Home', 'Goal_Attempts_Home', 'Goal_Attempts_Away',
    'Yellow_Cards_Home', 'Yellow_Cards_Away']

# Apply MinMaxScaler
scaler = MinMaxScaler()
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

# Discretize continuous variables into bins
data['Ball_Possession_Home'] = pd.cut(data['Ball_Possession_Home'], bins=10, labels=False)

data['Shots_on_Goal_Home'] = pd.cut(data['Shots_on_Goal_Home'], bins=10, labels=False)
data['Shots_on_Goal_Away'] = pd.cut(data['Shots_on_Goal_Away'], bins=10, labels=False)

data['Yellow_Cards_Home'] = pd.cut(data['Yellow_Cards_Home'], bins=10, labels=False)
data['Yellow_Cards_Away'] = pd.cut(data['Yellow_Cards_Away'], bins=10, labels=False)

data['Red_Cards_Home'] = pd.cut(data['Red_Cards_Home'], bins=5, labels=False)
data['Red_Cards_Away'] = pd.cut(data['Red_Cards_Away'], bins=5, labels=False)

data['Goal_Attempts_Home'] = pd.cut(data['Goal_Attempts_Home'], bins=10, labels=False)
data['Goal_Attempts_Away'] = pd.cut(data['Goal_Attempts_Away'], bins=10, labels=False)



>The processed dataset is saved to a CSV file named "processed_data.csv" using the to_csv method. Preprocessing has been finalized by exporting the cleaned and prepared data for forthcoming step.



In [45]:
# Saving the processed data
data.to_csv("processed_data.csv", index=False)

# Step 2. Bayesian Networks: Football Match Analysis

Bayesian Networks are explored using a dataset on football matches. Multiple Bayesian Network models are constructed, compared, and evaluated based on their structure, scoring metrics, and inference results. Hypotheses are tested through parameter sensitivity analyses and evidence-based scenario simulations, providing valuable insights into the relationships and dependencies within the dataset.

In [None]:
pip install pgmpy

In [46]:
import matplotlib.pyplot as plt
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import TreeSearch, HillClimbSearch, BDeuScore
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import VariableElimination
import networkx as nx

**Data Loading and Exploration**
> Various tools were utilized in this project to facilitate the analysis and modeling of Bayesian Networks. The pgmpy library was employed for Bayesian Network modeling and inference, while matplotlib and networkx were used to visualize network structures. Additionally, pandas was applied for data handling and preprocessing tasks.


> The dataset was thoroughly explored to understand its features and distributions. This process involved loading the dataset, displaying summary statistics, and visualizing data distributions to gain insights and prepare the data for further analysis.





In [47]:
# Load the processed data
data = pd.read_csv("processed_data.csv")

In [48]:
# Removing teams columns and any rows with missing values
data = data.drop(columns=['home_team', 'away_team']).dropna()

In [None]:
# Data Summary
print(data.describe())

In [None]:
# First few rows of the dataset
data.head()

In [None]:
# Visualize data distribution (optional)
data.hist(bins=15, figsize=(15, 10))
plt.suptitle("Dataset Feature Distributions")
plt.savefig("data_distributions.png")


**Bayesian Network Modeling**
> Bayesian Network models were constructed and compared to analyze the dataset's underlying dependencies. Three models were developed: a Manual Model, designed based on domain knowledge; a TreeSearch Model, learned using the TreeSearch algorithm; and a Hill Climb Model, learned through the Hill Climb search algorithm with the BDeu scoring metric.



In [None]:
# Define the manual model structure
manual_model = BayesianNetwork([
    ('Goal_Attempts_Home', 'Shots_on_Goal_Home'),
    ('Goal_Attempts_Away', 'Shots_on_Goal_Away'),
    ('Ball_Possession_Home', 'match_outcome'),
    ('Shots_on_Goal_Home', 'match_outcome'),
    ('Shots_on_Goal_Away', 'match_outcome'),
    ('Yellow_Cards_Home', 'Red_Cards_Home'),
    ('Yellow_Cards_Away', 'Red_Cards_Away'),
    ('first_half', 'match_outcome'),
    ('Red_Cards_Home', 'match_outcome'),
    ('Red_Cards_Away', 'match_outcome')
])

# TreeSearch-based model
ts = TreeSearch(data, root_node='match_outcome')
tree_model = BayesianNetwork(ts.estimate())

# Hill Climb Search-based model
scorer = BDeuScore(data)
hc = HillClimbSearch(data)
hc_model = BayesianNetwork(hc.estimate(scoring_method=scorer))

# Visualize the models
def visualize_model(model, title, filename):
    plt.figure(figsize=(10, 8))
    nx.draw_networkx(model, pos=nx.circular_layout(model), node_size=3000, font_size=10)
    plt.title(title)
    plt.savefig(filename)
    plt.show()

visualize_model(manual_model, "Manual Model", "manual_model.png")
visualize_model(tree_model, "TreeSearch Model", "tree_model.png")
visualize_model(hc_model, "Hill Climb Model", "hc_model.png")

**Model Fitting and Scoring**


> All Bayesian Network models were fitted to the dataset using MaximumLikelihoodEstimator to ensure accurate parameterization. Conditional Probability Tables (CPTs) were computed for each model, and BDeuscores were calculated to assess and compare the quality of the models. This process enabled an objective evaluation of the models' performance and alignment with the dataset.



In [50]:
# Fit CPTs
models = [('Manual', manual_model), ('TreeSearch', tree_model), ('HillClimb', hc_model)]
for name, model in models:
    print(f"Fitting CPTs for {name} model...")
    model.fit(data, estimator=MaximumLikelihoodEstimator)

# Compare scores
model_scores = {name: scorer.score(model) for name, model in models}
print("\nModel Scores:")
for name, score in model_scores.items():
    print(f"{name} Model: {score}")

Fitting CPTs for Manual model...
Fitting CPTs for TreeSearch model...
Fitting CPTs for HillClimb model...

Model Scores:
Manual Model: -811605.9831872664
TreeSearch Model: -727982.7569958973
HillClimb Model: -686820.9781788085


In [None]:
node = "first_half"

# Print CPDs for a node in all models
for name, model in models:
    print(f"\nCPD for {node} in {name} model:")
    try:
        cpd = model.get_cpds(node)
        print(cpd)
    except ValueError:
        print(f"No CPD found for '{node}' in {name} model.")

In [None]:
# Perform independence checks and Markov blanket analysis
nodes = ['Ball_Possession_Home', 'Shots_on_Goal_Home', 'Shots_on_Goal_Away',
         'Yellow_Cards_Home', 'Red_Cards_Home','Yellow_Cards_Away', 'Red_Cards_Away',
         'first_half', 'Goal_Attempts_Home', 'Goal_Attempts_Away', 'match_outcome']

for name, model in models:
    print(f"\nAnalysis for {name} model:")
    print(f"Total independence assertions: {len(model.get_independencies().get_assertions())}")
    for node in nodes:
        print(f"Markov blanket of {node}: {model.get_markov_blanket(node)}")


**Inference and Evidence Scenarios**
> The behavior of the Bayesian Network models was tested under various evidence scenarios to evaluate their performance. Evidence scenarios were defined, and inference was performed for each model. The results were compared across models to assess their consistency and ability to capture the underlying dependencies within the dataset.











In [None]:
# Define inference scenarios
evidence_scenarios = [
    {'Ball_Possession_Home': 2, 'first_half': 1},
    {'Yellow_Cards_Home': 2, 'Shots_on_Goal_Away': 0},
    {'Goal_Attempts_Home': 4, 'Shots_on_Goal_Home': 3},
    {'Goal_Attempts_Home': 1, 'Shots_on_Goal_Home': 3},
]

inferences = [(name, VariableElimination(model)) for name, model in models]

# Perform inference
for idx, evidence in enumerate(evidence_scenarios):
    print(f"\nScenario {idx + 1}: Evidence = {evidence}")
    for name, ve in inferences:
        query_result = ve.query(variables=['match_outcome'], evidence=evidence)
        print(f"{name} Model Result:\n{query_result}")

**Parameter Sensitivity Analysis**

> The impact of changes in key parameters on the inference results of the Bayesian Network models was analyzed. Parameters and their respective value ranges were defined, and inference was performed for varying parameter values. The resulting outputs were compared to evaluate the sensitivity of the models to parameter variations and to gain insights into their robustness and adaptability.











In [None]:
parameters = ['Yellow_Cards_Home', 'Red_Cards_Home']
values = [0, 1, 3]

sensitivity_results = {}
for param in parameters:
    for value in values:
        evidence = {param: value}
        print(f"\nTesting {param} = {value}")
        for name, ve in inferences:
            result = ve.query(variables=['match_outcome'], evidence=evidence)
            print(f"{name} Model: {result}")
            sensitivity_results[(name, param, value)] = result

# Step 3. Conclusion


**Final Insights**

The HillClimb Model emerged as the best-performing model, balancing structure complexity with predictive power.
Offensive metrics, particularly Shots on Goal, have the strongest influence on match outcomes, while disciplinary metrics like Yellow/Red Cards negatively impact results.
Bayesian Networks are effective in modeling the interplay of match statistics and outcomes, with structure-learning algorithms providing flexibility for automated insights.

\
**Future Work**

Explore additional features like player performance, match location, and weather conditions to enhance predictions.
Test dynamic Bayesian networks (DBNs) to capture temporal dependencies, such as momentum shifts during a game.
Use advanced scoring methods (e.g., BIC or AIC) to penalize overly complex models and avoid overfitting.