<h1>Matildas Result Prediction SVR Model <i>(June/July 2025 Update)</i></h1>

<h3>Function of the model</h3>
<p>This is an updated version of the regression model built using a support vector machine to predict the outcome of the Australian women's national football team's future matches, given all home and away results over the past five years (2020 through June 2nd, 2025).</p>

<h2>Model training code with explanations</h2>

<p>Carry out imports of required libraries providing functionality:</p>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score

<p>Read home and away win-loss data over the past 5 years from Excel files, and concatenate separate data files together:</p>

In [None]:
home_data = pd.read_excel('matildas_winlossHomeJuneJuly2025Update.xlsx')
away_data = pd.read_excel('matildas_winlossAwayApril2025Update.xlsx')

home_data['location'] = 'Home'
away_data['location'] = 'Away'

data = pd.concat([home_data, away_data], ignore_index = True)

<p>Aggregate a binary result value based on Boolean values in dataset, where 1 corresponds to a win, 0.5 to a draw and 0 to a loss:</p>

In [3]:
data['result'] = data['win'] * 1 + data['draw'] * 0.5 + data['lose'] * 0
data = data.drop(['win', 'draw', 'lose'], axis = 1)

<p>Define the features and target variables in the given dataset:</p>

In [4]:
features = data[['home_team', 'away_team', 'home_score', 'away_score', 'tournament', 'city', 'country', 'location']]
target = data['result']

<p>Define two variables, <b>num_features</b> and <b>cat_features</b>, corresponding to quantitative and qualitative data respectively:

In [5]:
num_features = ['home_score', 'away_score']
cat_features = ['home_team', 'away_team', 'tournament', 'city', 'country', 'location']

<p>Carry out preprocessing of the data using the StandardScaler and OneHotEncoder methods:</p>

In [6]:
preprocessor = ColumnTransformer(
    transformers = [
        ('num', StandardScaler(), num_features),
        ('cat', OneHotEncoder(handle_unknown = 'ignore', drop = 'first'), cat_features)
        ])

<p>Define a Pipeline which carries out preprocessing and linear regression using a support vector machine:</p>

In [7]:
pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('regressor', SVR(kernel = 'linear'))
])

<p>Split the dataset into training and test sets with test set size 0.2 and random state 42:</p>

In [8]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 42)

<p>Fit the pipeline to the training set data:</p>

In [None]:
pipeline.fit(X_train, y_train)

<p>Get predicted data from the test set data:</p>

In [None]:
y_pred = pipeline.predict(X_test)

<h2>Code for outputting prediction results with explanations</h2>

<p>Define dataframe for future matches to predict, corresponding to the upcoming Matildas fixtures as of the June/July 2025 update:</p>

In [None]:
home_team = 'Australia'
away_teams = ['Slovenia', 'Slovenia', 'Panama', 'Panama']
cities = ['Perth', 'Perth', 'Bunbury', 'Perth']
country = 'Australia'
tournament = 'Friendly'
venues = ['HBF Park', 'HBF Park', 'Hands Oval', 'HBF Park']

In [None]:
future_matches = pd.DataFrame({
    'home_team': [home_team] * 4,
    'away_team': away_teams,
    'tournament': [tournament] * 4,
    'city': cities,
    'country': [country] * 4,
    'location': venues,
    'home_score': [0] * 4,
    'away_score': [0] * 4
})

<p>Get predicted results from the prediction model and apply to future matches:</p>

In [None]:
predicted_results = pipeline.predict(future_matches)

<p>Add raw prediction data as a column to the resultant dataframe:</p>

In [14]:
future_matches['raw_predictiondata'] = predicted_results

<p>Define a function for classification of predicted results based on raw prediction data:</p>

In [15]:
def classify_result(predicted):
    if predicted > 1.0:
        return 'Win'
    elif predicted < 1.0:
        return 'Lose'
    else:
        return 'Draw'

<p>Add new column to resultant dataframe with classified result corresponding to win, draw and loss outcomes:</p>

In [16]:
future_matches['predicted_result'] = future_matches['raw_predictiondata'].apply(classify_result)

<p>Remove home and away score columns from the resultant dataframe, and output the data:</p>

In [17]:
future_matches = future_matches.drop(columns = ['home_score', 'away_score'])

<h2>Resultant dataframe output</h2>

<p>This is the resultant dataframe with predicted data, giving information about upcoming matches for the Australian women's national football team and their predicted outcomes as of the June/July 2025 update.</p>

In [None]:
future_matches

<h3>Explanation of columns</h3>
<ul>
    <li><b>home_team</b> corresponds to the designated home team for the match.</li>
    <li><b>away_team</b> corresponds to the designated away team for the match.</li>
    <li><b>tournament</b> corresponds to the tournament the fixtures are part of, or whether the fixtures are friendly matches.</li>
    <li><b>city</b> corresponds to the city the match is being played in.</li>
    <li><b>country</b> corresponds to the country the match is being played in.</li>
    <li><b>location</b> corresponds to the stadium the match is being played at.</li>
    <li><b>raw_predictiondata</b> corresponds to the raw prediction data output from the prediction model as a floating point number.</li>
    <li><b>predicted_result</b> corresponds to the predicted outcome of the match based on the raw prediction data.</li>
</ul>

<h2>Evaluation of the model</h2>

<h3>Mean absolute error</h3>
<p>This corresponds to the average absolute difference between predicted values and actual values, giving an idea of how wrong predictions are on average.</p>

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print("Mean absolute error:", mae)

<h3>Mean squared error</h3>
<p>This corresponds to the average of the squares of errors, giving more weight to larger errors which makes it susceptible to being affected by outlying data.</p>

In [None]:
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error:", mse)

<h3>Root mean squared error</h3>
<p>This corresponds to the square root of the mean squared error, which provides the error in the same units as the raw prediction data to facilitate easier interpretation.</p>

In [None]:
rmse = root_mean_squared_error(y_test, y_pred)
print("Root mean squared error:", rmse)

<h3>R-squared value</h3>
<p>This corresponds to how well a model's performance in terms of predicting target variables based on its inputs is. In this case, this refers to variance in predicted outcomes based on the raw prediction data.</p>

In [None]:
r2 = r2_score(y_test, y_pred)
print("R-squared value:", r2)

<h3>Accuracy</h3>
<p>This corresponds to how accurate the overall model is.</p>

In [None]:
y_pred_binary = [1 if pred >= 0.5 else 0 for pred in y_pred]
accuracy = (y_pred_binary == y_test).mean()
print("Accuracy:", accuracy)