# UEFA Predictions

I have a dataset that I found here:
https://www.kaggle.com/datasets/raminvali/uefa-champions-league

Historic results of the UEFA Champions League in the 2010-2021 season.
Scraped from FBREF

Includes all matches for each team based on date and detailed statistics for each team and the opposing team.

I did not edit the downloaded file in any way prior to loading it here

In [1]:
import pandas as pd
import numpy as np

matches = pd.read_csv("/Users/weronikakieliszek/Downloads/df.csv", index_col=0)
teams = matches["team"].unique()
matches["check"] = 0

teams

array(['Lyon', 'Panathinaikos', 'Rubin Kazan', 'Schalke 04',
       'Hapoel Tel Aviv', 'Valencia', 'Rangers', 'Tottenham', 'Inter',
       'Bursaspor', 'Werder Bremen', 'Twente', 'Barcelona',
       'FC Copenhagen', 'Benfica', 'Manchester Utd', 'Braga', 'Ajax',
       'Partizan', 'Chelsea', 'Spartak Moscow', 'Roma', 'Auxerre',
       'Real Madrid', 'Arsenal', 'Milan', 'Bayern Munich', 'Marseille',
       'MÅ\xa0K Å½ilina', 'Shakhtar', 'Basel', 'CFR Cluj', 'Porto',
       'Olympiacos', 'Genk', 'Dortmund', 'Viktoria PlzeÅ\x88',
       'Leverkusen', 'Zenit', 'BATE Borisov', 'APOEL FC', 'Dinamo Zagreb',
       'Manchester City', 'Lille', 'Villarreal', 'Napoli', 'CSKA Moscow',
       'OÈ\x9belul GalaÈ\x9bi', 'Trabzonspor', 'Montpellier', 'Paris S-G',
       'Dynamo Kyiv', 'MÃ¡laga', 'Anderlecht', 'NordsjÃ¦lland', 'Celtic',
       'Galatasaray', 'Juventus', 'Real Sociedad', 'AtlÃ©tico Madrid',
       'Steaua', 'Austria Wien', 'MalmÃ¶', 'Liverpool', 'Ludogorets',
       'Monaco', 'NK Maribor'

I'm allowing the user to enter information about a match in the future

In [2]:
num_rows = int(input("How many matches do we want to predict? "))

print("Below, enter the team names (use the list above - the values ​​must match), the binary value whether team No. 1 is playing at home, the date in the yyyy-mm-dd format. If you want to see the match results from the 2022-2023 season, enter 2022 in the season.")

for i in range(num_rows):
    team = input(f"Enter the first team for match no {i+1}: ")
    check = 1
    predictors = ["season", "home", "team_opp", "Date"]
    predictor_values = []
    for p in predictors:
        val = input(f"Enter {p} for match no {i+1}: ")
        predictor_values.append(val)
        
    new_data = [team] + [check] + predictor_values
    new_df = pd.DataFrame([new_data], columns=["team", "check"] + predictors)
    matches = pd.concat([matches, new_df], ignore_index=True)
    
# Print the updated DataFrame
print(matches)

Below, enter the team names (use the list above - the values ​​must match), the binary value whether team No. 1 is playing at home, the date in the yyyy-mm-dd format. If you want to see the match results from the 2022-2023 season, enter 2022 in the season.
      Day        Date             team         team_opp season  score  \
0     Tue  2010-09-14             Lyon       Schalke 04   2010    1.0   
1     Tue  2010-09-14    Panathinaikos        Barcelona   2010    1.0   
2     Tue  2010-09-14      Rubin Kazan    FC Copenhagen   2010    0.0   
3     Tue  2010-09-14       Schalke 04             Lyon   2010    0.0   
4     Tue  2010-09-14  Hapoel Tel Aviv          Benfica   2010    0.0   
...   ...         ...              ...              ...    ...    ...   
2921  Sat  2022-05-28      Real Madrid        Liverpool   2021    1.0   
2922  NaN  2023-04-18           Napoli            Milan   2022    NaN   
2923  NaN  2023-04-18          Chelsea      Real Madrid   2022    NaN   
2924  NaN  20

In [3]:
matches['Date'] = pd.to_datetime(matches['Date'])

matches["day_of_week"] = matches["Date"].dt.weekday
matches["month"] = matches["Date"].dt.month
matches["year"] = matches["Date"].dt.year
matches["day_of_month"] = matches["Date"].dt.day

In [4]:
matches['result'] = 0
matches.loc[matches['score'] == matches['score_opp'], 'result'] = 1
matches.loc[matches['score'] > matches['score_opp'], 'result'] = 2

In [5]:
matches["opp_code"] = matches["team_opp"].astype("category").cat.codes
matches.fillna(0, inplace=True)

In [6]:
matches.dtypes

Day                     object
Date            datetime64[ns]
team                    object
team_opp                object
season                  object
                     ...      
check                    int64
day_of_week              int32
month                    int32
year                     int32
day_of_month             int32
Length: 93, dtype: object

In [7]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1000, min_samples_split=10, random_state=1)

In [8]:
train = matches[matches["Date"] < "2018-03-13"]
test = matches[matches["Date"] >= "2018-03-13"]

In [9]:
# Predictors

predictors = ["season", "home", "opp_code", "day_of_week", "month", "year", "day_of_month"]

In [10]:
rf.fit(train[predictors], train["result"])

In [11]:
preds = rf.predict(test[predictors])

In [12]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(test["result"], preds)
print(acc)

0.43154761904761907


In [13]:
combined = pd.DataFrame(dict(actual=test["result"], prediction=preds))
pd.crosstab(index=combined["actual"], columns=combined["prediction"])

prediction,0,1,2
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,239,7,158
1,103,3,98
2,199,8,193


To improve the model, I want to use other data that I have in the set. There is still a lot of data that I can't use because I can't use a value that I can't know before the match as a predictor. Therefore I want to create new columns that will give me the average values ​​for the match statistics from the last 3 matches and like this. I will be able to use them in the model.

I group the set to calculate averages for each team separately (I don't want to mix values ​​for different teams)

In [14]:
grouped_matches = matches.groupby("team")

In [15]:
def rolling_averages(group, cols, new_cols):
    group = group.sort_values("Date")
    rolling_stats = group[cols].rolling(3, closed='left').mean()
    group[new_cols] = rolling_stats
    group = group.dropna(subset=new_cols)
    return group

In [16]:
cols = ['score', 'score_opp', '# Pl', 'Age', 'MP', 'Starts', 'Gls', 'Ast', 'G+A', 'G-PK', 'PK', 'PKatt', 'Gls90', 'Ast90', 'G+A90', 'G-PK90', 'G+A-PK', 'Min%', 'Subs', 'Mn/Sub', 'PPM', 'onG', 'onGA', '+/-', '# Pl.1', 'Min', 'GA', 'GA90', 'SoTA', 'Saves', 'Save%', 'W', 'D', 'L', 'CS', 'CS%', 'SoT', 'SoT/90', 'G/SoT', 'Fls', '# Pl_opp', 'Age_opp', 'MP_opp', 'Starts_opp', 'Gls_opp', 'Ast_opp', 'G+A_opp', 'G-PK_opp', 'PK_opp', 'PKatt_opp', 'Gls90_opp', 'Ast90_opp', 'G+A90_opp', 'G-PK90_opp', 'G+A-PK_opp', 'season_opp', 'Min%_opp', 'Subs_opp', 'Mn/Sub_opp', 'PPM_opp', 'onG_opp', 'onGA_opp', '+/-_opp', '# Pl_opp.1', 'Min_opp', 'GA_opp', 'GA90_opp', 'SoTA_opp', 'Saves_opp', 'Save%_opp', 'W_opp', 'D_opp', 'L_opp', 'CS_opp', 'CS%_opp', 'SoT_opp', 'SoT/90_opp', 'G/SoT_opp', 'Fls_opp']
new_cols = [f"{c}_rolling" for c in cols]

In [17]:
matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))

  matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))


In [18]:
matches_rolling = matches_rolling.droplevel("team")

In [19]:
matches_rolling.index = range(matches_rolling.shape[0])

In [20]:
def make_predictions(data, predictors):
    train = data[data["Date"] < "2018-03-13"]
    test = data[data["Date"] >= "2018-03-13"]
    rf.fit(train[predictors], train["result"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["result"], predicted=preds), index=test.index)
    acc = accuracy_score(test["result"], preds)
    return combined, acc

In [21]:
combined, acc = make_predictions(matches_rolling, predictors + new_cols)

In [22]:
acc

0.5742471443406023

In [23]:
combined = combined.merge(matches_rolling[["Date", "team", "team_opp", "check"]], left_index=True, right_index=True)
pd.crosstab(index=combined["actual"], columns=combined["predicted"])

predicted,0,1,2
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,232,7,135
1,92,9,93
2,77,6,312


In [24]:
# Checking the result for the previously entered match
# I use the column I added earlier to find this entered observation

result = combined[combined['check'] == 1]
result

Unnamed: 0,actual,predicted,Date,team,team_opp,check
569,0,2,2023-04-19,Bayern Munich,Manchester City,1
837,0,2,2023-04-18,Chelsea,Real Madrid,1
1126,0,2,2023-04-19,Inter,Benfica,1
1775,0,2,2023-04-18,Napoli,Milan,1


In [25]:
for index, row in result.iterrows():
    date = row['Date']
    team = row['team']
    predicted = {0: 'loses', 1: 'draws', 2: 'wins'}[row['predicted']]
    opponent = row['team_opp']
    comment = f"{date}, {team} {predicted} with {opponent}."
    print(comment)

2023-04-19 00:00:00, Bayern Munich wins with Manchester City.
2023-04-18 00:00:00, Chelsea wins with Real Madrid.
2023-04-19 00:00:00, Inter wins with Benfica.
2023-04-18 00:00:00, Napoli wins with Milan.
