In [18]:
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn import metrics
pd.set_option('display.max_columns', None)

First let's load our (already normalized) data, and specify input and target variables

In [None]:
df = pd.read_csv("PATH\\wszystko_znormalizowane.csv", encoding='utf-8-sig', index_col=[0])
df = df.dropna() #Remove all rows with at least one element, just in case
df = df.drop_duplicates() #Just in case
custom_map = {'Wygrali gospodarze':0,'Remis':1,'Wygrali goście':2} #Rename target variable elements to integers so that these are compatible with classification models
df['Kto_wygrał'] = df['Kto_wygrał'].map(custom_map)
odds1 = df[['Oddsy_gospodarze', 'Oddsy_remis', 'Oddsy_goście', 'Kto_wygrał']] 

df_Xy = df.drop(df.columns[range(10)], axis=1)
y = df_Xy['Kto_wygrał']
X = df_Xy.drop(['Kto_wygrał'], axis=1)

In [20]:
# SFG - Stałe Fragmenty Gry. Basically a percentage of how many goals each team scored from set pieces during the previous season. All data for each game is from the previous season, no more no less.
# Other variable names are pretty much self-explanatory
#df.head(30) #Uncomment if you want to take a look at what we're working with

Let's quickly check how much would we get back if we just gambled randomly:

In [21]:
tabela = odds1.copy()
array = np.random.randint(3, size=2939)
tabela.insert(4, "Randomly_assigned", array)
tabela.head(10)
rand_return_per_100 = 0
for j in range(5000):
    tabela = odds1.copy()
    array = np.random.randint(3, size=2939)
    tabela.insert(4, "Randomly_assigned", array)
    tabela["Wygrana"] = 0.0
    tabela["Wygrana"] = np.where(tabela["Kto_wygrał"] == tabela["Randomly_assigned"], 1, 0.0)
    tabela.loc[tabela.Kto_wygrał == 0,'Wygrana'] = tabela.Oddsy_gospodarze * 100 * tabela.Wygrana
    tabela.loc[tabela.Kto_wygrał == 1,'Wygrana'] = tabela.Oddsy_remis * 100 * tabela.Wygrana
    tabela.loc[tabela.Kto_wygrał == 2,'Wygrana'] = tabela.Oddsy_goście * 100 * tabela.Wygrana
    rand_return_per_100 = rand_return_per_100 + (tabela['Wygrana'].sum())/len(tabela['Wygrana'])
print(rand_return_per_100/5000)

94.39576522626719


Well that means that our model should on average bring at the very least ~94.30$ (per 100 invested).

#### ExpectedReturnNaive algorithm picks scenario with the highest probability, and returns what we would get back by placing bets according to this strategy.

Here we want to observe how betting on the most probable scenario will impact our returns

In [22]:
def ExpectedReturnNaive(__Classifier__, __criterion__, odds, X, y, i, j, t, wyg, acc):
    X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=t, random_state=j)
    meczTree = __Classifier__(criterion=__criterion__, max_depth = i) #From my observations, criterion type's impact is extremely marginal
    meczTree.fit(X_trainset,y_trainset) #Model training ^^^

    clf_predictions = meczTree.predict(X_testset) 
    indices = list(X_testset.index.astype(int))
    table = odds.loc[indices] #Makes sure that we actually allocate correct odds to a correct match
    table['Predicted_Winner'] = pd.Series(clf_predictions)

    table = table.dropna() #Somehow dropping NaN values twice wasn't enough
    table['Predicted_Winner'] = table['Predicted_Winner'].astype(int)
    table['Wygrana'] = 0.0
    table.loc[table.Kto_wygrał == table.Predicted_Winner,'Wygrana'] = 1 #Prediction is correct => assign 1. It's incorrect => leave it at 0
    table.loc[table.Kto_wygrał == 0,'Wygrana'] = table.Oddsy_gospodarze * 100 * table.Wygrana
    table.loc[table.Kto_wygrał == 1,'Wygrana'] = table.Oddsy_remis * 100 * table.Wygrana
    table.loc[table.Kto_wygrał == 2,'Wygrana'] = table.Oddsy_goście * 100 * table.Wygrana
    
    wyg = wyg + (table['Wygrana'].sum())/len(table['Wygrana']) #Average return per 100zł
    acc = acc + 100*(round(metrics.accuracy_score(y_testset, clf_predictions),4)) #Percentage score of correct picks
    return wyg, acc

#### ExpectedReturnAdvanced algorithm calculates Expected Value of each bet (i.e. P(scenario A) x bookmaker odds for scenario A), picks the one with highest EV, and returns what it'd yield back per 100zł placed, over the course of 2015/16 - 2024/25 seasons.

It's may not be the prettiest chunk of code, but tech debt isn't really an issue here. It's purpose is to give each class its own weight (i.e. odds of each scenario), for each sample respectively. Scikit-learn doesn't allow to allocate a dataframe of class weights (one for each observation), so we have to do it manually. Because of that, we can't really train a model to directly search for highest betting returns - however, if probability scores are accurate enough, that's... basically the same thing.

In [23]:
def ExpectedReturnAdvanced(__Classifier__, __criterion__, odds, X, y, i, j, t, wyg, acc):

    X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=t, random_state=j)
    meczTree = __Classifier__(criterion=__criterion__, max_depth = i)
    meczTree.fit(X_trainset,y_trainset) #Model training ^^^

    clf_predictions = meczTree.predict_proba(X_testset) #Notice how this time we are deriving exact probabilities of each scenario
    clf_predictions = pd.DataFrame(clf_predictions)
    indices = list(X_testset.index)
    odds_tree_predictions = odds.loc[indices]
    odds_tree_predictions = odds_tree_predictions.reset_index(drop=True)
    table = odds_tree_predictions.merge(clf_predictions, left_index=True, right_index=True)

    table['Potencjalny_return'] = 0
    table['Potencjalny_return'] = np.where(table['Kto_wygrał'] == 0, table['Oddsy_gospodarze']*100, table['Potencjalny_return']) #Creates a separate column that stores how much we'd get back if we picked the winner correctly
    table['Potencjalny_return'] = np.where(table['Kto_wygrał'] == 1, table['Oddsy_remis']*100, table['Potencjalny_return'])
    table['Potencjalny_return'] = np.where(table['Kto_wygrał'] == 2, table['Oddsy_goście']*100, table['Potencjalny_return'])
    table['Potencjalny_return'] = table['Potencjalny_return'].astype(int)
    table['Return_gospodarze'] = table['Oddsy_gospodarze'] * table[0] #Return_A calculates EV of scenario A
    table['Return_remis'] = table['Oddsy_remis'] * table[1] 
    table['Return_goście'] = table['Oddsy_goście'] * table[2]
    table["Max(EV)"] = table[['Return_gospodarze', 'Return_remis', 'Return_goście']].max(axis=1) #Takes the highest EV out of all scenarios (for each observation)
    table['Max(EV)'] = np.where(table['Max(EV)'] == table['Return_gospodarze'], 0, table['Max(EV)']) #Picks which scenario in a given observation has the highest EV
    table['Max(EV)'] = np.where(table['Max(EV)'] == table['Return_remis'], 1, table['Max(EV)']) 
    table['Max(EV)'] = np.where(table['Max(EV)'] == table['Return_goście'], 2, table['Max(EV)']) #In this model, we would bet exactly on the pick with highest EV
    table['Max(EV)'] = table['Max(EV)'].astype(int)
    table['Actual_return'] = np.where(table['Max(EV)'] == table['Kto_wygrał'], table['Potencjalny_return'], 0) #If scenario with max. EV = scenario that happened, allocate returns into here
    
    wyg = wyg + table['Actual_return'].sum() / len(table) #Average return per 100zł
    acc = acc + 100*(np.count_nonzero(table['Actual_return']))/len(table['Actual_return']) #Percentage score of correct picks

    return wyg, acc

Now let's test how both approaches behave

In [24]:
test_size = 0.2
n = 100
for i in range(1,30):
    return1, return2 = (0,0)
    accuracy1, accuracy2 = (0,0)
    for j in range(n):
        return1, accuracy1 = ExpectedReturnNaive(DecisionTreeClassifier, "entropy", odds1, X, y, i, j, test_size, return1, accuracy1)
    print(f"Most probable scenario | max_depth={i}, test_size = {test_size} ==> Average return per 100zł: {(return1/n):.3f}, Average accuracy score = {(accuracy1/n):.3f}%") 
    for j in range(n):
        return2, accuracy2 = ExpectedReturnAdvanced(DecisionTreeClassifier, "entropy", odds1, X, y, i, j, test_size, return2, accuracy2)
    print(f"Highest EV scenario    | max_depth={i}, test_size = {test_size} ==> Average return per 100zł: {(return2/n):.3f}, Average accuracy score = {(accuracy2/n):.3f}%")
#Here we're defining a random_state, because we want reproducibility - crucial for cross examination.
#We repeat the algorithm n times, because, well, we don't want to hop onto a bad model just because it got a lucky test sample

Most probable scenario | max_depth=1, test_size = 0.2 ==> Average return per 100zł: 97.767, Average accuracy score = 43.754%
Highest EV scenario    | max_depth=1, test_size = 0.2 ==> Average return per 100zł: 92.839, Average accuracy score = 28.449%
Most probable scenario | max_depth=2, test_size = 0.2 ==> Average return per 100zł: 96.774, Average accuracy score = 43.965%
Highest EV scenario    | max_depth=2, test_size = 0.2 ==> Average return per 100zł: 92.365, Average accuracy score = 29.524%
Most probable scenario | max_depth=3, test_size = 0.2 ==> Average return per 100zł: 98.052, Average accuracy score = 43.841%
Highest EV scenario    | max_depth=3, test_size = 0.2 ==> Average return per 100zł: 91.818, Average accuracy score = 29.599%
Most probable scenario | max_depth=4, test_size = 0.2 ==> Average return per 100zł: 97.829, Average accuracy score = 43.649%
Highest EV scenario    | max_depth=4, test_size = 0.2 ==> Average return per 100zł: 91.689, Average accuracy score = 29.726%


As we can see, the "naive" solution doesn't do that bad - it's not profitable, but it performs better than average, especially for max_depth between 3 and 12. However, the supposedly 'advanced" algorithm performs horrifyingly badly.

How come?
- We are betting on less probable scenarios, but ones that wield larger returns. Hence, its' accuracy score is (and should be) below 33%; the issue may be that it shouldn't be that low (it oscilates between 27 and 30%)
- Or maybe the probability scores are inaccurate. If a scenario with odds = 12.00 has a 20% chance of occuring, we should definitely bet on it. But 5%? That's why these values matter.

###### (oh and btw, if you're curious - high values for max_depth between 6 and 8 is due to variance; I checked it for n=1000 and its' return sits around 97%)

#### Let's try other models: This time, we'll go with a Random Forest. Since it takes way longer to compute, we won't make as many repetitions this time.

In [25]:
test_size = 0.25
n = 100
for i in range(1,6):
    return1, return2 = (0,0)
    accuracy1, accuracy2 = (0,0)
    for j in range(n):
        return1, accuracy1 = ExpectedReturnNaive(RandomForestClassifier, "entropy", odds1, X, y, i, j, test_size, return1, accuracy1)
        #return2, accuracy2 = ExpectedReturnAdvanced(RandomForestClassifier, "entropy", odds1, X, y, i, j, test_size, return2, accuracy2)
    print(f"Most probable scenario | max_depth={i}, test_size = {test_size} ==> Average return per 100zł: {(return1/n):.3f}, Average accuracy score = {(accuracy1/n):.3f}%") 
    #print(f"Highest EV scenario    | max_depth={i}, test_size = {test_size} ==> Average return per 100zł: {(return2/n):.3f}, Average accuracy score = {(accuracy2/n):.3f}%")

Most probable scenario | max_depth=1, test_size = 0.25 ==> Average return per 100zł: 98.683, Average accuracy score = 43.982%
Most probable scenario | max_depth=2, test_size = 0.25 ==> Average return per 100zł: 98.377, Average accuracy score = 44.388%
Most probable scenario | max_depth=3, test_size = 0.25 ==> Average return per 100zł: 97.430, Average accuracy score = 44.895%
Most probable scenario | max_depth=4, test_size = 0.25 ==> Average return per 100zł: 97.149, Average accuracy score = 44.993%
Most probable scenario | max_depth=5, test_size = 0.25 ==> Average return per 100zł: 96.825, Average accuracy score = 44.881%


Well that... doesn't look too bad. Still not profitable, but the test size and n repetitions are so large here that variance should be miniscule, and yet it's still not that far off from profitability. I mean, if you pick "correct" parameters, like n=20 and test_size = 0.05, you may find a 16% edge - potentially effective for convincing shareholders that your ideas are very good, a bit less efficient for holding onto your job.

If you want to learn why I commented out the "advanced" algorithm, you can uncomment it and find out yourself.

#### Let's now check how XGBoost performs. It has a different syntax, so we can't just copy previously defined functions.

In [26]:
def XGBoostPredictionsNaive(oddsy, X, y, t, x, i, j, wyg, acc):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=t, random_state=j)
    xgb_train = xgb.DMatrix(X_train, y_train, enable_categorical=True)
    xgb_test = xgb.DMatrix(X_test, y_test, enable_categorical=True)
    params = {
        'objective': 'multi:softmax', #Puts observations into one of 3 categories
        'max_depth': i,
        'learning_rate': x,
        'num_class': 3
    }
    model = xgb.train(params=params, dtrain=xgb_train, num_boost_round=50)
    xgb_predictions = model.predict(xgb_test)
    xgb_predictions = xgb_predictions.astype(int)
    indices = list(X_test.index)
    oddsy_xgb_predictions = oddsy.loc[indices]
    oddsy_xgb_predictions = oddsy_xgb_predictions.reset_index(drop=True)
    oddsy_xgb_predictions.insert(4, 'Predicted_Winner', xgb_predictions)
    tabelka = oddsy_xgb_predictions
    tabelka.loc[tabelka.Kto_wygrał == tabelka.Predicted_Winner,'Wygrana'] = 1
    tabelka.loc[tabelka.Kto_wygrał == 0,'Wygrana'] = tabelka.Oddsy_gospodarze * 100 * tabelka.Wygrana
    tabelka.loc[tabelka.Kto_wygrał == 1,'Wygrana'] = tabelka.Oddsy_remis * 100 * tabelka.Wygrana
    tabelka.loc[tabelka.Kto_wygrał == 2,'Wygrana'] = tabelka.Oddsy_goście * 100 * tabelka.Wygrana
    tabelka['Wygrana'] = tabelka['Wygrana'].fillna(0)
    tabelka['Wygrana'] = tabelka['Wygrana'].astype(int)

    wyg = wyg + (tabelka['Wygrana'].sum())/(len(tabelka['Wygrana']))
    acc = acc + 100*(np.count_nonzero(tabelka['Wygrana']))/len(tabelka['Wygrana'])

    return wyg, acc



def XGBoostPredictionsAdvanced(oddsy, X, y, t, x, i, j, wyg, acc): #Let's try again
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=t, random_state=j)
    xgb_train = xgb.DMatrix(X_train, y_train, enable_categorical=True)
    xgb_test = xgb.DMatrix(X_test, y_test, enable_categorical=True)
    params = {
        'objective': 'multi:softprob', #Predicts a probability score of each scenario
        'max_depth': i,
        'learning_rate': x,
        'num_class': 3
    }
    model = xgb.train(params=params, dtrain=xgb_train, num_boost_round=50)
    xgb_predictions = model.predict(xgb_test)
    xgb_predictions = pd.DataFrame(xgb_predictions)
    indices = list(X_test.index)
    oddsy_xgb_predictions = oddsy.loc[indices]
    oddsy_xgb_predictions = oddsy_xgb_predictions.reset_index(drop=True)
    tabelka = oddsy_xgb_predictions.merge(xgb_predictions, left_index=True, right_index=True)
    tabelka['Potencjalny_return'] = 0 #I thought of putting everything below this part into a separate function, but it caused more issues than I was bothered to fix. Maybe later. It's not necessary, not if we don't have to maintain too much of this code
    tabelka['Potencjalny_return'] = np.where(tabelka['Kto_wygrał'] == 0, tabelka['Oddsy_gospodarze']*100, tabelka['Potencjalny_return'])
    tabelka['Potencjalny_return'] = np.where(tabelka['Kto_wygrał'] == 1, tabelka['Oddsy_remis']*100, tabelka['Potencjalny_return'])
    tabelka['Potencjalny_return'] = np.where(tabelka['Kto_wygrał'] == 2, tabelka['Oddsy_goście']*100, tabelka['Potencjalny_return'])
    tabelka['Potencjalny_return'] = tabelka['Potencjalny_return'].astype(int)
    tabelka['Return_gospodarze'] = tabelka['Oddsy_gospodarze'] * tabelka[0]
    tabelka['Return_remis'] = tabelka['Oddsy_remis'] * tabelka[1]
    tabelka['Return_goście'] = tabelka['Oddsy_goście'] * tabelka[2]
    tabelka["Max(EV)"] = tabelka[['Return_gospodarze', 'Return_remis', 'Return_goście']].max(axis=1)
    tabelka['Max(EV)'] = np.where(tabelka['Max(EV)'] == tabelka['Return_gospodarze'], 0, tabelka['Max(EV)'])
    tabelka['Max(EV)'] = np.where(tabelka['Max(EV)'] == tabelka['Return_remis'], 1, tabelka['Max(EV)'])
    tabelka['Max(EV)'] = np.where(tabelka['Max(EV)'] == tabelka['Return_goście'], 2, tabelka['Max(EV)'])
    tabelka['Max(EV)'] = tabelka['Max(EV)'].astype(int)
    tabelka['Actual_return'] = np.where(tabelka['Max(EV)'] == tabelka['Kto_wygrał'], tabelka['Potencjalny_return'], 0)
    return_calkowity = tabelka['Actual_return'].sum()
    wyg = wyg + return_calkowity / len(tabelka)
    acc = acc + 100*(np.count_nonzero(tabelka['Actual_return']))/len(tabelka['Actual_return'])

    return wyg, acc

In [27]:
test_size = 0.2
n = 50
learning_rate = 0.001

#for i in range(1,6):    Not needed when learning_rate is so low; other values will just yield the same result. We can tune other parameters as well, but that will most likely keep us around 99%
wygrana1, wygrana2 = (0,0)
accuracy1, accuracy2 = (0,0)
for j in range(n):
    wygrana1, accuracy1 = XGBoostPredictionsNaive(odds1, X, y, test_size, learning_rate, 1, j, wygrana1, accuracy1)
    wygrana2, accuracy2 = XGBoostPredictionsAdvanced(odds1, X, y, test_size, learning_rate, 1, j, wygrana2, accuracy2)
print(f"XGBoostPredictionsNaive    | Average return per 100zł for test_size = {test_size}, learning_rate = {learning_rate}, max_depth = {i}:  {wygrana1/n:.3f}zł,  with accuracy_score = {accuracy1/n:.3f}%")
print(f"XGBoostPredictionsAdvanced | Average return per 100zł for test_size = {test_size}, learning_rate = {learning_rate}, max_depth = {i}:  {wygrana2/n:.3f}zł,  with accuracy_score = {accuracy2/n:.3f}%")


XGBoostPredictionsNaive    | Average return per 100zł for test_size = 0.2, learning_rate = 0.001, max_depth = 5:  99.068zł,  with accuracy_score = 44.310%
XGBoostPredictionsAdvanced | Average return per 100zł for test_size = 0.2, learning_rate = 0.001, max_depth = 5:  90.995zł,  with accuracy_score = 24.680%


Well that's just embarassing for the so-called "advanced" algorithm. Percentage scores must be totally off.

At the same time, again, the "naive" algorithm edges close to profitability, but it's so close yet so far.

There's also LogisticRegression that was worth trying out, but honestly, results are comparable to how decision trees performed, i.e. that would bring exactly nothing to the table.
"Naive" score oscilates around 97%, whereas "advanced" sits close to 93%. Again, not even better than random selection.

#### Some conclusions:
- The "naive" algorithm is... not bad. 99% return doesn't sound too impressive, but keep in mind that EV of playing this game is 94.3%, meaning that if this were a zero-sum game, this algorithm would actually find a 5% edge.
 Especially impressive after considering the fact that the only data we are working with here, is each team's performance during the previous season. That's it.
- If this model turned out to be profitable, then it'd be exceptionally easy to maintain. Data insertion would only have to occur once per year.
- The "advanced" algorithm is obviously bad. Something must be off with percentage scores, it does matter if a 40%, 35%, 25% triplet of scenarios is predicted to be a 45%, 40%, 15%. We can see it by how low the accuracy score is; to some extent, it should be lower than 33%, because maximizing EV means that sometimes we should bet for the underdogs. But it can't be as low as 24%.
- I've attempted to calibrate percentage scores using a CalibratedClassifierCV, but with little result. I might come back later and try again.
- Don't gamble. Unless you have a verifiably profitable model. Then do gamble. It might turn out to be a safer investment than buying MAG-7 stocks (maybe not).