# High scoring games ML model #
Utilsing a decision tree to help provide insight into potential high scoring games in attempts to close the house edge.

Uses instruction from https://www.youtube.com/watch?v=0irmDBWLrco

Imports:

In [1]:
#import
import pandas as pd


#read our data
matches= pd.read_csv("data_21_23.csv")

import the ml stuff

In [2]:
from sklearn.ensemble import RandomForestClassifier

In [3]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

# Prepare the data for ML #
We need to prepare our data, inclusing selecting test and training data

Unlike the video, we aren't trying to predict the outcome of the game, just what will be high scoring. As such the useful parameter are differnt.

In [4]:
matches["home_code"]= matches["HomeTeam"].astype("category").cat.codes

In [5]:
matches["away_code"]= matches["AwayTeam"].astype("category").cat.codes

In [6]:
matches['Date'] = pd.to_datetime(matches['Date'])
matches["Day_code"] = matches["Date"].dt.dayofweek

In [7]:
print(matches.head())

  Div       Date   Time        HomeTeam       AwayTeam  FTHG  FTAG FTR  HTHG  \
0  E0 2022-08-05  20:00  Crystal Palace        Arsenal   0.0   2.0   A   0.0   
1  E0 2022-08-06  12:30          Fulham      Liverpool   2.0   2.0   D   1.0   
2  E0 2022-08-06  15:00     Bournemouth    Aston Villa   2.0   0.0   H   1.0   
3  E0 2022-08-06  15:00           Leeds         Wolves   2.0   1.0   H   1.0   
4  E0 2022-08-06  15:00       Newcastle  Nott'm Forest   2.0   0.0   H   0.0   

   HTAG  ... B365CAHA PCAHH  PCAHA  MaxCAHH  MaxCAHA  AvgCAHH  AvgCAHA  \
0   1.0  ...     1.84  2.04   1.88     2.09     1.88     2.03     1.85   
1   0.0  ...     2.03  1.91   2.02     2.01     2.06     1.89     1.99   
2   0.0  ...     2.00  1.93   2.00     1.94     2.04     1.88     2.00   
3   1.0  ...     1.85  2.10   1.84     2.14     1.87     2.08     1.81   
4   0.0  ...     1.96  1.99   1.93     2.19     1.97     2.03     1.86   

   home_code  away_code  Day_code  
0          7          0       4.0  
1 

We also need to create a target colmun detailing whether or not there were >=3 goals in the game.

In [8]:
matches['high_scoring'] = (matches['FTHG'] + matches['FTAG'] >= 3).astype(int)


In [9]:
print(matches.head())

  Div       Date   Time        HomeTeam       AwayTeam  FTHG  FTAG FTR  HTHG  \
0  E0 2022-08-05  20:00  Crystal Palace        Arsenal   0.0   2.0   A   0.0   
1  E0 2022-08-06  12:30          Fulham      Liverpool   2.0   2.0   D   1.0   
2  E0 2022-08-06  15:00     Bournemouth    Aston Villa   2.0   0.0   H   1.0   
3  E0 2022-08-06  15:00           Leeds         Wolves   2.0   1.0   H   1.0   
4  E0 2022-08-06  15:00       Newcastle  Nott'm Forest   2.0   0.0   H   0.0   

   HTAG  ... PCAHH PCAHA  MaxCAHH  MaxCAHA  AvgCAHH  AvgCAHA  home_code  \
0   1.0  ...  2.04  1.88     2.09     1.88     2.03     1.85          7   
1   0.0  ...  1.91  2.02     2.01     2.06     1.89     1.99          9   
2   0.0  ...  1.93  2.00     1.94     2.04     1.88     2.00          2   
3   1.0  ...  2.10  1.84     2.14     1.87     2.08     1.81         10   
4   0.0  ...  1.99  1.93     2.19     1.97     2.03     1.86         15   

   away_code  Day_code  high_scoring  
0          0       4.0       

# Training the model #
Now to choose the predictors: these all need to be data points that are obtainable pre match to feed into the model, and features that are likley to affect the number of goals scored.
I have chosen to include the odds of >2.5 from bet365. This is included to encapsulate the bookmakers opinion. B365 are chosen as they are one of the biggest bookmakers so are likley to maintain good odds. I am also banned from bet365, so can't bet there so might as well use their information for other purposes.

In [10]:
train = matches[matches["Date"]<"2022-01-08"]

In [11]:
test = matches[matches["Date"]>"2022-01-08"]

In [12]:
predictors = ["Day_code","home_code","away_code", "B365>2.5"]

In [13]:
rf.fit(train[predictors],train["high_scoring"])

In [14]:
preds = rf.predict(test[predictors])

In [15]:
from sklearn.metrics import accuracy_score

In [16]:
acc = accuracy_score(test["high_scoring"],preds)

In [17]:
acc

0.5676625659050967

This is theoretically enough to be profitable in the long run, with break even equvilant odds being 1.76 whilst av odds were 1.85. 

Expected profit if this contiuned would be 0.56 x 1.85 = 1.05 or 5%

In [18]:
combined = pd.DataFrame(dict(actual=test["high_scoring"],prediction=preds))

In [19]:
pd.crosstab(index=combined["actual"],columns=combined["prediction"])

prediction,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,108,160
1,86,215


But as what we will bet on is when the model predcit that there will be a high scoring game, this is what we care about. So lets use that as the accuary.

In [20]:
from sklearn.metrics import precision_score

In [21]:
prec=precision_score(test["high_scoring"],preds)

In [22]:
prec

0.5733333333333334

Mildly higher

## Imporve our model ##
TO imporve the modle we will include rolling average of goals scored against and by these teams up to this game week

In [23]:
grouped_matches = matches.groupby("HomeTeam")

In [24]:
city_home=grouped_matches.get_group("Man City")

In [25]:
def rolling_averages(group, cols, new_cols):
    group = group.sort_values("Date")
    rolling_stats = group[cols].rolling(3, closed='left').mean()
    group[new_cols] = rolling_stats
    group = group.dropna(subset=new_cols)
    return group

In [26]:
cols = ["FTHG", "FTAG", "B365>2.5" ]
new_cols =[f"{c}_rolling" for c in cols]

In [27]:
matches_rolling = matches.groupby("HomeTeam").apply(lambda x: rolling_averages(x, cols, new_cols))

In [28]:
matches_rolling = matches_rolling.droplevel('HomeTeam')

In [29]:
matches_rolling.index = range(matches_rolling.shape[0])

In [31]:
def make_predictions(data, predictors):
    train2 = data[data["Date"] < '2022-01-08']
    test2 = data[data["Date"] > '2022-01-08']
    rf.fit(train[predictors], train["high_scoring"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["high_scoring"], predicted=preds), index=test.index)
    prec = precision_score(test["high_scoring"], preds)
    return combined, prec

In [32]:
combined, prec = make_predictions(matches_rolling, predictors + new_cols)

In [34]:
prec

0.5196374622356495

## Irony ##
In attempts to improve the model, the model got worse. Possible reasoning: the model worked best when the inputs were what teams are platying, when and the bookmaker odds. When more data is in put, the model precision drops. The bookmaker odds contain within themselfs much of this information, and it would seem a simpler system based on who is playing and what the bookies think is possible less likely to be swayed be trends that don't repeat them selves.

Returning to the original model:

In [63]:
model2 = RandomForestClassifier(n_estimators=100, min_samples_split=10, random_state=3)

In [64]:
model2.fit(train[predictors],train["high_scoring"])

In [65]:
preds = model2.predict(test[predictors])

In [66]:
predictors

['Day_code', 'home_code', 'away_code', 'B365>2.5']

In [69]:
combined = pd.DataFrame(dict(actual=test["high_scoring"],prediction=preds))
pd.crosstab(index=combined["actual"],columns=combined["prediction"])
prec = precision_score(test["high_scoring"], preds)

In [70]:
prec

0.5681818181818182