<a href="https://colab.research.google.com/github/shusritavenugopal/Football-Match-Prediction/blob/main/sheInnovates_FootBall_Match_Winner_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **EPL Match Predictor Model Using RandomForest Mchine Learning Model**

In [None]:
import pandas as pd

Use matches.csv data for this ML model.

The file has more than thousand rows and each row is a single match played in the English Premier League.

In [None]:
matches = pd.read_csv("https://raw.githubusercontent.com/shusritavenugopal/Football-Match-Prediction/main/matches.csv",index_col=0)


### **Investing Missing Data**
EPL has 38 matches in 1 season, 20 teams in 1 season. we have data for 2 seasons. We should have 1520 rows.

3 teams are moved to the lower league. 3 teams are pulled up to the EPL.

In [None]:
matches.shape

In [None]:
matches["team"].value_counts()

In [None]:
matches[matches["team"] == "Liverpool"].sort_values("date")
# missing one season for liverpool

In [None]:
matches["round"].value_counts()
# Few matchweeks have lesser than 39 matches because the this data was scrapped while the league was still going on.

# Cleaning our data for Machine Learning

ML models can only work with numeric datatypes. To avoid predictors with object datatype, we need to make new columns to convert object datatype to int64 or float64

In [None]:
matches.dtypes

In [None]:
# convert date column to correct dateTime format in pandas. This will help us use date column as predictor.
matches["date"] = pd.to_datetime(matches["date"])
matches.dtypes

Creating Predictors for Machine Learning Model

In [None]:
matches["venue_code"] = matches["venue"].astype("category").cat.codes
matches
# 0 when away, 1 when it was home game

In [None]:
# each code for a opponent team
matches["opp_code"] = matches["opponent"].astype("category").cat.codes
matches

In [None]:
matches["hour"] = matches["time"].str.replace(":.+", "", regex=True).astype("int")
matches

In [None]:
matches["day_code"] = matches["date"].dt.dayofweek

# to predict won or not.


In [None]:
matches["target"] = (matches["result"] == "W").astype("int")

In [None]:
matches

# Creating our initial machine learning model

Training a machine learning model.

RandomForest is a series of decision trees each decision trees will have slightly different parameters.

We will import RandomForestclassifier. We are choosing random forest classifier because the random forest classifier is a type of machine learning model pick up non-linearities from the data. Opp_code doesn't have the linear relationship with the other data columns - They are just codes for different opponents. This is something linear model can't pick up.


In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# n_estimators -  number of individual decision trees we would want to train. Higher the number of decision trees, longer the time taken to train the model and higher the accuracy of the prediction.
# min_samples_split - number of samples we want to have in a leaf of a decision tree before we split the node. Higher the value, lesser the chance of overfitting but accuracy is also less.
# random state - If we run the random forest multiple times, we would still get the same results given that the data is same.

rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

# This is a time series data.
Splitting train and test data. - We need to make sure all the test data is chronologically after the train data. This is because in real time applications, we cannot train ML model with future data and predict the same. Training must be done with historic data.

In [None]:
train = matches[matches["date"] < '2022-01-01']

In [None]:
test = matches[matches["date"] > '2022-01-01']

In [None]:
predictors = ["venue_code", "opp_code", "hour", "day_code"]

fit() - train the random forest model with the predictors like venue_code. opp_code, hour and day_code to predict target which is win.

In [None]:
rf.fit(train[predictors], train["target"])

Now we can generate predictions using the predict method. Pass in our test data with predictors.

In [None]:
preds = rf.predict(test[predictors])

Evaluation:

Determine the accuracy of the model.
1. Import accuracy_score. Accuracy_score is a metric that will say if you predicted a win, what percentage of the times your prediction accurate.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy = accuracy_score(test["target"], preds)
accuracy

In [None]:
combined = pd.DataFrame(dict(actual=test["target"], prediction=preds))

In [None]:
pd.crosstab(index=combined["actual"], columns=combined["prediction"])

The model predicted win more incorrectly than loss.

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(combined.index, combined['actual'], label='Actual', marker='o')
plt.scatter(combined.index, combined['prediction'], label='Predicted', marker='x')
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Actual vs Predicted Values')
plt.legend()
plt.show()

In [None]:
from sklearn.metrics import precision_score

precision_score(test["target"], preds)

When we predicted a win, a team won only 47% of the times. The precision is pretty bad. We can improve the model.

Improving Precision with Rolling Averages of a team.  
If we are at matchweek 4, how did a team perform in the previous three match weeks. We can compute the rolling averages to know the performance of last three matches and use that as a predictor in out model.

In [None]:
# This will create a dataframe for each team.
grouped_matches = matches.groupby("team")

In [None]:
group = grouped_matches.get_group("Manchester City").sort_values("date")
group

In [None]:
def rolling_averages(group, cols, new_cols):
  group = group.sort_values("date")
  rolling_stats = group[cols].rolling(3, closed='left').mean()
  group[new_cols] = rolling_stats
  # dropping missing values.
  group = group.dropna(subset=new_cols)
  return group

These are the columns for which we will compute rolling averages for.

In [None]:
# These columsn are present in the csv file - "goals_for", "goals_against", "shots", "shots_on_target", "distance", "free_kicks", "penalty_kicks", "penalty_kick_attempts"

cols = ["gf", "ga", "sh", "sot", "dist", "fk", "pk", "pkatt"]
new_cols = [f"{c}_rolling" for c in cols]
new_cols

In [None]:
# Let's call rolling_averages for a single group
rolling_averages(group, cols, new_cols)

We have successfully added rolling averages for "goals_for", "goals_against", "shots", "shots_on_target", "distance", "free_kicks", "penalty_kicks", "penalty_kick_attempts" columns. These columns can now be passed to the ML model as predictors to increase the accuracy and precision.

In [None]:
# Let's apply the rolling averages for all the teams and matches. We will groupby teams and compute rolling averages.
matches_rolling = matches.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))
matches_rolling

In [None]:
matches_rolling = matches_rolling.droplevel('team')

In [None]:
matches_rolling

In [None]:
matches_rolling.index = range(matches_rolling.shape[0])

# Retraining our machine learning model using rolling averages as predictors.

In [None]:
def make_predictions(data, predictors):
    train = data[data["date"] < '2022-01-01']
    test = data[data["date"] > '2022-01-01']
    rf.fit(train[predictors], train["target"])
    preds = rf.predict(test[predictors])
    combined = pd.DataFrame(dict(actual=test["target"], predicted=preds), index=test.index)
    precision = precision_score(test["target"], preds)
    return combined, precision

In [None]:
combined, precision = make_predictions(matches_rolling, predictors + new_cols)

In [None]:
precision

In [None]:
combined
pd.crosstab(index=combined["actual"], columns=combined["predicted"])

In [None]:
combined = combined.merge(matches_rolling[['date', "team", "opponent", "result"]], left_index = True, right_index=True )

In [None]:
combined

Combining Home and Away Predictions

In [None]:
class MissingDict(dict):
    __missing__ = lambda self, key: key

map_values = {"Brighton and Hove Albion": "Brighton", "Manchester United": "Manchester Utd", "Newcastle United": "Newcastle Utd", "Tottenham Hotspur": "Tottenham", "West Ham United": "West Ham", "Wolverhampton Wanderers": "Wolves"}
mapping = MissingDict(**map_values)

In [None]:
combined["new_team"] = combined["team"].map(mapping)

In [None]:
merged = combined.merge(combined, left_on=["date", "new_team"], right_on=["date", "opponent"])
merged

In [None]:
merged[(merged["predicted_x"] == 1) & (merged["predicted_y"] ==0)]["actual_x"].value_counts()

In [None]:
27 / 40 * 100