# Predict the match results for season 2015-2016 

Build a model to predict the match results (Win, Lose or Draw) for the 2015 season of the English 
Premier League (EPL). You would have to come up with metrics based on which you would predict the 
outcome. 

##### win – Home Team goals > Away Team goal 
##### draw – Home Team goals = Away Team goals 
##### lose – Home Team goals < Away Team goals 

__The steps I will take in this notebook:__

1. Load and Explore Data
- Check data structure, missing values, and relevant features.

2. Feature Engineering
- Create meaningful metrics like team performance, home advantage, goal difference, etc.

3. Preprocessing
- Encode categorical variables.
- Normalize or scale numerical features.

4. Train a Classification Model
- Use a machine learning model (e.g., Logistic Regression, Random Forest, or XGBoost) to predict match results.

5. Evaluate the Model
- Use accuracy, confusion matrix, and classification report to assess performance.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


### Load & Explore the Data

In [10]:
# Load the training and test datasets
train_path = "EPL_assignment/epl_matches_train.csv"
test_path = "EPL_assignment/epl_matches_test.csv"

df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

# Display the first few rows of the training dataset
df_train.head()

Unnamed: 0,season,stage,date,match_id,home_team_id,away_team_id,home_player_X1,home_player_X2,home_player_X3,home_player_X4,...,red_card_home_team,red_card_away_team,crosses_home_team,crosses_away_team,corner_home_team,corner_away_team,possession_home_team,possession_away_team,home_team_goal,away_team_goal
0,2008/2009,1,2008-08-17 00:00:00,49337,10260,10261,1,2,4,6,...,0,0,24,9,6,6,55.0,45.0,1,1
1,2008/2009,1,2008-08-16 00:00:00,38136,9825,8659,1,2,4,6,...,0,0,21,7,7,5,66.0,34.0,1,0
2,2008/2009,1,2008-08-16 00:00:00,43276,8472,8650,1,2,4,6,...,0,0,15,19,1,8,46.0,54.0,0,1
3,2008/2009,1,2008-08-16 00:00:00,40671,8654,8528,1,2,4,6,...,0,0,15,27,6,10,52.0,48.0,2,1
4,2008/2009,1,2008-08-17 00:00:00,34633,10252,8456,1,2,4,6,...,0,0,16,16,7,8,52.0,48.0,4,2


In [11]:
# Ensure no missing values
df_train.dropna(inplace=True)

df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2579 entries, 0 to 2659
Data columns (total 90 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   season                     2579 non-null   object 
 1   stage                      2579 non-null   int64  
 2   date                       2579 non-null   object 
 3   match_id                   2579 non-null   int64  
 4   home_team_id               2579 non-null   int64  
 5   away_team_id               2579 non-null   int64  
 6   home_player_X1             2579 non-null   int64  
 7   home_player_X2             2579 non-null   int64  
 8   home_player_X3             2579 non-null   int64  
 9   home_player_X4             2579 non-null   int64  
 10  home_player_X5             2579 non-null   int64  
 11  home_player_X6             2579 non-null   int64  
 12  home_player_X7             2579 non-null   int64  
 13  home_player_X8             2579 non-null   int64  
 1

### Feature Engineering

#### For train:

In [7]:
# Create target variable 'match_result'
def get_match_result(home_goals, away_goals):
    if home_goals > away_goals:
        return "Win"
    elif home_goals < away_goals:
        return "Lose"
    else:
        return "Draw"

df_train["match_result"] = df_train.apply(lambda x: get_match_result(x["home_team_goal"], x["away_team_goal"]), axis=1)

# Select key features (excluding player IDs and unnecessary columns)
features = [
    "home_team_id", "away_team_id",
    "possession_home_team", "possession_away_team",
    "corner_home_team", "corner_away_team",
    "crosses_home_team", "crosses_away_team",
    "red_card_home_team", "red_card_away_team",
    "home_team_goal", "away_team_goal"
]

df_train_filtered = df_train[features + ["match_result"]]

# Show data overview
df_train_filtered.head()

Unnamed: 0,home_team_id,away_team_id,possession_home_team,possession_away_team,corner_home_team,corner_away_team,crosses_home_team,crosses_away_team,red_card_home_team,red_card_away_team,home_team_goal,away_team_goal,match_result
0,10260,10261,55.0,45.0,6,6,24,9,0,0,1,1,Draw
1,9825,8659,66.0,34.0,7,5,21,7,0,0,1,0,Win
2,8472,8650,46.0,54.0,1,8,15,19,0,0,0,1,Lose
3,8654,8528,52.0,48.0,6,10,15,27,0,0,2,1,Win
5,8668,8655,51.0,49.0,3,4,14,21,0,0,2,3,Lose


In [8]:
# Goal Difference
df_train_filtered["goal_difference"] = df_train_filtered["home_team_goal"] - df_train_filtered["away_team_goal"]

# Possession Difference
df_train_filtered["possession_difference"] = df_train_filtered["possession_home_team"] - df_train_filtered["possession_away_team"]

# Set-Piece Advantage (Corners & Crosses Difference)
df_train_filtered["corner_difference"] = df_train_filtered["corner_home_team"] - df_train_filtered["corner_away_team"]
df_train_filtered["crosses_difference"] = df_train_filtered["crosses_home_team"] - df_train_filtered["crosses_away_team"]

# Red Card Impact
df_train_filtered["red_card_difference"] = df_train_filtered["red_card_away_team"] - df_train_filtered["red_card_home_team"]

# Show data overview
df_train_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_filtered["goal_difference"] = df_train_filtered["home_team_goal"] - df_train_filtered["away_team_goal"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train_filtered["possession_difference"] = df_train_filtered["possession_home_team"] - df_train_filtered["possession_away_team"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

Unnamed: 0,home_team_id,away_team_id,possession_home_team,possession_away_team,corner_home_team,corner_away_team,crosses_home_team,crosses_away_team,red_card_home_team,red_card_away_team,home_team_goal,away_team_goal,match_result,goal_difference,possession_difference,corner_difference,crosses_difference,red_card_difference
0,10260,10261,55.0,45.0,6,6,24,9,0,0,1,1,Draw,0,10.0,0,15,0
1,9825,8659,66.0,34.0,7,5,21,7,0,0,1,0,Win,1,32.0,2,14,0
2,8472,8650,46.0,54.0,1,8,15,19,0,0,0,1,Lose,-1,-8.0,-7,-4,0
3,8654,8528,52.0,48.0,6,10,15,27,0,0,2,1,Win,1,4.0,-4,-12,0
5,8668,8655,51.0,49.0,3,4,14,21,0,0,2,3,Lose,-1,2.0,-1,-7,0


### Preprocessing

#### For train:

In [None]:
# Drop Redundant Columns
df_train_filtered = df_train_filtered.drop(columns=["home_team_goal", "away_team_goal"])

In [None]:
# Convert Categorical Labels
label_encoder = LabelEncoder()
df_train_filtered["match_result"] = label_encoder.fit_transform(df_train_filtered["match_result"])

This results in:
#### Win → 2
#### Draw → 1
#### Lose → 0

#### For Test:

In [None]:
# Ensure no missing values
df_test.fillna(0, inplace=True)

### Train a Classification Model

In [None]:
# Split dataset into features and target
X = df_train_filtered.drop(columns=["match_result"])
y = df_test["match_result"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train a Random Forest model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_val)
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_val, y_pred))
print("Classification Report:\n", classification_report(y_val, y_pred))
