# Business Understanding

## Title: Machine Learning Powered Soccer Match Prediction 


![Soccer Image](./images/cropped_epl_image.webp)


**Authors**: 
1. Dennis Kobia
2. Jane Mwangi
3. Rose Kyalo
4. Brytone Omare
5. Ivy Ndunge
6. Wayne Korir

## 1. Background
The betting industry in Kenya has undergone a transformative shift, evolving into a multi-billion-dollar industry. According to the Betting Control and Licensing Board of Kenya, the sector has grown exponentially, contributing significantly to the nation's economic development. The latest available data from the board indicates that the industry generated over Ksh 204 billion (approximately USD 2 billion) in revenue in the last fiscal year alone ([Betting Control and Licensing Board - Kenya](https://bclb.co.ke/)).

Within this burgeoning industry, soccer betting has emerged as the undisputed frontrunner, capturing the imagination of millions of Kenyan sports enthusiasts. A survey conducted by GeoPoll, a mobile-based market research platform, revealed that over 60% of mobile gamers in Kenya actively engage in soccer betting, making it the dominant betting category ([GeoPoll](https://www.geopoll.com/blog/the-rise-of-mobile-gaming-in-africa/)).

The allure of soccer betting lies in its dynamic and unpredictable nature, where punters navigate through a plethora of betting options. From predicting the winner of a match to anticipating the total goals scored, the process involves a meticulous analysis of team and player statistics, historical performances, and real-time match dynamics. Kenyan sports enthusiasts, fueled by their passion for the game, seek to translate their knowledge into profitable betting decisions ([Daily Nation](https://www.nation.co.ke/kenya/sports/-/1090/1090/-/h96psf/-/index.html)).

Recognizing the demand for more informed and accurate betting decisions, the intersection of technology and sports betting has witnessed a surge in predictive modeling. The Kenya Gazette, an official government publication, acknowledges the potential of predictive models in providing valuable insights to bettors, helping them make data-driven decisions to enhance their success rate ([Kenya Gazette](https://www.kenyagazette.co.ke/)).

This research project focuses on the development of a simplified soccer match outcome prediction system for the English Premier League. The envisioned product is a user-friendly system where a client inputs a match ID, and the system, powered by machine learning, provides a prediction for the match outcome. Leveraging advanced algorithms, the system aims to offer straightforward and reliable match predictions based on historical data and relevant match features.

Amid the remarkable growth in the Kenyan betting industry, there exists a research gap in the application of machine learning to provide simplified match predictions for soccer enthusiasts. This project aims to address this gap by introducing cutting-edge data science techniques into the realm of Kenyan soccer betting, offering users a practical and accessible tool to enhance their betting experience.

## 2. Problem Statement
The surge in popularity of sports betting, particularly in the vibrant betting industry of Kenya, has prompted an increasing demand for accurate and data-driven predictions, especially in the context of soccer matches. Despite the availability of vast amounts of football-related data, there is a notable gap in providing simplified and user-friendly prediction systems for Kenyan soccer enthusiasts.

### 2.1 Challenges
i. Existing machine learning predictive models often lack accessibility for the average user, requiring a level of expertise in data interpretation.

ii. Additionally, current betting prediction platforms may not utilize advanced machine learning techniques to offer precise and tailored predictions.

### 2.2 Opportunity
The opportunity lies in developing a simplified and user-friendly soccer match outcome prediction system for the English Premier League. This system aims to empower users with accurate match predictions, leveraging machine learning algorithms to process historical data and relevant match features.

## 3. Research Questions

### 3.1 General Research Questions
i. How can machine learning algorithms be effectively employed to predict soccer match outcomes?

ii. What are the key features and historical data points that significantly influence match predictions?

iii. How can a simplified prediction system be designed to cater to the needs of Kenyan soccer betting enthusiasts?

### 3.2 Exploratory Data Analysis (EDA) Research Questions
i. What are the distribution patterns of key performance metrics, such as team ranks, points, and goal differentials, across the English Premier League teams?

ii. Are there any correlations or trends between a team's historical performance metrics and its current rank in the league standings?

iii. How do teams' performance metrics vary when playing at home versus playing away?

iv. Can patterns or trends be identified in the outcomes of matches based on the results of the last five games' results for each team?

v. What role does goal difference play in determining the outcome of matches, and how does it correlate with the teams' overall performance?

### 3.3 Additional Machine Learning Model Assessment Questions
i. How sensitive is the prediction model to changes in input features, and which features contribute most significantly to prediction outcomes?

ii. What impact do changes in the training dataset size and composition have on the model's predictive performance?

## 4. Project Objectives

### 4.1 General Objective
1.4.1.1 To Develop a Soccer Match Outcome Prediction System: Create a robust and user-friendly machine learning-based system capable of predicting the outcome of English Premier League soccer matches, catering specifically to the preferences and needs of Kenyan soccer betting enthusiasts.

### 4.2 Specific Objectives
**1.4.2.1** To Implement Advanced Machine Learning Algorithms for Soccer Match Prediction: Integrate sophisticated machine learning algorithms, such as ensemble methods or deep learning models, to enhance the accuracy and predictive power of the soccer match outcome prediction system.

**1.4.2.2** To Conduct Feature Selection for Enhanced Predictive Insights: Conduct a comprehensive analysis to identify and select key features and historical data points that significantly influence soccer match outcomes, ensuring the inclusion of the most impactful variables in the prediction model.

**1.4.2.3** To Iteratively Optimize the Prediction Model for Accuracy: Implement an iterative optimization process to continuously refine and improve the prediction model, incorporating feedback and adjusting parameters to achieve the highest possible accuracy in predicting English Premier League soccer match outcomes.

**1.4.2.4** To Validate Predictions Against Historical Match Data: Validate the prediction system by comparing its predictions against actual outcomes using a historical dataset of English Premier League matches, establishing the reliability and effectiveness of the developed system.

**1.4.2.5** To deploy the optimized model and design a simple User-Friendly Interface for Soccer Match Predictions: Deploy model with Flask or Streamlit and add an simple intuitive and user-friendly interface for the prediction system, optimizing accessibility for users with varying technical backgrounds and ensuring a seamless experience in obtaining match predictions.

## 4.3 Project Scope

### 4.3.1 In Scope
- For this project, we will only be predicting the following markets:
  1. Home win, draw, or away win
  2. Total number of goals scored
  3. Fouls, specifically: Number of yellow cards and red cards

### 4.3.2 Out of Scope
- In-Depth Player Analysis: Detailed analysis of individual player performance, including statistics such as goals scored by specific players, player injuries, etc.

- Live Match Prediction: Predicting match outcomes in real-time during live matches.

- Betting Odds Calculation: Calculating or providing betting odds for predicted outcomes.

- Player Transfer and Team Management: Analyzing or predicting player transfers, team management decisions, or other off-field aspects.

- Predictions for Other Football Leagues: Extending predictions to other football leagues beyond the English Premier League.


## 5. Success Criteria

### 5.1 Model Performance
#### 5.1.1 Correct Outcome Prediction (Win, Lose, Draw)
- Achieve a baseline prediction accuracy of at least 70% for determining the correct outcome (win, lose, draw) of English Premier League soccer matches.

#### 5.1.2 Consistency
- Demonstrate consistency in predictions across multiple test datasets, indicated by a low variance in prediction accuracy.

#### 5.1.3 Correct Outcome Prediction (Over and Under Markets - Total Goals)
- Achieve a prediction accuracy of at least 75% for markets related to the total number of goals scored in English Premier League soccer matches.

#### 5.1.4 Benchmark Comparison
- Outperform or match the performance of existing benchmark models or prediction systems in the context of English Premier League match predictions. 

## 6. Data Understanding

#### 6.1 Data Source
The primary data source for this project will be the Football Data API ([Footsystats API](https://footystats.org/api/)), providing comprehensive information on English Premier League teams, matches, and performance metrics.

#### 6.2 Data Points and Descriptions
| Data Point                  | Description                                                  |
|-----------------------------|--------------------------------------------------------------|
| Team Rank                   | The current rank of the team within the English Premier League standings.          |
| Team Name and Crest          | The official name and emblem of the team for identification.                       |
| Played Matches              | The total number of matches played by the team in the current season.              |
| Wins                        | The total number of matches won by the team.                                    |
| Losses                      | The total number of matches lost by the team.                                   |
| Draws                       | The total number of matches drawn by the team.                                  |
| Points                      | The total points earned by the team based on wins and draws.                      |
| Last Five Games Results     | The outcomes (win, lose, draw) of the team's last five matches.                  |
| Goal Difference             | The numerical difference between goals scored and goals conceded.                 |
| Differential                | A calculated metric representing the difference between wins and losses.         |
| Goals For                   | The total number of goals scored by the team.                                   |
| Goals Against               | The total number of goals conceded by the team.                               |
| Win Percentage              | The percentage of matches won by the team.                                    |
| Won in Group                | The number of matches won by the team when playing in a group.                  |
| Lost in Group               | The number of matches lost by the team when playing in a group.                  |
| Win Percentage in Groups    | The percentage of matches won by the team when playing in a group.             |
| Won at Home                 | The number of matches won by the team when playing at their home stadium.      |
| Won Away                    | The number of matches won by the team when playing away.                        |
| Lost at Home                | The number of matches lost by the team when playing at their home stadium.     |
| Lost Away                   | The number of matches lost by the team when playing away.                       |

#### 6.3 Data Retrieval Process
1. Utilize the Footsystat API to retrieve real-time and historical data for English Premier League teams.
2. Implement requests to obtain specific metrics outlined above for each team, considering relevant time frames for historical performance.



In [1]:
# Import the necessary libraries
import requests  # Library for making HTTP requests
import pandas as pd  # Library for data manipulation and analysis


In [2]:
# Define the API key variable to hold the access key for the Footystats API
api_key = "*************************************************" 

In [3]:
# Define the URL for accessing the Football Data API's league list endpoint, including the API key
url = "https://api.football-data-api.com/league-list?key=" + api_key

# Make an HTTP GET request to the defined URL using the requests library
response = requests.get(url)


In [4]:
# Convert the response content to JSON format using the .json() method
data = response.json()


In [5]:
# Retrieve the keys of the 'data' dictionary to inspect its structure
data.keys()


dict_keys(['success', 'pager', 'metadata', 'data', 'message'])

In [6]:
# Extract the 'country' values from each dictionary in the 'data' list using list comprehension
countries = [data.get('country') for data in data['data']]

# Print the first 5 elements of the 'countries' list
print(countries[:5])


['USA', 'Scotland', 'Germany', 'Europe', 'Malaysia']


In [8]:
# Find the index of the value "England" in the 'countries' list
england_index = countries.index("England")

england_index

5

In [9]:
# Access information about seasons from the 'data' dictionary
seasons_info = data["data"][5]["season"]

# Print the 'seasons_info' variable
seasons_info


[{'id': 9, 'year': 20162017},
 {'id': 10, 'year': 20152016},
 {'id': 11, 'year': 20142015},
 {'id': 12, 'year': 20132014},
 {'id': 161, 'year': 20172018},
 {'id': 246, 'year': 20122013},
 {'id': 1625, 'year': 20182019},
 {'id': 2012, 'year': 20192020},
 {'id': 3119, 'year': 20112012},
 {'id': 3121, 'year': 20102011},
 {'id': 3125, 'year': 20092010},
 {'id': 3131, 'year': 20082009},
 {'id': 3137, 'year': 20072008},
 {'id': 4759, 'year': 20202021},
 {'id': 6135, 'year': 20212022},
 {'id': 7704, 'year': 20222023},
 {'id': 9660, 'year': 20232024}]

In [10]:
# Extract the 'id' values from each dictionary in the 'seasons_info' list using list comprehension
season_ids = [season.get("id") for season in seasons_info]

# Print the 'season_ids' list
print(season_ids)


[9, 10, 11, 12, 161, 246, 1625, 2012, 3119, 3121, 3125, 3131, 3137, 4759, 6135, 7704, 9660]


In [11]:
def get_league_matches(api_key, season_id):
    """
    Retrieve league matches data for a specific season.

    Parameters:
    - api_key (str): API key for accessing the Football Data API.
    - season_id (int): ID of the specific season.

    Returns:
    - response: HTTP response object containing the retrieved data.
    """
    # Construct the URL for accessing league matches data for the specified season
    url = f"https://api.football-data-api.com/league-matches?key={api_key}&season_id={season_id}"
    
    # Make an HTTP GET request to the defined URL using the requests library
    response = requests.get(url)
    
    # Return the response object
    return response


In [12]:
import pandas as pd

def create_dataframe(api_key, season_ids):
    """
    Create a Pandas DataFrame by fetching and concatenating league matches data for multiple seasons.

    Parameters:
    - api_key (str): API key for accessing the Football Data API.
    - season_ids (list): List of season IDs for which data will be fetched.

    Returns:
    - concatenated_df: Pandas DataFrame containing concatenated league matches data for the specified seasons.
    """
    # Initialize an empty list to store individual DataFrames for each season
    list_of_dfs = []

    # Iterate through each season ID in the provided list
    for season_id in season_ids:
        try:
            # Fetch league matches data for the current season
            response = get_league_matches(api_key, season_id)
            data = response.json()
            
            # Create a DataFrame from the fetched data
            df = pd.DataFrame(data["data"])
            
            # Append the DataFrame to the list
            list_of_dfs.append(df)
        except:
            # Handle errors and exit the function if an error occurs
            print("There was an error.")
            exit()

    # Concatenate the DataFrames in the list to create a single DataFrame
    concatenated_df = pd.concat(list_of_dfs, ignore_index=True)
    
    # Return the concatenated DataFrame
    return concatenated_df


In [13]:
# Create a DataFrame containing league matches data for multiple seasons using the create_dataframe function
matches_df = create_dataframe(api_key, season_ids)

# Display the first few rows of the DataFrame to inspect the data
matches_df.head()


Unnamed: 0,id,homeID,awayID,season,status,roundID,game_week,revised_game_week,homeGoals,awayGoals,...,matches_completed_minimum,over05,over15,over25,over35,over45,over55,btts,homeGoals_timings,awayGoals_timings
0,2155,150,108,2016/2017,complete,19,1,-1,"[45+1, 57]",[47],...,38,True,True,True,False,False,False,True,"[45+1, 57]",[47]
1,2156,145,154,2016/2017,complete,19,1,-1,[],[82],...,38,True,False,False,False,False,False,False,[],[82]
2,2157,143,142,2016/2017,complete,19,1,-1,[],[74],...,38,True,False,False,False,False,False,False,[],[74]
3,2158,144,92,2016/2017,complete,19,1,-1,[5],[59],...,38,True,True,False,False,False,False,True,[5],[59]
4,2159,147,141,2016/2017,complete,19,1,-1,[11],[67],...,38,True,True,False,False,False,False,True,[11],[67]


In [14]:
# Inspect the resulting dataframe
matches_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6460 entries, 0 to 6459
Columns: 215 entries, id to awayGoals_timings
dtypes: bool(7), float64(80), int64(112), object(16)
memory usage: 10.3+ MB


In [15]:
# Inspect the shape of  dataframe
matches_df.shape

(6460, 215)