# Bayesian Data Analysis - Premier League Project

## Table of Contents
<ul>
    <li>Imports</li>
    <li>Background</li>
    <li>Data</li>
    <li>Model 1: Hierarchical model</li>
    <li>Sources</li>
</ul>

## Imports

In [1]:
import numpy as np
import arviz as az
import stan
import nest_asyncio
import warnings

## Background

<b>Bayesian methods in sports</b><br>
The usage of Bayesian methods in sports analytics has been increasing due to the increasing amount of statistics gathered from sports and also the multiple advantages that Bayesian methods have (Santos-Fernandez et al., 2019). Bayesian approach allows for example to provide probabilistic estimates, add new information to models and make predictions that take the uncertainty into consideration. 

<b>Premier League</b><br>
As I am a big football fan, I have decided to do the project about the English Premier league. In the summer of 2021 a big English football club, Manchester United, decided to re-sign Ronaldo from Juventus with a transfer costing about £12.85 million. On top of this, Ronaldo's yearly salary is estimated to be about $125 (Settimi, 2021), at least in this project.

Despite of being a Tottenham Hotspur fan, in this project I am using Bayesian methods to do decision analysis for Manchester United (obviously not <I>really</I> :D). 

<b>This project</b><br>
In this project we are going to assess whether Manchester United should re-sign Ronaldo in the summer of 2021. As nowadays football is mostly about money and business, we are going to base our decision only on money. To model the amount of money Ronaldo provides for ManU, we are going to use the amount of goals Ronaldo will score in a season. However, as Ronaldo has not played in the EPL for ages and Serie A/LaLiga goals are arguably not comparable with EPL goals, we cannot use for example the average of Ronaldo's previous goal scoring numbers as a point estimate for the goals in EPL or model with Bayesian methods straight from the data. Therefore, we are going to model Ronaldo's expected goal scoring with two bayesian models that use data also from EPL: Hierarchical model and Pooled model. The models are explained comprehensively in the Model-part of the this notebook.

<b>IMPORTANT</b><br>
This project is just for training with Bayesian Data Analysis and the model provided in the project is not intended to work in the real world.

## Data

As Ronaldo has played 11 seasons (from season 2009-2010 until 2020-2021) outside the EPL, it is not possible to obtain a huge amount of data of the goals scored. But the beauty of Bayesian methods is that we can work with small amount of data and still get a distribution of our predictions.<br>

In this project for simplicity we will only use goals that have been scored in EPL, LaLiga or Serie A and therefore we ignore goals for example scored in Champions League. For the Manchester United forward goals we are going to use each season the top 3 goal scoring forward's goals as no player has played in Manchester United the whole time that Ronaldo has played. This can be rationalized by the fact that we are not interested in these players from the ManU population and their meaning in this model is to add the ManU way of playing to the model. The first column in the file manu.txt corresponds to the top scorer (forward or attacking midfielder) of the season and the two others are for the 2nd and 3rd.

Ronaldo's goal scoring data is from ESPN.com and it is manually collected as there is no universal free database for football statistics. The ManU goals are from Premier League's official website, premierleague.com, and they are also manually collected. We will use data starting from season 2009-2010 until season 2020-2021, as these are the seasons Ronaldo played in LaLiga and Serie A.

In [9]:
# DATA FOR RONALDO
data = []
with open("./data/cr7.txt") as file:
    for line in file:
        values = line.rstrip()
        data.append(values)
data = np.array([int(value) for value in data])

In [12]:
# DATA FOR MANU FORWARDS
data = []
with open("./data/manu.txt") as file:
    for line in file:
        values = line.rstrip().split(",")
        data.append(values)
data = np.array([[int(value) for value in values] for values in data])

## Model 1 - Hierarchical Model

### Description

In the first model we are going to model Ronaldo's expected goal amount in a season using a hierarchical model, where we are going to use also the goal scoring of other Manchester United forwards in previous seasons. As each team in EPL have their own style of playing, we can argue that there also is a specific amount of goals that usually occur from that style of playing. Therefore, we could assume that Ronaldo's goal amount in ManU shirt is related to the goal amount of other ManU forwards and therefore we could use a hierarchical model, where we take advantage of the data we have from the other forwards.<br>

Now, we could take the easy way and make our predictions just by using the ManU forward population (scored goals in each season), give some weakly informative hyperpriors and sample hyperparameters for the priors for each player, create a fitting likelihood function and sample a predictive distribution for a new player coming in the ManU team (Ronaldo). However, the prediction would not be that accurate as Ronaldo is not some random player coming to the club, but one of the best players in the world. How can we add this knwoledge of Ronaldo's skills to our model, while still keeping the information from the ManU forward population distribution? One possible way, which is used in this model, is to obtain the prior for Ronaldo's goal scoring from the ManU population distribution and add Ronaldo's skills to the model through data from his past goal scoring in LaLiga and Serie A.<br>

Therefore, as the hierarchical model we get:<br>

### Stan model

### Implementation

### Convergence diagnostics

### Posterior predictive checking

### Sensitivity analysis

## Model 2 - Pooled Model

## Sources

Santos-Fernandez, E. & Wu, P. & Mengerson, K. L. (2019). Bayesian statistics meets sports: a comprehensive review. Journal of Quantitative Analysis in Sports. Vol. 15(4). S. 289-312.

Settimi, C. (2021). The World’s Highest-Paid Soccer Players 2021: Manchester United’s Cristiano Ronaldo Reclaims Top Spot From PSG’s Lionel Messi. Forbes. 