# FifaSkill

A Probabalistic Programming Package for European Soccer Analysis by Vinay Ramesh (vrr2112) and Alek Anchowski (aja2173)


In [8]:
from fifaskill.data_processing import process
from fifaskill.models import regression
import numpy as np
import pandas as pd
import sqlite3

from IPython.display import display

## Introduction

Trueskill, developed by Microsoft Research, is a general skill based ranking system based upon a prior distribution around a player's mean skill and variance (Herbrich, Minka, Graepel). The original system models an player's skill, variance in performance, and uses an Expectation Propogation algorithm to update a player's skill representation and variance for every new game that the player participates in.

In our project fifaskill, we attempt to extend the Trueskill method to FIFA games, with the aim of modeling a team's overall skill and predicting future matches. We adapt from Trueskill methods because Expectation Propagation and message passing algorithms are not ideally implemented in edward. We experiment with modeling offense and defense skills seperately, as well as home team advantage.

## Data

We will start in the middle of Box's loop for an overview of the data. We use the [Kaggle Dataset for European Soccer Matches](https://www.kaggle.com/hugomathien/soccer) 
which consists of a list of matches in 11 leagues and 8 seasons. We are interested in predicting outcomes of future matches using this given data. Because Champions League data is not present in this dataset, we evaluate leagues one at a time rather than jointly as we had originally hoped. The query we will use is focused on the English Premier League but can be trivially modified to any other country league by specifying the country in the query.   

A [Kaggle Notebook by Ashirwad](https://www.kaggle.com/ashirwadsangwan/competitiveness-in-the-european-leagues/notebook) noted the parity was highest in the English Premier League within the given dataset which makes this particular league the most challenging. We draw the below image from his notebook analysis to demonstrate this fact.  

![alt text](https://github.com/vinoo999/trueskill_augmented/raw/master/final_project/images/competitiveness.png)


In [5]:
# Now we load our EPL Data
db = '../database.sqlite'
query_fname = '../db_queries/detailed_match_query.sql'

conn = sqlite3.connect(db)

q_file = open(query_fname, 'r')
data = pd.read_sql(q_file.read(), conn)

data.head(2)

Unnamed: 0,id,country_name,league_name,season,stage,date,home_team,away_team,home_team_goal,away_team_goal
0,1730,England,England Premier League,2008/2009,1,2008-08-16 00:00:00,Arsenal,West Bromwich Albion,1,0
1,1731,England,England Premier League,2008/2009,1,2008-08-16 00:00:00,Sunderland,Liverpool,0,1


Below we partition our data to get ready for training with our models.

In [6]:
train, test = process.partition_data(data, by_season=True)

display(train.head(3))
display(test.head(3))

Unnamed: 0,id,country_name,league_name,season,stage,date,home_team,away_team,home_team_goal,away_team_goal
0,1730,England,England Premier League,2008/2009,1,2008-08-16 00:00:00,Arsenal,West Bromwich Albion,1,0
258,1917,England,England Premier League,2008/2009,26,2009-02-23 00:00:00,Hull City,Tottenham Hotspur,1,2
257,1915,England,England Premier League,2008/2009,26,2009-02-22 00:00:00,Fulham,West Bromwich Albion,2,0


Unnamed: 0,id,country_name,league_name,season,stage,date,home_team,away_team,home_team_goal,away_team_goal
2912,4573,England,England Premier League,2015/2016,26,2016-02-13 00:00:00,Crystal Palace,Watford,1,2
2913,4574,England,England Premier League,2015/2016,26,2016-02-13 00:00:00,Everton,West Bromwich Albion,0,1
2914,4576,England,England Premier League,2015/2016,26,2016-02-13 00:00:00,Norwich City,West Ham United,2,2


## Model

### Basic from Trueskill

To model a team's performance, we assume a Gaussian prior, where the prior on skill $s_i$ is 
$$
\begin{align*}
  p(\mathbf{s_i})
  &\sim
  \text{Normal}( 25, \frac{25}{3} ^ 2)
\end{align*}
$$

When learning parameters we expect that the true skill $s_i^*$ be reflected as the mean of the performance distribution above.

Given two team's performances, $s_1$ and $s_2$, we then model the game's outcome, $r$, which we define to be the difference of the goals scored by each team against one another $\textbf{1}[s_1 - s_2] > 0$

In the model, this corrosponds to drawing from a Poisson distribution where the rate is the performance of the team. 

$$
\begin{align*}
    r &= \textbf{1}[\mathcal{P}(s^*_1) - \mathcal{P}(s^*_2) > 0]
\end{align*}
$$

To train the model, given $n$ teams, we obtain a matrix $R^{n, n}$ from the database, where an entry $R[team1, team2]$ would corrospond to the average difference of the goals scored by team1 against team2 and goals scored by team2 against team1. R is skew-symmetric, meaning $R^T = -R$, or more specifically that $R[team1, team2] == -R[team2, team1]$.

Our original implementation also shown below proved to converge poorly. As such this model can be translated into a simple Bayesian Linear Regression model. 

Here, our input data $X$ is a $m\times n$ matrix where there are $m$ matches and $n$ teams. Each match is represented as a two-hot vector where an entry is 1 if the team is home, -1 if the team is away, 0 otherwise. Our weight matrix $W \sim \mathcal{N}(25, \frac{25}{3} ^ 2)$ is a $d \times 1$ matrix denoting the skill of each team. Thus $XW$ outputs $Y$, a $m$ dimensional vector denoting the difference in skill levels of the two teams. Our supervised approach uses $Y$ as the goal differences for that particular match.

In [None]:
from fifaskill.models import regression
from fifaskill.models import basic
tsr = regression.TrueSkillRegressor(train)

### Offense Defense Separation

To extend the basic model we seperate the team's skill into two scores, offense $s_{io}$ and defense $s_{id}$, where each is also defined with a Gaussian prior. Each team also has a variable offense and defense performance, based off the corrosponding skill, similar to how skill and performance were related in the basic model. We further now make the assertion that since we are modeling a discrete variable - goals - that we will use a Poisson distribution to model the goals scored. Based on the offense and defense performance, we draw from a Poisson distribution twice to determine the number of goals scored and number of goals allowed. We center the Gaussian prior to the standard normal in order to better account for a goals scored/allowed prior.

$$
\begin{align*}
s_{io} &\sim \mathcal{N}(0,1)\\
s_{id} &\sim \mathcal{N}(0,1)\\
s_i^* &\sim \mathcal{P}(s_{io}-s_{jd})\\
s_j^* &\sim \mathcal{P}(s_{jo}-s_{id})\\
r &\sim \mathbf{1}[s_i^* - s_j^* > 0]
\end{align*}
$$

We again reduce this to a Bayesian regression model. This is in effect the composition of two log-linear models. Theoretically, this creates a compound Poisson regression model which can indeed be modeled with a simple log-linear model. We will show both cases.

Next we define the new model in Edward.

In [None]:
off_def = regression.LogLinearOffDef(train)
loglin = regression.LogLinear(train)

## Inference

In the definitions of the models above, our classes inadvertently run inference. We use a factorized variational distribution in all cases. Of greatest note, we identified that the Reparamaterization Score Gradient has a smaller variance than the normal Score Gradient. This can be seen with final two of the plots. Our classes have changed all `KLqp` instances to `ReparamaterizationKLqp` for inference in Edward. 

In [None]:
from fifaskill.data_processing import visualization
visualization.plot_loss(tsr, off_def, loglin)

In this case
`KLqp` defaults to minimizing the
$\text{KL}(q\|p)$ divergence measure using the reparameterization
gradient.
For more details on inference, see the [$\text{KL}(q\|p)$ tutorial](/tutorials/klqp).
(This example happens to be slow because evaluating and inverting full
covariances in Gaussian processes happens to be slow.)

## Model Evaluation and Criticism

From an empirical standpoint, we begin with a review of some aggregate rankings for the data. As a reminder, we partitioned the data to remove the final season. 

Here we simulate the season and compare the ranking results for the 4 model

Lastly we look at the Posterior Predictive Checks to check resemblance to the underlying distribution of goals. We see that the models do not seem to represent the true distribution well.

To evaluate our model's prediction for the offense and defense scores for each team, we test them against the true values as defined by our dataset. For offense, we take an average of the chance creation scores of a team across several seasons, and for defense we average the defense pressure and aggression scores. For future work, it might be useful to determine whether our choice to combine these particular scores exactly corrosponds to a pure offense/defense skill level - perhaps there is a more complex relationship between pressure and aggression for example that would corrospond to higher defense scores.

In [13]:
query_fname = '../db_queries/off_def.sql'

q_file = open(query_fname, 'r')

off_def = pd.read_sql(q_file.read(), conn)
off_def['offense'] = off_def[['pass', 'cross', 'shoot']].mean(axis=1)
off_def['defense'] = off_def[['pressure', 'aggression']].mean(axis=1)
off_def = off_def[['team_name', 'offense','defense']]

In [14]:
off_def.head(5)

Unnamed: 0,team_name,offense,defense
0,Arsenal,42.5,48.5
1,Aston Villa,52.722222,42.75
2,Birmingham City,58.777778,48.75
3,Blackburn Rovers,51.666667,50.333333
4,Blackpool,58.833333,49.25


Next, we compare these scores to our predicted values

### Criticism/ Model Revision

As we have explained earlier, our original conception of the model greatly changed because the model became trapped in a local optima, which resulted in the adaptation of varying regression models. 

In [15]:
# Next Steps