# What is the Value of a Player's Contribution?

Whose really the best in soccer? For a sport as popular as association football is, I’m surprised there isn’t a lot of numbers being used in its analysis. So I’ve gone ahead and tried my best impersonation of Bill James in coming up with one way to evaluate players.

*Note: This won't run as I use a couple of bespoke files. You'll have to download the data and keep it in the same directory to run this notebook.*

Sabermetrics is baseball statistics taken to an extreme, leveraging computers to evaluate players and teams in often unconventional ways. You can see this in the Oakland A’s 2002 season when they won an at-the-time record of 20 wins. The team and their approach is immortalized in the book Moneyball by Michael Lewis.

Looking at Association Football or Soccer — as I’ll call it — often has these stats being left out in favor of more traditional methods of evaluating players. I decided to crack open a recent Kaggle dataset and take a deeper look at trying to evaluate these players.

First, we’ll start with goals

Tracking the number of goals scored is an important starting point, but it doesn’t happen enough in a game to be statistically significant. A top ranking team in the English Premier League like Manchester City only scored 80 goals in the entire 2016–2017 season. That comes out to ~2.11 goals a game. Given the size of a squad, evaluating players solely on their goal production seems an inefficient way to judge them, especially if they don’t score often.

A better way to evaluate a player is to look at how often he attempts a goal and how often that particular shot-on-goal converts into a real goal. There should be enough data on attempts on goal to make this a meaningful number. It’s similar to the Corsi statistic used in evaluating hockey players.

So this is a two fold notebook: first is to look at how shots convert into goals (and goals into games). Second is to look at what types of shots on goal resulted in goals what percentage of time.

From this, I could evaluate players based on their ability to take those types of shot. My intuition was that players who take shots on goal that rarely convert into goals are to be valued less than players who take shots on goal that are likely to convert.

Note that this is far from a perfect analysis. For instance, defensive players are not evaluated here (the data I have does not record defensive maneuvers so they are more difficult to evaluate). I also took a lot of assumptions in order to make this work, which I’ll note when I can as the analysis progresses.

Thus, don’t take it as an end-all-be-all of soccer offensive player analysis, but instead take it as a starting place to look at player contributions in soccer teams.

## First Steps

First we need to import all of our data. I'll assumed you grabbed [the data from Kaggle](https://www.kaggle.com/secareanualin/football-events), you'll [need an account to do so](https://www.kaggle.com/?login=true). 

Keep in mind that I turned the `dictionary.txt` into a file of dictionaries. I was creating multiple notebooks for analysis so I wanted something I could import multiple times. Therefore, they are not included in this. 

In [None]:
# Imports -- get these out of the way
# It's considered good Pythonic practice to put imports at top

from random import randint
import os

import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np

from IPython.display import display


#--> This is a python-converted set of dictionaries from 'dictionary.txt' that I made
from events_dict import *  

I like [Pandas](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for data manipulations done locally, so I'll be using that. They work similar to R dataframes.

In [None]:
# Get Data and local file dictionaries
events = pd.read_csv("events.csv")
games = pd.read_csv("ginf.csv")

Now that we have the data, we can start to pull out the information we want. 

When looking at a game, taking just the goal information is often insufficient. There are not enough goal events to make this a significant aspect to look at. 

What's better is to use shots on goal. Taking a cue from the [Corsi metric](https://hockeynumbers.blogspot.com/2007/11/corsi-numbers.html) used in Hockey, we can instead look at shots on goal. 

We'll coax those out first.

In [None]:
shots = events[events['event_type']==1 ]

One interesting number we'll want to tease out is how many goals the average winning team has. My gut feeling says this will be a [gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution).

If we have a narrow enough distribution, we'll be able to say how many games a team will win with a particular player by the number of goals he creates. This is very similar to the [wins above replacement](https://en.wikipedia.org/wiki/Wins_Above_Replacement) used in Baseball statistics where players are evaluated by the number of wins they bring over the average player. 

In [None]:
goals_winning_team = []

# go row by row (don't think a one liner would work)
for I,game in games.iterrows():
    if game['fthg'] > game['ftag']:
        goals_winning_team.append(game['fthg'])
    elif game['fthg'] < game['ftag']:
        goals_winning_team.append(game['ftag'])
        
avg_goals = np.mean(goals_winning_team)
std_goals = np.std(goals_winning_team)
        
print("Average Goals Per Game from Winning Team: %0.3f" % avg_goals)
print("Std. of Goals per game from winning team: %0.3f" % std_goals)
print("%% Deviation: +/- %0.3f%%" % ((std_goals/avg_goals)*100))

Ok, so the standard deviation on this is really high.

Evaluating a socccer player by the number of wins he contributes looks to be a crapshoot. There's too much variance in the number of goals a team needs to win to make this a meaningful statistic.

We'll plot the histogram to get a feel for how much like a gaussian this looks.

In [None]:
# plot this as a histogram
plt.hist(goals_winning_team, bins=10, normed=True)
plt.show()

The distribution is a skewed gaussian. This makes sense: soccer is fundamentally a low scoring game. 

Next, we'll look at the average number of shots that convert into goals. This gets a feel for the number of shots a player needs to takae in order for a goal to result. Players that hustle and take a lot of shots intuitively should be evaluated more highly.

In [None]:
print("Shots: %d" % shots.shape[0])
print("Goals: %d" % shots[shots['is_goal']==True].shape[0])
print("Shots that convert: %0.3f%%" % (100*(shots[shots['is_goal']==True].shape[0]/shots[shots['event_type']==1].shape[0])))

This gives a baseline of the shots that convert into goals. However, we can do better.

Not all shots on goal are the same. We can slice up the data more finely by looking at the kinds of shot taken. The dataset we have allows us to split up the data by the location the shot was taken, where in the net it was aimed, the body part used, the context (e.g. was it a penalty shot?), and if anybody was assisting. 

So let's try that. 

First, we'll have to remove the shots where location is not recorded. 

In [None]:
shots = shots[ shots['location']!=19. ] 

Next, we'll use the dictionaries I translated into python (called `dictionary.txt` from the Kaggle download) to replace all the data. 

You may be curious why I didn't do this earlier. The reason is that string comparisons take more computation than integer or float comparisons. So I tend to translate into strings towards the end and not the beginning. 

In [None]:
shots.replace({'location': location_dict, 
               'shot_place': shot_place_dict, 
               'bodypart': bodypart_dict, 
               'assist_method': assist_method_dict, 
               'situation': situation_dict}, inplace=True)

We'll create a new column that takes in the different strings. This will make for a unique key we can get all the shots on.

If you're curious why the new column is called 'uc', it's short for 'unique column' and it easier to type out. 

In [None]:
shots['uc'] = shots['location'] + ', ' + shots['shot_place'] + ', ' + \
    shots['bodypart'] + ', ' + shots['assist_method'] + ', ' + \
    shots['situation']

We also need a baseline number of comparisons for a particularly unique shot. After all, shots that don't occur often can't be evaluated on the same level as shots that never occur. 

All shots that are low occurrence will be evaluated at zero for the sole purpose of being able to have a clean print out sheet. In practice, you would want to ignore them anyway. 

We'll then plot the distribution of occurrences of unique shot types. I could put the shot type in the x axis, but experience has told me that a cluttered visualization is worse than no visualization. Therefore, I'll drop it. It's mostly to show the distribution of shot types that get taken.

For curiosity sake, we'll print out the most common shot taken. 

In [None]:
unique_combos = shots['uc'].value_counts().to_dict()
unique_combos_that_scored = shots[ shots['is_goal']==True ]['uc'].value_counts().to_dict()

print("Total Unique Combos: %d" % len(unique_combos))

# Filter out keys that did not score
scoring_keys = [k for k in unique_combos.keys() if k in unique_combos_that_scored.keys()]
unique_combos = {key:unique_combos[key] for key in scoring_keys}

print("Total Unique Combos that Scored: %d" % len(unique_combos))

# Filter out keys that don't have at least 100 occurrences
unique_combos = {key:value for key,value in unique_combos.items() if value > 100}
print("Number of Unique Combos: %d" % len(unique_combos))

items = sorted(unique_combos.items(), key=lambda x: x[1], reverse=True)

# Plot the data
plt.bar(range(len(unique_combos)), [i[1] for i in items], align='center')
plt.show()

# Print out most frequent shot type
print("Most Common Shot Taken => %r: %d occurences" % (items[0][0],items[0][1]))

Now that we have a unique shot column, we can start to look at conversion percentages. This will be how often that shot converts into a goal expressed as a percentage. 

We can use this as a percentage of contribution of a goal. That means a shot on goal that converts 60% of the time will be put down as 0.6 goals. We'll give ourself two significant figures as we looked at occurrences of at least 100 times. 

In [None]:
shots_per_uc = unique_combos
goals_per_uc = shots[ shots['is_goal']==True ]['uc'].value_counts().to_dict()

l = [];

for k in shots_per_uc.keys():
    s = shots_per_uc[k]
    g = goals_per_uc[k]
    f = float(g/s)
    
    split_uc = k.split(", ")
    location = split_uc[0]
    shot_place = split_uc[1]
    bodypart = split_uc[2]
    assist_method = split_uc[3]
    situation = split_uc[4]
    
    l.append([location,
              shot_place,
              bodypart,
              assist_method,
              situation,
              s,g,f])


# While the 'uc' column makes typing out easier, display is better if
# we return to the original columnar display
df = pd.DataFrame(l,columns=['location',
                             'shot_place',
                             'bodypart',
                             'assist_method',
                             'situation',
                             'shots',
                             'goals',
                             'percentage_conversion'])

# Finally, we sort the dataframe by the highest conversion rate
df.sort_values(axis=0, by=['percentage_conversion'],
               ascending=False, inplace=True)


In [None]:
# Display the dataframe itself
display(df)

So how can we interpret this data? A few ideas come to mind.

One is that it can influences play. A coach may want to orient players to pursue higher scoring opportunities (e.g. ones that will more likely convert into a goal).

Another is to evaluate players by their abilities to make these situations happen. Which leads us to the next section...


## Analyzing Players

Let's use this data to evaluate players. 

We'll assume that _only goals matter_ which will leave out defensive players. 

To start, we'll recreate the unique combo column like before and get the number of occurences for each player. That means that if player X takes unique position Y to score a goal W times, we want that tally of W times the percentage it converts. 

This will make more sense as we go along. 

Note that I originally ran this as two separate sheets, which explains why we have to rebuild the uniqueness column.

In [None]:
# Rename the DF from before
events = df

# Rebuild the unique column up
events['key'] = events['location']+'/'+events['shot_place']+'/'+events['bodypart']+'/'+events['assist_method']+'/'+events['situation']

other_conversions = pd.Series(events['percentage_conversion'].values,index=events['key']).to_dict()

# Then we'll rank all the other events as zero as there's 
# insufficient information to make a judgement -- this works out in the calculations
shots['key'] = shots['location']+'/'+shots['shot_place']+'/'+shots['bodypart']+'/'+shots['assist_method']+'/'+shots['situation']
conversions = pd.Series([0 for i in range(shots['key'].shape[0])], index=shots['key']).to_dict()

# Then we combine them 
conversions.update(other_conversions)

Next, we'll get the value of each shot by converting it through a dictionary

In [None]:
# We assign value to every contribution
shots['value'] = shots['key'].apply(lambda x: conversions[x])

# Get all players into a list -- players who are not mention (i.e. NaN) are dropped
players = shots['player'].dropna().unique().tolist()
player_contributions_per_game = {}
player_games_number = {}

Finally, we loop through each player and tally up the the type of shot taken by it's "goal worth" or the percentage of times it converts into a goal. 

In [None]:
for player in players:
    A = shots[ shots['player']==player ]
    games_played = A['id_odsp'].unique().tolist()

    contributions = 0
    for game in games_played:
        contributions += A[ A['id_odsp']==game ]['value'].sum()
        
    # We normalize the contribution such that it's _per-game_
    player_contributions_per_game[player] = float(contributions/len(games_played))
    player_games_number[player] = len(games_played);


Finally, we'll plot the value of each player. This will likely be a pareto distribution. with the vast majority of players contributing very little in value, especially as defensive players will tend towards zero. 

In [None]:
anonymous_contributions = list(player_contributions_per_game.values())
plt.hist(anonymous_contributions,bins='auto')
plt.show()

We'll filter out players who provide no value (e.g. defensive players) and plot that as a distribution. 

In [None]:
A = [x for x in list(player_contributions_per_game.values()) if x > 0.0]

plt.hist(A, bins='auto')
plt.show()

The pareto distribution indicates to me that soccer is game of finding superstars. This does not surprise me; most sports are a [winner-takes-all market](https://www.investopedia.com/terms/w/winner-takes-all-market.asp) with a few superstars dominating the game.

Next, we'll look at the the baseline player to compare against to see what kind of an outlier a player ends up being. Normally this is average, but that will probably be less useful as outliers will lopside the distribution. 

In [None]:
war = np.median(anonymous_contributions)
print("Median Player's Contribution: %f" % war)
print("Mean Player's Contribution: %f" % np.mean(anonymous_contributions))
print("Mean Player's Contribution (zeros dropped): %f" % np.mean(A))

Now for fun, let's look at the top players by contribution! 

We'll filter out people who have played less than 10 games as they likely have insufficient data to properly judge. 

In [None]:
# Let's look at the top players in terms of contributions 
B = [(v,k) for k,v in player_contributions_per_game.items()]
B = sorted(B, key=lambda x: x[0], reverse=True)
l = []

for i in B:
    
    if player_games_number[i[1]] < 10:
        continue

    l.append([i[1],i[0],player_games_number[i[1]]])
    #print("%r: %f, played %d games" % (i[1],i[0],player_games_number[i[1]]));
    
df = pd.DataFrame(l,
                 columns=['player','contribution_per_game','games_played']
                )


In [None]:
# Display the data
display(df);

This passes the sniff test. Cristiano Ronaldo [is considered by many to be the best ever](https://www.quora.com/Is-Cristiano-Ronaldo-the-best-soccer-player-of-all-time) so having him at the top makes a lot of sense. The other players are also of similar caliber. None of them are defensive players, which again makes a lot of sense. 

The one outlier of [Jorginho Frello](https://en.wikipedia.org/wiki/Jorginho_(footballer,_born_1991)) having played only 11 games in this dataset makes sense as he doesn't have enough data to properly train. I actually find it surprising he's the _only_ outlier here. We could adjust for the minimum number of games a player needs to be in in order to appear, but 10 feels fair given the typical soccer league has ~40 games in a season. That means a player needs to appear in a quarter of the games in order to be here, which feels fair (though admittedly gut-feeling). 

## Final Thoughts

So there you have it: a full data analysis of the top offensive players and the best places to take shots on goal. 

It's neat if imperfect. More thorough analysis would want to look at defensive players. If I had the data, I'd use possession time as the baseline metric, again taking inspiration from Hockey statistics. 

Also, it would be helpful to look at players' net shot conversions to see what types of players they are. Perhaps they convert more often on certain shots than the typical average. However, I'm not 100% convinced there's enough data in this set to support such an analysis. 

Thoughts? Questions? Feel free to shoot me an email at [vincent@saulys.me](mailto:vincent@saulys.me).