Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
65 lines (43 sloc) 7.15 KB
title: "Building open football player transfer data"
description: "Collating player transfers to and from football clubs in major European leagues."
date: '2018-08-27'
slug: building-open-football-player-transfer-data
tags: ["football", "data-viz", "r"]
draft: no
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=TRUE, warning = FALSE, message = FALSE, echo = FALSE, out.width = '100%', dev = 'png')
# load packages
# get data
epl_transfers <- read_csv(file="")
Around this time last year I ran [a post on European football transfers](, taking in the 2017/18 season's summer window. I got a bit of an itch to refresh this work when the 2018/19 window hit. The aforementioned itch led to me getting in too deep and scraping all major European league transfers going back to the year 2000, naturally.
Here, I tell a short story about how this open data was built and showcases some visualisation pieces that utilised that effort.[^code] Hopefully this can encourage others to share whatever neat stuff they tap into.
## Building the data
I eventually settled on ~~scraping~~ using [the Guardian's Transfer Interactive]( to power my previous work. This source included transfer timestamps, which allowed for some intra-window time series stuff - it remains handily [hosted by Tom Worville in a public, flat-file format]( However, it isn't really set up for investigating historical trends as the Guardian has only run this interactive since 2017, as far as I can tell.
Enter [Transfermarkt](, a data goldmine of player transfers for a number of major European leagues (e.g. English Premier League, Spanish La Liga, Italian Serie A) and some other oddities (my personal favourite is [this list of father/son combos for national teams - glorious and appreciated]( Season-level stats like player transfers run back for quite a few seasons, joyously arranged in predictable html tables for bountiful scraping. I wrote a short scraping program[^scrape] to collect and clean up player transfers for these (and other) leagues, back to the 2000/01 season (N.B. this decision was entirely arbitrary).
Et voila - the data is now freely available, [hosted on Github as flat .csv files]( (in accordance with Transfermarkt's [terms of use]([^legal]. Here's a preview, lovingly featuring the deal taking Mr. Igors Stepanovs to Arsenal.
```{r data-preview}
kable(head(epl_transfers), format = "html") %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%")
Notice the duplicated player name field `r emo::ji("thinking")` if you find a fix, holler. For more on the variables included, visit the [repo's readme](
## Sketching some visuals on top
Now for a couple of visualisation pieces I've tried out using this dataset. I've included some notes on my process/workflow for each[^viz], if you're that way inclined. Otherwise, just *~absorb the inspiration~* `r emo::ji("idea")`
I took a look at the value of player buys vs sales for Premier League clubs in the 2018/19 window, using a [Cleveland dot plot]( (AKA 'dumbbell' chart) variant.
> This type of visualisation is an elegant and simple way to show ranges of data (i.e. spend vs sales difference) across multiple categories (i.e. football clubs). I did the initial sketch for this using my standard charting workflow in R (mostly [ggplot2]( and it's many extensions, including [Bob Rudis']( charming [ggalt]( which made this chart type ), but I *did* export this into [Adobe Illustrator]( ([Inkscape]( is a fine free alternative) to do good text annotations quicker. The final version therefore includes non-reproducible elements that makes refreshing the viz for new transfer windows non-trivial, but that help in telling stories contained in this view of the data. In this one-off case, I think the trade-off is fine.
Next, a look at a single club's season-by-season transfer spend and sales, following the relationship between these two amounts through time.
> This visualisation choice might be a little difficult to follow at first, if it's your first connected scatter (in [this post by Elijah Meeks](, the connected scatter example actually includes a link to an explanation of what's going on). [Steve Haroz + collaborators' research paper]( was invaluable in guiding my first application of this chart format. In short, they are good at showing changing data for two variables whenever there is a relatively clear pattern of progression. Similarly to the previous example, this was sketched out in R with ggplot2, with some Illustrator annotation fine-tuning.
That's basically it...let me know if you make something `r emo::ji("fire")``r emo::ji("fire")``r emo::ji("fire")`
[^scrape]: For this post I chose to omit lengthy passages on web scraping, as not to deter non-programmers (insights can be gleaned from the cleaned data w/o additional code). However, the code used to scrape, clean and analyse the data is publicly available within the `src` directory of the [footy-transfer-data GitHub repo](, featuring [rvest]( (web scraping for R) in conjunction with [purrr]( (iteration tools for R).
[^code]: You can find the R code used to generate this post [here](
[^legal]: Web scraping is a legally/ethically grey area. All effort should be made to verify if scraping a webpage is in accordance with the parent domain's terms of use. A helpful permissions checker for R, [robotstxt](, is invaluable in this pursuit. Use this (or similar in other languages), or at least study the terms of use for the domain in question closely.
[^viz]: For the R code used to sketch the chart examples included in the post (and others that didn't make the cut), try [here](