R package to scrape soccer commentary and statistics from ESPN
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
R
data
man
.Rbuildignore
.gitignore
DESCRIPTION
NAMESPACE
README-unnamed-chunk-3-1.png
README.md
fcscrapR.Rproj

README.md

Introducing fcscrapR

The goal of fcscrapR is to allow R users quick access to the commentary for each soccer game available on ESPN. The commentary data includes basic events such as shot attempts, substitutions, fouls, cards, corners, and video reviews along with information about the players involved. The data can be accessed in-game as ESPN updates their match commentary. This package was created to help get data in the hands of soccer fans to do their own analysis and contribute to reproducible metrics.

Installation

You can install fcscrapR from github with:

# install.packages("devtools")
devtools::install_github("ryurko/fcscrapR")

Example game scraping

Here’s an example of how to scrape a game using fcscrapR. The workhorse function of the package is scrape_commentary() which takes in a game id. This game id is located in the url for a game, such as the group stage match between Serbia and Costa Rica in the 2018 World Cup: http://www.espn.com/soccer/commentary?gameId=498194

Using this game id, we can easily grab the commentary data frame:

library(fcscrapR)
#> Loading required package: magrittr
srb_crc_commentary <- scrape_commentary(498194)

Check out the documentation for scrape_commentary() for a description of all of the columns in the commentary data:

colnames(srb_crc_commentary)
#>  [1] "game_id"                 "commentary"             
#>  [3] "match_time"              "team_one"               
#>  [5] "team_two"                "team_one_score"         
#>  [7] "team_two_score"          "half_end"               
#>  [9] "match_end"               "half_begins"            
#> [11] "shot_attempt"            "penalty_shot"           
#> [13] "shot_result"             "shot_by_player"         
#> [15] "shot_by_team"            "shot_with"              
#> [17] "shot_where"              "net_location"           
#> [19] "assist_by_player"        "foul"                   
#> [21] "foul_by_player"          "foul_by_team"           
#> [23] "follow_set_piece"        "assist_type"            
#> [25] "follow_corner"           "offside"                
#> [27] "offside_team"            "offside_player"         
#> [29] "offside_pass_from"       "shown_card"             
#> [31] "card_type"               "card_player"            
#> [33] "card_team"               "video_review"           
#> [35] "video_review_event"      "video_review_result"    
#> [37] "delay_in_match"          "delay_team"             
#> [39] "free_kick_won"           "free_kick_player"       
#> [41] "free_kick_team"          "free_kick_where"        
#> [43] "corner"                  "corner_team"            
#> [45] "corner_conceded_by"      "substitution"           
#> [47] "sub_injury"              "sub_team"               
#> [49] "sub_player"              "replaced_player"        
#> [51] "penalty"                 "team_drew_penalty"      
#> [53] "player_drew_penalty"     "player_conceded_penalty"
#> [55] "team_conceded_penalty"   "half"                   
#> [57] "comment_id"              "stoppage_time"          
#> [59] "team_one_penalty_score"  "team_two_penalty_score" 
#> [61] "match_time_numeric"

Can quickly make a chart showing the difference in shot attempts for each team by the outcome:

# install.packages("ggplot2")
library(ggplot2)
srb_crc_commentary %>%
  dplyr::filter(!is.na(shot_result)) %>%
  ggplot(aes(x = shot_by_team, fill = shot_result)) +
  geom_bar() + labs(x = "Team", y = "Count", 
                    fill = "Shot result",
                    title = "Distribution of shot attempts for Costa Rica vs Serbia by result",
                    caption = "Data from ESPN, accessed with fcscrapR") +
  scale_fill_manual(values = c("darkorange", "darkblue", "darkred", "darkcyan")) +
  theme_bw()

Gather game ids

The only function available currently to get game ids is scrape_scoreboard_ids() which pulls the game ids for all soccer matches on ESPN’s soccer scoreboard given a league or tournament. You must use a league or tournament that has an associated url in the league_url_data table provided in fcscrapR:

# install.packages(pander)
league_url_data %>%
  head() %>%
  pander::pander()
name
show all leagues
fifa world cup
uefa champions league
uefa europa league
english premier league
spanish primera división

Table continues below

url
http://www.espn.com/soccer/scoreboard/_/league/all
http://www.espn.com/soccer/scoreboard/_/league/fifa.world
http://www.espn.com/soccer/scoreboard/_/league/uefa.champions
http://www.espn.com/soccer/scoreboard/_/league/uefa.europa
http://www.espn.com/soccer/scoreboard/_/league/eng.1
http://www.espn.com/soccer/scoreboard/_/league/esp.1

Here’s an example of grabbing the World Cup games from June 20th, 2018:

scrape_scoreboard_ids(scoreboard_name = "fifa world cup", 
                      game_date = "2018-06-20") %>%
  pander::pander()
#> Loading required package: XML
#> Loading required package: RCurl
#> Loading required package: bitops
game_id team_one team_two
498185 Portugal Morocco
498184 Uruguay Saudi Arabia
498183 Iran Spain

Acknowledgements

Many thanks to the sports analytics community on Twitter for guiding me to various resources of soccer data. Big thanks to Brendan Kent for pointing me to the commentary data.